Multivariate Characterization, Modeling, and Design of Chemical
Products in Property Cluster Space
by
Subin Hada
A dissertation submitted to the graduate faculty of
Auburn University
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Auburn, Alabama
August 3, 2013
Keywords: Chemical Product Design, Multivariate Statistical Analysis,
Chemometric Technique, Reverse Problem Formulation, Property Clustering.
Copyright 2013 by Subin Hada
Approved by
Mario R. Eden, Chair and McMillan Professor of Chemical Engineering
Christopher B. Roberts, Uthlaut Professor of Chemical Engineering
Elizabeth Lipke, Assistant Professor of Chemical Engineering
Sushil Adhikari, Assistant Professor of Biosystems Engineering
Mahmoud ElHalwagi, McFerrin Professor of Chemical Engineering
ii
ABSTRACT??..
The focus of this dissertation is on the development of in silico approaches for
the logical and systematic solution of chemical product design problems. The
application of multivariate characterization, modeling, and design is
accomplished by utilizing interdisciplinary methods and tools that extend
through multivariate statistics, applied mathematics and computer science.
Methodologies and techniques such as spectroscopybased group contribution
methods, chemometric/chemoinformatic techniques, reverse problem
formulation, and property clustering techniques are integrated within
computeraided molecular/mixture design (CAMD) algorithms to design
chemical products in a computationally efficient manner that provides
optimum performance in terms of customer requirements. Propertybased
design techniques and multivariate datadriven modeling and optimization
strategies are presented in this dissertation covering two specific areas of
chemical product design: mixture and molecular design.
In mixture design, the property integration framework is combined with
multivariate statistical techniques and applied in a reverse problem
formulation on chemical product design problems by systematic and
insightful use of past data describing the properties of the raw materials,
their blend ratios, and the process conditions during the production of a
range of product grades to achieve new and improved products. Projection
methods, like principal component analysis (PCA) and partial least squares
(PLS) are applied to identify the underlying relationships necessary for
simultaneous optimization of all three variables. The method is illustrated
using a polymer blending problem.
iii
In molecular design, multivariate characterization techniques like
infrared (IR) spectroscopy are utilized to generate numerical descriptors of
molecular architecture in terms of IR frequency of a set of representative
samples. Models based on quantitative structureproperty relations (QSPR)
are used to elucidate structureproperty relationships. Applying principal
component analysis, high dimensional and highly correlated molecular
descriptor variables are transformed into low dimensional and statistically
independent latent variables. These latent variables are then used to
calibrate latent property models. Finally, the reverse design of molecules is
accomplished by exhaustively searching for molecular structures with target
properties, from the combinatorial building blocks. A characterizationbased
group contribution method is utilized to estimate the properties of the
formulated chemical products. The concepts and the solution methodologies
are demonstrated using two proofofconcept examples: biodiesel additive
formulation and ionic liquid design.
iv
ACKNOWLEDGEMENTS
My deepest gratitude to my advisor, Dr. Mario R. Eden, for his gentle
guidance, patience, and constant encouragement throughout my doctoral
program at Auburn University. His mentorship and support has been
invaluable on both an academic and a personal level. I will miss travelling
with him around the world for conferences.
It is a pleasure to thank my dissertation committee, and outside reader,
respectively: Dr. Christopher Roberts, Dr. Elizabeth Lipke, Dr. Sushil
Adhikari, Dr. Mahmoud ElHalwagi, and Dr. Steven Taylor. Thank you for
having patience as I endeavored to complete my work. I appreciate the
feedback and review that helped to improve and complete this dissertation.
I thank my fellow labmates in the Eden laboratory who supported me
and shared their enthusiasm for and comments on my work. My special
thanks to Robert Herring for the stimulating discussions and for the sleepless
nights we worked together before deadlines.
I gratefully acknowledge the financial support provided by the
Southeastern Regional Sun Grant Program, and the USDAAFRI (Award#
20116700920077). In addition, I thank Dr. Koji Muteki; Pfizer Inc., CT,
USA, for providing the starch blending data used in the mixture design proof
ofconcept example.
I cannot end without thanking my parents, Rajendra P. Hada and
Subrata Hada, and my brother Sujin Hada, on whose constant
encouragement and love I have relied on throughout long years of my
educational journey. Their persistent audacity and conviction will always
inspire me. It is to them that I dedicate this work. I would also like to thank
my wife, Fui Chi Yap, for her unequivocal support.
v
TABLE OF CONTENTS
Page
ABSTRACT??.. ............................................................................................... ii
ACKNOWLEDGEMENTS .............................................................................. iv
TABLE OF CONTENTS .................................................................................. v
LIST OF TABLES ......................................................................................... viii
LIST OF FIGURES........................................................................................... x
LIST OF ABBREVATIONS .......................................................................... xiv
CHAPTER 1 INTRODUCTION ................................................................. 1
1.1 Chemical Product Design ....................................................... 1
1.2 Challenges and motivations ................................................... 4
1.3 Scope and Objectives ............................................................. 11
1.4 Significance of the Research ................................................. 12
1.5 Organization .......................................................................... 13
CHAPTER 2 THEORY AND METHODOLOGY .................................... 15
2.1 ProcessProduct Design ........................................................ 15
2.2 PropertyBased Process and Product Design ....................... 16
2.2.1 Reverse Problem Formulation ................................ 17
2.2.2 Property Clustering Technique ............................... 20
2.3 Prediction of Properties ........................................................ 25
2.3.1 Group Contribution Method .................................... 27
2.4 Characterization Techniques ............................................... 29
2.4.1 Spectroscopy ............................................................ 31
2.5 General Regression Models .................................................. 34
2.6 Latent Variable Models ........................................................ 36
2.6.1 Principal Component Analysis (PCA) ..................... 39
vi
2.6.2 Principal Component Regression (PCR) ................. 46
2.6.3 Partial Least Squares (PLS) ................................... 47
2.6.4 Model Validation ..................................................... 49
2.7 Principal Properties in Cluster Space .................................. 51
2.8 ComputerAided Design using QSPR and cGCM ................ 52
2.8.1 Solution to Reverse Problem ................................... 53
2.8.2 Solution to Forward Problem .................................. 54
CHAPTER 3 MIXTURE DESIGN ............................................................ 58
3.1 Introduction ........................................................................... 58
3.2 Traditional Approach ............................................................ 59
3.3 MultiBlock Data Structure .................................................. 62
3.4 MultiBlock Regression Models ............................................ 63
3.5 Proof of Concept Example ? Starch Blending ...................... 66
3.5.1 Structure of starch blending data ........................... 66
3.5.2 Data analysis ........................................................... 70
3.5.3 Model development .................................................. 75
3.5.4 Design of desired products in score space .............. 77
3.5.5 Design of desired products in cluster space ........... 79
3.6 Conclusion ............................................................................. 83
CHAPTER 4 DESIGN OF BIODIESEL ADDITIVES ........................... 85
4.1 Introduction ........................................................................... 85
4.2 Structure Property Relationships ........................................ 89
4.3 Technical Difficulties with Biodiesel Use ............................ 90
4.4 Fuel Additives ....................................................................... 93
4.5 Additive Design ..................................................................... 93
4.5.1 Types of Additives ................................................... 94
4.5.2 Additive Property Estimation ................................. 95
4.5.3 Characterization of the Additive Molecules ........... 97
4.5.4 Additive IR Data Analysis ...................................... 98
4.5.5 Latent Variable Model Development .................... 100
4.5.6 Translating Physical properties to Latent
Properties ............................................................... 102
4.5.7 Evaluation of Desired Additive Feasibility
Region .................................................................... 104
4.5.8 Enumeration of Desired Additive Molecules ........ 106
4.6 Conclusion ........................................................................... 112
vii
CHAPTER 5 REVERSE DESIGN OF IONIC LIQUIDS .................... 113
5.1 Introduction ......................................................................... 113
5.2 Density Functional Theory ................................................. 115
5.3 Data Analysis and Model Development ............................. 120
5.4 Reverse Design of Ionic Liquids using QSPR and cGCM .. 123
5.5 Conclusion ........................................................................... 130
CHAPTER 6 FUTURE WORK ............................................................... 132
6.1 Methodology Improvements ............................................... 132
6.1.1 MultiDimensional Characterization .................... 132
6.1.2 MultiWay Modeling .............................................. 134
6.1.3 Stochastic Search and Optimization..................... 136
6.1.4 Managing and Handling Uncertainty .................. 137
6.2 Design of Inherently Benign Chemical Process Routes .... 138
REFERENCES. ............................................................................................. 141
APPENDIX A MIXTURE MODELS ......................................................... 150
A.1 Scheffe Mixture Model ................................................ 150
A.1 Cox Mixture Model ...................................................... 151
APPENDIX B SPECTRAL INTERPRETATION ................................... 153
B.1 Infrared Spectroscopy.................................................. 154
B.2 Molecular Vibrational Spectroscopy ........................... 155
B.3 Near Infrared Spectroscopy ........................................ 160
B.4 Characterizing IR Spectroscopy .................................. 161
viii
LIST OF TABLES
Table Page
Table 3.1: Starch material property data matrix. ....................................... 67
Table 3.2: Blending ratio and process condition data matrix. .................... 68
Table 3.3: Mixture product quality data matrix. ......................................... 69
Table 3.4: PCA score values for X and Yblocks. ........................................ 74
Table 3.5: PCA loading values for X and Yblocks. .................................... 75
Table 3.6: The regression coefficients for the expressions in Eqs. (3.13)
through Eqs. (3.15). ...................................................................... 76
Table 3.7: Score values for desired products in Figure 3.11........................ 78
Table 3.8: Desired product properties for desired products in Figure
3.11. .............................................................................................. 78
Table 3.9: Required mixture conditions to achieve target product
properties. ..................................................................................... 79
Table 3.10: Candidate ternary mixtures and fractional contributions of
the constituents. ........................................................................... 82
Table 4.1: Chemical structures of common fatty acids. ............................... 87
Table 4.2: Fatty acid profiles of some common biodiesel feedstock ............ 88
Table 4.3: Fuel properties of biodiesel fuels and diesel ............................... 89
Table 4.4: Specifications for biodiesel and diesel ......................................... 92
Table 4.5: Commercially available diesel additives and their estimated
properties. ..................................................................................... 97
Table 4.6: Model coefficients using PCR. ................................................... 101
Table 4.7: Biodiesel target properties. ....................................................... 102
Table 4.8: Physical and latent properties describing feasibility region. ... 103
Table 4.9: Biodiesel target latent properties.............................................. 103
Table 4.10: Crude biodiesel properties. ........................................................ 103
ix
Table 4.11: Biodiesel additive latent property feasibility region. ............... 105
Table 4.12: Molecular groups and their latent property contributions. ..... 107
Table 4.13: Results from characterizationbased molecular design. .......... 109
Table 4.14: Additive solubility in FAME at 25
o
C. ....................................... 110
Table 5.1: B3LYP/6311+G(2d,p) vibrational assignments (cm
1
) of
[emIm]PF
6
. ................................................................................. 118
Table 5.2: PCA score values for X and Yblocks. ...................................... 121
Table 5.3: Model coefficients using PCR. ................................................... 122
Table 5.4: Target ionic liquid properties. ................................................... 124
Table 5.5: Anion groups and their latent property contributions. ............ 125
Table 5.6: Cation groups and their latent property contributions. ........... 126
Table 5.7: Alkyl chain attached to cation groups and their latent
property contributions. .............................................................. 127
Table 5.8: Thirteen candidate ionic liquid molecules that match target
properties in property space. ..................................................... 130
Table B.1: IR spectrum absorption for different bond types. ..................... 160
Table B.2: IR absorbance frequencies and magnitudes of functional
groups. ........................................................................................ 161
Table B.3: Abbreviations, names and structures of investigated ionic
liquid training set. ..................................................................... 173
Table B.4: Candidate ionic liquid solutions enumerated from exhaustive
search in latent property space. ................................................ 179
x
LIST OF FIGURES
Figure Page
Figure 1.1: Stages involved in chemical product design. ................................ 2
Figure 1.2: Chemically formulated products. .................................................. 3
Figure 1.3: Hierarchy of chemical and biological informatics. ........................ 9
Figure 2.1: Conventional solution approach for process and molecular
design problems............................................................................ 16
Figure 2.2: Product and process design problem using reverse problem
formulation methodology. ............................................................ 18
Figure 2.3: A multiscale product design framework showing (I)
traditional approach and (II) RPF approach linking each of
the scales via a common property domain. ................................. 19
Figure 2.4: Representation of intra and interstream conservation of
clusters in ternary diagram. ........................................................ 22
Figure 2.5: Representation of feasibility reason with sourcesink
mapping using clusters in ternary diagram. The clustering
points are converted from ternary to Cartesian coordinate. ...... 24
Figure 2.6: Classification of property estimation methods. .......................... 25
Figure 2.7: QSAR model development steps. ................................................. 26
Figure 2.8: An overview of the interconnectivity of characterization
techniques, molecular architecture, and physical properties
and attributes of chemical and material products...................... 30
Figure 2.9: IR spectra of butylated hydroxytoluene molecule....................... 31
Figure 2.10: Unit variance scaled and meancentered variables. ................... 38
Figure 2.11: Projection of higher dimensional data onto a lower
dimensional subspace. ................................................................. 40
Figure 2.12: PCA decomposition of X matrix. ................................................. 42
Figure 2.13: Scree plot of the correlation matrix. ............................................ 43
xi
Figure 2.14: Principal component analysis: Score plot (a, left) of t
1
/t
2
and
loading plot (b, right) of p
1
/p
2
. The ellipse represents the
Hotelling T
2
with 95% confidence in score plot. .......................... 45
Figure 2.15: PLS regression on descriptive (X) and response (Y) variables. ... 47
Figure 2.16: Forward and reverse problems in computeraided molecular
design. ........................................................................................... 52
Figure 2.17: An overview on the methodology of multivariate
characterization, modeling, and design. ..................................... 57
Figure 3.1: Traditional approaches in mixture design. ................................. 61
Figure 3.2: Data structure for three manipulative variable matrixes and
a quality/response variable matrix. ............................................. 63
Figure 3.3: Data structure for combined manipulative variable matrixes
and a quality/response variable matrix. ..................................... 64
Figure 3.4: Data structure of X and R matrix for two classes of raw
materials and their blend ratios. ................................................. 67
Figure 3.5: Distribution plot, outlier box plot and normal quantile plot. ..... 70
Figure 3.6: Scree plot and pareto plot for PCA on Xvariables. .................... 71
Figure 3.7: Combined PCA score and loading plots (Biplot) on first and
second components for Xblock. ................................................... 72
Figure 3.8: Combined PCA score and loading plots (Biplot) on first and
second components for Yblock. ................................................... 72
Figure 3.9: PLS model coefficients for blend property matrix Y using four
latent factors. ............................................................................... 73
Figure 3.10: Predicted vs. actual product properties using PCR model. ........ 77
Figure 3.11: Visualization of target product properties in score space. ......... 78
Figure 3.12: Visualization of starch blending formulation in cluster space. .. 80
Figure 3.13: Visualization of starch blending formulation in cluster spac .... 81
Figure 4.1: Overall stoichiometric transesterification reaction scheme. ...... 86
Figure 4.2: Stearic acid methyl ester. ............................................................ 87
Figure 4.3: A typical triglyceride molecule with different fatty acid
chains of soybean oil. ................................................................... 88
Figure 4.4: Oleic acid methyl ester. ................................................................ 90
xii
Figure 4.5: Compositions of fats and oils and their effects on the fuel
properties. ..................................................................................... 91
Figure 4.6: Approaches to improving biodiesel fuel properties. .................... 92
Figure 4.7. Target specifications for biodiesel and its blend in terms of
cetane number, melting point, and kinematic viscosity. ............ 94
Figure 4.8: Infrared spectra of diesel additive molecules. ............................. 98
Figure 4.9: Scree plot for PCA on additive IR data. ...................................... 99
Figure 4.10: PCA score plots on first third PCs for additive IR data. ............ 99
Figure 4.11: Predicted vs. actual product properties using PCR model. ...... 101
Figure 4.12: Target feasibility region and crude biodiesel in cluster space. 104
Figure 4.13: Desired additive design feasibility region in cluster space. ..... 105
Figure 4.14: Cluster diagram for biodiesel blending problem. ...................... 108
Figure 4.15: Spatial representation of candidate additive molecules
according to B3LYP/6311++G(3df,3dp) calculations: (a)
isopropenyl acrylate (CM1), and (b) ethyl acetate (CM3)......... 111
Figure 5.1: Selection of anions, cations, and side chains attached in
cations for a task specific ionic liquid application. ................... 114
Figure 5.2: Geometry optimized molecular structure of [emIm]PF
6
at the
B3LYP/6311G(2d,p) computational level. ................................ 116
Figure 5.3: (a) Calculated vs. experimental IR frequencies and (b)
infrared spectrum at 2cm
1
resolution of [emIm]PF
6
at the
B3LYP/6311G(2d,p) computational level................................ 117
Figure 5.4: Predicted vs. actual IL properties using PCR model. ............... 122
Figure 5.5: Scenario of reverse design of ILs. .............................................. 128
Figure 5.6: Enumerated and validated candidate IL molecules. ................ 129
Figure 6.1: Improvements in the process of multivariate
characterization, modeling, and design. ................................... 133
Figure 6.2: The decomposition of Xblock by (a) PCA and (b) PARAFAC. . 135
Figure 6.3: Structure of an artificial neural network. ................................. 136
Figure 6.4: Aspects of early process design. ................................................. 138
Figure B.6.5: IR regions of the electromagnetic spectrum. .......................... 154
Figure B.6.6: Stretching and bending vibrational modes for H
2
O. .............. 155
xiii
Figure B.6.7: Stretching and bending vibrational modes for a CH
2
goup. .. 157
Figure B.6.8: Ball and spring model for atoms and bonds respectively. ..... 158
Figure B.6.9: Vibrational bands in infrared spectrum. ................................ 160
xiv
LIST OF ABBREVATIONS
ADMET Absorption, Distribution, Metabolism, elimination, and
Toxicity
ANN Artificial Neural Network
AUP Augmented Property Index
ASTM America Society for Testing and Materials
BIC Bayesian Information Criterion
BHT Butylated Hydroxytoluene
CAMD Computer Aided Molecular Design
cCAMD Characterizationbased ComputerAided Molecular Design
cGCM Characterizationbased Group Contribution Method
CI Connectivity Indices
CN Cetane Number
CFPP Cold Filter Plugging Point
CP Cloud Point
DFT Density Functional Theory
EB Elongation at Break
EEM EmissionExcitation Matrix
EG Ethylene Glycol
EGMEA Ethylene Glycol Methyl Ether Acrylate
DOE Design of Experiments
DG Diglyceride
DTBP DiTertButyl Peroxide
EHS Environment, Health, and Safety
FAME Fatty Acid Methyl Ester
xv
FBN Free Bond Number
GA Genetic Algorithm
GC Group Contribution
GCM Group Contribution Methods
GL Glycerol
IBI Inherent Benignness Indicator
ILs Ionic Liquids
IPA Iso Propyl Alcohol
IR InfraRed
LCA Life Cycle Assessment
LP Linear Programs
LVs Latent Variables
MSA Mass Separating Agent
MC Monte Carlo
MD Molecular Dynamics
MG Monoglyceride
MINLP Mixed Integer NonLinear Programs
MLR MultiLinear Regression
MM Molecular Mechanics
MO Methyl Oleate
MTBE Methyl TertButyl Ether
NLP Non Linear Programs
NMR Nuclear Magnetic Resonance
NIR Near InfraRed
NIST National Institute of Standard and Technology
NPCT Nonlinear Principal Component Regression
OA Oleic Acid
OLS Ordinary Least Square
PARAFAC PARAllel FACtor analysis
xvi
PCs Principal Components
PCA Principal Component Analysis
PCR Principal Component Regression
PLS Partial Least Square
PSE Process Systems Engineering
PMMA Poly Methyl MethAcrylate
PP Pour Point
PRESS Predicted Residual Sum of Squares
PSE Process Systems Engineering
PY Pyrogallol
QM Quantum Mechanics
QSAR Quantitative Structure Attribute Relationships
QSPR Quantitative Structure Property Relationships
RMSECV Root Mean Square Error of CrossValidation
RMSEP Root Mean Square Error of Prediction
SA Stearic Acid
SVM Support Vector Machines
TBHQ TButyl Hydro Quinone
TG Triglyceride
TI Topological Indices
TM Tensile Modulus
TS Tensile Strength
UFF Unified Force Field
XRD XRay Diffraction
1
CHAPTER 1
INTRODUCTION
1.1 Chemical Product Design
Chemical product design is a response to major changes in the chemical
industry which have occurred in recent decades. Traditionally, the chemical
industry has focused on process design involving the manufacture of bulk
commodity chemicals with an objective primarily being efficient production to
reduce costs in order to be competitive. Such chemicals are produced in
dedicated equipment and at very large scale such as ethylene, ammonia, and
sulfuric acid. The market success of such products depends on the cost of
making them. However, in recent decades, specialty, higher valueadded,
smaller volume chemical products such as pharmaceuticals, electronics, and
pigments have become increasingly important. The success for such products
depends on their discovery and their time to market. In the production of
specialty chemicals, the key is improvement in performance rather than
minimization of cost.
Unlike conventional process systems engineering (PSE) approaches that
were focused on the synthesis, design, optimization and control of chemical
processes which are based on a priori knowledge on the products, in chemical
product design, the identity of the final product is not known. Instead, the
basic idea of its behavior is known and the problem is to find the most
appropriate chemical(s) that will exhibit and/or cause the desired behavior
[1]. Several common desirable behaviors are: biodegradability, lower toxicity,
environmental benignity, and less hazard. Since customer needs and
customer attributes are the most important sources of product requirements,
2
product design should address these needs by translating them into the final
commercial products. Rapidly developing new products required by customers
with minimum cost is an increasingly important problem. According to
Cussler and Moggridge [2], chemical product design is a procedure consisting
of four stages:
1. Needs: Identification of customer needs and the translation of these
needs into product specifications.
2. Ideas: Generate ideas that satisfy this need.
3. Selection: Screen and select the best idea for manufacture.
4. Manufacture: Decide what the product should look like and how it
should be manufactured.
The 2
nd
and 3
rd
stages together represent Molecular Design and
Mixture/Blend Design problems. The 1
st
and the 4
th
stages may be considered
problem formulation and process design stages respectively. Figure 1.1 is a
schematic representation of the stages involved in chemical product design
and problems that are addressed in each stage [1].
Figure 1.1: Stages involved in chemical product design.
ProcessProduct Design
PreDesign
?define needs
and goals?
Process Design
?manufacture and
test products?
Molecular Design Mixture Design
Product Design
?generate and select alternatives?
3
The molecular and mixture/blend design problems can be solved
independent of the process design problem or as an integrated product
process design problem. In product design, we describe which product to
make; in process design, we explore how we will make it. Chemical product
design is the larger topic, and includes the process design in the final step.
Product design is now considered as an emerging paradigm in the field
of chemical engineering [1, 2, 3] because it requires a different set of tools
and skill sets from other problems traditionally encountered in the field.
Design of environmentally benign solvents and alternative media for
extraction and purification are new challenges within product design.
Examples of chemically formulated products include performance chemicals,
paints, cosmetics, pharmaceuticals, proteins, semiconductors, foods, fuel
mixtures, and many more (Figure 1.2).
Figure 1.2: Chemically formulated products.
The search for potentially new molecules that possess one or more
desired properties using traditional experimental approaches costs time and
resources. This awareness has generated momentum to find alternative ways
of numerical characterization of chemicals. Ertl [4] estimated that the
4
chemical space could be 10
20
 10
24
compounds whereas the current size of the
CAS Registry is 2 x 10
7
compounds. There are abundantly more chemical
compounds waiting to be prepared and characterized. Here, the application of
computational tools can facilitate the investigation of chemical product
formulations prior to experimentation and simulation of their manufacturing
processes, giving the flexibility to handle changing design constraints early in
the development process. In silico research and practice can alleviate
problems arising from diverse customer demands and shorter life time of
specialty chemical products. When appropriate property models are available
to describe and predict the target product properties, computer aided
methodologies can be utilized to solve the design problems, and are
considered computer aided molecular/mixture design (CAMD) problems [1].
Chemical product design that utilizes systematic approaches to integrate the
developed chemistry with manufacturing constraints is considered a new
chemical engineering paradigm [3, 5].
1.2 Challenges and motivations
Unlike the design of commodity chemicals, which have a known molecular
architecture and limited raw material options from which to build an
optimum process configuration, the design of specialty chemical products
does not have defined molecular architectures or raw material sources.
Developing techniques to discover and optimize the molecular architecture
that delivers desired attributes is the focus of research in the process systems
engineering (PSE) community. In addition, it is important to develop a
systematic methodology to produce chemicals that possess both the consumer
specified attributes and environmentally acceptable characteristics.
CAMD facilitates the application of computer algorithms to solve the
mathematical formulations of chemical product design. For computational
searching, the desired property must be calculated from a model describing
5
the structureproperty relationship. Based on the type of the property model,
the resulting CAMD problem can be a linear program (LP) or a nonlinear
program (NLP). Various CAMDbased techniques have been proposed and
can be classified as (a) mathematical optimization techniques [6, 7, 8], (b)
stochastic optimization techniques [9], and (c) exhaustive enumeration
(generate and test) [1, 10].
Traditionally, molecular and/or mixture/blend design (chemical product
design) procedures suffer from two major challenges: the ability to solve large
scale optimization problems, and the ability to predict the physical and
chemical properties of a given molecule [11]. Computational methods aimed
at finding new molecular structures that possess desired product properties
follow the experimental approach of generateandtest. These approaches
involve the minimization or maximization of an objective function with many
linear and nonlinear constraint equations. They are commonly referred to as
a MixedInteger NonLinear Programs (MINLP) [1]. The generic
mathematical programming formulation of integrated processproduct
synthesis/design problems can be presented as [1]:
( ){ }max Objective function
T
obj
F Cy f x= +
(1.1)
s.t.
( )
1
0 Process design specificationshx=
(1.2)
( )
2
0 Process model equationshx=
(1.3)
( )
3
, 0 CAMD specificationsh xy=
(1.4)
( )
11 1
Process design constraintsl gx u??
(1.5)
( )
22 2
, CAMD constraintsl g xy u??
(1.6)
6
33
Logical constraintsl By Cx u?+?
(1.7)
In the above equations, x represents continuous variables (such as
temperature, flowrate and mixture compositions, etc.) and y represents
binary integer variables (such as the existence or absence of certain process
units, raw materials, or molecular groups). Many variations of the above
mathematical formulation exist for solving different chemical product design
problems. An indepth review of the problem formulations has been
presented by Gani [1]. The presence of nonlinear constraints means that the
solution to the MINLP may only be considered a local optimum and not
necessarily a global optimum and as such may not be able to generate a
complete set of candidate molecules. This leads to suboptimal design and
inefficient iterative solution approach.
Although mathematical programming and hybrid methods provide the
framework to transform solution techniques into computeraided methods
and tools to determine the optimal design, they are computationally intensive
if the problem is not welldefined or is highly nonlinear. The major non
linearity of the problem is introduced by the property descriptors which
either need to be supplied (measured or retrieved from database) and/or
predicted through appropriate models. In addition, when considering
interfacing product and process design, most algorithms face a bottleneck
when it comes to using property models suitable for both product design and
process design [1]. If multiple models with restricted application ranges are
included for the same property, the algorithm may suffer from discontinuities
in the solution trajectory, which may make it more complicated to achieve
solution convergence [1]. Overall, purely mathematical optimizationbased
approaches are computationally intensive and require very detailed models
for all considered unit operations, thus it is desirable to reduce the search
space prior to invoking the optimization solver.
7
Almost all CAMD methods have used group contributionbased property
prediction methods to evaluate the generated compound with respect to a
specified set of desired properties [12, 13, 14]. Group contribution methods
(GCM) are simple, have acceptable accuracy for many properties and are
predictive in nature. However, GCM do not exist for all necessary properties,
and reliability of predictions is often questionable for large, complex
molecules. There are many property parameters or consumer attributes for
which group contribution (GC) data is not available, for example, attributes
such as paper softness are difficult to define in terms of properties and
molecular architecture.
In order to characterize the properties of observations, one measures
variables. In chemical and engineering practice, it is often assumed that our
systems are driven by inherent, latent variables. Changes in intrinsic
molecular properties cannot easily be measured directly. However,
macroscopic properties, which are manifestations of the intrinsic properties,
can be predicted through the use of relevant molecular descriptors. Such
descriptors can be related to chemical (reactivity, pH, enthalpy of formation,
lipophilicity, etc.), physical (boiling point, viscosity, density, etc.), and
structural (electron distribution, van der Waals interactions, hydrogen bonds,
etc.) properties of molecules. Discrete change of a substituent in one part of a
molecule or polymer does not just affect one isolated property, but several of
the above properties. Therefore, the chemical behavior of a molecule is
governed by these intrinsic, latent molecular properties [15]. Thus, it is
important to capture the intrinsic properties using relevant molecular
descriptor variables.
Chemical structures can be numerically encoded by molecular
descriptors. Thousands of molecular descriptors have been reported in the
literature, ranging from simple topological to more complicated topographical
properties of molecules. They cover features from constitution to geometry
and electronic properties [16, 17]. Relating chemical structures to physical
8
and chemical properties is not a new endeavor. However, to deal with the
enormous amount of data generated, it becomes crucial to develop and utilize
features selection algorithms, along with data visualization, interpretation,
and mining techniques to progress in computeraided molecular design.
Molecular descriptor variables can be of spectral origin (e.g. IR, NIR,
NMR, UV, Xray, etc.); chromatographic origin (e.g. HPLC, GC, TLC, etc.) or
they may be measurements from sensors in a process (e.g. temperatures,
flows, pressure, etc.). When variables are highly correlated or are highly
redundant, they are said to be collinear.
As the chemical engineering discipline moves from data poor to data
rich, the research in process systems engineering (PSE) shifts the focus from
simulation and complex experimentation toward datadriven techniques to
acquire the needed translation of product attributes into common physical
chemical properties [18]. With datadriven means gaining popularity as being
powerful and inexpensive, computer usage is being increasingly exploited for
scientific investigation. Also, the application of chemometrics and
chemoinformatics is gaining a great deal of recognition and application in
order to address problems in chemistry, chemical engineering, biochemistry,
biology, and medicine [16]. According to Wold [19], chemometrics is the
science of extracting chemically relevant information by datadriven means
and deriving the respective multivariate statistical models and descriptors.
Chemometrics is therefore a process. Chemoinformatics is a subfield of
chemometrics which involves the transformation of data into information and
information into knowledge to facilitate decision making [18, 16].
Combination of chemometrics and chemoinformatics can lead to target
focused twoand threedimensional structures from molecular composition.
The algorithm and informatics methods can be applicable and transferable to
a wide range of chemical and biological systems, regardless of whether the
starting point is DNA or a chemical element distribution as illustrated in
Figure 1.3 [19].
9
Figure 1.3: Hierarchy of chemical and biological informatics.
Chemometrics can be applied to solve both descriptive and predictive
problems involved with such data [19]. In descriptive applications, properties
of chemical systems are modeled with the intent of learning the underlying
relationships and structure of the system. In predictive applications,
properties of chemical systems are modeled with intent of predicting new
property values or behaviors of interest. Chemoinformatics can be utilized to
analyze a large volume of data generated from molecular modeling, chemical
information, and computational chemistry techniques [16].
Moreover, when a good theoretical process model is not available or
before new experiments are conducted, historical process data is often
available in industry that encompasses a wide spectrum of operating
conditions and product grades. Such multiblock data involving the properties
of the pure materials, their blend ratios, and the process conditions, could
provide an opportunity to enhance the performance of the final product in
blending operations for coatings, food, cosmetics, pharmaceuticals and fuels
[20]. The datadriven modeling approach is apt for processes and phenomena
where causeandeffect cannot be simply approached from firstprinciples.
For example, a reallife industrial process system often involves complex,
nonlinear, incomplete, and uncertain data.
Chemistry
Molecular composition
Connectivity (graph)
Molecular similarity
Chemotype
Structure
Interaction
Specific activity
Drug
Biology
DNA sequence
Protein sequence
Sequence similarity
Family
Structure
Interaction
Function
intervention
I
n
f
o
r
m
a
t
i
c
s
10
In chemical product (molecular or mixture) design, the identity of the
final product is unknown, however, the general behavior or characteristics of
the product (goal) is known. Since the properties of the component or mixture
of components dictate whether or not the design is useful, the basis for
solution approaches in this area should be based on the properties
themselves. The recognition of propertybased design came about as a direct
result of the following observations:
? Many processes are driven by properties not components.
? Performance objectives are often described in terms of measurable
physicalchemical properties.
? Often objectives cannot be described by composition alone.
? Molecular/mixture design is based on properties.
? Insights are often hidden by not integrating properties directly.
The discussion presented here provided the motivation that guided the
research, and as a result the methods and tools developed must accomplish
the following:
? Integrate process and product design problems via a methodology within
the property integration paradigm.
? The approach needs to be systematic and capable of setting up the
design performance requirements or ?targets? a priori, i.e. a targeting
approach.
? Utilize a datadriven approach combined with multivariate statistical
techniques to solve both descriptive and predictive problems.
? Apply multivariate characterization, modeling, and design.
? Incorporate the concepts of reverse problem formulation and property
clusters to aid in the decomposition of the design problem.
? The technique should take advantage of the benefits of visualization
tools in the formulation of the problem and as part of its solution
algorithms.
11
1.3 Scope and Objectives
General background information on traditional frameworks for solving
computeraided molecular/mixture design (CAMD) problems is presented in
this work. Techniques such as group contribution, reverse problem
formulation, property clustering, and multivariate statistics are presented
within CAMD to formulate and solve chemical product design problems.
These methods form the basis for the propertybased design techniques
presented in this dissertation.
Primarily, the effort of this dissertation is on the formulation of chemical
products through the application of multivariate characterization, modeling,
and design. Propertybased molecular and mixture design algorithm within
the general property clustering framework utilizing chemometric techniques
and group contribution methods in a reverse problem formulation are
presented. Databased modeling and optimization strategies will be
presented in this dissertation covering two specific areas of chemical product
design:
1. Mixture/Blend design: For mixture products, a systematic and
insightful use of past data describing the properties of the raw
materials, their blend ratios, and the process conditions during the
production of a range of product grades is investigated to explore their
effects on the final product properties and to achieve new and improved
products.
2. Molecular design: For molecular products, chemically feasible,
structured molecules are exhaustively enumerated from a set of
appropriate descriptors (represented by fragments or building blocks) to
identify compounds exhibiting certain desirable or specified behavior.
The present work is limited to systems that can be characterized by
three properties, however, algebraic and optimizationbased approaches are
available to extend the application range to include more properties [21].
12
1.4 Significance of the Research
A significant result of the developed methodology is that for problems that
can be adequately described by just three properties, the process and
molecular design problems are solved visually and simultaneously on a
ternary diagram, irrespective of how many substructural molecular
fragments are included in the search space. The primary benefits gained
through the developed methodology are as follows:
? utilization of interdisciplinary methods and tools that extend through
multivariate statistics, applied mathematics, computational expertise,
and experimentation across all scales,
? formulation and solution of chemical design problem on property basis,
? bridging domains via property models facilitate nested and nonlinear
routines to be reduced to a single subproblem in cluster space.
? identification of design targets without performing detailed calculations,
? visualization and solution of problem in lower dimensional space, which
provide valuable information by elucidating hidden relationships in data
and multivariate understanding of complex processes and phenomena,
? multivariate projectionbased regression retains better memory of the
structure of the training set data to predict molecules with similar
properties and structures,
? reverse design approach using the descriptors of group contribution type
combined with chemoinformatics enables exhaustive search of the
structures corresponding the target physical properties, and
? explicitly tracking and integrating properties in a systematic manner
relieve, the iterative nature, multiple component combinatorial
explosion, and difficulty of formulating and solving the mixed integer
nonlinear programs (MINLP) of conventional design techniques.
The aim is to allow us to see the big picture first, and the details later.
13
1.5 Organization
Chapter 2 introduces the general theoretical background on the methods and
tools that the research in this dissertation is based on. Section 2.2 introduces
the propertybased approach that facilitates the flow of information from the
process level to the molecular level, and vice versa. Section 2.3 highlights
property estimation methods based on combinatorial techniques such as the
group contribution method. Different characterization (e.g. IR spectroscopy)
based group contribution methods are presented in Section 2.4. The
characterizationbased groups serve as the descriptive application of
chemometric techniques that describe the molecular architecture of a
chemical product. Section 2.5 discusses the limitation of using general
regression models that is often encountered when the number of descriptor
variables is larger than the number of samples and when the variables
exhibit linear relationships with each other. The predictive application of
chemometric techniques is presented using statistical multivariate
techniques like PCA and PLS to model the properties of chemical systems in
Section 2.6.
Chapter 3 presents the developed methodology for solution of mixture
design problems utilizing the tools presented in Chapter 2. The multiblock
data structures (L and Tshaped) available in blending operations are
presented in Section 3.3. Combining multiple blocks using appropriate
mixing rules and matrix algebra to help simplify the analysis and design by
not having to differentiate between mixture and process variables and not
having to assume independence of the factors when multivariate analysis
techniques are used is discussed in Section 3.4. Section 3.5 presents a case
study to illustrate the method and concept using the development of thermo
plastic as a case study. The example incorporates all three degrees of freedom
available in blending operation.
Chapter 4 presents the developed molecular design framework based on
a characterizationbased group contribution method. The advantages of
14
biodiesel, a short overview on its production, chemical profiles of different
feedstocks and their effects on product quality are present in Section 4.1.
Technical difficulties with biodiesel that have impaired its use and
commercialization are presented in Section 4.3. Additive design is presented
as an alternative approach to solve the technical difficulties with biodiesel in
Section 4.4. Section 4.5 discusses the methodology for additive design that
combines the framework and the tools presented in Chapter 2. In this
Section, the types of training set additive molecules, their property
estimation methods, IRbased characterization of their molecular
architecture, multivariate statistical analysis of IR data for descriptive
application, latent variable property model development for predictive
application, evaluation of target properties for additive design, and design of
additive molecules that meet the target property specifications are discussed.
Chapter 5 introduces the use of density functional theory (DFT) based
simulation techniques to generate the IR spectra as molecular descriptors
with which to develop predictive property models. This step relieves the
dependency on measured or database values for the required IR data used in
the methodology presented in Chapter 4. The interdisciplinary and novel
framework proposed will be demonstrated using a case study focused on the
reverse design of ionic liquids with tailored properties. The methods in this
Chapter will integrate the framework and tools presented in Chapter 2 and
Chapter 4 with an additional ability to generate spectroscopic data from
quantum chemical calculations.
Chapter 6 introduces future works that can be extended from the
methodologies and tools developed in this dissertation. Section 6.1 introduces
potential improvement or extension areas in the characterization, modeling,
and design methodologies presented in this dissertation. Section 6.2
incorporates the design of inherently benign chemical process routes using
process route descriptors related to environment, health, and safety factors.
15
CHAPTER 2
THEORY AND METHODOLOGY
In this chapter, theoretical background on current methods used in chemical
product design along with tools that will be employed in this dissertation
research will be defined and discussed.
2.1 ProcessProduct Design
The molecular and mixture/blend design problems can be solved independent
of the process design problem or as an integrated productprocess design
problem as shown in Figure 2.1. In general, the objective in the design or
optimization of process is to find a balance between satisfying process unit
requirements/constraints and the use of appropriate raw materials in order
to maximize profit or to minimize cost. Traditionally, process design and
molecular/mixture design have been treated as two separate problems, with
little or no feedback between the two approaches as represented by Figure 2.1
[22]. The raw materials could be searched from a material database or can be
obtained from molecular design. Since process and molecular/product design
are decoupled, molecular design can be performed based on qualitative
process knowledge and/or experience. This can lead to suboptimal designs
and an inefficient iterative solution approach.
However, the decoupling of the process and the molecular design
problems can be addressed by systematically solving a series of reverse
problems [22, 24]. This is accomplished by a propertybased approach which
16
facilitates the flow of information from the process level to the molecular
level, and vice versa.
Figure 2.1: Conventional solution approach for process and molecular
design problems.
2.2 PropertyBased Process and Product Design
In molecular or mixture design, the identity of the final product is unknown,
however, the general behavior or characteristics of the product (goal) are
known. The objective is to find the most appropriate chemical, or mixture of
chemicals, that will satisfy these goals. Since the properties of the component
or mixture of components dictate whether or not the design is useful, the
basis for solution approaches in this area should be based on the properties
themselves. Similarly, in process design, since properties (or functionalities)
form the basis of performance of many process units (e.g., vapor pressure in
condensers; specific gravity in decantation; relative volatility in distillation;
Henry?s coefficient in absorption; density and head in pumps; density,
pressure ratio, and heat capacity ratio in compressor; etc.), it would be very
insightful to develop procedures based on key properties instead of key
components [22].
Discrete Decisions
(e.g. structural modifications)
Continuous Decisions
(e.g. Operating conditions)
Process Design
Discrete Decisions
(e.g. type of compound,
number of functional groups)
Continuous Decisions
(e.g. Operating conditions)
Molecular Design
Given set of components
to be screened
(e.g. raw materials, MSA?s)
Given set of molecular
groups to be screened
(building blocks)
Optimize process objectives
to meet desired performance
(e.g. recovery, yield, cost)
Optimize molecular
structures to meet given set
of property values
(e.g. physical, chemical)
17
However, unlike componentbased chemocentric approaches where
chemical components are conserved and mixing of components is linear; in a
propertybased approach, properties are not conserved entities and mixing of
properties is not necessarily linear. To overcome these limitations, Shelley
and ElHalwagi et al. [23] introduced the property clustering framework that
uses conserved quantities called clusters. It was later applied to process and
product design by Eden et al. [22]. To bridge the gap between the process
design and the molecular design problems, Eljack et al. [24] extended the
property integration framework by combining the property clustering
technique and Group Contribution Methods (GCM). Qin et al. [21] introduced
an algebraic approach using property clusters to relieve the limitation in the
use of more than three properties. These contributions enabled simultaneous
consideration of process performance requirements and molecular property
constraints in the cluster domain and solution of the design problem in this
reduced domain. The following sections discuss the property integration
framework based on reverse problem formulation and property clustering
techniques.
2.2.1 Reverse Problem Formulation
The reverse problem formulation technique decomposes a processproduct
design problem into two reverse problems linked by property
targets/constraints [22, 24]. This enables a twostep approach, where the
property targets that satisfy the process performance/constraints are
identified in the first step and then the molecules that match the targets are
identified in the second step. This gives the ability to identify optimum
solutions to process and product design problems more easily than solving the
conventional forward problems, which are iterative in nature [22, 24, 25, 27].
Figure 2.2 is a schematic representation of the reverse problem
formulation concept. The first reverse problem is the reverse of a simulation
problem, where the process model is solved in terms of the constitutive/design
18
variables instead of the process variables, thus providing the design targets.
The second reverse problem solves the constitutive equations (property
models) to identity candidate components by matching the design targets.
Figure 2.2: Product and process design problem using reverse problem
formulation methodology.
Prediction of macroscopic properties from molecular information is
nontrivial and is by no means extensive. In most methods, the computed
information at smaller length scales is passed to models at a larger length
scale. This is achieved by coarsegraining (removing degrees of freedom) and
information passing between low and highscales [26]. Pathway I in Figure 1
represents the dependency of overall higher scale performance on the lower
scale phenomena. This path (approach) necessitates coupling of
mathematically different models and phenomena across two or more scales.
Discrete Decisions
(e.g. structural modifications)
Continuous Decisions
(e.g. Operating conditions)
Process Design
Desired process
performance
Given set of molecular
groups to be screened
(building blocks)
Integrated design
capable of achieving
desired performance
Reverse
Property
Prediction
Constraints on property values
obtained by targeting optimum
process performance
Designed Molecules
(e.g. raw materials, MSA?s)
Discrete Decisions
(e.g. type of compound,
number of functional groups)
Continuous Decisions
(e.g. Operating conditions)
Molecular Design
Design Property Targets
Reverse
Process
Simulation
19
This results in a greater coupling in design of products at multiple scales,
thereby increasing problem complexity, and combinatorial explosion in the
number of models and parameters [12].
Figure 2.3: A multiscale product design framework showing (I)
traditional approach and (II) RPF approach linking each of the scales via a
common property domain.
The use of reverse problem formulation (RPF) helps circumvent the
challenge posed by coupling of scales by bridging them through a property
domain (pathway II in Figure 2.3) [27]. RPF uses the duality of linear
programming to reformulate the design problem as a series of reverse
problems solved in the property domain. This way, an immense
computational cost associated with the hierarchical nesting across multiple
scales is relieved leading to a much more efficient solution achieved through
reduction in the need for enumeration [27].
QM, DFT Models
Ab Initio Methods
(Quantum Scale)
Fundamental Principles of
Chemical Engineering
Process Design
(Macro Scale)
Chemometric Models
Consumer Attributes
(Macro/Meta Scale)
MM, MD, MC Simulations
Atomistic Methods
(Atomic Scale)
CoarseGraining and
Chemometric Models
Microstructure
(Meso Scale)
II
II II
II II
I
I I
I/II
GCM, Topological Indices
& Chemometric Models
CAMD
(Molecular Scale)
I/II
I/II
PROPERTIES
II
I
20
2.2.2 Property Clustering Technique
Property clustering is a propertybased visualization tool for mapping the
design problem from the nonconserved property domain into the conserved
(componentless) cluster domain [23]. The property clustering technique
utilizes property operators, which are functional relationships describing the
attributes and physicalchemical properties. The clusters map property
relationships into a lower dimensional domain, thus allowing for
visualization and insights into the problem. Details can be found in Eden et
al. [22]. Only highlights will be presented in this dissertation.
2.2.2.1 Property Operator Functions
The clustering approach utilizes property operators, ?, which are functions of
the original raw physical properties. In Eq. (2.1), the property is described by
a general linear mixing rule:
( ) ( )
1
??
=
= ?
?
n
jj i jj
Mix i
i
y xy
(2.1)
where, ?
?
= the fractional contribution of component i
?
?
= the j
th
property
Although the property operator equation must have linear mixing rules,
the property operator itself may be nonlinear. For example, if the property
operator describes density, then to meet the linear criteria imposed by Eq.
(2.1) we would use specific volume as the property operator, ignoring any
interaction effects from mixing, as shown in Eq. (2.2) and (2.3). The operator
expressions will invariably be different for molecular fragments and process
streams, however, as they both represent the same property, they can be
visualized on a common cluster domain in a similar fashion.
21
( )
1
jj
i
i
y?
?
=
(2.2)
1
11
n
i
i
Mix i
x
??
=
= ?
?
(2.3)
Since the properties may have various functional forms and units, the
operators are normalized into a dimensionless form by dividing by an
arbitrary reference operator and then summarized to yield an Augmented
Property Index (AUP) as:
( )
( ) 1
,
p
jj
i
ji i ji
j
jj
ref
y
AUP
y
?
?
=
?= = ?
?
(2.4)
A cluster is then defined by dividing the nondimensionalized property by the
AUP, as:
1
where 1 ,
k
ji
ji ji
i
i
CC
AUP
=
?
= =
?
(2.5)
Since the clusters are tailored to maintain the two fundamental rules for
intra and inter stream conservation, additive rules, e.g. leverarm rules, are
needed to ensure that the mixture property cluster of two streams with
different individual property clusters can be easily determined. The mixture
cluster, Eq. (2.6), and mixture AUP, Eq. (2.7), values can be calculated
through the linear mixing rules as follows:
,
1
where
n
ii
j Mix i ji i
i
Mix
x AUP
CC
AUP
??
=
?
?=
?
,
(2.6)
1
n
Mix i i
i
AUP x AUP
=
= ?
?
(2.7)
22
The relative cluster arm, ?
?
, which represents the fractional
contributions of the streams, can be calculated using Cartesian coordinates
as well as estimated visually from the relative lengths of the lever arms as
shown in Figure 2.5. Visually, intrastream conservation means that once two
clusters are known, the third one is automatically determined. In order to use
common tools like Microsoft Excel? that do not support ternary plot
representation, the ternary coordinate is converted to Cartesian coordinates.
Figure 2.4: Representation of intra and interstream conservation of
clusters in ternary diagram.
2.2.2.2 Ternary to Cartesian Coordinates
The coordinate transformations from ternary to Cartesian for an equilateral
triangle can be accomplished using the Pythagorian theorem in terms of
ternary coordinates defined as cluster values and dimensionless property
operators are as follows:
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C
3
C
2
C
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Target
S
1
S
2
C
1s
C
2s
C
3s
?
1
?
2
23
12
,1 2
123
0.5
0.5
ss
CC s s s
sss
XC C
? + ??
=+=
? +? +?
(2.8)
2
,2
123
s
CC s s
sss
YC
?
= =
? +? +?
(2.9)
Furthermore, the relative arms can be calculated accordingly as:
22
,2 , ,2 ,
1
,2 ,1 ,2 ,1
( )( )
( )( )
CC CC mix CC CC mix
CC CC CC CC
X X YY
X X YY
?
? +?
=
? +?
(2.10)
22
,1 , ,1 ,
2
,2 ,1 ,2 ,1
( )( )
( )( )
CC CC mix CC CC mix
CC CC CC CC
X X YY
X X YY
?
? +?
=
? +?
(2.11)
In cluster space, points are used to represent discrete property values while
feasibility regions are used to represent a range of accepted property values
(Figure 2.5).
2.2.2.3 Feasibility Region Boundaries
For problems that can be adequately described by just three properties, the
desired process performance range or product property requirement range
can be visualized within the boundary of the true feasibility region defined by
six unique points, Eq. (2.12), characterized by values of dimensionless
operators. The feasibility region narrows down the search space and
guarantees that no feasible points will exist outside it. The feasibility region
boundary analysis helps identify and describe the expression for the
feasibility region a priori and requires no enumeration of an infinite number
of feasible points possible within its boundaries [22, 23]. In Figure 2.5, the
feasibility region is shown in dashed lines, to form a hexagon in a ternary
cluster diagram.
24
( ) ( ) ( )
( ) ( ) ( )
min min max min max max min max min
123 123 123
max max min max min max max min min
123 123 123
,, ,, ,,
,, ,, ,,
??? ??? ???
??? ??? ???
(2.12)
Figure 2.5: Representation of feasibility reason with sourcesink mapping
using clusters in ternary diagram. The clustering points are converted
from ternary to Cartesian coordinate.
Visualization of the problem allows for easy identification of optimum
strategies for the combination of molecular groups, while the unique feature
of linear mixing rules allow for the use of simple lever arm analysis to solve
the problem in reduced cluster space. This way, the design problem can be
solved by identifying the product properties corresponding to the desired
process performance. The property clustering methods for solving mixture
and molecular design problems form the basis for the methods developed in
this dissertation.
Design of target chemical products that have tailored structures and
that exhibit an array of unique functional properties requires the use of
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C
3
C
2
C
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Feasibility
Region
S
1
S
2
Y
X
Y
CC
X
CC
25
physicalchemical properties and/or biological activities. Experimental
measurements of properties of interest for products such as drugs, ionic
liquids, additives, etc. are scarce and limited. With respect to the synthesis,
characterization, and applications of such chemical products, it is essential to
estimate properties through theoretical or empirical means.
2.3 Prediction of Properties
As processproduct design may involve both process and molecular design,
separate property models may be required. Property models can range from
easy to use datadependent regressed models whose application range is
small, to computationally complex models like ab initio (quantumchemical)
calculation, which can be applied to any chemical system. In the middle are
the most common methods that balance full empiricism with basic chemical
theory to adapt the rigorous equations that describe molecular architecture in
a form suitable for design. A comprehensive overview of a wide range of
property estimation methods can be found in Poling et al. [28]. Constantinou
and Gani [12] classified property estimation methods as shown in Figure 2.6.
Figure 2.6: Classification of property estimation methods.
Reference
SemiEmpirical Models
? Corresponding State
Theory
? Topology/Geometry
? Group/Atom/Bond additivity
y = a + b
.
log(x) + ?
Empirical Models
? Chemometrics
? Pattern matching
? Factor analysis
? QSAR/QSPR
y = a + b
.
x + ?
Fundamental Models
? Quantum Mechanics
? Molecular Mechanics
? Molecular Simulation
H? = E?
Approximate
Property Estimation
Models
26
Regardless of the type of property model used, it is essential that they
are easytouse, accurate, reliable, and computationally efficient throughout
the entire design domain. Group contribution (GC) and topological indices
(TI) are two such techniques that utilize quantitative structure
property/activity relationships (QSPR/QSAR) that have been developed using
empirical relationships between molecular architecture and physical
chemical properties found in large databases [11, 29, 30].
A QSPR model finds the relationship between structural features
(constitutional, topological, geometrical, etc.) and physical, mechanical, or
chemical properties of materials. In drug design, the QSAR model finds the
underlying relationship between molecular descriptors with
pharmacokinetic/pharmacodynamics, and ADMET (Absorption, Distribution,
Metabolism, Elimination and Toxicity) properties [16, 31, 32]. Since QSPR
models provide information on features affecting the compounds?
physicochemical properties, they can be used for screening and further
optimization. Thus, development of robust QSPR model can facilitate
conserving resources and accelerating the process of development of new and
enhanced products.
Figure 2.7: QSAR model development steps.
Training Set
(Chemical,
Biological data)
Generate Descriptors
(Spectroscopy,
DRAGON)
Modeling Methods
(MLR, PCR, PLS,
ANN)
Model Validation
(Internal or
External)
Property Prediction
(Screening new
compounds)
Knowledge
and
Chemical Intuition
Synthesis
and
Testing
Forward QSPR
Reverse QSPR
27
Figure 2.7 presents QSPR model development steps. Several machine
learning methods such as artificial neural network (ANN), Bayesian
statistics, support vector machines (SVM), genetic algorithm (GA), partial
leastsquares analysis (PLS), and multiple linear regression analysis (MLR)
have been used in literature to perform QSPR modeling [31].
However, in order for a developed model to be relevant to the target
product properties, the training set must contain optimum molecules
covering balanced variation (high diversity) of the features spanning the
chemical/property space believed to be important for interaction with the
physical, chemical, or biological target [30]. The diversity of the chemical
features can be achieved by using statistical experimental design such as
factorial design and Doptimal design. The coverage can be achieved by
employing quadratic or cubic models instead of linear model design [30, 33].
However, the requirement of high diversity and full coverage often leads to
redundancy. Since a set of structural descriptors is never complete, some
redundancy could be allowed at this level. The use of projection methods
(discussed later in Section 2.6) will circumvent the problem of redundancy in
the descriptor data.
2.3.1 Group Contribution Method
Most property estimation methods are based on the group contribution
method (GCM), where appropriate descriptors or predefined fragments
(group, bond, or atom, etc.) representing a molecule are identified and the
properties of the molecule are estimated by summing all the contributions
from each fragment that make up the molecule [12, 14, 22, 24, 25, 34]. The
GCM is a powerful product/molecular design tool, which allows prediction of
the physical properties of molecules from structural information alone. An
additive, three level, group contribution property estimation model, which
estimates the property of a compound as a linear combination of the
appropriate descriptor contributions, is as follows:
28
?
st nd rd
1 Order 2 Order 3 Order
Property Function
Group Contribution Terms
()
ii j j kk
ij k
f x NC M D OE=++
?? ?
????? ????? ?????
?????????????
(2.13)
where, C
i
= the contribution from firstorder group i
N
i
= the number of occurrence of firstorder group i
D
j
= the contribution from secondorder group j
M
j
= the number of occurrence of secondorder group j
E
k
= the contribution from thirdorder group k
O
k
= the number of occurrence of thirdorder group k
First order groups contain basic information and can be combined
linearly since they assume no interaction between groups. Second order
groups can be estimated from first order groups and correct for the
interactions between first order groups. Third order groups can be derived in
a similar manner and help to correct for polyfunctional compounds with
more than four carbon atoms in the main chain.
However, any application of group contribution relies on the availability
of atom type, molecular group type, or type of chemical bonding present to
describe the structure as well as tables giving the property contributions of
each group. There are many properties which cannot be estimated by GCM.
For instance, cetane number is an important performance indicator for
biodiesel, but GCM parameters are not available to describe this property.
Furthermore, not all possible atomic arrangements and structures can be
represented in GCM. Hence, there is a need for an efficient methodology for
the design of structured molecules. One such approach to structured product
design is combining decomposition techniques with multivariate methods
[35, 36, 37].
29
This approach first utilizes multivariate characterization techniques
such as infrared (IR) and near infrared (NIR) spectroscopy to describe a set of
representative samples, and then uses decomposition techniques such as
principal component analysis (PCA) and partial least squares onto latent
surfaces (PLS) to find the underlying latent variable models that describe the
molecule's properties. The factors are called ?latent? because they cannot be
observed directly, but can be characterized indirectly. The orthogonal nature
of these models allows for groupbased interpretations and property
predictions which can be utilized to design new molecules not found in the
original set of molecules. The structure and identity of candidate molecules
can then be identified by combining or "mixing" substructural molecular
fragments until the resulting properties match the targets [24].
2.4 Characterization Techniques
Characterization is a class of tools associated with the determination of not
only chemical constituents or molecular structure, but also of larger
structural characteristics describing the orientation and alignment of these
molecules by exploiting the fact that molecules absorb specific frequencies
that are characteristic of their structure. Some common characterization
techniques include nuclear magnetic resonance (NMR), xray diffraction
(XRD), and infrared spectroscopy (IR). Characterization techniques are often
applied to a training set of molecules used to explore a set of property
attributes. Solvason [27] proposed a general guide for managing the
complexity of the information through a systematic method for determining
which specific information on molecular architecture will be necessary to
build appropriate models for a specific application as shown in Figure 2.8.
30
Figure 2.8: An overview of the interconnectivity of characterization
techniques, molecular architecture, and physical properties and attributes
of chemical and material products.
Spectroscopy will be the primary characterization technique explored in
this dissertation. Spectroscopy such as infrared (IR) and nearinfrared (NIR)
provide specific information on the presence of functional groups, information
on the orbital configurations of the electrons, and details of the carbon
hydrogen structure of the chemical products. The added structural
information available from this characterization can also be used to
distinguish some orientation specific information in various isomer
geometries. More on spectroscopy can be found elsewhere [38]. Appendix B
gives an overview on IR spectroscopy.
IR / NIR / Raman
Spectroscopy
UVVis
Spectroscopy
NMR
Spectroscopy
Mass
Spectroscopy
GC / HPLC / SEC
Chromatography
Scattering &
Diffraction
Thermomechanical
Analysis
Rheological
Analysis
AFM / SEM / TEM
Microscopy
Electronic Structure
(1? ? 2nm)
Atomic Structure
(1? ? 1nm)
Group Structure
(5? ? 5nm)
Chain Structure
(10nm ? 2?m)
Intermolecular Structure
(2? ? 10nm)
Nanostructure
(1nm ? 100nm)
Microstructure
(1?m ? 100?m)
Mesostructure
(10?m ? 1mm)
Electrical
Thermodynamic
Toxicological
Environmental
Kinetic
Transport
viscoelastic
Mechanical
Characterization
Type
Molecular
Architecture
Attributes &
Properties
31
Figure 2.9: IR spectra of butylated hydroxytoluene molecule.
Figure 2.9 is an IR spectrum of a butylated hydroxytoluene molecule
[39] and its molecular structure showing 1
st
and 2
nd
order GC groups. IR
spectra of such molecules contain large quantities of descriptor data involving
information on molecular architecture. In addition, there is a high probability
that many of the descriptor variables will be correlated, i.e., that some
variables will be linear functions of other variables [40]. Managing such
complexity of information to design chemical products and to build
appropriate models for a specific application will require a systematic method
for capturing important features of the molecular architecture. Therefore
multivariate statistical techniques can be used to decompose large quantities
of information about the system in the initial training set. More detailed
discussion on multivariate techniques will be presented in Section 2.6.
2.4.1 Spectroscopy
Many substances in solution follow BeerLambert?s law, showing a linear
relationship between concentration and absorbance. Beer?s law relates the
absorption of light to the properties of the material through which the light is
Wavenumber [cm
1
]
A
bsor
ba
nc
e
[
%
]
32
traveling. The absorbance of a single wavelength due to the presence of a
single component can be written as [41]:
A KC= ?
(2.14)
where, A = the absorbance at that wavelength
K = the absorbance coefficient for that wavelength and component
C = the concentration of the component
This concept can be applied to any system where we can measure a
quantity, A, that is proportional to some property, C, of the sample. The
generalization of the above equation for multiple components, multiple
wavelengths and multiple samples gives:
1
n
ws wc cs
c
A KC
=
= ?
?
(2.15)
where, Aw = the absorbance at the w
th
wavelength in the s
th
sample
Kwc = the absorbance coefficient for the w
th
wavelength and for the c
th
component
C
c
= the concentration of the c
th
component in the s
th
sample
n = the total number of components
In expanded form Eq. (2.15) takes the form:
1 11 1 12 2 1
2 21 1 22 2 2
3 21 1 32 2 3
21 1 2 2
...
...
...
...
s c cs
s c cs
s c cs
ws w wc cs
A KC KC KC
A KC KC KC
A KC KC KC
A KC K C K C
= ?+ ?+ + ?
= ?+ ?+ + ?
= ?+ ?+ + ?
= ?+ ?+ + ?
? ? ???
(2.16)
It is clear from Eq. (2.16) that the absorbance at a given wavelength, w,
is simply the sum of the absorbances at that wavelength from each of the
components present. Since matrix notation can easily generalize the above
Eq. (2.16), a single matrix equation is:
33
A KC= ?
(2.17)
Note: The notations used here are limited to this section only. New notations
will be introduced in following sections that will be used throughout the
remainder of the dissertation. Boldfaced unsubscripted letters are used to
refer to vectors and matrices.
In spectroscopy, it is clear that the spectrum of a sample is the sum of
the spectra of the constituents multiplied by their concentrations in the
sample. If the concentrations are t and the spectra p, we get the latent
variable model
11 2 2
...
TT T T
cc
X t p t p t p T P noise=?+?+?=?+ (BeerLambert?s law).
In many applications this interpretation with the data explained by a
number of ?factors? (components) makes sense [41]. Chemometric modeling
methodologies can be investigated and employed as a means to derive
mathematical relationships between spectroscopic measurements and
measured product properties. Kramer [41] suggests that the power and
limitations of chemometric techniques that should be realized while solving
datadriven problems are as follows:
We can use these chemometric techniques to:
? remove as much noise as possible from the data.
? extract as much information as possible from the data.
? use the information to learn how to make accurate predictions about
unknown samples.
In order for this to work, two essential conditions must be met:
? the data must have information content.
? the information in the data must have some relationship with the
property(s) which we are trying to predict.
Because consumer attributes are difficult to quantify physically, the
relationship between them and the underlying fundamental physical
34
chemical properties and/or the molecular architecture will most likely involve
empirical relationships. Empirical models describe the underlying
phenomena?s relationship to a set of experimental data using regression
analysis.
2.5 General Regression Models
Traditionally, the modeling of the predicted response, Y, by means of a
descriptor variable, X, is done using multilinear regression (MLR), which
works well as long as the Xvariables are fairly few and fairly uncorrelated,
i.e., X has full rank (rank is a number expressing the true underlying
dimensionality of a matrix). Often, the relationship between X and Y
variables can be approximated using a linear model and can be represented
mathematically as:
YX?E
ML MK KL ML? ?? ?
= ?+
(2.18)
where, M = the number rows of sample readings or observations
L = the number columns of measured response properties
K = the number columns of descriptor variables (like components)
?
= the regression coefficients or sensitivities matrix
E = the error or residual matrix
Three cases can be distinguished as described by Geladi and Kowalski [42] in
Eq. (2.18) to determine ?:
1. ? > ?: There is no unique solution for ?
as infinite numbers of solutions
exist, unless one deletes predictor variables.
2. ? = ?: There is one unique solution provided that X has full rank.
0EYX?= ??=
3. ? < ?: There is no exact solution for ?, however, a solution can be
achieved by minimizing the residual in the following equation:
EYX?=??
35
The ordinary leastsquare (OLS) method is the most popular method to
find the regression coefficients by maximizing the model sum of squares and
minimizing the residual sum of squares. Using leastsquare, ? can be
estimated by:
( )
1
?
? XX XY
TT
?
= ? ??
(2.19)
where, the superscript T symbolizes the transpose of a matrix.
When the number of Xvariables is large compared to the number of
observations, for example in spectroscopy (?. ?.? > ?), it can lead to a
singular (?
?
?) matrix whose inverse does not exist. This is because the
number of unknown variables is greater than the number of equations,
leading to an underdetermined equation system which has an infinite
number of solutions for ?. This is the most frequent problem in MLR. One can
exclude variables ? > ? that are not significant; however, it is not a
guaranteed solution [42].
In addition, where a situation like mixture design is considered, the X
variables could be component proportions, ?
?
, in the mixture that are not
mutually independent (?.?.? ?
?
= 1
?
?=1
, where ?
?
? 0 ? i = 1,2, ? ,?). This
situation is referred to as X being rank deficient, collinearity, zero
determinant, singularity, and illconditioned. In such a situation, the
inversion of (?
?
?) matrix may not exist leading to the estimation of ? using
OLS with large variances. Here, a statistical multivariate method provides
the right tool to extract systematic variables and remove collinearity in the
data set as described in Section 2.6.
In molecular products, an important objective is to find a chemical
product that exhibits certain desirable or specified behavior. Assuming the
model in Eq. (2.18), and the parameter estimates
?
?,
a new xvariable can be
predicted from a desired yvariable such that
36
( ) ( )
11
?
,?
TT
L K KL
des new
yx
? ??
= ?
(2.20)
The model inversion of Eq. (2.20) gives
( ) ( ) ( )
1
11
?? ?
?? ?
T TT T
K L LK KL LK
new des
xy
?
? ? ?? ?
= ??
(2.21)
where,
( )
1
?? ?
?? ?
TT
?
? is the generalized inverse or pseudoinverse of
?
?
Eq. (2.20) does not contain any information about the covariance
structure within the manipulated X variables. Consequently, the solution
given by Eq. (2.21) will not respect those previous structural relationships
when solving for the new conditions ?
???
?
. Therefore, the standard regression
model (Eq. (2.20)) and its inversion (Eq. (2.21)) possess serious limitations in
solving similar problems. Multivariate projection methods like PCA and PLS
become indispensable tools in dealing with such difficulties and can be used
for exploration, calibration, and classification of multivariate data.
2.6 Latent Variable Models
The most effective tools in multivariate data analysis are Principal
Component Analysis (PCA) and Partial Least Square (PLS). PCA and PLS
are decomposition techniques that compress a large quantity of data and
extract the information by projecting them into a lowdimensional subspace
that summarizes all the important information for analysis [43, 44, 45].
Then, further design work can be conducted in the reduced subspace.
If the data used is high dimensional and noisy, and the number of
samples are small while developing a calibration model, there is the danger
of overfitting. In such cases, PCA (which can capture a dominant part of data
variance) is more appropriate to reduce the data dimensionality and then
train a regression model with the reduced latent variables. Principal
component analysis is presented in this section as a variable reduction tool
including important aspects such as pretreatment of data, validation, and
37
outlier detection that must be considered. Prior to PCA, data often needs to
be pretreated, in order to transform the data into a form suitable for
analysis.
Data Preprocessing: Preprocessing of data can make the difference
between a useful model and no model at all. Pretreating data is often
employed to transform data into a form suitable for analysis. It is general
practice to meancenter and scale the property variables prior to analysis [42,
46]. Variables often have substantially different numerical ranges. A variable
with a large range has a large variance, whereas a variable with a small
range has a small variance. Since PCA is a maximum variance projection
method, it follows that a variable with a large variance is more likely to be
expressed in the modeling than a lowvariance variable [15].
In particular, the property descriptor data matrix ?
???
, consisting of M
observations described by K descriptors, is meancentered and unit variance
scaled (UV) (also known as autoscaled) across the M properties.
? Mean centering: The mean for each variable (each column) is
calculated based on the entire sample and subtracted from each
measurement (elements of matrix ?
???
).
,
1
1
where,
M
ij ij j j i j
j
x xx x x
M
=
=?=
?
(2.22)
? Scaling: Variance scaling of the data to unity across the K descriptor
properties can be accomplished by standardizing variables. For each
measurement, this is done by dividing mean centered data by the
standard deviation (?
?
).
( )
2
2
,
1
1
where,
1
M
ij j
ij j i j j
j
j
xx
x s xx
sM
=
?
= = ?
?
?
(2.23)
38
Autoscaling puts all variables on equal footing; i.e., all variables have
the same chance of entering the model and of taking part in the model. PCA
performed on autoscaled data is referred to as a correlation PCA.
Figure 2.10: Unit variance scaled and meancentered variables.
Figure 2.10 is a simple geometrical understanding of UVscaling and
meancentering of variables [15]. In this representation, each bar corresponds
to one variable and the short horizontal line inside each bar represents the
mean value. The length of a bar (vector) is equal to its standard deviation
(square root of variance). After UVscaling, we get a shrinking of ?long?
variables and a stretching of ?short? variables. By putting all variables on a
comparable footing, no variable is allowed to dominate over another because
of its length. Prior to any preprocessing, the variables have different
variance and mean values. After UVscaling, the ?length? of each variable is
identical; however, mean values still remain different. After meancentering,
mean values are zero (i.e., the centroid of the whole data set is zero). This
improves the interpretability of the model developed.
However, it must be noted that in some cases, it is not necessarily
advantageous to use autoscaling, and some other choice might be more
appropriate. Also, the X and Yvariables can be scaled differently because
the regression coefficients absorb the differences in scaling.
Unit variance
scaling
Mean
centering
0
Measured
values
&
?length?
39
2.6.1 Principal Component Analysis (PCA)
Principal component analysis is a factor analysis method that is widely used
in identification of systematic patterns in data and provides visualization of
multivariate data by using as few variables as possible. It linearly maps
multidimensional data onto lower dimensions with minimum loss of
information [44]. The goals of PCA are to:
? extract the most important information from the data table,
? compress the size of the data set by keeping only this important
information,
? simplify the description of the data set, and
? capture and analyze the structure of the variables.
In order to achieve these goals, PCA transforms a set of correlated
variables into a new set of uncorrelated ones, known as principal components
(PCs) such that
? the first PC is the linear combination of the standardized original
variables that have the greatest possible variance,
? each subsequent PC is the linear combination of the standardized
original variables that have the greatest possible variance, is orthogonal
and has zero correlation with all previously defined PCs.
The orthogonality constraint imposed by the mathematics of PCA
ensures that each variancebased axis is independent. PCs are arranged in
order of decreasing eigenvalues. First PC is the most informative. Figure 2.11
shows the dimensionality reduction of original data to a low dimensional
subspace using PCA which is much easier to visualize and analyze.
40
Figure 2.11: Projection of higher dimensional data onto a lower
dimensional subspace.
Note: In the following discussion, matrices are denoted by capital bold
characters (X, Y), column vectors by small italic characters (t), and row
vectors by transpose vector (?
?
).
Using PCA, a structural descriptor data set of molecular architecture
information or a process condition matrix (?
???
), representing M
observations of K variables, can be decomposed to
11 2 2
1
.......X
TP
TT
MK
K
T
ii
i
T
MK KK
tp tp K
tp
?
=
??
=?+?+
= ?
= ?
?
(2.24)
where, T = the score matrix with mutually orthonormal columns
P = the loading matrix with mutually orthonormal columns
Principal components (PCs) are new lines that best approximate the
data in the least squares sense and passes through the average point [15].
When two PCs are derived they, together, define a plane as seen in Figure
2.11). The score matrix (T) represents the projections of the data onto this
Data Projections
Data Center
PCs
Initial Data
Data Plane
41
line in order to get a coordinate value along the PCline. The new coordinate
value is known as a score (?
?
). The loadings define the orientation of the PC
plane with respect to the original Xvariable. The loading matrix (L) contains
the coefficients in the linear combination of the original variable defining the
principal components (PCs).
PCAbased soft models are both linear and additive. The loadings
unravel the magnitude (large or small correlation) and the manner (positive
or negative correlation) in which the measured variables contribute to the
scores. Together the scores and loadings describe the principal components of
the data set.
Normally, the first three A (? ? ?) PCs capture most of the variance in
the data (8090% of total variance [46]). By retaining only the first A PC?s the
X matrix can be approximated by:
11 2 2
1
?
.......X
TP
TT
MK
A
T
ii
i
T
M A AK
tp tp A
tp
?
=
??
=?+?+
= ?
= ?
?
(2.25)
In order to characterize the properties of the observations one measures
variables. Observations are comprised of two parts: signal and noise. Signal
describes the property or effect of interest, and noise is everything else. Using
methods based on variance such as PCA, multivariate data can be separated
into signal and noise. Using Eq. (2.25), matrix X can be reconstructed as
?
1
?
1
1
1
X XE
E
T P E
T
MK MK MK
A
TT
i i MK
i
TT
M A AK M K
Noise
Structure
x
x tp
x
? ??
?
=
?? ?
=?+ +
=?+ ?+
=?+ ? +
?
?????
(2.26)
42
Here, 1 ???
?
represents the variable average which originates from the
pretreatment step. This way, the data matrix ?
???
, containing K highly
correlated variables, is transformed into the score matrix ?
???
, containing
only A (where, ? ? ?) mutually independent latent variables, which are
linear combinations of the original K variables, have better properties
(orthogonality) and also span the multidimensional space of ?
???
. The
residual matrix (E) comprises the distances of the original variables to their
projection onto the principal components.
Figure 2.12 graphically represents how scores and loadings form the T
.
P
(structure) part of the PC model equation represented by Eq. (2.24).
Figure 2.12: PCA decomposition of X matrix.
Here vectors are designated as column vectors and the corresponding
transposed vectors are designated as row vectors. The dashed lines in the
matrix indicate the mean centering and scaling direction.
2.6.1.1 Number of Principal Components
Before calibration of a model from PCs, it is important to determine the
number of components necessary to extract the most relevant information
from a data matrix. The errors calculated for the calibration set decrease
E
P
T
T
t
2
t
1
+= +
(M x K)
matrix
?A +
(M x K)
matrix
= +
E
p
T
1
p
T
2
(M x K)
matrix
(
M x
A
)
m
a
t
ri
x
(A x K)
matrix
43
continuously as the number of components increase. Two methods [47] to
help choose the number of components are to:
? plot the eigenvalues according to their size (the so called scree plot).
The plot provides a visual aid for deciding at what point in this graph
(often called an elbow) including additional components no longer
increases the amount of variance accounted for by a nontrivial amount
(slope of the graph goes from steep to flat). Keep only the components
before the elbow. For example, the scree plot represented in Figure
2.13 suggests that using three PCs is appropriate.
? keep only the components whose eigenvalue is larger than the average
eigenvalue. For correlation PCA, this means to keep only the
eigenvalues larger than 1.
Figure 2.13: Scree plot of the correlation matrix.
One must be aware of the fact that unwanted variability in the data set,
such as random noise, may also be taken into account by a model constructed
with too many PCs. The model is said to be overfitted, showing excellent
results for evaluating samples belonging to the calibration set but failing on
prediction of an external validation set.
0
1
2
3
4
5
0 2 4 6 8 10 12 14
E
i
g
en
v
al
ue
Number of Components
Three components
44
2.6.1.2 Scores Plot
The scores plot shows correlations between observations, measurements or
responses. Basically, it helps answer questions like:
? are observations related to each other?
Responses close to each other have similar properties, whereas those
far from each other are dissimilar with respect to descriptor variable
profile (Figure 2.14 (a)).
? are there any groups or trends?
In Figure 2.14 (a), data represented by different symbols (circle,
triangle and square) represent a group of responses with similarity in
descriptor properties.
2.6.1.3 Loading Plot
The loadings plot shows correlations between variables. It helps answer
questions like:
? how the descriptor variables are correlated?
Descriptor variables contributing similar information are grouped
together, that is, they are correlated. For instance, in Figure 2.14 (b),
variables 15 and 28 are correlated.
? which variables are influential?
If the variables are in the same quadrant, they are positively
correlated, whereas variables in opposite quadrants (opposite side of
plot origin) are negatively (inversely) correlated. For instance in Figure
2.14 (b), when the value of one variable in upperright hand corner (say
variable 13) increases or decreases, the value of the other variable (say
variable 16) has a tendency to change in the same way. Moreover, as
the distance of a variable from the plot origin increases, the stronger is
the impact of this variable on the model. This suggests that the
variables 6, 13, 16, 44, 2, and 5 (from Figure 2.14 (b)) separate the
three response groups (in Figure 2.14 (a)).
45
Figure 2.14: Principal component analysis: Score plot (a, left) of t1/t2 and
loading plot (b, right) of p1/p2. The ellipse represents the Hotelling T
2
with
95% confidence in score plot.
Figure 2.14 depicts score and loading plots obtained from PCA.
Comparing the loadings plot to the scores plot helps one understand how the
variables relate to the observations.
2.6.1.4 Outlier Removal
Real data is rarely homogeneous (outlierfree) and therefore most statistical
methods require removal of outliers prior to the calibration of a model.
Outlying samples may have a huge influence on the calibration of a model
and may decrease its predictive ability. Outlier identification is based on
distance from data centroids. Hotelling?s ?
2
can be used to detect outliers
inside the model space [48]. Hotelling?s ?
2
is a multivariate generalization of
the univariate student?s ttest, and provides a check for observations obeying
to multivariate normality. When this statistic is used in conjunction with a
score plot, Hotelling?s ?
2
defines a 95% or 99% tolerance region. In Figure
2.14 (a), one observation can be considered to be an outlier as it may not
belong to the majority of the sample population and reveals a mistake in the
property value obtained.
P [1]
(b)
P [2]
t [1]
t
[2]
(a)
46
2.6.2 Principal Component Regression (PCR)
After applying PCA to the ?
???
block for variable reduction, regression can
be used to predict a particular quantitative characteristic (like attributes) as
a function of score vectors. The multilinear regression (MLR) relationship
between the principal component scores ?
???
and the attribute properties
?
???
can be developed using a PCR model as:
( )
1
??
, where,Y T B B TT TY
M L M A AL
?
? ??
= ? = ? ??
(2.27)
Unlike, in OLS where the columns of Y are regressed onto the large and
highly correlated columns of X, in PCR the columns of Y are regressed onto
the reduced and mutually independent latent variables T.
For any desired (1 ? ?) vector of Yvariable, (?
?
)
???
, one can compute a
(1 ? ?) vector of new latent variable scores as
( ) ( ) ( )
1
11
?? ?
BB B
T TT T
A L LA AL LA
new new
ty
?
? ? ?? ?
= ??
(2.28)
and then predict the (1 ? ?) vector of new Xvariables as
( ) ( )
11
TT
K A AK
new new
x tp
? ??
= ?
(2.29)
Notice that Eq. (2.27) has the same structure as Eq. (2.18), however, instead
of finding K variables to estimate (?
?
)
???
, now only ? (where,? ? ?) latent
variables has to be found to estimate (?
?
)
???
thereby achieving reduction in
dimension of the equation system involved. Also, since Eq. (2.25) preserves
the covariance structure of X, the new Xvariable found will be consistent
with the past ones. More details can be found in [49].
47
2.6.3 Partial Least Squares (PLS)
PLS is a regression extension of principal component analysis (PCA). It
generalizes and combines features from PCA and multiple linear regressions
(MLR). Besides, relating two data matrices, X and Y, PLS also models the
common structure between them thereby, giving richer results than the
traditional multiple regression approach. PLS regression modeling has been
described extensively in the open literature and can be found in Erikson et al.
[15], Gabrielsson et al. [35], Geladi and Kowalski [42], among others. While
this approach ensures the best possible correlation between the two data sets,
it does not guarantee to best describe the X and Y data individually.
Figure 2.15 shows a PLS model being generated between the descriptor
data matrix X, which could be molecular descriptors or property descriptors,
and response data matrix Y, which could be attribute or property
information. The method fits two ?PCAlike? models at the same time, one for
X and one for Y. However, the projections differ from those obtained with
PCA on both blocks separately. The outer relation for X and Y block are:
1X TP E
TT
MK MA AK MK
x
? ?? ?
=?+ ? +
(2.30)
1
TT
ML MA AL ML
y
? ?? ?
=?+ ? +Y UVF
(2.31)
Figure 2.15: PLS regression on descriptive (X) and response (Y) variables.
y
1
y
2
y
3
u
t x
2
x
3
u
t
x
1
Attribute & PhysicalChemical
Property Variables (Y)
PLS Regression
Model
Property Structure
Descriptive Variables (X)
48
The projections of X and Y are then connected through the inner relation
U TG
MA MA MA???
= +
(2.32)
where, G is a residual and the regression coefficient is one.
The score plot in Figure 2.15 shows a linear relationship between
predictors and the responses, however, nonlinearities may exist. The PLS
score plot u/t shows linear correlation structure between the predictors and
the responses. The dashdot line is the projection if PCA was performed on X
and Y individually. In PLS, besides loadings P and V, there are additional
loadings called weights ?
?
which express the correlation between U and X
and are used to calculate T such that
T XW
MA MK KA
?
? ??
=
(2.33)
The prediction of Y can be obtained from the PLS model as:
( )
??
TT
ML MA AL MK KA AL MK KL
?
? ?? ? ?? ? ?
=?= ? ? = ?Y TV X WV X B
(2.34)
PCR and PLS can be considered standard calibration techniques for
several spectroscopic techniques, among many. The main advantage of these
techniques is to avoid collinearity problems thus allowing one to work with a
number of variables that is greater than the number of samples. A
comparison between these two techniques reveals similar results in terms of
prediction ability in multivariate calibration, with no significant difference
being reported when both employ the optimized number of principal
components (PCs) [50]. PCR yields lower accuracy (degree of closeness of a
measured value to the actual value) but higher precision (degree of closeness
of the measured values to each other) than PLS. The basis of the model
development in this dissertation will be based on PCR, however, PLS will be
used for few qualitative applications.
49
2.6.4 Model Validation
It is important to validate the predictive ability of the developed model in
addition to having a good fit. Validation can be distinguished between two
types:
? Internal validation: where the calibration data is also used as
validation data. This method is applicable when a proper validation
sample is not available, or is not used. Often the calibration sample is
separated into a learning set (for calibration) and a testing set (for
validation) [15, 51]. A popular internal validation technique is the leave
oneout crossvalidation technique in which one sample is used for
validation and the remaining samples are used for calibration (oneata
time method). This is repeated for each sample. Predicted residual sum
of squares (PRESS) is computed as:
( )
2
1
?
K
ii
i
PRESS y y
=
= ?
?
(2.35)
The number of components giving a minimum PRESS is the right
number for the model that gives optimal prediction. The root mean
square error of crossvalidation is,
=
PRESS
RMSECV
K
(2.36)
The smaller the PRESS the better the quality of the estimation for a
developed model. Details can be found in Montgomery et al. [51].
? External validation: is based on a new or independent set of
validation data to evaluate the predictive ability of the previously
developed model from the training data set. The external validation
errors are often presented as the root mean square error of prediction
(RMSEP), which is calculated as:
50
( )
1
22
1
?
K
ii
i
yy
RMSEP
K
=
??
?
??
=
??
?
(2.37)
Including more latent variable terms in the final model will always
increase the goodness of fit, ?
2
, regardless of whether the additional variable
is statistically significant or not. However, a large value of ?
2
does not
necessarily imply that the regression model is a good one. Because ?
2
always
increases as we add terms to the model, adjusted ?
2
statistics must also be
considered. The ?
adj
2
is particularly useful in the selection stage of model
building.
( )
2 22
1
1 1 where, 1
1
?
??
=?? =?
??
?
??
adj
total
K SS
R RR
A SS
(2.38)
where, SS is the residual sum of square.
Unlike ?
2
, the ?
adj
2
increases only if the necessary terms improve the
model more than could be expected by chance. Therefore, when ?
2
and ?
adj
2
differ dramatically, there is a good chance that nonsignificant terms have
been included in the model [51]. The ?
adj
2
can be negative, and will always be
less than or equal to ?
2
.
An individual principal component (PC) generated by PCA or PLS is
considered significant if its crossvalidated Q
2
value is greater than zero. Q
2
is calculated according to
2
1= ?
total
PRESS
Q
SS
(2.39)
The overall significance of each PCA or PLS model is evaluated in terms of
Q
2
(cum) as [15]:
2
( )1
??
??
= ??
????
??
??
PRESS
Q cum
SS
(2.40)
51
2.7 Principal Properties in Cluster Space
In order to solve a design problem in a single domain, all the physico
chemical attribute properties of interest have to be converted to principal
properties by using the regression coefficients from the calibration model
[27].
( )
( )
1
11
1
?? ?
, if
where,
???
, if
?
?? ?
??
? ?? ?
?
???
?
?>
?
=?=
?
?
?<
?
BB B
T Y B B
BBB
TT
LA AL LA
M A ML LA LA
TT
LA AL LA
LA
LA
(2.41)
?
?1
is a generalized inverse or pseudoinverse of a matrix
?
B [52].
In order to utilize the latent variables (LVs) in the property clustering
algorithm, it is important that the LV structures follow a linear mixing rule.
This can be achieved by standardizing the data structure to obtain a new
matrix, ?
???
. Rearranging the data decomposition represented by Eq. (2.25),
we get:
? ??
= ?T X P
MA MK KA
(2.42)
If the loadings ?
???
are thought of as the pure values of the principal
components, then scores ?
???
serves as the predicted mixture properties
represented by ?
???
in Eq. (2.1). Here, the mixture fractions of the
multivariate descriptor data ?
???
must sum to one across descriptor
variables K. To achieve this form, the latent variable structure is
standardized by dividing ?
???
and ?
???
in Eq. (2.42) by the sum total of the
property descriptors of each experimental run ?
???
such that:
? ??
= ?Q UP
MA MK KA
(2.43)
Where, ?
???
= ?
???
/?
?
, ?
???
= ?
???
/?
?
, and =
?
H
Mi
i
Sx.
A powerful chemical product design framework is achieved, by
integrating latent variable methods within property cluster domain and by
52
formulating two reverse problems. The first reverse problem identifies
product quality/performance requirements and second identifies mixing
conditions (involving selection of materials, blend ratios, or process operation
conditions) when considering mixture design, and substructural molecular
building blocks when considering molecular design. The following chapters
will demonstrate the mixture and molecular design solution framework
developed by combining methods and tools presented in this chapter.
2.8 ComputerAided Design using QSPR and cGCM
This section describes how the tools and techniques presented in this chapter
are combined within the computeraided approach to facilitate investigation
of chemical product formulations prior to experimentation and simulation.
When appropriate property models are available to describe and predict the
target product properties, computeraided molecular design methodologies
are utilized to solve the design problems systematically and efficiently. The
computeraided molecular design problems require: (1) prediction of
properties from molecular structure (solution to forward problem) and (2)
identification of optimized molecular structures that meet given a set of
property values (solution to reverse problem) [1, 9, 11, 24 27, 53]. Figure 2.16
shows the two required solutions of CAMD.
Figure 2.16: Forward and reverse problems in computeraided molecular
design.
Given set of molecular
groups/buildingblocks
(e.g. CH
3
, C=C, , ,
Set of target properties
values
(e.g. ?, ?, T
b
)
Forward Problem
Reverse Problem
6
,
?
PF
4
)
?
BF
+
IM
+
PY
53
2.8.1 Solution to Reverse Problem
The molecular design problem presented in this dissertation follows a reverse
problem of property prediction from molecular structures. In this approach, it
is required to enumerate structured molecules from a given set of
representative molecular building blocks or fragments corresponding to
target physicalchemical properties estimated by property models. First, the
values of the target properties of desired molecule needed for a specific
application are determined a priori. Lower and upper bounds on each
property, ?
?
, of molecule, i, are obtained either from process design problem
(see Section 2.2.1) or from product property requirements/constraints.
min max
??
j ij j
y yy
(2.44)
Mapping the design problem formulation from property to cluster space
(described in Section 2.2.2) enables visualization of a problem and its solution
when three properties are concerned. The normalized property operator of
molecule i using corresponding reference values give:
min max
? ?? ??
j ij j
(2.45)
where ?
?
is defined by Eq. (2.4).
Second, bruteforce search or exhaustive searches are performed using
numerous permutations or combination of atoms and molecular groups, to
systematically enumerate all possible molecular structures, and mixtures of
molecules, that satisfy specifications in terms of the normalized target
property operator values in Eq. (2.45). The maximum number of similar
groups ?
?
?
are predefined to limit the size of a generated molecule. The
possible number ?
?
?
of similar molecular fragments/groups of type g that can
be included in the design can be estimated by taking the minimum of the
nearest integer value of ??
?
???
?
?
? ? and ?
?
?
.
54
max
**
min :
???
=
??
?
??
j
gg
j
nN
(2.46)
where ?
?
???
is the target maximum latent property value of property ?.
It must be noted that exhaustive generateandtest algorithms are
adequate to solve problems with small size. For problems with slightly larger
search space, reduction in the search space (reducing the set of candidate
solutions to a manageable size) using problemspecific heuristics can make
the algorithm more efficient [54, 55]. However, for large scale molecular
design problems, stochastic or evolutionary search algorithms such as genetic
algorithm (GA), are preferred [9].
2.8.2 Solution to Forward Problem
In order to achieve predictive property models, quantitative structure
property relations (QSPR) are utilized where the training set molecular
descriptors are generated in terms of infrared (IR) frequencies. Since
molecular descriptors are obtained through characterization techniques
based on IR spectroscopy (described in Section 2.4.1), the group based
property estimation method is termed characterizationbased group
contribution method (cGCM) [27]. Since the combination of molecular groups
can be done in an infinite number of ways, the search space and the size
range of enumerated molecules can be reduced by constraining the number of
occurrence of similar and dissimilar types of groups within minimum and
maximum limit.
**
0 ??
gg
nN
(2.47)
2 ??
gg
nN
(2.48)
Note that a minimum number of two dissimilar groups, ?
?
, must be selected
to form a structurally feasible molecule.
55
When order of selection does not matter and repetition of similar groups
is allowed, the total number of possible candidate molecules that can be
generated by selecting ?
?
groups from a set of ?
?
groups is given by:
( )
( )
total
22
1!
Candidate
! 1!
= =
+?
= =
?
??
gg
g
g
gg
NN
ggN
n
nn
gg
Nn
C
nN
(2.49)
As the number of groups gets high, the problem of combinatorial
explosion arises. In order to reduce the combinatorial problem, atoms and
several first order groups (except some terminal groups like CH
3
) are
clustered into metagroups. In groupbased property estimation methods,
these metagroups are treated as first order groups. Such groups capture
inter and intraatomic and group interactions. The property operators of
molecules using first order metagroupsbased cGCM are estimated as:
1
g
N
Mix
j g jg
g
n
=
? = ??
?
(2.50)
where, ?
?
???
= the mixture property operator values of property ?
?
?
= the number of occurrences of dissimilar molecular fragment ?
?
?
= the maximum number of possible appearance of dissimilar
fragments (user defined), and
?
?
= the property contribution of the appeared group in the
formulated molecule.
When the predicted property values using Eq. (2.50) of enumerated
molecules from the combinatorial building blocks satisfy the property
constraints in Eq. (2.45), the set of molecules are considered and selected as a
candidate solution. Then the enumerated candidate molecules are screened
for structural constraints to ensure that a stable and connected molecule was
formed. One such structural constraint is to check the number of unused
bonds in a generated molecule, i.e. the free bond number (FBN) [24].
56
11
2 12
gg
NN
g g g Rings
FBN n FBN n N
= =
? ?? ?
= ? ?? ? ??
? ?? ?
? ?? ?
??
(2.51)
where, ???
?
= the unique free bond number associated with group g
?
????
= the number of rings in the formulation.
For an electronically complete molecular formulation the free bond
number must be equal to zero. A FBN of zero indicates that the electron
valency shells of all atoms in the molecule have been satisfied, which, in most
cases, indicates one of the minimum energy connectivity configurations of the
atoms in the molecule.
Finally the feasible solution can be verified through more rigorous
experimentation and/or computational studies based on molecular dynamics
and quantum chemical calculations. The methodology presented in this
dissertation provides a computationally efficient screening procedure to
narrow down potential compounds with desired attributes from a large
chemical space. Figure 2.17 is a schematic diagram that summarizes the
steps involved with variable transformation, latent variable model
calibration, and property integration to solve the computeraided reverse
design of chemical products in reduced space.
57
Figure 2.17: An overview on the methodology of multivariate
characterization, modeling, and design.
,
1
H
T
MA MA hm
h
x
??
=
=
?
QT
?
M L M A AL? ??
= ?Y TB
,,
T
new M A New M L L A? ??
= ?
T YB
MH?
X
?
T
M H M A AH? ??
= ?X TP
( )
? =
a a ma
m
qq
( ) ( )???=
am aa aa
m ref
qq
1=
= ?
?
A
m am
a
AUP
Chemometric Technique
? Dimensionality reduction and multicollinearity
removal from molecular/process descriptor data set.
? PCA to achieve a qualitative analysis of data and
PLS to develop the calibration model.
? Transform physicochemical properties to principal
properties (latent variable domain).
Property Clustering Technique
? Standardize principal properties such that they follow
linear mixing rule (suitable in cluster domain).
? Property operators as functions of standardized
principal properties.
? Dimensionless property operators follow linear
mixing rules and allow optimization based on lever
arm analysis.
? AUP values of the formulated product and the target
must match along with matching the cluster targets.
? Clusters follow intra and interstream conservation.
0=FBN
1=
? = ??
?
g
N
Mix
a g ag
g
n
,
? ?? ??
LU
a am a
Reverse Molecular Design
? Mixing of properties operators follow linear mixing.
? Selection of maximum number of dissimilar
molecular fragments.
? Solution properties must lie within the lower and
upper target property limits.
? Molecules must be chemically stable and structurally
feasible.
=?
ma ma m
C AUP
0 ??
gg
nN
58
CHAPTER 3
MIXTURE DESIGN
3.1 Introduction
Mixing/blending processes play an important role in today?s manufacturing of
valueadded chemical products such as commodity chemicals, food, cosmetics,
oil and pharmaceuticals. The product formulation design problem in
industrial research and development is one such area where the analysis of
mixture data could be utilized more effectively.
Mathematical models are commonly used to characterize a system,
which is to be controlled or optimized by a set of variables, to study the
effects of various factors and to make predictions about behavior. Generally,
models can be classified into mechanismdriven models and datadriven
models. Traditionally, the chemical engineering discipline has focused on
mechanistic models that describe underlying phenomena with a system of
differential and algebraic equations (DAEs). However, reallife industrial
process systems, which are often complex and nonlinear with incomplete
and/or uncertain data, cannot be adequately described by such models [18].
Datadriven models offer an alternative solution as we move from
limited data, which used to be obtained through timeconsuming experiments
and simulations, to massive amounts of data that can be generated from
analytical instruments, images, spectra, etc. (i.e. ?data poor? to a ?data rich?
paradigm shift due to rapid instrumentalization of science and technology).
This has been the trend in recent decades [18, 19, 35, 27, 41, 56, 57].
Industrial reality also suggests that a good theoretical process model is
often not available. Before new experiments are conducted, historical process
59
data that encompass a wide spectrum of operating conditions and existing
product grades can be utilized to achieve new and improved products (20).
Today, tremendous amounts of diverse data are readily available from
extensive monitoring of equipment, processes, and products at all scales (18).
However, managing such abundance of complex data to build appropriate
models for a specific application and exploring their effects on the final
product properties remain major challenges.
3.2 Traditional Approach
In classical nonmixture designs such as factorial and response surface
designs, all the factors are orthogonal or independent. This means that it is
possible to freely choose the level of a factor regardless of the other factors?
levels [58]. In general, the model parameters can be used to judge the effects
of the mixture components. For example, consider a mixture experiment in
which three components ?
1
, ?
2
, and ?
3
, were blended to form a product with
response variable ?. If
1 2 3 12 13 23
? 11.7 9.4 16.4 19.0 11.4 9.6y x x x xx xx xx= ++ + + ? can
adequately represent the response, it can be concluded that component 3 has
the largest contribution to the highest response value because the coefficients
are in the order ?
?
1
> ?
?
2
> ?
?
3
. Furthermore, since ?
?
12
and ?
?
13
are positive,
presence of components 1 and 2 or components 1 or 3 enhances the response.
This is an example of synergistic blending effects. Components 2 and 3 have
antagonistic blending effects because ?
?
23
is negative and the presence of both
components works against the response of interest.
However, this freedom does not exist for mixture designs, because each
component in a mixture is dependent upon the settings of the other
component settings. In mixture design, the factors are interdependent and
the effects of the factors on the responses are not separable. For example, for
a mixture containing N components, if the proportions of the first N1
components are defined, the proportion of the N
th
component cannot be freely
60
chosen. If x
i
is the percentage content of component i, the sum of all the
mixture components is given by the fundamental mixture constraint
relationship
12
1
1.0
N
iN
i
xxx x
=
= + +???+ =
?
(3.1)
0 1.0 1, 2, ...,
i
x iN? ? ?=
(3.2)
Analyzing mixture data with multiple regressions necessitates a special
model form to eliminate the mixture constraint (Eq. (3.1)). The Scheffe
canonical models [59] and Cox polynomial models [60] are the two most
commonly used to analyze mixture data with multiple regressions. Scheffe
introduced canonical models of various orders by eliminating some terms
from the complete polynomial model. Cornell [58] provides multiple standard
references to regression, modeling and analysis of mixture data. A short
review on Scheffe and Cox models is also included in Appendix B.
In most practical applications, mixture data can be noisy and highly
collinear because of process or operational constraints (20). Although the
Scheffe canonical models and the Cox polynomial models (a reparameterized
and constrained version of the Scheffe model) eliminated the true collinearity,
and enabled the use of multiple regressions for the estimation, the problem of
near collinearities with mixture data remains. Design of experiments (DOE)
with response surface methods is usually used to determine the optimum
combination of chemical constituents that give a desired response using a
minimum number of experimental runs [35]. While such a design approach is
adequate for most experimental designs, it suffers from combinatorial
explosion and visualization difficulties when dealing with multicomponent
mixtures [22,61]. Solvason et al. [61] presented a solution to the above
problems by integrating the property clustering framework with existing
mixture design techniques.
61
In addition, traditional mixture models are usually employed to
investigate the relationships between the blend ratio matrix (R) and the final
blend product property matrix (Y) only (Eq. (3.3)), given that the properties of
the pure raw materials (X) and the process conditions used to manufacture
them (Z) are already chosen. The preceding topics represent this situation.
However, for the development of new products that meet target properties
with minimum experiments and minimum total material cost, it is important
to simultaneously take into account all three degrees of freedom (X, R, and Z)
available in blending operations [57].
In general, mixture design approaches tend to treat the three problems
as separate steps. A set of raw materials is selected usually based on the
experimenters? best guess, and then a set of blending experiments with
certain constant process conditions are run to see if the target properties can
be achieved. If the results are not acceptable, another set of raw materials
and/or process conditions are selected and the process repeated. Figure 3.1
depicts the traditional mixture design approach which may lead to many
blending experiments, a very inadequate investigation of the large number of
raw materials, and a very long development time.
Figure 3.1: Traditional approaches in mixture design.
Material
Selection
Process
Condition
Selection
Blend Ratio
Selection
Target
Properties
meet?
Final Product
Select set of Raw Materials
and/or process conditions
based on best guess or
experience
Design set of blending
experiments or trial and
error
Yes
No
62
Statistical multivariate models such as principal component analysis
(PCA), partial least squares (PLS), and neural networks (NN) provide
powerful tools to extract systematic variables and remove both of the above
collinearities in the data set, thus allowing one to work with a number of
variables that is greater than the number of samples [15, 49, 56, 57].
KettanehWold [56] proposed the use of a partial least square (PLS)
model for mixture data, and showed how it effectively deals with both the
above collinearity problems, and can simultaneously incorporate process
conditions. Recently, Muteki et al. [57, 62] added the relationship between
the raw material properties and final blend properties that allows
investigation of the effect of raw material properties on the final mixture
product properties. However, it is important that the raw material properties
selected are sufficiently correlated with the final blend product properties.
The multivariate statistical methods presented in this chapter are based on
Muteki?s development on mixtureproperty models combined with the
property clustering technique and reverse problem formulation. The
nomenclatures are kept consistent from that in Muteki?s papers for less
confusion. In this chapter we shall call the Xvariables factors or predictors
and the Yvariables qualities or responses.
3.3 MultiBlock Data Structure
The data structure generally available on raw material property data and
blending data in mixture design is shown in Figure 3.2 [57]. The raw
material properties matrix, ?
???
, consists of N available raw materials with
K number of properties. The blend ratio matrix, ?
???
, consists of M number
of blends of N materials used in the formulation of the blends such that
? ?
?
= 1.0
?
?=1
, where ?
?
? 0 ? i = 1,2, ? , ?.
The process conditions matrix, ?
???
,
consists of J process conditions. The final response or property matrix, ?
???
,
consists of L properties measured on the final product. It must be noted that
63
there is no common dimension over the entire data matrix. However, ?
?
has
an indirect relationship to Y through R as it has one dimension in common
with R but no dimension in common with Y. It is referred to as Lor Tshaped
data structure [57].
Figure 3.2: Data structure for three manipulative variable matrixes and a
quality/response variable matrix.
3.4 MultiBlock Regression Models
Traditional mixture models such as the Scheffe and the Cox models [58],
have commonly been used to model the relationship between R and Y as:
()YRf ?= +
(3.3)
Muteki et al. [57] used ideal mixing rules for combining the raw material
properties matrix (X) and the blend ratios matrix (R) in order to relate all (Z,
R, ?
?
) to the Y matrix with a common dimension M.
,
( , ) where,Y XZ X R X
mix mix M K M N N K
f ?
? ??
=+=?
(3.4)
?
???
is the mixtureraw material mixtureproperty matrix. Muteki [57]
demonstrated that the mixtureproperty models account for the similarities
among the raw material properties and their effect on the blends. This can
X
T
(K x N)
matrix
R
aw
Mat
er
i
al
P
r
oper
t
y
Material
Z
(M x J)
matrixS
am
pl
e
Process
Condition
R
(M x N)
matrix
Material
Proportion
Y
(M x L)
matrix
Mixture
Property
S
am
pl
e
S
am
pl
e
64
successfully capture more inherent latent relationships between the mixture
properties (?
???
= ???) and the final product properties Y than those which
exist simply between the ratios (R) and final blend properties (Y). When a
property model does not follow linear mixing, then the model can be moved
from its nonlinear domain to a linear domain by a suitable transformation of
the model formulation [34]. If the process operating conditions (Z) change
between blending experiments, then the effect of these changes are easily
accounted for by incorporating Z into the PLS models:
[ ] ( )
( , ) where,
Y X Z X X Z RXZ
mix all mix all mix
f ?=+==???
??
(3.5)
?
??? ???
is a matrix combining ?
???
and Z in parallel. The above data
combination can be better understood graphically in Figure 3.3. (The dotted
lines in each block indicate the mean centering and scaling direction.)
Figure 3.3: Data structure for combined manipulative variable matrixes
and a quality/response variable matrix.
Y
(M x L)
matrix
X
T
(K x N)
matrix
R
(M x N)
matrix
Z
(M x J)
matrix
Y
(M x L)
matrix
X
mixall
= [(R
.
X)Z]
(M x H)
matrix
Ra
w
M
at
er
i
al
P
r
oper
t
y
Material
E
xp
.
N
o
Material
Proportion
E
xp
.
N
o
Process
Condition
E
xp
.
N
o
Mixture
Property
Adjustable Process Variables
E
xp
.
N
o
E
xp
.
N
o
Mixture
Property
65
This simplifies the analysis and design by not having to differentiate
between mixture and process variables and not having to assume
independence of the factors when multivariate analysis techniques are used
[56]. The multiblock PCR (Section 2.6.2) or PLS (Section 2.6.3) model can be
used to obtain the relationship in Eq. (3.5). PLS regression is performed by
projecting the ?
??? ???
data and Y data onto a lower dimensional subspace:
*
,
1
where,
A
TT
mix all M H i i M H M A A H M H mix all
i
tp
? ? ?? ?
=
= ?+ = ? + = ?
?
X E T P E T X W
(3.6)
1
A
TT
ML i i ML MA AL ML
i
tv
? ? ?? ?
=
= ?+ = ? +
?
Y F TV F
(3.7)
The prediction of Y can be obtained from the PLS model as:
,,
??
TT
M L M A A L mix all M H H A A L mix all M H H L? ?? ? ?? ? ?
=?= ? ?? ?
Y TVX WVX B
(3.8)
The prediction of Y can be obtained from the PCR model as:
??
M L M A AL? ??
= ?Y TB
(3.9)
This way, the data matrix ?
??? ???,???
, containing ? (= ? + ?) highly
correlated manipulated (or predictive) variables is transformed into the score
matrix, ?
???
, containing only A (where,? < ?) independent latent variables,
which are linear combinations of the original manipulated variables. The
weights of this linear combination are captured in the loading matrices,
?
???
?
, ?
???
and ?
???
. If new raw material properties, (?
?
)
???
, have to be
predicted from the desired product quality specifications, (?
?
)
???
, then the
inversion of the latent variable model gives:
1
,1 ,1
?
??()BB B P
T T T TT
new K des K L A A K
xy
?
? ? ??
= ??? ?
(3.10)
66
3.5 Proof of Concept Example ? Starch Blending
The development of thermoplastics from the mixing of starches, lactic acids
and additives using latent variables in cluster space is presented as a case
study to illustrate the method and concept described in this chapter. Data for
the polymer blend problem was obtained from Muteki [63] and involves study
of the influence of raw material properties (?
?
), blend ratios (R) and process
conditions (Z) on the product property matrix Y. This work employs
multivariate data analysis techniques and the mixtureproperty model
introduced by Muteki et al. (2006) then formulates and solves the mixture
design problem in the cluster space.
3.5.1 Structure of starch blending data
The raw material data matrix (X) consists of 5 properties (Amylose content
and 4 properties related to molecular weight distribution) on 3 starches. The
blend matrix (R) consists of different blend ratios of 3 starches and 3 other
materials (1 polylactic acid and 2 additives) in each blend of 28 mixtures. The
process condition matrix (Z) consists of the molding temperature as the
process variable. The final product property matrix (Y) consists of 4 polymer
properties (tensile strength (TS), tensile modulus (TM), elongation at break
(EB) and density (Rho)) for 28 mixture blends.
Figure 3.4 represents a more complex extension of the data structure for
the X and R matrices shown in Figure 3.2. The raw material property data
matrices result in a staircase type of the data structure and are not
overlapped with each other because their raw material properties contain
measurements on different variables due to the different classes of materials.
These blend ratio matrices were designed by a Doptimal design, and
therefore the conditioning is relatively good. In this case, only one class of
raw materials has the property data information (?
??????
?
), and the property
data matrix (?
??????
?
) of the other materials (PLA and additives) is not
67
available [63]. The data structure of the X matrix for this problem of the
mixtureproperty model using ideal mixing rule becomes ?
??? ???
=
[(???
???
)?] = [(?
??????
??
??????
)?
?????
?]. It is also known that many polymer
properties, such as the average molecular weight approximately follow ideal
mixing rules [64].
Figure 3.4: Data structure of X and R matrix for two classes of raw
materials and their blend ratios.
Table 3.1 and Table 3.2 contain the data for the raw material properties
and Table 3.3 contains mixture conditions and product properties [63].
Table 3.1: Starch material property data matrix.
Raw Material Property Data, ?
??????
Starch
I.D.
Amylose
content,
Average molecular weights (?10
3
g/mol)
wt% Mn Mw Mz
Mw/Mn
(PDI)
Starch1 70 41 214.5 463.7 5.23
Starch2 23 62.3 531.4 821.2 8.53
Starch3 0 90.1 722.2 1040 8.02
X
T
starch
(SK x SN)
S
t
ar
c
h
P
r
oper
t
y
Starch
X
T
other
(OK x ON)
O
t
her
P
r
oper
t
y
Other
R = [R
starch
R
other
]
(M x N)
E
xp
.
N
o
Material Proportion
R
other
(M x ON)
R
starch
(M x SN)
68
Table 3.2: Blending ratio and process condition data matrix.
Material Blend Ratio
Process
Condition
Mixture R
starch
R
other
Z
I.D. Starch
1
Starch
2
Starch
3
PLA Additive
1
Additive
2
Molding
temperature
1
0.50 0.00 0.00 0.00 0.50 0.00 150
2
0.41 0.00 0.00 0.18 0.41 0.00 150
3
0.41 0.00 0.00 0.18 0.41 0.00 170
4
0.33 0.00 0.00 0.33 0.33 0.00 150
5
0.33 0.00 0.00 0.33 0.33 0.00 170
6
0.23 0.00 0.00 0.54 0.23 0.00 150
7
0.23 0.00 0.00 0.54 0.23 0.00 160
8
0.23 0.00 0.00 0.54 0.23 0.00 170
9
0.39 0.00 0.00 0.17 0.39 0.05 150
10
0.32 0.00 0.00 0.32 0.32 0.05 150
11
0.22 0.00 0.00 0.51 0.22 0.05 150
12
0.16 0.00 0.00 0.63 0.16 0.05 150
13
0.36 0.00 0.00 0.36 0.24 0.05 150
14
0.36 0.00 0.00 0.36 0.24 0.05 170
15
0.38 0.00 0.00 0.38 0.19 0.05 150
16
0.38 0.00 0.00 0.38 0.19 0.05 170
17
0.36 0.00 0.00 0.36 0.18 0.09 150
18
0.36 0.00 0.00 0.36 0.18 0.09 170
19
0.00 0.52 0.00 0.22 0.26 0.00 150
20
0.00 0.52 0.00 0.22 0.26 0.00 170
21
0.00 0.40 0.00 0.40 0.20 0.00 150
22
0.00 0.40 0.00 0.40 0.20 0.00 170
23
0.00 0.26 0.00 0.61 0.13 0.00 150
24
0.00 0.26 0.00 0.61 0.13 0.00 170
25
0.40 0.00 0.00 0.40 0.20 0.00 150
26
0.40 0.00 0.00 0.40 0.20 0.00 170
27
0.00 0.00 0.40 0.40 0.20 0.00 150
28
0.00 0.00 0.40 0.40 0.20 0.00 170
69
Table 3.3: Mixture product quality data matrix.
Product Quality, Y
Mixture
I.D.
Tensile
strength
Mpa
Tensile
modulus
Mpa
Elongation
at break
%
Density
g/cm
3
1
0.888 18.59 10.1 1.367
2
3.72 162.3 10.1 1.341
3
3.25 166.1 7.6 1.332
4
9.7 629.1 6.59 1.311
5
7.55 489.1 4.02 1.316
6
22.8 1522 7.59 1.296
7
23.8 1469 8.00 1.298
8
20.0 1430 4.21 1.294
9
3.03 134.5 11.00 1.345
10
8.73 568.9 7.3 1.315
11
17.6 1327 20.1 1.296
12
22.4 1700 19.4 1.295
13
14.8 1067 3.46 1.344
14
11.5 692.3 4.15 1.327
15
16.8 1740 2.17 1.346
16
16.8 1386 1.61 1.336
17
12.9 1149 2.20 1.339
18
16.2 1291 1.90 1.331
19
5.08 250.6 13.0 1.391
20
5.09 198.9 15.5 1.377
21
15.1 926.2 6.22 1.367
22
10.4 721.7 3.02 1.347
23
24.5 1557 5.30 1.324
24
19.2 1496 1.99 1.312
25
13.85 937.5 3.32 1.353
26
11.84 854.8 2.06 1.342
27
13.79 784.8 7.03 1.363
28
14.45 968 3.17 1.367
70
3.5.2 Data analysis
The data was first mean centered and unit variance scaled as described in
Section 2.6. Data analysis was performed using the JMP? 9.0 statistical
software package by SAS [65]. PCA was used as the projection method to
analyse multivariate data and examine the overall data structure. Figure 3.5
is an outlier analysis based on Hotelling?s ?
2
. The data contains no outliers.
Figure 3.5: Distribution plot, outlier box plot and normal quantile plot.
Figure 3.6 is a scree plot of eigenvalues vs. number of principal
components (PCs) combined with a pareto plot. The pareto plot shows each
eigenvalue as a percentage of the total eigenvalue. Eigenvalues sum to the
number of variables when the principal component analysis is done on the
correlation matrix. Cumulative percent shows the cumulative percent of
variation represented by the eigenvalues. The scree plot is useful for
visualizing the dimensionality of the data space. In this example, the scree
plot suggests that using three PCs is adequate as any additional PC did not
substantially increase the amount of variance accounted for. The first three
principal components (i. e.? = 3) that are guaranteed to be orthogonal,
captured 86.82% of the total variance of ?
??? ???
[= (?
??????
??
??????
)?
????
?]
4
6
8
10
12
14
28
27
25
23
24
1
26
18
17
20
19
10
1214
9
115
321
2
22415
13
16
8
6
7
1.64 1.28 0.67 0.0 0.67 1.28 1.64
0.1 0.2 0.4 0.6 0.8 0.9
Normal Quantile Plot
71
data. The score variables generated by PCA are optimal summaries of the
original variables. Table 3.4 and Table 3.5 contain the score and the loading
values, respectively, for X and Y blocks.
Figure 3.6: Scree plot and pareto plot for PCA on Xvariables.
Figure 3.7 and Figure 3.8 present loading plots combined with score
plots for the PCA on Xvariables. There are 28 mixture observations involved
in this problem. Since a three dimensional plot is the highest possible
dimensional plot that can be constructed, only three variables can be studied
at a time. Therefore, the relationship between all 28 observations is hidden in
a highly dimensional space. The score plot projects high dimensional
variables onto low dimensional variables thereby facilitating the analysis of
all 28 observations simultaneously on a two dimensional plot (see Section
2.6.1.2). The loading plot shows the relationships among the 7 variables. The
variables that are most influential for the model are found on the periphery
whereas the less influential variables are encountered around the origin of
the loading plot (see Section 2.6.1.3). Since the score and the loading are
complementary and superimposable, a biplot can be created that combines
both plots (Figure 3.7 and Figure 3.8).
72
Figure 3.7: Combined PCA score and loading plots (Biplot) on first and
second components for Xblock.
Figure 3.8: Combined PCA score and loading plots (Biplot) on first and
second components for Yblock.
? Mixtures
? Mixture variables
? Mixtures
? Product variables
73
It is clear from Figure 3.7 that the variables PDI, Mn, Mz and Mw are
positively correlated, i.e. when one increase or decreases, the other has a
tendency to change in the same direction. On the other hand, the variable
molding temperature is negatively or inversely correlated to amylose% i.e.
increase in one will have inverse effect on the other, and vice versa.
Similarly, Figure 3.8 shows a loading plot combined with a score plot for the
PCA on Yvariables (which explained 86.79% of the variance of the final
product properties). In this loading plot all product property variables, except
the tensile strength and the tensile modulus, have the tradeoff relationship
among them.
Figure 3.9 shows the regression coefficients for the final product
properties and how the starch properties, PLA and additives are correlated.
For instance: high additive1, low molding temperature and low amylose
content leads to high elongation. High lactic acid and low additive1 leads to
high tensile strength and modulus. The observations are well known to most
experienced polymer chemists [64]. The results obtained here are consistent
with Muteki?s results [63]. Since only three properties can be represented in
the ternary cluster diagram and tensile strength and modulus are affected
similarly by the mixture factors, only one of them will be used together with
elongation and density as three targeted product properties. However, in
situations where more than three properties have to be considered, an
algebraic approach may be used [24, 21].
Figure 3.9: PLS model coefficients for blend property matrix Y using four
latent factors.
74
Table 3.4: PCA score values for X and Yblocks.
X score value Y score value
t
1
t
2
t
3
t
1
t
2
0.252 4.221 0.155 2.814 0.288
0.426 2.466 0.204 1.928 0.625
0.279 2.257 1.655 1.743 0.299
1.022 0.948 0.248 0.215 0.330
0.874 0.740 1.611 0.597 0.179
1.784 1.043 0.302 2.264 0.566
1.711 1.148 0.628 2.248 0.618
1.637 1.252 1.557 2.023 0.002
0.922 2.265 0.817 2.117 0.737
1.459 0.858 0.854 0.463 0.413
2.204 1.049 0.909 1.237 2.864
2.657 2.219 0.942 2.147 2.688
1.217 0.342 0.952 0.259 0.770
1.070 0.133 0.908 0.196 0.346
1.093 0.018 1.011 1.222 1.133
0.946 0.191 0.848 1.002 1.043
1.507 0.072 1.494 0.301 0.936
1.359 0.280 0.365 0.918 0.905
4.256 0.867 1.106 2.659 0.435
4.403 0.658 0.753 2.535 1.090
2.462 0.608 0.949 0.355 0.572
2.609 0.816 0.910 0.588 0.832
0.369 2.330 0.767 2.025 0.247
0.516 2.539 1.093 1.765 0.659
0.602 0.111 0.408 0.134 0.900
0.455 0.098 1.451 0.188 0.958
4.105 1.016 1.259 0.590 0.347
4.252 1.224 0.601 0.287 1.128
75
Table 3.5: PCA loading values for X and Yblocks.
X loading value
Y loading value
p
1
p
2
p
3
p
1
p
2
0.276 0.469 0.097 0.616 0.012
0.460 0.066 0.069 0.132 0.924
0.468 0.052 0.078 0.474 0.374
0.468 0.036 0.067
0.447 0.118 0.043
0.122 0.623 0.008
0.029 0.599 0.095
0.231 0.013 0.359
0.072 0.102 0.914
3.5.3 Model development
Latent variable models were developed using three latent variables (?
?
,
obtained from PCA) to predict three product properties: tensile strength (TS),
elongation at break (EB) and density (Rho). The general form of the equation
is:
3 33 3
2
0
11
?
i i ij i j ii i
i i j ji i
y t tt t?? ? ?
= <> =
=++ +
? ?? ?
(3.11)
Since addition of parameters result in the risk of overfitting, it is
important to select the optimum number of parameters during model
selection. Bayesian Information Criterion (BIC) introduces a penalty term for
the number of parameters chosen in the model. BIC can be expressed as:
2log ln( )BIC likelihood B M=?+
(3.12)
where, B = the number of parameters, including intercept and error terms
in the model
M = the number of observations in the data set.
76
Minimization of BIC was used as a criterion during the model selection
process in the statistical package JMP 9.0 and the detailed expressions for
the property models are as follows:
2
0 1 1 2 2 3 3 13 1 3 33 3
TS t t t t t t?? ? ? ? ?= + + + + ?+
(3.13)
22
0 1 1 2 2 3 3 12 1 2 23 2 3 11 1 22 2
EB t t t t t t t t t?? ? ? ? ? ? ?= + + + + ?+ ?+ +
(3.14)
22
0 1 1 2 2 3 3 12 1 2 11 1 22 2
Rho t t t t t t t?? ? ? ? ? ?= + + + + ?+ +
(3.15)
The estimated regression coefficients for the respective property models
are listed in Table 3.6. Figure 3.10 shows a plot with actual versus predicted
properties.
Table 3.6: The regression coefficients for the expressions in Eqs. (3.13)
through Eqs. (3.15).
Regression
coefficients
TS
(MPa)
EB
(%)
Rho
(g/cm
3
)
0
? 14.59 2.534 1.345
1
? 1.119 1.328 0.0146
2
? 3.938 0.4816 0.0073
3
? 0.1558 0.7965 0.0068
12
?  0.7488 0.0016
23
?  0.6859 
13
?
0.4749  
11
?  0.7361 0.0018
22
?  0.5084 0.0013
33
?
1.59  
?
2
0.917 0.698 0.852
?
adj
2
0.898 0.593 0.810
77
Figure 3.10: Predicted vs. actual product properties using PCR model.
3.5.4 Design of desired products in score space
It is desirable to cover a wide range of product properties with a minimum
number of products in order to minimize the manufacturing costs, inventories
and material cost. Figure 3.11 is a product properties score plot (same as
Figure 3.8) where the desired product properties can be independently
selected such that a set of product grades spans the desired property space
[63]. Here five target products are selected in the score space to provide a
wide range of final product properties. Table 3.7 shows the score values from
the selected points in Figure 3.11. Using the score and the loading values
from the PCA on Y (Table 3.5), the properties for the desired products
(tabulated in Table 3.8) can be calculated as
2
,
1
? 1
=
??
=?+ ? ?
??
??
?
T
des h des a a h
a
y x t ps
(3.16)
where, 1 ??? = the variable average
? = the standard deviation originated from the preprocessing step.
78
Figure 3.11: Visualization of target product properties in score space.
Table 3.7: Score values for desired products in Figure 3.11.
Products t
1
t
2
D1 0.0 0.0
D2 2.0 0.0
D3 1.0 2.0
D4 2.0 1.0
D5 0.0 1.0
Table 3.8: Desired product properties for desired products in Figure 3.11.
Products
TS
(MPa)
EB
(%)
Rho
(g/cm
3
)
D1 13.06 6.861 1.335
D2 21.32 5.510 1.309
D3 17.03 15.64 1.302
D4 4.725 12.94 1.350
D5 13.14 2.136 1.345
? : Existing Products Grades
? : Desired Product Properties
D4
D1
D3
D2
D5
79
From the five desired products, product D2 (from Table 3.8) was selected
as a target product (?) in score space. The mixture conditions (i.e., the raw
materials, their blend ratios, and process condition) that give the desired
product properties of the target product can be estimated by minimizing the
sum of square differences between the target and predicted properties. The
results are listed in Table 3.9. Using this information, a new blend product
can be synthesized.
Table 3.9: Required mixture conditions to achieve target product
properties.
Target R
starch
R
other
Z
Starch
1
Starch
2
Starch
3
PLA
Add
1
Add
2
Temp
o
C
D2 0.171 0.132 0.053 0.404 0.223 0.016 159.5
Muteki used a mixed integer nonlinear programming (MINLP)
algorithm to find raw materials that were not part of previous product
grades. The results can be found in [63]. Such optimization is outside of the
scope of this study.
3.5.5 Design of desired products in cluster space
The ternary cluster space provides an excellent platform for simultaneous
visualization and solution of mixture design problems. Figure 3.12 is a
ternary cluster plot with a feasibility region for blend product properties that
incorporates all the existing product grades and all five desired product
properties identified from Figure 3.11 and Table 3.8. The method to convert
score values to cluster values is discussed in Section 2.7. The mixture
feasibility region in cluster space ensures that the products within its
boundary are physically feasible, consistent with past operating strategies
and expected to yield the desired product qualities.
80
Figure 3.12: Visualization of starch blending formulation in cluster space.
Analogous to the use of the score plot to identify a product whose
properties span the desired property space, cluster space can achieve similar
objectives. Cartesian coordinate points for D2 cluster are (0.634, 0.540). Since
property clusters are tailored to maintain the fundamental rules for intra
and inter stream conservation, the mixing operation can be optimized using
leverarm analysis. For instance, if two polymer product grades are
compatible, i.e. miscible, they can be mixed together to get a polymer with
properties somewhere between those of the two polymers mixed. In latent
property cluster space, one can rapidly screen out inherently infeasible
combinations of candidate constituents visually. The reduced search space
can then be explored for the feasibility of formulating binary, ternary and
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Rho
EB
TS
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Mixture Design
Feasibility Region
? : Desired Product Properties
? : Target Product Properties
? : Existing Blend Products
D3
D2
D1
D5
D4
8
24
15
16
13
18
4
10
26
5
22
2
9
3
25
6
81
multicomponent mixtures to achieve optimum products based on leverarm
analysis without extensive enumeration.
For example, the target product (D2) synthesized in Table 3.9 can also
be formulated by mixing existing blends 24 and 25 with blend 6 (denoted as
24256 in Table 3.10). In Figure 12, M1 represents a mixture of existing
blend products 24 and 25 in a proportion 0.640 and 0.360 respectively. When
M1 is mixed with blend 6 in a proportion 0.207 and 0.793 respectively,
product target (D2) with property targets (listed in Table 3.8) can be
achieved. A simple visual analysis of Figure 3.13 suggests that the binary
mixture of existing blend products is inherently infeasible to achieve the
target product.
Figure 3.13: Visualization of starch blending formulation in cluster spac
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Rho
EB
TS
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
6
25
24
Target
M1
Mixture Design
Feasibility Region
?: Possible Mixture of Blends
? : Desired Product Properties
? : Target Product Properties
?: Existing Blend Products
?
6
?
M1
82
Table 3.10: Candidate ternary mixtures and fractional contributions of the
constituents.
S.N. Mixture (ABC) Feasible A B C
1 826 Yes 15% 5% 80%
2 836 No 12% 6% 82%
3 846 No 10% 10% 80%
4 856 No 6% 11% 83%
5 896 Yes 16% 5% 79%
6 8106 No 11% 8% 81%
7 8136 Yes 15% 8% 77%
8 8156 Yes 14% 13% 73%
9 8166 No 1% 21% 78%
10 8186 No 1% 20% 79%
11 8226 No 12% 7% 81%
12 8256 Yes 16% 7% 77%
13 8266 No 9% 9% 82%
14 2426 Yes 12% 6% 82%
15 2436 No 10% 7% 83%
16 2446 No 8% 10% 82%
17 2456 No 5% 12% 83%
18 2496 Yes 14% 6% 80%
19 24106 No 9% 9% 82%
20 24136 Yes 12% 10% 78%
21 24156 No 12% 14% 74%
22 24166 No 1% 23% 76%
23 24186 No 1% 21% 78%
24 24226 No 10% 8% 82%
25 24256 Yes 13% 8% 79%
26 24266 No 7% 10% 83%
However, a total of 26 candidate ternary mixtures were identified. In
order to validate the feasibility of the designed formulation, the AUP values
of the formulated mixtures and the target must match along with matching
the cluster targets. After performing this analysis, only 9 out of 26 identified
83
candidate ternary mixtures matched the cluster target and AUP value of the
target product and the results are summarized in Table 3.10. The 17
infeasible mixtures, although, matched the cluster targets but had AUP
values higher than ?1% of the target value (1.802). Therefore, the
discrepancy was assumed to be higher than the accuracy that can be expected
by graphical leverarm analysis.
3.6 Conclusion
This research effort has focused on the formulation and solution of product
design problems by systematic and insightful use of past data. Properties of
the raw materials, their blend ratios, and the process conditions are used to
predict and enhance the performance of a target product. Using the duality of
linear programming to solve the design problem in the lower dimensional
property domain, instead of high dimensional component space, significantly
reduced the computational complexity of the problem. Moreover, the ternary
diagram provides a quick targeting tool that aides in the evaluation, analysis,
and screening of alternatives. The approach differs from conventional
techniques because it is noniterative, avoids the combinatorial explosion
when multiple components are involved, and avoids the difficulty of
formulating and solving the mixed integer nonlinear programming involved
in many mixture design problems. Such a practice in industry results in
fewer trials and experiments to run, thereby saving resources, capital and
most importantly the product development time.
However, the method and example presented above is limited to
selecting, i.e. identifying candidates from a database of known raw materials.
The raw materials used, are selected from a list of predefined candidate
components, therefore limiting the performance to those components. The
problem here is that these decisions are made ahead of design and are
usually based on qualitative (or at best quantitative) process knowledge
84
and/or experience and thus possibly yield a suboptimal design. In order to
guarantee global optimality, all possible compounds must be considered.
Chapter 4 will present a biodiesel additive design problem to demonstrate
molecular design concepts.
85
CHAPTER 4
DESIGN OF BIODIESEL ADDITIVES
4.1 Introduction
High prices and environmental impact related to fossilbased raw materials
are the driving force for sustainable development concepts and their use in
industry. This will also have implications for the design of chemical products
and their production routes. Biofuels, mainly ethanol used in gasoline
engines and fatty acid alkyl esters (biodiesel) as well as their blends with
petrodiesel used in diesel engines, are some of the few alternatives that have
not required significant new infrastructure or change on the part of
consumers or auto manufacturers. As a result, these biofuels are the most
widely deployed substitute for conventional fossil fuels in transportation
today. In the U.S., biofuels can currently be blended up to 10% (ethanol) and
20% (biodiesel) in every gallon of fossil fuel. Among these alternatives,
biodiesel has become a fast growing market and is expected to outpace
gasoline demand [66, 67, 68].
Some advantages of biodiesel, compared to petrodiesel, include:
? renewability and domestic origin,
? biodegradability and sustainable,
? higher flash point,
? reduction of most regulated exhaust emissions due to lack of sulfur,
? miscibility in all ratios with petrodiesel,
? compatibility with the existing fuel distribution infrastructure, and
? inherent lubricity.
86
Any type of feedstock that contains free fatty acids and/or triglycerides
such as vegetable oil, waste oil, animal fat, and waste grease can be
converted into biodiesel. The American Society for Testing and Materials
(ASTM) defines biodiesel as a fuel comprised exclusively of monoalkyl esters
of longchain fatty acids derived from vegetable oils or animal fats,
designated B100 (100% pure biodiesel), meeting the requirements of ASTM
designation D6751 [69].
Biodiesel is a very good example of chemical product design. A high
viscosity and high melting point make its use directly as a fuel in common
engines difficult. The conversion of triglycerides into methyl or ethyl esters
through the transesterification (also called alcoholysis) process reduces the
molecular weight to onethird that of the triglyceride, reduces the viscosity by
a factor of about eight and marginally increases the volatility [70].
Transesterification is an ester conversion process that splits up the
triglyceride (TG); that is, it takes the glycerol (GL) of the TG and replaces it
with alkyl radical of the alcohol used [71]. In Figure 4.1, R
1
, R
2
, and R
3
represent long chain fatty acid radicals of the mixed TG used whereas R'
represents an alkyl radical of the alcohol used. Some common fatty acids,
which exist in the triglyceride molecule are shown in Table 4.1.
Figure 4.1: Overall stoichiometric transesterification reaction scheme.
Glycerol is the main byproduct in biodiesel production; almost 13% of
glycerol comes from biodiesel production. Therefore, a valuable use of glycerol
O
R
1
C O CH
2
+
3R' OH
(catalyst)
O
R
2
C O CH
O
R
3
C O CH
2
O
R
1
C O R'
O
R
2
C O R'
O
R
3
C O R'
+
CH
2
CH
2
OH
CH OH
OH
Triglyceride Alcohol Glycerol Biodiesel
87
is a very important success factor for this technology. Product design
principles have to be applied to search for profitable applications that also
fulfill the requirements of the sustainability principles [72].
Table 4.1: Chemical structures of common fatty acids.
Fatty Acid CXX:Y Chemical Structure
Myristic acid (C14:0) CH
3
(CH
2
)
12
COOH
Palmitic acid (C16:0) CH
3
(CH
2
)
14
COOH
Stearic acid (C18:0) CH
3
(CH
2
)
16
COOH
Oleic acid (C18:1) CH
3
(CH
2
)
7
CH=CH(CH
2
)
7
COOH
Linoleic acid (C18:2) CH
3
(CH
2
)
4
CH=CHCH
2
CH=CH(CH
2
)
7
COOH
Linolenic acid (C18:3) CH
3
CH
2
CH=CHCH
2
CH=CHCH
2
CH=CH(CH
2
)
7
COOH
Arachidic acid (C20:0) CH
3
(CH
2
)
18
COOH
Behenic acid (C22:0) CH
3
(CH
2
)
20
COOH
Erucic acid (C22:1) CH
3
(CH
2
)
7
CH=CH(CH
2
)
11
COOH
If methanol is used, the biodiesel produced is fatty acid methyl ester
(FAME). For example, the structure of stearic acid methyl ester can be
obtained by replacing the H atom in the COOH group with a CH
3
group as
shown in Figure 4.2.
O
CH
3
O
Figure 4.2: Stearic acid methyl ester.
Figure 4.3 shows a typical triglyceride structure of soybean oil that is
made up of mixed fatty acid fragments and a glycerol fragment.
The physical and chemical fuel properties of biodiesel basically depend
on the fatty acids distribution of the triglyceride used in the production. Fatty
acids vary in their carbon chain length and in the number of double bonds
(unsaturation level), and are represented by C XX:Y where ?XX? is the number
88
of carbon atoms and ?Y? is the number of double bonds. The fatty acid
distributions of some feedstock commonly used in biodiesel production are
shown in Table 4.2 [73].
Figure 4.3: A typical triglyceride molecule with different fatty acid chains
of soybean oil.
Since the fatty acid profile varies with different feedstock, the final
quality of the biodiesel varies depending upon the parent feedstock used. A
feedstock dependent fuel property is one of the technical difficulties
associated with biodiesel that have limited its wide usability.
Table 4.2: Fatty acid profiles of some common biodiesel feedstock
Fatty acids (wt %)
Feedstock C14:0 C16:0 C16:1 C18:0 C18:1 C18:2 C18:3
Sunflower ? 6.08 ? 3.26 16.93 73.73 ?
Rapeseed ? 3.49 ? 0.85 64.40 22.30 8.23
Soybean ? 10.58 ? 4.76 22.52 52.34 8.19
Soybean soapstock ? 17.20 ? 4.4 15.7 55.6 7.10
Used frying oil ? 12.00 ? ? 53.00 33.00 1.00
Tallow 3?6 24?32 ? 20?25 37?43 2?3 ?
Lard 1?2 28?30 ? 12?18 4?50 7?13 ?
Yellow grease 2.43 23.24 3.79 12.96 44.32 6.97 0.67
Brown grease 1.66 22.83 3.13 12.54 42.36 12.09 0.82
O
C O CH
2
O
C
O CH
O
C
O CH
2
Glycerol
fragment
CH
3
(CH
2
)
7
CH=CH(CH
2
)
7
COOH
CH
3
(CH
2
)
4
CH=CHCH
2
CH=CH(CH
2
)
7
COOH
Fatty acid
fragment
Oleic acid (C18:1)
Linoleic acid (C18:2)
Linolenic acid (C18:3)
CH
3
CH
2
CH=CHCH
2
CH=CHCH
2
CH=CH(CH
2
)
7
COOH
89
Transesterification does not alter the fatty acid composition of the
feedstocks and this composition plays an important role in some critical
parameters of biodiesel. The vegetable oils are mainly characterized by
certain fuel related properties. Some of them are tabulated in Table 4.3 [74].
Table 4.3: Fuel properties of biodiesel fuels and diesel
Vegetable
oil
Kinematic
viscosity at 38
o
C
Cetane
No.
Heating
Value
Cloud
Point
Pour
Point
Flash
Point Density
Carbon
residue
(mm
2
/s) (
o
C) (MJ/kg) (
o
C) (
o
C) (
o
C) (kg/l) (wt.%)
Sunflower 34.4 36.7 39.6 7.2 15.0 274 0.916 0.27
Rapeseed 37.3 37.5 39.7 3.9 31.7 246 0.912 0.30
Soybean 33.1 38.1 39.6 3.9 12.2 254 0.914 0.25
Peanut 40.0 34.6 39.8 12.8 6.7 271 0.903 0.24
Palm 39.6 42.0 ? 31.0 ? 267 0.918 0.23
Cottonseed 33.7 33.7 39.5 1.7 15.0 234 0.915 0.24
Corn 35.1 37.5 39.5 1.1 40.0 277 0.909 0.24
Diesel 2.04.5 51.0 43.8 ? 18 25 55
0.820
0.860
?
4.2 Structure Property Relationships
Chain length and number, position and configuration of double bonds account
for the variation in physical properties of fatty acids. Saturated chains are
highly flexible but the fully extended conformation is the most stable because
of the lack of steric interference. The tetrahedral bond angle on carbon
results in a molecular geometry for saturated fatty acids that is relatively
linear (see Figure 4.2). This molecular structure allows a close arrangement
of the fatty acid molecules with strong intermolecular interactions.
On the other hand, the introduction of double bonds with predominantly
cisconfiguration in the hydrocarbon chain in unsaturated fatty acids results
in bends of about 30 degrees in the molecular geometry (see Figure 4.4). The
molecules do not arrange very closely due to chain branching. It allows more
flexibility and weaker van der Waals force between the molecules.
90
O
CH
3
O
Figure 4.4: Oleic acid methyl ester.
4.3 Technical Difficulties with Biodiesel Use
Although, biodiesel, because of its biological origin, has many advantages
compared to its petroleum counterparts, it also has several technical
problems that have persisted to the present and have impaired its use and
commercialization. Three major limitations are [71]:
? Oxidative stability: Biodiesel undergoes oxidative degradation
over time, mainly influenced by temperature and oxygen
availability/exposure. Residual products of biodiesel such as insoluble
gums, organic acids, and aldehydes formed from the degradation may
cause engine deposits and injection problems. Biodiesel is essentially
nonaromatic. Petroleum diesel contains essentially no olefinic bonds,
while biodiesel can contain a significant number of these reactive,
unsaturated sites that provide pathways for oxidation instability.
? Lowtemperature Operability: Biodiesel contains wax molecules,
which are dissolved in the fuel at higher temperatures. As the fuel
temperature drops, the wax molecules begin to crystallize. At lower
temperature, the larger crystals fuse together and form agglomerations
that eventually prevent pouring of the fuel and plugging of filters.
? Feedstock dependent fuel property: The fatty acid profile of
biodiesel is identical to that of the parent oil or fat. Therefore,
feedstock origin will impact the final quality of the biodiesel product.
To produce fuel grade biodiesel, the characteristics of the feedstock are
91
very important during the initial research and production stage (fuel
properties of diesel and biodiesel fuels from various sources are given
in Table 4.3.)
Moreover, simultaneous solution of these problems has proven difficult
as improvements in one area tend to impair another. Due to the inverse
relationship between oxidative stability and lowtemperature operability, the
design of an optimal fuel for all environments can be a rather difficult task.
Structural factors that improve oxidative stability adversely influence low
temperature operability and vice versa. Figure 4.5 is a graphical
representation of the effects of the biodiesel feedstock profile on the biodiesel
properties published by the National Renewable Energy Laboratory in 2007.
Figure 4.5: Compositions of fats and oils and their effects on the fuel
properties.
To ensure a uniform quality of biodiesel produced from vegetable oils or
animal fats, the final products must meet stringent international and
regional quality requirements such as ASTM D 6751 [69] in the U.S. and EN
14214 [75] in Europe. These standards identify the parameters the pure
biodiesel (B100) must meet before being used as a pure fuel or being blended
with petroleumbased diesel fuel. Table 4.4 provides the specifications for
biodiesel and diesel.
92
Table 4.4: Specifications for biodiesel and diesel
Specification
EN
14214:2008
ASTM
D 6751:2009
EN
590:1999
Applies to Units FAME
a
FAAE
b
Diesel
Kinematic Viscosity mm?/s, @
40?C
3.55.0 1.96.0 2.04.5
Cetane number  51 minimum 47 minimum 51 minimum
Cloud point ?C  report 
CFPP ?C Location and time
specific
 Location and
time specific
Oxidation stability hr, @110?C 6 minimum 3 minimum N/A (25 g/m
3
)
a
refers to fatty acid alkyl esters
b
refers to fatty acid methyl esters
The listed specifications in biodiesel standards are directly influenced by
the fatty acid profile of the biodiesel fuel which is directly influenced by the
fatty acid profile of parent oil used (see Table 4.1 and Table 4.2).
Figure 4.6: Approaches to improving biodiesel fuel properties.
Several pathways are possible for improving the fuelrelated properties
of biodiesel. Figure 4.6 presents an overview of the various approaches (A to
E) that have been explored [76, 77].
A
ppr
o
a
c
he
s
Unchanged fatty
ester composition
Additives
A
Unchanged fatty
ester composition
Change alcohol
B
Change fatty acid
profile
Physical procedures
C
Inherently different
fatty acid profile
Genetic
modification
D
Alternative
feedstock
E
93
4.4 Fuel Additives
Fuel additives, such as antioxidants, cetane enhancers, or coldflow
improvers have become a common and indispensable tool not only to decrease
the drawbacks described above but also to produce specified products that
meet international and regional standards, allowing fuels trading to take
place. However, an additive solution to one problem often aggravates another
problem. In addition, the questions of additive compatibility, required
addition levels, the effect on other properties, and whether these additives
function as designed for biodiesel fuels with differing fatty acid profiles still
remain challenges that require further investigation [78].
Therefore, it is desired to molecularly design biodiesel additives to
account for the unintended effect on other fuel properties in the neat and the
blend fuel in order to achieve the performance properties of the petroleum
based fuel. In this way, biofuels can be formulated that are adaptable to a
range or blend of feedstocks and the desirable fuel characteristics like
oxidative stability and wide operating temperature range. Moreover, such a
sustainable biofuel must also meet the specifications required by the
transportation and the aviation industry.
4.5 Additive Design
Fuel characterization data obtained with near infrared spectroscopy (see
Section 2.4) is combined with property clustering techniques (see Section
2.2.2) in a reverse problem formulation (see Section 2.2.1) to design additive
molecules which, when mixed with offspec biodiesel, produce biodiesel that
meets the desired fuel specifications. The characterization data consists of a
multitude of property values (such as cetane number, melting point, and
kinematic viscosity) specified by the aviation industry to ensure adequate
performance. To facilitate an efficient design we propose consolidating these
various properties into a latent property domain using principal component
94
analysis (PCA) and principal component regression (PCR) techniques (see
Section 2.6.1 and 2.6.2). Characterizationbased molecular design using
group contribution parameters are then used to build novel additives that
match the fuel specifications in the latent property space. Sustainability may
be controlled by environmental and ecological constraints on the design of the
additives. The additives found can then be used to offset the impact of the
feedstock residuals on the biofuel blend properties.
4.5.1 Types of Additives
Different additives that are commercially available to improve diesel fuel
performance are selected as the training set molecules [79, 80]. Antioxidant
additives can help slow the degradation process and improve fuel stability.
Cold flow enhancers can improve the coldflow properties to solve the low
temperature operability problem. Cetane improvers can help improve ignition
properties thereby reducing NOx emission. Table 4.5 shows some of the
available additives used as training set in this formulation study.
Figure 4.7. Target specifications for biodiesel and its blend in terms of
cetane number, melting point, and kinematic viscosity.
95
Since cetane number, cold flow, kinematic viscosity, and oxidation
stability are critical properties for the operation of a fuel in a diesel engine,
Knothe [78] proposed a threedimensional plot of the cetane number, melting
point, and kinematic viscosity to describe the optimum properties to meet the
ASTM D6751 fuel requirements. Figure 4.7 visually represents this required
biodiesel property space by the shaded rectangular box [78].
4.5.2 Additive Property Estimation
Since many additives are proprietary products, their physicalchemical
properties are not readily available and therefore experimental data such as
melting point temperature, viscosity and cetane number are difficult to find.
Therefore, for the commercially available additive compounds, these
properties are estimated using structural information alone. The estimated
values are later used as the measured values for property model calibration.
However, the property values used in the calibration would normally be
measured in an industrial setup.
Melting temperature: The normal melting point temperature (?
?
, K) values
were estimated using a group contribution expression [34]:
1 23
exp
m
i mi j m j k m k
ij
mo
T
NT M T OT
t
??
=++
??
??
?? ?
(4.1)
where, ?
?
, ?
?
and ?
?
are the number of first, second, and thirdorder groups
of types i, j and k, respectively.
Viscosity: The dynamic viscosity (?, mPa.s) values at 300 K were estimated
using the GC
+
method. GC
+
combines the group contribution (GC) method
and the atomconnectivity indices (CI) method [81].
( )
1 23
ln( )
ii j j k k
ij k
N M OF? ? ? ??
?
=+ ++
?? ?
CI TermsGC Terms
????????????????
(4.2)
96
where, ?(?
?
) is a function of viscosity for all missing GC groups/fragments.
( )
1
()
K
k
k
k
F nF d??
?
=
= +
?
(4.3)
where, ?(?)
?
= a function of the viscosity contribution for a missing GC
group/fragment ?,
K = the number of missing groups/fragments,
n = the number of times a missing group/fragment appears in the
molecule, and
d = a constant [81].
( ) ( )
01
() 2
ii
i
F aA b c
??
? ??=++
?
(4.4)
where, ?
?
= the occurrences of the i
th
atom in the molecular structure,
?
?
= the contribution of atom i, and
b & c = adjustable parameters.
The zeroorder (atomic) connectivity index (
0?
? ) and the firstorder (bond)
connectivity index (
1?
? ) are defined by [82]
01
11
where,
LM
k
ij
k
ii
? ? ??
?
? ? ? ??
??= =
?? ??
?? ??= = =
?? ??
??
(4.5)
where, L = the number of atoms in the hydrogen suppressed graph,
M = the number of bonds in the graph, and
?
?
= the bond indices defined by atomic indices ?
?
(the values can
be found in [81]).
The kinematic viscosity can be converted into dynamic viscosity through the
density.
Cetane number: The correlation used for estimation of the cetane number of
additives is from Lapuerta et al. [83]
97
( )
22
21.157 7.965 1.785 0.235 0.099
db db c c
CN N N N N=?+? + ?
(4.6)
where, ?
??
= the number of double bonds and
?
?
= the number of carbon atoms in the molecule.
Table 4.5 tabulates the estimated properties for different type of additives.
Table 4.5: Commercially available diesel additives and their estimated
properties.
Type Compound exp(T
m
/T
mo
) ln(?) CN
Ox
ida
t
iv
e
s
t
a
b
il
it
y
impr
o
v
e
r
s
Butylated
hydroxytoluene
(BHT) 10.11 0.465 76.04
tbutylhydroquinone (TBHQ)
13.26 0.465 48.59
Isopropyl alcohol (IPA)
4.114 0.465 1.847
Pyrogallol (PY)
18.791 0.465 14.79
Methyl tertbutyl ether (MTBE)
3.484 0.465 16.19
C
old
f
low
impr
o
v
e
r
s
Polymethyl
Methacrylate
(PMMA)
3.035 1.062 15.19
Ethylene glycol methyl
ether acrylate
(EGMEA)
4.299 0.517 7.289
C
et
a
n
e n
u
m
b
er
impr
o
v
e
r
s
Ditertbutyl peroxide (DTBP) 3.138 3.376 36.23
Methyl Oleate (MO)
7.002 2.177 44.47
Ethylene glycol (EG)
6.497 3.154 5.623
Oleic acid (OA)
12.47 3.134 31.82
Stearic acid (SA)
11.84 3.924 62.24
4.5.3 Characterization of the Additive Molecules
Infrared (IR) spectroscopybased characterization was used to determine the
chemical constituents or molecular structures of the additive training set that
describe the orientation and alignment of these molecules. Using the NIST
Webbook [39], the complete IR spectral region (4,000400) cm
1
for twelve
additives including antioxidants, cetane enhancers, and coldflow improvers
98
were obtained. Ideally, both IR and NIR data would be available for the
training set to provide better predictions, but unfortunately the availability of
NIR data for the components of interest is severely limited.
Figure 4.8: Infrared spectra of diesel additive molecules.
Figure 4.8 is IR spectra of the diesel additive molecules tabulated in
Table 4.5 that serve as the descriptors of their molecular architecture. The IR
spectra were translated to discrete variables by a process of digitization. In
this process, each spectrum is fragmented into small equal fragments (2 cm
1
resolution) along the wavelength axis resulting with 1801 frequencies
(descriptor variables).
4.5.4 Additive IR Data Analysis
The principal component scores were used to describe the variation in the
multivariate characterization data with a minimum of variables to elucidate
the underlying structure of the data. Principal components (PCs) captured
the most variation possible in the smallest number of dimensions and
0
50
100
4
00
8
00
12
00
16
00
20
00
24
00
28
00
32
00
36
00
40
00
T
r
ans
m
i
t
t
anc
e (
%
)
Wavenumber (cm1)
BHT IPA PY MTBE PMMA EGMEA
DTBP MO EG OA SA TBHQ
Functional group region Fingerprint region
99
consolidate multiple property effects into single, underlying latent variables
which are devoid of collinearity. The first three principal components
captured 69.53% of the total variance of the standardized IR data (Figure
4.9).
Figure 4.9: Scree plot for PCA on additive IR data.
Figure 4.10: PCA score plots on first third PCs for additive IR data.
? : Oxidative Stability Improver
? : Cold Flow Improver
? : Cetane Number Improver
100
Figure 4.10 is a score plot that maps the three different additives types
involved in the training set. Additives that are close to each other should
have similar properties, whereas those far from each other should be
dissimilar with respect to the three properties of interest. It is clear from the
picture that cold flow additives (?) are together in all the plots. However,
oxidative stability (?) and cetane improvers (?) are scattered. This may
suggest that an additive that improves oxidative stability also improves
cetane and vice versa. This information is supported by the trend depicted in
Figure 4.5.
4.5.5 Latent Variable Model Development
As each score vector ti is a linear combination of the initial predictor variables
X (the IR data), nonlinear regression models that describe response variables
Y (the fuel properties) subjected to the X values are obtained. First, the PCA
for variable reduction and second, multiple regression for calibration model
development could be considered as a nonlinear PCR (NPCR). Nonlinear
Principal Component Regression resulted in second order regression models
for melting point (Eq. (4.7)) and kinematic viscosity (Eq. (4.8)), whereas, a
third order regression model was developed for cetane number (Eq. (4.9)). The
generalized forms of these latent variable property models are as follows:
3 33
0
1
exp
m
i i ij i j
i i jjimo
T
t tt
T
?? ?
= <>
??
??
??
=++
? ??
(4.7)
3 33
0
1
ln( )
i i ij i j
i i j ji
t tt?? ? ?
= <>
=++
? ??
(4.8)
33 3
2
0
11
33
i i ii i ijk i k
i i kj
j
i jji
CN PC t t tt?? ? ?
= = ><>
=+ ++
?? ???
(4.9)
The left hand sides (LHS) of the above models are forced to have a
particular form such that the respective properties follow linear additive
101
rules as described by the general group contribution model equation (Eq.
(2.13)). Table 4.6 tabulates the regression coefficients for the latent variable
models represented by Eq. (4.7) through Eq. (4.9). Figure 4.11 is a plot
comparing the actual versus the model predicted properties.
Table 4.6: Model coefficients using PCR.
Properties ??? (?
?
?
??
)? ??(? ) CN
0
? 8.1702 0.8132 65.89
1
? 0.0793 0.0005 0.2833
2
? 0.1758 0.0545 0.0195
3
? 0.0401 0.0021 1.3906
12
? 0.0048 0.0033 
13
? 0.0063 0.0024 
23
? 0.0144 0.0015 
11
?   0.0491
22
?   0.3826
33
?   0.0632
123
?   0.0038
?
2
0.816 0.764 0.790
?
adj
2
0.595 0.480 0.422
Figure 4.11: Predicted vs. actual product properties using PCR model.
102
4.5.6 Translating Physical properties to Latent Properties
The purpose of this case study is to identify candidate additives which, when
mixed with a crude (offspec) biodiesel, produce a fuel that meets the required
fuel specifications presented in Table 4.7. Table 4.7 has much more stringent
property constraint criteria than the one defined in Figure 4.7. Here,
kinematic viscosity of 6.0mm
2
/s at 40
o
C is the maximum limit in ASTM
D6751 (Table 4.4).
Table 4.7: Biodiesel target properties.
Fuel Property
T
m
[C]
? [mm
2
/s]
@ 40
o
C
CN
Lower Limit 60 4.51 47
Upper Limit 20 6.00 65
In order to ensure that the target property space is properly explored, a
feasibility region on the ternary diagram was evaluated. The identity of the
exact shape of the feasibility region without extensive enumeration was
evaluated by the six unique points described by Eq. (2.12). The three physical
properties (?
?
) were converted to latent properties (?
?
) by solving Eq. (4.7),
(4.8) and Eq. (4.9) simultaneously. The score variables are then standardized
(?
?
) using Eq, (2.43). The results are tabulated in Table 4.8. The feasibility
region was mapped from the threedimensional volume to a twodimensional
area utilizing the property clustering technique described in section 2.2.2.
Figure 4.12 shows the biodiesel target feasibility region described by Table
4.8.
The minimum and maximum latent properties that correspond to
physical properties in Table 4.7 are in Table 4.9.
103
Table 4.8: Physical and latent properties describing feasibility
region.
Feasibility
Region
Points
Physical Property
Latent Property
Score Standardized Score
T
m
[K]
?
[mm
2
/s] CN t
1
t
2
t
3
q
1
q
2
q
3
min,min,max 213 4.5 65 81 21 33 0.180 0.048 0.073
min,max,max 213 6.0 65 80 21 34 0.184 0.049 0.078
min,max,min 213 6.0 47 80 20 32 0.177 0.043 0.070
max,max,min 253 6.0 47 74 19 31 0.183 0.048 0.078
max,min,min 253 4.5 47 75 20 30 0.178 0.046 0.071
max,min,max 253 4.5 65 75 21 33 0.186 0.052 0.081
Table 4.9: Biodiesel target latent properties.
Latent Property
q
1
q
2
q
3
Lower Limit 0.174 0.054 0.088
Upper Limit 0.191 0.043 0.065
Reference value 0.06 0.02 0.05
The properties of a crude biodiesel feedstock were selected such that it
does not meet the fuel specification target properties defined in Table 4.9.
The standardized latent property values are tabulated in Table 4.10. The
crude biodiesel along with the petrodiesel property values are mapped onto
the cluster space as shown in Figure 4.12.
Table 4.10: Crude biodiesel properties.
Latent Property
q
1
q
2
q
3
Crude biodiesel 0.142 0.0480 0.0820
104
Figure 4.12: Target feasibility region and crude biodiesel in cluster space.
4.5.7 Evaluation of Desired Additive Feasibility Region
An infinite number of possible additives may exist that could be mixed with
crude biodiesel. However, it is difficult to narrow down the feasible additive
candidates such that the final mixture products meet the target biodiesel
properties. Since the property cluster formulation enables linear mixing and
lever arm analysis of the latent properties, it is straightforward to identify
the feasibility region for the additives. The region bounded by the black
dashed lines (shaded area) in Figure 4.13 represents the entire latent
property search space for the potential additive molecules. The corresponding
latent property ranges are presented in Table 4.11. The additive feasibility
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C3
C2
C1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Crude
Diesel
Biodiesel Target Region
?: Diesel
?: Crude Biodiesel
105
region will serve as the property target region in the molecular design
algorithm.
Table 4.11: Biodiesel additive latent property feasibility region.
Latent Property
q
1
q
2
q
3
Lower Limit 0.519 0.076 0.138
Upper Limit 0.174 0.000 0.000
Figure 4.13: Desired additive design feasibility region in cluster space.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C3
C2
C1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Crude
a
c
Diesel
Biodiesel Target Region
?: Diesel
?: Crude Biodiesel
Additive Feasibility Region
106
4.5.8 Enumeration of Desired Additive Molecules
The characterizationbased group contribution method (cGCM) developed by
Solvason [27] was used to take advantage of the additional molecular
architecture information provided by characterization data. Characterization
based molecular groups/fragments are linearly combined to formulate a
molecule. The property of the formulated molecule is determined by the
individual contributions of each molecular group that make up the molecule.
Twentythree molecular groups were selected from the additive training set.
Some of the groups are the fundamental building blocks present in every
additive molecule and represent first order contributions, while the rest are
larger groups and represents higher level of contributions. These selected
molecular groups are considered to be a set of basic groups which represent
the chemical makeup of the training set and are listed in Table 4.12 .
The infrared (IR) descriptor data [Xmatrix] for the molecular fragments
was compiled from Socrates [84] and is listed in Appendix B, Table B.2. The
latent property contributions of each group are evaluated using the loading
matrix (P
T
) obtained from the PCA of the training set data such that:
MA MK KA? ??
= ?T XP
(4.10)
The standardized score values (?
?
) are calculated as described previously
(Eq. (2.43)) and the results are tabulated in Table 4.12. Visual Basic for
Application (VBA) codes are used to enumerate all potential molecules from
the characterizationbased groups in Table 4.12 that satisfied the target
property constraints described in Table 4.11. Using the groupbased property
model, molecular groups or fragments are added together analogous to inter
stream conservation. A maximum number of similar groups, ?
?
= 2, was
selected such that progressive combinations of similar groups are added until
the maximum is reached.
107
Table 4.12: Molecular groups and their latent property contributions.
S.N. Molecular Group
Latent Property
q
1
q
2
q
3
1
Methine Group, CH 
0.04 0.459 0.005
2
Methylene Group, CH
2

0.05 0.12 0.003
3
Methyl Group, CH
3
0.04 0.156 0.025
4
Tetramethyl Group, C(CH
3
)
3
0.025 0.359 0.061
5 Aliphatic Methoxy Group, OCH
3
0.03 0.66 0.061
6
Vinyl Group, CH=CH
2
0.125 0.43 0.106
7
Vinylidene Group, CH
2
=C 
0.04 0.339 0.002
8 cisVinylene Group, CH=CH
0.18 0.17 0.084
9 transVinylene Group, CH=CH
0.17 0.18 0.091
10 Hydroxyl Group, OH
0.02 0.37 0.229
11 Primary Alcohol Group, CH
2
OH
0.093 0.21 0.086
12 Secondary Alcohol Group,  CHOH
0.09 0.275 0.056
13 Alilphatic Ether Group, O 0.15 0.27 0.072
14 Alkyl Peroxide Group, OO 0.211 0.34 0.134
15 Saturated Aliphatic Ester Group, COO 0.247 0.328 0.155
16 Saturated Aliphatic Methyl Ester Group, COOCH
3
0.184 0.404 0.172
17 Saturated Aliphatic Ethyl Ester Group, COOCH
2
CH
3
0.162 0.5 0.139
18 Acrylate Ester Group, CH2=CHCOO 0.13 0.179 0.14
19 Methacrylate Ester Group, CH2=C(CH
3
)COO 0.111 0.547 0.105
20 oAlkyl Phenol Group (With Hbonding) 0.006 0.455 0.042
21 pAlkyl Phenol Group (With Hbonding) 0.007 0.42 0.042
22 Monosubstituted Benzenes 0.1 0.27 0.081
23 1,2,4 Trisubstituted Benzene 0.055 0.42 0.069
108
The enumerated candidate molecules were then screened for structural
constraints, such as free bond number (FBN), to ensure that a stable,
connected molecule was formed. The candidate molecular structures that
were identified to fall within the feasibility region for the additive properties
in Figure 4.14 and satisfy all the constraints, are presented in Table 4.13.
The list of potential molecules is mapped onto the cluster space and is shown
in Figure 4.14.
Figure 4.14: Cluster diagram for biodiesel blending problem.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C3
C2
C1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Crude
a
c
Diesel
Biodiesel Target Region
CM3
CM4
?: Diesel
?: Crude Biodiesel
?: Candidate Molecules
CM1 : H
2
C=CHC(=O)OC(=CH
2
)CH
3
CM2 : H
2
C=CHC(=O)OCH(OH)CH
3
CM3 : H
3
CC(=O)OCH
2
CH
3
CM4 : pMethyl Phenol
Additive Feasibility Region
CM1
CM2
109
Table 4.13: Results from characterizationbased molecular design.
I.D. Candidate Molecules
T
m
[K] ? [mm
2
/s] CN
CM1
CH
2
O
O CH
3
CH
2
(Isopropenyl acrylate)
203 0.400 3.63
CM2
CH
2
O
O
CH
3
OH
(Hydroxyethyl acrylate)
261 1.04 3.04
CM3
CH
3
O
O
CH
3
(Ethyl acetate)
172 0.72 2.92
CM4
CH
3
OH
(4methyl phenol)
306 9.68 7.07
In addition, solubility parameters can be used as a simple method to
predict and calculate the dissolving power of the above candidate additive
molecules in biodiesel as a further screening criterion. As a general rule, two
substances with close solubility parameters (?) should be mutually soluble
[85]. In the thermodynamics of solution, the Hansen solubility parameters
related to dispersion force (?
?
), polar interaction (?
?
), and hydrogen bond
interaction (?
?
) have been conveniently used to estimate the solubility or
miscibility between two compounds. These parameters can be estimated from
additive group contributions [85]:
110
2
dp h
pi
di hiii i
mm m
F
FE
VV V
?? ?= = =
?
??
(4.11)
where ?
??
, ?
??
, and ?
??
are contributions from group i for calculating
dispersion, polar, and hydrogen component solubilities respectively using
Hoftyzer and Van Krevelen method [85]. The molar volume (?
?
) of a molecule
was estimated by group contribution methods [34]. The total Hansen
solubility parameter (equivalent to Hildebrand solubility parameter) can be
expressed as:
222
d ph
? ???= ++
(4.12)
Table 4.14: Additive solubility in FAME at 25
o
C.
I.D.
?
d
MPa
1/2
?
p
MPa
1/2
?
h
MPa
1/2
?
MPa
1/2
V
m
cm
3
/mol
Feasible
CM1 13.75 3.59 7.16 15.91 136.68 Yes
CM2 12.44 5.12 14.05 19.45 107.35 No
CM3 10.97 3.59 7.16 13.58 97.80 Yes
CM4 19.55 4.77 13.78 24.39 105.37 No
FAME 15.7 1.46 4.57 16.43 335.36 
Table 4.14 tabulates the solubility parameters and molar volumes of the
additives and a common fatty acid methyl ester (FAME). Since linoleic acid
(C18:2) represents a major constituent in the fatty acid profile (Table 4.2), it
is used to check the solubility of the candidate additive molecules in methyl
ester (biodiesel). Additives with lower a solubility parameter than FAME
were considered miscible in most proportions with FAME (? ? 16 MPa
1/2
).
Based on this criterion, the candidates CM1 and CM3 satisfied the screening
criteria and were considered feasible solutions.
111
From Table 4.14 it can be inferred that among the Hansen parameters,
the high dispersion parameter values (?
?
) have the major attractive factor for
FAME and the additive molecules involved in the process of solubilization.
These interactions arise from induced dipoles and their strength is related to
the polarizabilities of the molecules (dipole moment of FAME, CM1, and CM3
molecules are primarily from the dipole moment of the carbonyl group). The
polar parameter (?
?
) and the hydrogen bonding parameter (?
?
) of FAME,
CM1, and CM3 are comparatively low.
Figure 4.15: Spatial representation of candidate additive molecules
according to B3LYP/6311++G(3df,3dp) calculations: (a) isopropenyl acrylate
(CM1), and (b) ethyl acetate (CM3).
Furthermore, density functional theory (DFT) calculations were
performed for the CM1 and CM3 additive molecules to optimize their
geometry. Using the B3LYP method with the 6311++G(3df,3dp) basis set the
lowest energy conformer for each molecule is presented in Figure 4.15. The
estimated dipolar moments of CM1, CM3 and FAME are 1.54, 2.11 and 4.51
respectively. This is in agreement with the polar parameters obtained with
Hansen?s theory (Table 4.14). The difference between dipole moments
(?
????
??
????????
) is at a minimum for CM1, which corresponds to the better
affinity between CM1 and FAME and to the highest solubility limits of this
additive. Consequently, we can conclude that both the polarity and spatial
0
o
67
o
(a) (b)
112
configurations of feasible additive molecules are involved in the process of
solubilization.
4.6 Conclusion
Unlike in group contribution methods (GCM), where parameters of the
contribution are obtained by fitting the group contribution model to
experimental data for a set of chemical compounds, characterizationbased
group contribution method (cGCM) will account for the chemical information
thus extending the predictive capabilities of this method. Additionally,
insights into the molecular structures of the candidates and candidate
mixtures were obtained by incorporating the dipole moment (an important
descriptor of a molecule) indirectly through the use of IR as a descriptor of
molecules.
It should be noted that the limited amount of data and supporting
characterization information available in the training set (and thus the
molecular fragments available for molecular design) impacts the chemical
stability/feasibility of the molecules that are generated in this step.
Additional training set data can improve the quality of the predictions and
thus increase the application range but would not require a different
optimization methodology.
113
CHAPTER 5
REVERSE DESIGN OF IONIC LIQUIDS
5.1 Introduction
Design of environmentally benign solvents and alternative media for
extraction and purification are new challenges within chemical product
design. One class of novel compounds being studied for such application is
ionic liquids (ILs) [11, 86, 87, 88]. They have become the subject of an
increasing number of investigations due to their unique properties such as
high polarity, stability at high temperature, flame resistance, and negligible
vapor pressure. Ionic liquids that have tailored structures with an array of
unique functional properties can have important applications in areas such
as CO
2
capture and sequestration, sulfur removal from fuels, energy storage,
biomass pretreatment, and chemical separations. Through variation of both
cation and anion, particular ionic liquids with tunable physical properties can
be tailored. For example, the miscibility of ionic liquids with water or organic
solvents can be varied with alkyl chain lengths on the cation and the type of
anion present.
It is estimated that over 10
14
unique cation/anion combinations are
possible for use as room temperature ionic liquids, the majority of which have
never been synthesized [87]. Thus, it is essential to develop a logical and
systematic approach of selectively choosing a given ionic pair that matches a
set of desired physicochemical property targets. However, the traditional
experimental trialanderror approach of searching through this large
molecular space is unrealistic as it is both time and labor intensive. An
114
example of possible combinations of cation, anion, and the alkyl chain of the
side chain attached in the cation is represented graphically in Figure 5.1.
Figure 5.1: Selection of anions, cations, and side chains attached in
cations for a task specific ionic liquid application.
Computerassisted approaches towards the automated design of
chemically formulated products with desired physical, chemical and biological
properties hold immediate potential. In such methods, predicting the
properties of ILs would be necessary for molecular design of such compounds.
The advantage of the newly developed characterizationbased group
contribution technique [27] has previously been demonstrated to predict
physicalchemical properties and design of biodiesel additive molecules
(Section 4.5.3) [53]. By exploiting the fact that molecules absorb specific
frequencies that are characteristic of their structure, vibrational spectroscopy
is used to elucidate chemical constituents, and the orientation and alignment
of molecules. For example, infrared (IR) spectroscopybased characterization
CH
3
N N
+
CH
3
B

F
FF
F
Alkyl chain Cation Anion
?
4
BF
6
?
PF
33
?
CF SO
27
?
Al Cl
3
?
NO
4
?
AlCl
2
?
Tf N
halide
NN
+
R
1
R
2
S
+
R
1
R
2
R
3
N
+
R
2
R
2
Ionic Liquid
3
CH
2
CH
OH
22
CH CH=CH
23
CH OCH
3
CH O
115
contains large quantities of descriptor data involving information on
molecular architecture at atomic, nano, and microscales to describe
physical properties and attributes of chemical products. Some common
characterizations used to quantify molecular architectures at other length
scales include nuclear magnetic resonance (NMR) and xray diffraction (XRD)
(See Section 2.4 for more detail). In addition, the characterizationbased
group contribution method (cGCM) utilizes the latent property parameters
based on characterization data instead of conventional regressionbased
property parameters which often exhibit poor attributeproperty
relationships [27].
The choice of an appropriate spectra training set is vital to building the
latent variable structure since the training set defines the molecular
architecture building blocks that can be used in the cGCM. In order for the
method to be independent of the availability of experimental spectroscopic
data, in this Chapter we investigated the use of density functional theory
(DFT) based simulation techniques to generate the required IR spectra as
molecular descriptors to develop predictive property models that can be used
for the reverse design of ionic liquids.
5.2 Density Functional Theory
A set of ionic liquids with three properties were compiled from IUPAC Ionic
Liquid database (ILThermo) [89] and listed in Table B.3. For each of these
ionic liquid, the ionpaired structures were drawn into the Accelrys Draw 4.0
program to develop an initial 2dimensional MDL Molfile. This information
was then imported into Avogadro (v1.0.3) [90] to develop an initial geometry
utilizing a molecular mechanics (MM) forcefield for energy minimization.
The MM forcefield chosen for this task was MMFF94, also known as the
Merck Molecular ForceField. MMFF94 has been parameterized for a wide
range of organic chemistry calculations and several charged molecules have
116
been included as well, most notably imidazolium cations [91]. For IL?s with
phosphate containing anions, the unified forcefield (UFF) was utilized to
generate starting point geometries because the MMFF94 forcefield is not
suited to handle these types of molecules.
With this starting point geometry, the final molecular geometry
optimization was performed using the quantum chemical Gaussian 09
program [92] executed through a supercomputing cluster run by the Alabama
Supercomputing Authority. As a molecular model to simulate the pure ionic
liquid, ionpaired structures combining the cations and anions were
optimized as a whole. The DFT method B3LYP utilizing the 6311G(2d,p)
basis set in the ideal gas phase was used for molecular geometry optimization
of all ionpaired structures in Table B.3. The method and basis set were
chosen because similar computational level (B3LYP/6311+G(2d,p)) has
previously been demonstrated to have the lowest rootmeansquare error
(when compared to HarteeFock (HF) method utilizing similar basis sets) for
simulated fundamental molecular vibrations [93]. Figure 5.2 shows the
optimized structure of the stable gas phase (local potential energy minimum)
1ethyl3methyl imidazolium hexafluorophosphate [emIm]PF
6
ionic liquid
conformer
.
Figure 5.2: Geometry optimized molecular structure of [emIm]PF6 at the
B3LYP/6311G(2d,p) computational level.
117
Vibrational frequency calculations were performed for each ionpaired
structure to confirm the presence of an energy minimum. This is confirmed as
the resulting frequencies were all positive values (no imaginary frequencies)
[93]. Negative values appearing in this data would reveal the presence of a
transition state geometry since the vibrational modes are derived from a
square root of the force constants from the Hessian matrix (Eq. B.6).
In general, the values of harmonic vibrational frequencies determined by
ab initio computational methods are larger than the experimental
frequencies because the methods neglect anharmonicity effects, incompletely
incorporate electron correlations and uses finite basis sets [94]. The
vibrational spectra calculated using DFT (which incorporates electronic
correlation) requires a correlation factor of 0.965 at the B3LYP/6311G(2d,p)
level [95]. The base set is a mathematical function to approximate the
electronic wave function. Figure 5.3 (a) shows the high correlation (R
2
of
0.998) of the IR frequency calculated to the available experimental data [93]
for [emIm]PF
6
. Figure 5.3 (b) shows the resulting simulated vibrational
spectra using Gaussview software.
Figure 5.3: (a) Calculated vs. experimental IR frequencies and (b) infrared
spectrum at 2cm
1
resolution of [emIm]PF6 at the B3LYP/6311G(2d,p)
computational level.
? (cm
1
)
R? = 0.9983
0
500
1000
1500
2000
2500
3000
3500
0 500 1000 1500 2000 2500 3000 3500
C
a
lc
u
la
t
e
d
V
ib
r
a
t
io
n
,
?
(
c
m

1)
Experimental Vibration, ? (cm
1
)
118
Table 5.1 contains the theoretical and experimental vibrations along
with vibrational assignments for [emIm]PF
6.
Table 5.1: B3LYP/6311+G(2d,p) vibrational assignments (cm
1
) of
[emIm]PF6.
Frequency (cm
1
)
Assignment Calculated Experimental
CatAn tors 21 
CatAn bend 39 
CatAn bend 56 
CatAn bend 62 
CatAn bend, CH
2
(N) bend 99 
atAn bend, CH
2
(N) bend 106 
CH
3
(N) twist 127 
CH
3
(N) twist 162 
CH
2
(N) bend 186 
terminal CH
3
twist 210 
CH
3
(N) bend 218 
PF
6
scissors 284 
PF
6
scissors 287 
PF
6
scissors 295 
CCH bend, CH
3
(N) bend 296 
CCH bend, CH
3
(N) bend 326 
CCH bend, CH
3
(N) bend 389 
PF
6
scissors 434 
PF
6
scissors 439 
PF
6
scissors 440 
FPF sym str 447 
FPF sym str 472 
FPF sym bend 527 
FPF sym bend 528 
FPF sym bend 531 559
ring ip sym bend 569 559
ring op asym bend 570 625
119
Frequency (cm
1
)
Assignment Calculated Experimental
ring op asym bend 593 649
FPF sym str, ring op bend 695 
ring ip bend, CH
3
(N) bend, CH
2
(N) bend 710 
ring HCCH sym bend 749 741
CCH bend 812 
ring HCCH asym bend 826 836
FPF asym str, ring HCCH asym bend 875 838
FPF asym str, ring NC(H)N bend, CCH bend 900 847
FPF asym str 905 
ring NC(H)N bend 907 
CC str 960 
ring ip sym str 981 
ring sym str, CH
3
(N) str, CH
2
(N) str 1012 1033
CC str, ring ip sym str 1063 1087
ring HCCH sym bend, ring ip sym str 1082 
ring HCCH sym bend, ring ip sym str 1138 
CC str 1159 1114
CH
3
(N) HCH bend 1173 
ring sym str, CH
3
(N) str, CH
2
(N) str 1200 1172
ring ip asym str, CC bend 1216 
ring ip asym str, CH
3
(N) str 1267 
ring ip sym str, CH
2
(N) str 1298 1340
ring ip sym str, CH
2
(N) str, CH
3
(N) str 1328 
ring ip asym str, CH
2
(N) bend 1382 1387
CC str 1429 
ring ip asym str 1447 
ring ip asym str, CH
3
(N) str 1468 1432
CCH HCH sym bend, CH
3
(N)HCH sym bend 1505 
CCH HCH asym bend, CH
3
(N)HCH sym bend 1513 1468
CH
3
(N) asym bend 1522 
CC HCH bend 1533 
CH
3
(N) HCH sym bend 1538 
ring ip sym str, CH
3
(N) str, CH
2
(N) str 1547 1575
120
Frequency (cm
1
)
Assignment Calculated Experimental
ring ip asym str, CH
3
(N) str, CH
2
(N) str 1591 1575
terminal CH
3
HCH sym str 3059 2878
CH
3
(N) HCH sym str 3065 
CH
2
HCH sym str 3076 
terminal CH
3
HCH asym str 3102 2952
CH
2
HCH asym str 3109 
CH
3
(N) HCH asym str 3137 
CC HCH asym str 3145 
CH
3
(N)HCH asym str 3157 3115
ring NC(H)NCH str 3176 3134
ring HCCH asym str 3300 3168
ring HCCH sym str, ring NC(H)NCH str 3316 3179
5.3 Data Analysis and Model Development
IR data of 22 training set IL molecules generated using DFT contained 701
descriptor variables at 2 cm
1
resolution for (35000) cm
1
frequency range.
Using PCA, the first three principal components captured about 60% of the
total variance. The score values are listed in Table 5.2. The loading values
are not presented as they constitute a matrix of [701x3]. Using PCR, the
latent variable property models (Figure 5.4) were developed. The generalized
expressions are shown in Eq. 5, 6, and 7 for dynamic viscosity, density, and
melting temperature, respectively. Table 5.3 contains the respective model
coefficients. Note that the log, inverse, and exponent transformations of the
calibrated QSPR models were made to ensure linear mixing of the property
operators of these properties.
33
2
0
11
ln( )
i i ii i
ii
tt?? ? ?
= =
=++
??
(5.1)
121
33
2
0
11
1
ii iii
ii
tt?? ?
?
= =
=++
??
(5.2)
3 33 3
2
0
11
exp
m
i i ij i j ii i
i i j ji i
mo
T
t tt t
T
?? ? ?
= <> =
??
=++ +
??
??
? ?? ?
(5.3)
Table 5.2: PCA score values for X and Yblocks.
X score value
t
1
t
2
t
3
3.102 1.676 2.218
37.75 0.137 6.502
25.57 1.434 5.686
21.98 10.25 7.347
10.86 1.346 11.18
0.315 1.392 3.867
0.923 1.408 8.155
28.34 1.497 3.089
6.778 0.023 7.867
0.438 0.708 1.950
18.46 9.313 5.078
6.854 1.547 7.027
6.392 0.682 11.33
5.741 0.417 6.179
18.04 1.227 4.456
16.84 11.40 8.355
13.73 3.260 2.640
23.92 5.838 9.794
36.60 1.853 7.743
8.058 0.288 10.72
25.84 29.45 10.24
14.50 1.365 3.655
4.452 0.656 2.259
122
Figure 5.4: Predicted vs. actual IL properties using PCR model.
Table 5.3: Model coefficients using PCR.
Properties ??(?)
? ??
??? (?
?
?
??
)?
0
? 1.273 8.13e
4
6.778
1
? 0.0736 6.03e
7
0.0311
2
? 0.315 3.25e
6
0.0152
3
? 0.175 1.05e
5
0.0206
13
?   0.0137
23
?   0.0293
11
?  1.14e
7
0.0035
22
? 0.0162  0.0168
?
2
0.609 0.643 0.541
?
adj
2
0.517 0.559 0.312
The poor accuracy of predictions for the properties of interest could be
related to the quality of the experimental data used for calibrating the
property models. Moreover, the gas phase nature of IR simulation has
difficulties to take into account the structural features of ionic liquids in
liquid state (electrostatic, van der Waals interactions, hydrogen bonds, etc.).
In addition, accounting for the impact of variation in the structure of ions
(size, symmetry, conformational flexibility, etc.) on properties is always
challenging [31]. For example, melting point (?
?
) of ILs decreases with
increase in the length of alkyl substituents due to reduction of electrostatic
123
interactions between ions. However, this trend is not always true. With the
length of alkyl group increasing, van der Waals attraction between bulk alkyl
radicals favors an increase in (?
?
). Experimental observations have shown
oscillation of the melting point of ILs with the size of alkyl groups [96]. In
addition, characterization of ionic liquids using the vibrational spectroscopy
techniques like IR works better on molecules with covalent bonds. Although,
ionic liquids have some degree of covalent bonding, the ionic character
dominates the covalent character.
5.4 Reverse Design of Ionic Liquids using QSPR and cGCM
Identification of ionic liquids that possess task specific properties through
time consuming experiments and simulations of individual alternative
molecules is virtually impossible. However, there are relatively small
numbers of distinct buildingblocks/functionalgroups/fragments that can
theoretically be combined to generate a wide variation of possible molecules.
In order to capture the group contribution and interaction variability in the
22 training ILs (listed in Table B.3), a total of fifteen groups are selected.
Table 5.5, Table 5.6, and Table 5.7 contain seven anion, six cation, and two
alkyl chain cation substituents, respectively. These selected molecular groups
are considered to be a set of basic groups which represent the chemical make
up of the training set.
Anion groups are considered as a whole molecule without any possible
attachment position. Cation groups consist of imidazolium, pyridinium, and
alkyaminium bases with a maximum of four possible alkyl group
attachments. Two alkyl groups are selected as possible alkyl chains attached
to cation groups as these were the only groups present in the training set. All
these groups are considered first order in the group contributionbased
property estimation of the formulated ionic liquid molecule. Since most of the
groups cover a significant proportion of a potential IL molecule, they
124
incorporate most of the interactions between the groups. Moreover, the use of
projection methods for property model calibration also capture the most
common features and underlying latent relationships resulting from group
group, ionion, and iongroup interactions among the ionic pairs in the
training set molecules. The loadings (P) from PCA preserve the covariance
structure which can be applied to reverse design of ionic liquid molecules that
are consistent with structural attributes in the training set molecules.
The infrared frequency data for the anion, cation, and alkyl groups are
generated using B3LYP/6311G(2d,p) computational level and by following
the procedure outlined in Section 5.2. These IR descriptors values are then
transformed to score values using ? = ???. Finally, the three latent
properties are obtained by normalizing the score values to follow linear
mixing in terms of property operators (see Section 2.7 for detail). The latent
property values which are the contributions of each group to the three
properties of interest (?, ?, and ?
?
) are tabulated in Table 5.5, Table 5.6, and
Table 5.7.
As a proofofconcept example, an ionic liquid is to be designed as an
alternative benign solvent that should replace a molecular solvent with
environmental and health hazard properties. Table 5.4 presents the target
property ranges in terms of viscosity, density, and normal melting
temperature for a certain taskspecific application. The corresponding latent
property values were obtained following the steps outlined in Section 2.6.2.
Table 5.4: Target ionic liquid properties.
Physical Properties Latent Properties
?
[Pa.s]
?
[kg/m
3
]
T
m
[K]
q
1
q
2
q
3
Lower Limit 1.0 1200 230 0.0004 0.0077 0.0346
Upper Limit 1.1 1300 300 0.0671 0.4807 0.0248
125
Table 5.5: Anion groups and their latent property contributions.
Latent Property
Anion Groups
q
1
q
2
q
3
O

SO
CH
3
O
O
Methylsulfate
0.43 2.45 0.08
P

F
F
F
F
F
F
Hexafluorophosphate
0.15 1.91 0.14
B

F
FF
F
Tetrafluoroborate
0.14 2.92 0.74
NN
N

Dicyanamide
0.27 7.70 0.71
N

S S
O
O O
O
F
F
F
F
F
F
Bis(trifluromethylsulfonly)amide
0.11 3.51 1.40
S O

O
O
F
F
F
Triflate
0.06 2.67 0.70
S
O
O

O
O
CH
3
Octylsulfate
0.04 5.17 0.90
126
Table 5.6: Cation groups and their latent property contributions.
Latent Property
Cation Groups Valency
q
1
q
2
q
3
N
+ CH
3
N
R
methyl imidazolium
1
0.06 0.73 0.16
N N
+ RCH
3
methyl imidazolium (+)
1
0.20 0.27 0.00
N N
+ CH
3
R
CH
3
2,3 dimethyl imidazolium
1
0.55 0.12 0.37
N
+ R
Pyridinium
1
0.43 0.33 0.16
N
+
R
CH
3
CH
3
CH
3
Triethy alkyaminium
1
0.91 0.99 0.67
N
+
R
R
1
R
2
R
3
Trialkyl alkyaminium
4
0.23 0.90 0.04
127
Table 5.7: Alkyl chain attached to cation groups and their latent property
contributions.
Latent Property
Alkyl Groups
Valency
q
1
q
2
q
3
CH
3
Methyl
1
3.08 0.08 0.37
?CH
2
?
Methylene
2
4.08 0.92 1.37
The reverse design of potential IL molecules is accomplished by
exhaustively generating combinatorial structures from the given molecular
fragments until the resulting properties match the target property values in
Table 5.4. In exhaustive searches, selection from among numerous
permutations of anion, cation, and alkyl chain attached to cation groups is
performed. First, an anion is selected, and the properties of interest are
evaluated by using a groupbased property estimation method, changing the
cations and the length of side alkyl chain attached to the cation. For every
cation; CH
3
groups, equal to the free bond number (FBN) of the cation, are
added as a cap at the free end of the cation. Finally, varying numbers of CH
2
groups are added until the sum of property values of all the groups fall within
the target property range. In this case study, a maximum of fifteen CH
2
groups are allowed to occur in a generated ionic liquid pair. Figure 5.5 depicts
a scenario for the reverse design of ILs. The molecular property is estimated
based on the first order group contribution method:
15
1
j gg
g
n
=
? = ??
?
(5.4)
where ?
?
is the normalized latent property operator of property ? (which
includes 3 properties), ?
?
is the similar number of group appearances, and ?
?
is the normalized latent property contribution of the appeared group.
128
Figure 5.5: Scenario of reverse design of ILs.
A heuristic and exhaustive search algorithm in the Python programming
language was written to automate the generateandtest procedure to
enumerate all possible IL molecules by selecting collections of groups,
estimating their physical properties, and checking these against the target
values. A total of 10,638 possible candidate were generated. Only 26 of these
enumerated IL molecules satisfied the target property specifications in latent
space (Table 5.4) and the free bond number (FBN) structural constraint. The
structures whose FBN equals to zero ensure feasible, stable, and connected
Target Properties
Viscosity, Density, Melting
Point Temperature etc.
Generation of IL Structures from Combinatorial Building Blocks
Alkyl chain
R
1
?
4
BF
6
?
PF
33
?
CF SO
27
?
Al Cl
3
?
NO
4
?
AlCl
2
?
Tf N
halide
NN
+
R
1
R
2
S
+
R
1
R
2
R
3
N
+
R
2
R
2
Cation
R
2
+
Anion

Select appropriate
IL molecular building blocks
Exhaustive search by computer
Structured IL Molecules
Meet target property values,
Electronically complete
3
CH
2
CH
OH
22
CH CH=CH
23
CH OCH
3
CH O
129
molecules are formed (see Section 2.8.2). Out of 26 of these molecules, 23
solutions were not present in the training data set (Table B.3). Table B.4
contains 23 unique solutions in latent or principal property space. When
these 23 IL molecules were mapped from principal property space to physical
property space, only 13 IL molecules satisfied the target physical properties
(Table 5.4). It must be noted that all 23 solutions are feasible; however, only
13 are feasible in both the spaces. The loss of 10 molecules when properties of
the IL molecules were transformed from the latent space to physical space
could be contributed to the uncertainties associated with the property models.
Figure 5.6: Enumerated and validated candidate IL molecules.
Table 5.8 lists the final candidate IL molecules for which predicted
properties matches the target properties and are structurally feasible. The ID
numbers and candidate molecule names are the same as in Table B.4. These
molecules could further be subjected to experimental and/or more detailed
computational simulation tests. The enumeration and selection after
validation of IL molecules is schematically presented in Figure 5.6.
130
Table 5.8: Thirteen candidate ionic liquid molecules that match target
properties in property space.
ID Candidate Molecule Physical Properties
?
[Pa.s] ? [kg/m
3
] T
m
[K]
3 [e
3
pNH
4
]mSO
4
1.00 1248 267
4 [meIm]PF
6
1.00 1226 290
7 [meIm]BF
4
1.00 1261 270
8 [e2,3m
2
Im]BF
4
1.00 1261 266
9 [ePy]BF
4
1.00 1291 260
11 [e
3
bNH
4
]BF
4
1.00 1247 267
12 [e
3
pNH
4
]dCN 1.00 1259 261
14 [e2,3m
2
Im]Tf
2
N 1.00 1247 267
17 [e
3
bNH
4
]Tf
2
N 1.00 1233 247
18 [meIm]CF
3
SO
3
1.00 1263 266
21 [e
3
bNH
4
]CF
3
SO
3
1.00 1221 277
22 [ePy]oSO
4
1.10 1203 252
23 [e
3
pNH
4
]oSO
4
1.00 1255 262
5.5 Conclusion
Ionic liquids (ILs) as green solvents can be used in separation processes,
chemical synthesis, catalysis and electrochemistry, successfully replacing the
conventional volatile, flammable and toxic organic solvents. Within the
computeraided molecular design (CAMD) framework, a characterization
based method was combined with chemometric techniques towards the
design of IL structures corresponding to particular physical properties.
Infrared spectra (IR) generated from density functional theory (DFT)
simulations were used as molecular descriptors for capturing information on
molecular architecture, and for calibration of latent variable property models
to design ILs in a logical and systematic methodology. The use of DFT
eliminates the dependency on the availability of experimental IR data,
131
thereby extending the capabilities of a design method based on such
characterization techniques.
In addition, the design of ILs using the characterizationbased group
contribution method (cGCM) further demonstrated the advantages of using it
compared to the conventional GCM for predicting properties. For new class of
chemical compounds such as ILs, tabulated group contribution values for
many molecular groups are not available in the literature. Here, cGCM
expands the application range of general GCM to predict properties of the
enumerated IL molecules from the combination of IL molecular fragments by
using the molecular information captured from characterization technique
based on infrared spectroscopy data.
132
CHAPTER 6
FUTURE WORK
6.1 Methodology Improvements
The chemical product formulation methodology presented in this dissertation
utilizes multivariate data analytics in three areas: (1) characterization, (2)
modeling, and (3) design. Each of these aspects in product design plays an
important role at their respective levels. The contribution of this work has
been in the development of a generalized methodology that integrates all of
the above tools and techniques to design new and improved materials with
tailored properties. The applications are presented in the form of case studies
involving the design of biofuel additives, thermoplastic formulation, and ionic
liquid. Several of the tools and the techniques combined in this work may not
be an ideal choice; however, the general problem formulation as well as the
solution approach can be extended to other problems in chemical engineering.
In the future, some of the areas of improvement can be seen in each of
the multivariate analytics. Figure 6.1 presents schematics of the potential
tools and techniques in the characterization, modeling, and design.
6.1.1 MultiDimensional Characterization
In silico molecular design approaches presented in the foregoing chapters can
take advantage of an increasing amount of techniques available to
characterize molecular architectures. Recently, the advancement of high
performance computers and robust theories, such as quantum chemistry,
information theory, graph theory, etc., are accelerating a paradigm shift in
133
the molecular modeling. Today, thousands of molecular descriptors that
capture and transform the information encoded in the molecular structure
are effortlessly generated using readily available algorithms and software.
Each molecular descriptor takes into account a small part of the whole
chemical information contained within a real molecule. It is important to
consider the whole environment of the compound as a potential source of
information to investigate its interplay with physicochemical properties and
biological activities. QSPR/QSAR modeling is an integral part of in silico
molecular/drug design approaches.
Figure 6.1: Improvements in the process of multivariate characterization,
modeling, and design.
Since knowledge gained from chemical data alone is inadequate for
success in drug discovery, it is important to closely combine or merge
chemoinformatics and bioinformatics in the future works. Among many
computer programs, DRAGON
TM
computer software can generate more than
1,600 molecular descriptors, covering topological, molecular, and 3D
Multivariate
Characterization Modeling Design
? Descriptors including:
0D, 1D, 2D, 3D....
? Computer programs:
ADMET Predictor,
POLLY, DRAGON
? Multiway analysis:
Parafac, Tucker....
? Clustering:
NN, SVM, Kmean,
Decision Trees, LDA...
? Stochastic optimization:
GA, SA, Tabu search..
Mutation
Selection
Initial Molecule/Group
Final Molecule
134
properties of molecules [97]. EDRAGON applet is an electronic version of
DRAGON and is freely available at www.vcclab.org/edragon/. Likewise,
ADMET Predictor
TM
is a powerful software that can estimate a number of
properties associated with physicochemical and biological properties of drug
like chemicals and their molecular structures [98]. The addition of
descriptors with diverse chemical features and wider molecular space
coverage would make the presented methodology more flexible and robust,
but would not require a new problem formulation and solution framework.
In addition, since spectroscopic methods are fast and noninvasive, their
use has grown considerably in recent decades in both research and industries
such as food, petrochemical and pharmaceutics production. To establish more
chemical features of samples, two distinct analytical methods can also be
combined. For example, chromatography can be applied first to separate the
components, while infrared spectroscopy can be applied later to identify
them. When two distinct methods are combined, in socalled hyphenated
methods, the resulting data is multivariate and multiway. Likewise,
fluorescence excitationemission matrix (EEM) spectroscopy is a flexible,
rapid, and portable organic matter characterization tool. In order to use these
characterization techniques, however, twoway data decomposition
techniques such as principal component analysis (PCA) and partial least
square (PLS), used in this dissertation, cannot be applied without further
extension.
6.1.2 MultiWay Modeling
Decomposition of multiway data arrays can be accomplished by using
parallel factor analysis (PARAFAC), which is an extension of the twoway
PCA to threeway systems. Unlike the bilinear PCA model, the trilinear
PARAFRAC model factors do not have rotational freedom, and thus the
solution is essentially unique [99, 100]. Figure 6.2 (a) and Figure 6.2 (b) show
a schematic of twoway data and threeway tensor decomposition,
135
respectively. While PCA has a score and loading matrix (Figure 6.2 (a)),
PARAFAC has a score matrix and two loading matrices (Figure 6.2 (b)) for
threeway systems.
Figure 6.2: The decomposition of Xblock by (a) PCA and (b) PARAFAC.
Another important area where PARAFAC becomes instrumental is in
the batchwise manufacturing processes that are common in chemical,
pharmaceutical, and semiconductor industries. Batch processes give rise to
threedimensional matrices as shown in Figure 6.2 (b). The mixture
formulation problem presented in Chapter 3 involves a twoway data
structure. The PARAFAC decomposition of a threeway tensor ? is
represented by Eq. (6.1).
`
,, , , , ,,
1
1,...., ; 1,..., ; 1,...,
=
= ??+ = = =
?
A
i jk ia ja ka i jk
a
x a b c e i Ij Jk K
(6.1)
For batch processes, Eq. (6.1) contains I batches, J time points, and K
variables. For fluorescence EEM, Eq. (6.1) contains I samples with J emission
wavelength and K excitation wavelengths. Analogous to PCR regression,
= +
a
2
a
1
b
1
b
2
c
1
c
2
(b)
EEM Fluorescence
t
2
t
1
= +
p
1
p
2
(a)
NIR
136
PARAFAC can be combined with MLR to obtain a multiway regression
model: decompose ? using PARAFAC and regress ? on the PARAFAC scores
?` as presented in Sections 2.6.2 and 2.6.3.
In the other hand, when linear approximations (such as PCA or
PARAFAC) are not valid, neural networks (NNs), which use a series of
weights (?
?
) and hidden neurons to detect complex and nonlinear
relationships between inputs and outputs, can be utilized [101]. Figure 6.3 is
a schematic for an artificial neural network. NNs are nonlinear statistical
data modeling tools suitable for highdimensional and nonlinear data such
as the data generated from EEM fluorescence spectroscopy. They are an
adaptive system that changes its structure based on external or internal
information that flows through the network. They are useful when predictive
accuracy is the most important objective.
Figure 6.3: Structure of an artificial neural network.
6.1.3 Stochastic Search and Optimization
The deterministic solver is used in the molecular design problems presented
in this dissertation, involving the design of biodiesel additives and ionic
liquids. The developed methodology shifted the product design paradigm from
137
guessandtest to generateandtest approach. Molecules with desired
properties are generated using dynamic programming that searches through
molecular space defined by a set of combinatorial building blocks. The
combinatorial explosion is minimized by using metagroups (see Section
2.8.2). In groupbased property estimation methods, these metagroups are
treated as first order groups. The problem formulation using a deterministic
approach is able to handle the presented design problem because the
molecular search space is small. However, when the molecular search space
becomes combinatorially large and nonlinear, stochastic optimization
algorithms like evolutionary algorithm [9], simulated annealing (SA), tabu
search [11], and artificial neural networks (ANN) are more appropriate and
effective computer algorithms.
Genetic algorithms (GAs) are stochastic evolutionary searches that use
an analogy to chromosome encoding and Darwinian model of natural
selection and evolution [9]. GAs are useful for combinatorial problems
dealing with highly complex and a highly dimensional search space. Unlike
traditional search methods that move in the objective area deterministically
(pointbypoint forward movement), GAs move probabilistically (parallel
movement) in the optimal direction. In GAs, by stochastically favoring the
mating of a more fit population of molecules, the most promising areas of the
search space are explored at the expense of low performance regions [9].
6.1.4 Managing and Handling Uncertainty
Uncertainty accumulates through multiple steps in variable transformation
and can make the variance of the final response undesirably large. The
identification, quantification, and communication of model prediction
uncertainties are important steps in determining the usefulness of any
model. Uncertainty analysis techniques such as uncertainty propagation
(forward and inverse) need to be incorporated into model calibration and
design methodologies. By developing new design methods that are capable of
138
dealing with uncertainty in multiscale models of materials and its
propagation through subsequent design and analysis, the application range
increases and design becomes robust to uncertainties in the design process.
Importance of model uncertainty becomes greater as the extrapolation
becomes farther from the historical or training data set.
6.2 Design of Inherently Benign Chemical Process Routes
Traditionally, process design was guided by two major facets, technical and
economic decision criteria. However, the incorporation of two further
dimensions of sustainability, the ecological and the social aspects,
simultaneously into the early process design stages has become necessary in
recent years [102]. Figure 6.4 depicts different aspects involved in early
process design. Opportunities for identification and development of
inherently safer process alternatives for solvents, reaction paths, catalysts,
etc., are most abundant during the early design stage. An inherently safer
process avoids or reduces hazards instead of controlling them [103, 104].
Figure 6.4: Aspects of early process design.
Analogous to the molecular architecture descriptors used in molecular
design work, various process route descriptors related to environment,
health, and safety factors such as toxicity, reactivity, flammability, heat of
Economic
Sustainability
Ecological
Technical
Process Design
Aspects
Social
Environmental
Lifecycle
Health
Safety
139
reaction, temperature, process yield, properties of chemicals involved, etc.,
can be explored and used to design inherently safer synthesis pathways.
Here, the primary objective is to minimize the amount of raw material usage,
energy usage, waste generation, and the hazard to both life and environment.
Many indexbased approaches exist to compare and rank routes based
on environment, health, and safety (EHS) hazards and life cycle assessment
(LCA) impacts, however, they generally suffer from scaling and multivariate
related issues. The recently proposed statistical analysis based Inherent
Benignness Indicator (IBI) framework by Srinivasan and Nhan [103]
alleviates the above shortcomings. They considered principal components
(PCs) as statistically independent factors of the routes instead of traditional
empirical weighting schemes. The scores and loadings on the first two PCs
make up the IBI and help characterize the inherent benignness of a process.
The larger the IBI value, the less benign the route is.
The work in this dissertation can be easily extended to use this PCA
based IBI methodology to not only investigate the most inherently safer
process route among different routes examined, but also to identify the
similarities and differences in the EHS footprints of the available routes.
This information and insight can be used to determine modifications needed
to improve a route?s benignness through molecular/mixture design. Reverse
problem formulation could be applied as an integration procedure where the
process unit performance that minimizes the IBI could be identified in the
first step and then the molecule/mixtures that match the targets process
performance would be identified in the second step.
The computeraided molecular/mixture design (CAMD) framework
combined with the property clustering technique would then be employed to
generate products that meet the design constraints. If successful, the method
could provide a new generation of greener chemical substitutes (e.g. solvents,
catalyst, lubricants, heat transfer fluids, refrigerants, surfactants, etc.) that
result in lower safety, health and environmental impacts. The algorithm will
140
be computationally efficient as it will be based on the targeting approach
mentioned before in this dissertation. Moreover, simultaneously integrating
inherent safety principles (using safer materials and operating conditions,
minimizing inventory, etc.) into process and product design, from conception
to completion, could lead to a more sustainable process.
141
REFERENCES.
[1] Gani, R. (2004). Chemical product design: challenges and opportunities.
Computers & Chemical Engineering, 28(12), 24412457.
[2] Cussler, E. L., & Moggridge, G. D. (2001). Chemical Product Design. USA:
Cambridge University Press, New York, ISBN 0521791839.
[3] Hill, M. (2009). Chemical Product Engineering  The third paradigm.
Computers & Chemical Engineering, 33(5): 947.
[4] Ertl, P. (2003). Cheminformatics analysis of organic substituents:
identification of the most common substituents, calculation of substituent
properties, and automatic identification of druglike bioisosteric groups.
Journal of Chemical Information and Computer Sciences, 43(2), 374380.
[5] Achenie, L. E. K., Gani, R., & Venkatasubramanian, V. (Eds.). (2002).
Computer Aided Molecular Design: Theory and Practice (p. 408). Elsevier
Science.
[6] Sahinidis, N. V., Tawarmalani, M., & Yu, M. (2003). Design of alternative
refrigerants via global optimization. AIChE Journal, 49(7), 1761?1775.
[7] Biegler, L. T., Grossmann, I. E., & Westerberg, A. W. (1997). Systematic
methods of chemical process design. New Jersey (p. 796). Prentice Hall PTR.
[8] Camarda, K. V., & Maranas, C. D. (1999). Optimization in Polymer Design
Using Connectivity Indices. Industrial & Engineering Chemistry Research,
38(10), 18841892.
[9] Venkatasubramanian, V., Chan, K., & Caruthers, J. M. (1995). Evolutionary
Design of Molecules with Desired Properties Using the Genetic Algorithm.
Journal of Chemical Information and Modeling, 35(2), 188?195.
[10] Harper, P. M., & Gani, R. (2000). Computer aided tools for design/selection of
environmentally friendly substances. Process Design Tools for Environment
(pp. 371?404). Philadelphia, USA: Taylor & Francis.
[11] McLeese, S. E., Eslick, J. C., Hoffmann, N. J., Scurto, A. M., & Camarda, K.
V. (2010). Design of ionic liquids via computational molecular design.
Computers & Chemical Engineering, 34(9), 14761480. Elsevier Ltd.
142
[12] Contantinou, L., Gani, R. (1994). New group contribution method for
estimating properties of pure compounds. AIChE Journal, 40(10): 16971710.
[13] Harper, P. M., & Gani, R. (1999). Computeraided molecular design with
combined molecular modeling and group contribution. Fluid Phase
Equilibria, 337347.
[14] Joback, K. G., Reid, R. C. (1983). Estimation of PureComponent Properties
from Group Contributions. Chemical Engineering Communication 57: 233.
[15] Eriksson, L., Johansson, E., KettanehWold, N., Trygg, J., Wikstrom, C., &
Wold, S. (2006). Multi and Megavariate Data Analysis: Basic Principles and
Applications (Part I) (2nd ed., p. 425). Umetrics Academy, Umea.
[16] Lavine, B. K. (2005). Chemometrics And Chemoinformatics (p. 216). American
Chemical Society.
[17] Carrera, G., & AiresdeSousa, J. (2005). Estimation of melting points of
pyridinium bromide ionic liquids with decision trees and neural networks.
Green Chemistry, 7(1), 20.
[18] Venkatasubramanian, V. (2009). Drowning in data: Informatics and
modelling challenges in a data rich networked world. AIChE Journal, 55(1):
28.
[19] Wold, S. (1995). Chemometrics; what do we mean with it, and what do we
want from it? Chemometrics and Intelligent Laboratory Systems, 30(1), 109
115.
[20] Jaeckle, C. M., & Macgregor, J. F. (1998). Product design through
multivariate statistical analysis of process data. AIChE Journal, 44(5), 1105
1118.
[21] Qin, X., F. Gabriel, D. Harell, and M. M. ElHalwagi. 2004. Algebraic
Techniques for Property Integration via Componentless Design, Industrial &
Engineering Chemistry Research, 43 (14) (July): 3792?3798.
[22] Eden, M. R. (2003). Property Based Process and Product Synthesis and
Design. (Doctoral dissertation). CAPEC, Department of Chemical
Engineering, Technical University of Denmark, Retrieved from
http://www.orbit.dtu.dk/fedora/objects/orbit85827/datastreams/file_5486802/c
ontent
[23] Shelley, M. D., ElHalwagi, M.M. (2000). Componentless design of recovery
and allocation systems: a functionalitybased clustering approach. Computers
& Chemical Engineering, 24(910), 20812091.
143
[24] Eljack, F. T., Solvason, C. C., Chemmangattuvalappil, N., & Eden, M. R.
(2008). A PropertyBased Approach for Simultaneous Process and Molecular
Design. Chinese Journal of Chemical Engineering, 16(3), 424?434.
[25] Chemmangattuvalappil, N. G., Eljack, F. T., Solvason, C. C., & Eden, M. R.
(2009). A novel algorithm for molecular synthesis using enhanced property
operators. Computers & Chemical Engineering, 33(3), 636?643.
[26] Vlachos, D. G. (2012). Multiscale Modeling for Emergent Behavior,
Complexity, and Combinatorial Explosion. AIChE Journal, 58(5), 13141325.
[27] Solvason, C. C. (2011). Integrated Multiscale Product Design using Property
Clustering and Decomposition Techniques in a Reverse Problem Formulation.
(Doctoral dissertation). Department of Chemical Engineering, Auburn
University, Auburn, Alabama.
[28] Poling, B. E, Prausnitz, J. P., O?Connell, J. P. (2000). The Properties of Gases
and Liquids, 5
th
edition, McGrawHill, New York, USA.
[29] Matsuda, H., Yamamoto, H., Kurihara, K., & Tochigi, K. (2007). Computer
aided reverse design for ionic liquids by QSPR using descriptors of group
contribution type for ionic conductivities and viscosities. Fluid Phase
Equilibria, 261(12), 434?443.
[30] Linusson, a, Elofsson, M., Andersson, I. E., & Dahlgren, M. K. (2010).
Statistical molecular design of balanced compound libraries for QSAR
modeling. Current medicinal chemistry, 17(19), 2001?16.
[31] Varnek, A., Kireeva, N., Tetko, I. V, Baskin, I. I., & Solov?ev, V. P. (2007).
Exhaustive QSPR studies of a large diverse set of ionic liquids: how
accurately can we predict melting points? Journal of chemical information
and modeling, 47(3), 1111?22.
[32] Michielan, L., & Moro, S. (2010). Pharmaceutical perspectives of nonlinear
QSAR strategies. Journal of chemical information and modeling, 50(6), 961?
78.
[33] Linusson, a, Gottfries, J., Lindgren, F., & Wold, S. (2000). Statistical
molecular design of building blocks for combinatorial chemistry. Journal of
medicinal chemistry, 43(7), 1320?8.
[34] Marrero, J., Gani, R. (2001). Groupcontributionbased estimation of pure
component properties. Fluid Phase Equilibria, 183184: 183208.
[35] Gabrielsson, J., Lindberg, N.O., & Lundstedt, T. (2002). Multivariate
methods in pharmaceutical applications. Journal of Chemometrics, 16(3),
141160.
144
[36] Gabrielsson, J., Lindberg, N.O., P?lsson, M., Nicklasson, F., Sj?str?m, M., &
Lundstedt, T. (2004). Multivariate methods in the development of a new
tablet formulation: optimization and validation. Drug development and
industrial pharmacy, 30(10), 103749.
[37] Gabrielsson, J., Sj?str?m, M., Lindberg, N.O., Pihl, A.C., & Lundstedt, T.
(2006). Multivariate methods in the development of a new tablet formulation:
excipient mixtures and principal properties. Drug development and industrial
pharmacy, 32(1), 720.
[38] Workman, J., & Weyer, L. (2008). Practical Guide to Interpretive Near
Infrared Spectroscopy (1st ed., p. 344). CRC Press.
[39] S.E. Stein (2010), NIST Mass Spec Data Center, NIST Standard Reference
Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National
Institute of Standards and Technology.
[40] Eriksson, L., & Johansson, E. (1996). Multivariate design and modeling in
QSAR. Chemometrics and Intelligent Laboratory Systems, 34(1), 1?19.
[41] Kramer, R. (1998). Chemometric Techniques for Quantitative Analysis. New
York (p. 203). CRC.
[42] Geladi, P., & Kowalski, B. R. (1986). Partial leastsquares regression: a
tutorial. Analytica chimica acta, 185, 1?17.
[43] Kresta, J. V., MacGregor, J. F. and Marlin, T. E. (1991). Multivariate
statistical monitoring of process performance. Canadian Journal of Chemical
Engineering. 69, 35.
[44] Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical multiblock PLS
and PC models for Easier Model interpretation and as an alternative to
variable selection. Journal of Chemometrics, 10, 463482.
[45] MacGregor, J. F., and Kourti, T. (1995). Statistical process control of
multivariate process. Control Engineering Practice. 3(3), 403414.
[46] Johnson, R. A. (2007). Applied Multivariate Statistical Analysis (6th ed., p.
800). Pearson.
[47] Jolliffe, I.T. (2002). Principal component analysis. New York: Springer.
[48] Jackson, J. E. (1991). A User?s Guide to Principal Components. Journal of the
Operational Research Society (Vol. 43, pp. 641641). WileyInterscience.
[49] Jaeckle, C. M., & Macgregor, J. F. (1998). Product design through
multivariate statistical analysis of process data. AIChE Journal, 44(5), 1105
1118.
145
[50] Wentzell, P. D., & Vega Montoto, L. (2003). Comparison of principal
components regression and partial least squares regression through generic
simulations of complex mixtures. Chemometrics and Intelligent Laboratory
Systems, 65(2), 257279.
[51] Montgomery, D. C. (2000). Design and Analysis of Experiments, 5th Edition.
America (pp. 15). Wiley.
[52] BenIsrael, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and
Applications. (J. Borwein & P. Borwein, Eds.) Book (Vol. 18, p. 371). Springer.
[53] Hada, S., Solvason, C. C., & Eden, M. R. (2011). Molecular Design of Biofuel
Additives for Optimization of Fuel Characteristics. In A. C. Pistikopoulos,
E.N.; Georgiadis, M.C.; Kokossis (Ed.), 21nd European Symposium on
Computer Aided Process Engineering (pp. 1633?1637).
[54] Brignole, E., Bottini, S., & Gani, R. (1986). A strategy for design and selection
for separation processes. Fluid Phase Equilibria, 13, 331?340.
[55] Joback, K. G. (1989). Designing Molecules Possessing Desired Physical
Property Values. (Doctoral dissertation). Massachusetts Institute of
Technology. Retrieved from http://dspace.mit.edu/handle/1721.1/14191
[56] KettanehWold, N. (1992). Analysis of mixture data with partial least
squares. Chemometrics and Intelligent Laboratory Systems, 14(13), 5769.
[57] Muteki, K., & Macgregor, J. (2007). Multiblock PLS modeling for Lshape
data structures with applications to mixture modeling. Chemometrics and
Intelligent Laboratory Systems, 85(2), 186194.
[58] Cornell, J. (1990), Experiments with mixtures, 2nd Ed., Wiley, New York.
[59] Scheffe, H. (1963). Simplexcentroid design for experiments with mixtures.
Journal of the Royal Statistical Society, 25 (2), 235263.
[60] Cox, D. R. (1971). A note on polynomial response functions for mixtures.
Biometrica, 58 (1), 155159.
[61] Solvason, C. C., Chemmangattuvalappil, N. G., Eljack, F. T., & Eden, M. R.
(2009). Efficient Visual Mixture Design of Experiments using Property
Clustering Techniques. Industrial & Engineering Chemistry Research, 48(4),
22452256.
[62] Muteki, K., MacGregor, J. F., & Ueda, T. (2006). Rapid Development of New
Polymer Blends: The Optimal Selection of Materials and Blend Ratios.
Industrial & Engineering Chemistry Research, 45(13), 46534660.
146
[63] Muteki, K. (2006). Mixture product design using latent variable methods.
(Doctoral dissertation). Department of Chemical Engineering, McMaster
University.
[64] Grassmann, P., Sawistowski, H., & Hardbottle, R. (1971). Physical principles
of chemical engineering. New York: Pergamon Press.
[65] JMP, Version 9.0. SAS Institute Inc., Cary, NC, 19892012.
[66] Janaun, J., & Ellis, N. (2010). Perspectives on biodiesel as a sustainable fuel.
Renewable and Sustainable Energy Reviews, 14(4), 13121320. Elsevier Ltd.
[67] Schober, S., Mitelbach, M. (2004). The impact of antioxidants on biodiesel
oxidation stability. European Journal of Lipid Science and Technology, 106,
382389.
[68] Karmee, S.K., Chadha, A. (2005). Preparation of biodiesel from crude oil of
Pongamia pinnata. Bioresource Technology, 96 (13), 14251429.
[69] American Society for Testing and Materials (ASTM), D675112 Standard
Specification for Biodiesel Fuel Blend Stock (B100) for Middle Distillate
Fuels, ASTM International, West Conshohocken, PA, 2012, www.astm.org.
[70] Fukud, H., Kondo, A., Noda, H. (2001). Biodiesel Fuel Production by
Transesterification of Oils. Journal of Bioscience and Bioengineering, 92(5),
405416.
[71] Knothe, G., Gerpen, J. H. V., & Krahl, J. (2005). The Biodiesel Handbook. (G.
Knothe, J. H. Van Gerpen, & Jurgen Krahl, Eds.) Applied Sciences (Vol. 2, p.
302). AOCS Press.
[72] Br?ckel, U., Meier, W., & Wagner, G. (2007). Product Design and
Engineering: Best Practices. John Wiley & Sons.
[73] Canakci, M., & Sanli, H. (2008). Biodiesel production from various feedstocks
and their effects on the fuel properties. Journal of industrial microbiology
biotechnology, 35(5), 431441.
[74] Demirbas, A. (2003). Chemical and Fuel Properties of Seventeen Vegetable
Oils. Energy Sources, 25(7), 721728.
[75] European Committee for Standarization (CEN), EN 14214 (2003).
Automotive fuels ? fatty acid methyl esters (FAME) for diesel engines ?
Requirements and test methods, CEN, Brussels.
[76] Wang, P.S., Tat, M.E., Van Gerpen, J. (2005). The production of fatty acid
isopropyl ester and their use as a diesel engine fuel. Journal of the American
Oil Chemists Society, 82(11), 845849.
147
[77] Lee, I., Johnson, L.A., Hammond, E.G. (1995). Use of branchedchain esters to
reduce the crystallization temperature of biodiesel. Journal of the American
Oil Chemists Society, 72, 11551160.
[78] Knothe. G. (2008), ?Designer? Biodiesel: Optimizing Fatty Ester Composition
to Improve Fuel Properties, Energy and Fuels, 22(2), 13581364.
[79] Ribeiro, M., Pinto, A. C., Quintella, C. M., Rocha, G. O., Teixeira, L. S. G.,
Guarieiro, L. N., Rangel, C., et al. (2007). The Role of Additives for Diesel and
Diesel Blended (Ethanol or Biodiesel) Fuels?: A Review Nu. Energy & Fuels,
21(4), 24332445.
[80] Suppes, G. J., Goff, M., Burkhart, M. L., & Bockwinkel, K. (2001).
Multifunctional Diesel Fuel Additives from Triglycerides. Energy & Fuels,
15(1), 151157.
[81] Conte, E., Martinho, A., Matos, H. A., & Gani, R. (2008). Combined Group
Contribution and Atom Connectivity IndexBased Methods for Estimation of
Surface Tension and Viscosity. Industrial & Engineering Chemistry Research,
47(20), 79407954.
[82] Hall, L. H., & Kier, L. B. (2001). Issues in representation of molecular
structure the development of molecular connectivity. Journal of molecular
graphics & modelling, 20(1), 418.
[83] Lapuerta, M., Rodr?guezFern?ndez, J., & De Mora, E. F. (2009). Correlation
for the estimation of the cetane number of biodiesel fuels and implications on
the iodine number. Social Sciences, 37(11), 43374344.
[84] Socrates, G. (2001). Infrared and Raman characteristic group frequencies:
tables and charts. (G. Socrates. Ed.) Journal of Raman Spectroscopy (Vol. 35.
p. 347). Wiley.
[85] Krevelen, D. W. V., & Nijenhuis, K. T. (2009). Properties of Polymers. Their
correlation with chemical structure; their numerical estimation and prediction
from additive group contributions. (Null, Eds.)Endeavour (Vol. 16, pp. 656
658).
[86] Ayala, A. E., Simoni, L. D., Lin, Y., & Brennecke, J. F. (2006). Process design
using ionic liquids?: Physical property modeling. Computer Aided Chemical
Engineering, 21, 463468.
[87] Turner, E. A., Pye, C. C., & Singer, R. D. (2003). Use of ab Initio Calculations
toward the Rational Design of Room Temperature Ionic Liquids. Journal of
Physical Chemistry, (Table 1), 2277?2288.
[88] Holbrey, J. D., & Seddon, K. R. (1999). Ionic Liquids. Clean Products and
Processes, 1, 223236.
148
[89] Ionic Liquids Database (ILThermo). NIST Standard Reference database #
147. Retrieved from http://ilthermo.boulder.nist.gov/ILThermo/mainmenu.uix
[90] Hanwell, M. D., Curtis, D. E., Lonie, D. C., Vandermeersch, T., Zurek, E., &
Hutchison, G. R. (2012). Avogadro: an advanced semantic chemical editor,
visualization, and analysis platform. Journal of cheminformatics, 4(1), 17.
[91] Halgren, T. A. (1996). Merck molecular force field. I. Basis, form, scope,
parameterization, and performance of MMFF94. Journal of Computational
Chemistry, 17(56), 490?519.
[92] Gaussian 09 (2009), Revision A.1, Frisch, M. J.; Trucks, G. W.; Schlegel, H.
B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.;
Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian,
H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.;
Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.;
Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, Jr., J. A.; Peralta, J.
E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.;
Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.;
Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.;
Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.;
Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.;
Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.;
Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.;
Farkas, ?.; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian,
Inc., Wallingford CT.
[93] Talaty, E. R., Raja, S., Storhaug, V. J., Do, A., & Carper, W. R. (2004). Raman
and Infrared Spectra and ab Initio Calculations of C 24 MIM Imidazolium
Hexafluorophosphate Ionic Liquids. Journal of Physical Chemistry, 108,
13177?13184.
[94] Scott, A. P., & Radom, L. (1996). Harmonic Vibrational Frequencies: An
Evaluation of Hartree?Fock, M?ller?Plesset, Quadratic Configuration
Interaction, Density Functional Theory, and Semiempirical Scale Factors.
Journal of Physical Chemistry, 100(41), 16502?16513.
[95] http://cccbdb.nist.gov/vibscalejust.asp
[96] Wasserscheid, P., & Welton, T. (Eds.). (2007). Ionic Liquids in Synthesis (2nd
ed., p. 776). WILEYVCH.
[97] Todeschini, R., & Consonni, V. (2009). Molecular Descriptors for
Chemoinformatics. (R. Mannhold, H. Kubinyi, & G. Folkers, Eds.) (2nd ed., p.
1257). Weinheim: WILEYVCH.
[98] ADMET Predictor (2011). User Manual. Simulations Plus, Lancaster, CA.
149
[99] Bro, R., & Kiers, H. A. L. (2003). A new efficient method for determining the
number of components in PARAFAC models. Journal of Chemometrics, 17(5),
274?286.
[100] Thygesen, J., & Van Den Berg, F. (2011). Calibration transfer for excitation
emission fluorescence measurements. Analytica Chimica Acta, 705(12), 817.
[101] Bieroza, M., Baker, A., & Bridgeman, J. (2011). Classification and calibration
of organic matter fluorescence data with multiway analysis methods and
artificial neural networks: an operational tool for improved drinking water
treatment. Environmetrics, 22(3), 256?270.
[102] Albrecht, T., Papadokonstantakis, S., Sugiyama, H., & Hungerb?hler, K.
(2010). Demonstrating multiobjective screening of chemical batch process
alternatives during early design phases. Chemical Engineering Research and
Design, 88(56), 529?550.
[103] Srinivasan, R., & Nhan, N. T. (2008). A statistical approach for evaluating
inherent benignness of chemical process routes in early design stages.
Process Safety and Environmental Protection, 86(3), 163?174.
[104] Banimostafa, A., Papadokonstantakis, S., & Hungerb?hler, K. (2012).
Evaluation of EHS hazard and sustainability metrics during early process
design stages using principal component analysis. Process Safety and
Environmental Protection, 90(1), 8?26.
150
APPENDIX A
MIXTURE MODELS
The Scheffe canonical models and Cox polynomial models are the two most
commonly used models to analyze mixture data with multiple regressions.
A.1 Scheffe Mixture Model
Scheffe
1
introduced canonical models of various orders by eliminating some
terms from the complete polynomial model. More details can be found in
Cornell
2
. The first and second order point estimate forms of the Scheffe
models are represented by Eqs (A.2) and (A.3) respectively.
1
n
ii
i
yx?
=
=
?
(A.2)
1
n nn
i i ij i j
i i j ji
y x xx??
= <>
= +
? ??
(A.3)
By removing the constant term, the primary collinearity introduced by
Eq. (3.1) is eliminated; however, it makes it impossible to center these
models, which leads to illconditioning of the ?
?
? matrix and poor estimates
of the coefficients
3
. Moreover, the Scheffe model is difficult to interpret when
the objective of the data analysis is determination of the component effects on
1 Scheffe, H. (1963). Simplexcentroid design for experiments with mixtures. J. R. Stat. Soc.
B., 25 (2), 235263.
2 Cornell, J. (1990), Experiments with mixtures, 2nd Ed., Wiley, New York.
3 KettanehWold, N. (1992). Analysis of mixture data with partial least squares.
Chemometrics and Intelligent Laboratory Systems, 14(13), 5769.
151
the response. Moreover, secondary collinearity from additional constraints
such as upper and lower limits on components, results in poor estimates of
the regression coefficients.
A.1 Cox Mixture Model
Cox
3
, recognizing the difficulties with the Scheffe mixture model, derived new
mixture models with respect to a specified reference point in the
experimental region. They are regular polynomials with constraints involving
the reference mixture. The first and second order polynomial forms are
represented by Eqs (A.4) and (A.5).
1
n
o ii
i
yx??
=
= +
?
(A.4)
1
n nn
o i i ij i j
i ij j
y x xx?? ?
= ?
=++
? ??
(A.5)
In terms of change in constituent i,
i
? , Eqs (A.4) and (A.5) can be rewritten
as:
1
()
1
n
i
i
i
i
y ys
s
?
=
??
=+?
??
?
??
?
(A.6)
( )
2
2
11
()
1
1
nn
i ii
ii
ii
i
i
y ys
s
s
??
= =
??
??
= + ?+ ???
??
?
?
??
??
??
(A.7)
where, y(s) is the expected response at the standard reference mixture.
The Cox coefficients represent the change in the response as one moves
away from the standard reference mixture, and hence is meaningful in most
applications. However, for multiple regressions, the Cox model encounters
estimation difficulties as additional constraints are involved.
152
In most practical cases, (except when X is generated according to an
experimental design), however, the Xvariables are not statistically
independent. This situation is referred to as X being rank deficient. Although
the Scheffe canonical models and the Cox polynomial models (a
reparameterized and constrained version of the Scheffe model) eliminated the
true collinearity, and enabled the use of multiple regressions for the
estimation, the problem of near collinearities with mixture data remains.
Design of experiments (DOE) with response surface methods are usually used
to determine the optimum combination of chemical constituents that give a
desired response using a minimum number of experimental runs. While such
a design approach is adequate for most experimental designs, it suffers from
combinatorial explosion and visualization difficulties when dealing with
multicomponent mixtures
4
. Solvason et al.
5
presented a solution to these
problems by integrating the property clustering framework with existing
mixture design techniques.
4 Eden, M. R. (2003). PropertyBased Process and Product Synthesis and Design. CAPEC,
Department of Chemical Engineering, Technical University of Denmark. Ph.D Thesis.
5 Solvason, C. C., Chemmangattuvalappil, N. G., Eljack, F. T., & Eden, M. R. (2009).
Efficient Visual Mixture Design of Experiments using Property Clustering Techniques.
Industrial & Engineering Chemistry Research, 48(4), 22452256.
153
APPENDIX B
SPECTRAL INTERPRETATION
The vibrational spectrum of a molecule is considered to be a unique physical
property and is a characteristic of the molecule. Any spectrum originates
from radiation energy transferred to mechanical energy associated with the
motion of atoms held together by chemical bonds in a molecule. The first
principles approach, which is based on the fact that structural features of the
molecule, whether they are the backbone of the molecule or the functional
groups attached to the molecule, produce characteristic and reproducible
absorptions in the spectrum
6
. This information can indicate whether there is
a backbone to the structure and, if so, whether the backbone consists of linear
or branched chains. Next it is possible to determine if there is unsaturation
and/or aromatic rings in the structure. Finally, it is possible to deduce
whether specific functional groups are present. If detected, one is also able to
determine the local orientation of the group and its local environment and/or
location in the structure
7
.
6 Salzer, R. (2008). Practical Guide to Interpretive NearInfrared Spectroscopy. By Jerry
Workman, Jr. and Lois Weyer. Angewandte Chemie International Edition,47(25), 4628
4629.
7 Pasquini, C. (2003). Review Near Infrared Spectroscopy?: Fundamentals , Practical
Aspects and Analytical. Applications. Spectroscopy, 14(2), 198219.
154
B.1 Infrared Spectroscopy
The fundamental absorption frequencies (also known as group frequencies)
are the key to unlocking the structure?spectral relationships of the
associated molecular vibrations. An infrared spectrum is formed as a
consequence of the absorption of electromagnetic radiation at frequencies
that correlate to the vibration of specific sets of chemical bonds from within a
molecule. Figure B.6.5
8
shows the IR region of the electromagnetic spectrum.
The distribution of energy possessed by a molecule at any given moment can
be defined as the sum of the contributing energy terms:
total electronic vibrational rotational trnaslational
EE E E E=+++
(B.1)
Figure B.6.5: IR regions of the electromagnetic spectrum.
8 Raffael, K. D. (2002). Infrared Spectroscopy: The Theory. Journal of Molecular
Spectroscopy, 214(1), 2127
Visible
Infrared
Microwave
155
B.2 Molecular Vibrational Spectroscopy
There are two types of molecular vibrations, stretching and bending. A
molecule consisting of natoms has 3n degrees of freedom, corresponding to
the Cartesian coordinates of each atom in the molecule. In a nonlinear
molecule, 3 of these degrees are rotational and 3 are translational and the
remaining corresponds to fundamental vibrations; in a linear molecule, 2
degrees are rotational and 3 are translational. This is because in a linear
molecule, all of the atoms lie on a single straight line and hence rotation
about the bond axis is not possible. The net number of vibrational degrees of
freedom for a given molecule can be determined from Eq. (B.2):
3 6 (nonlinear)
3 5 (linear)
Number of normal mode n
n
= ?
= ?
(B.2)
For example, water, which is nonlinear, has three fundamental
vibrations as shown in Figure B.6.6
8
.
Figure B.6.6: Stretching and bending vibrational modes for H2O.
If we calculate the number of modes for a simple hydrocarbon, such as
methane (nonlinear, tetrahedral structure), a value of nine is obtained. This
would imply that nine sets of absorption frequencies would be observed in the
spectrum of methane gas. In practice, infrared spectra do not normally
display separate absorption signals for each of the 3n6 fundamental
vibrational modes of a molecule. The number of observed absorptions may be
Symmetrical stretching asymmetrical stretching Scissoring (bending)
156
increased by additive and subtractive interactions leading to combination
tones and overtones of the fundamental vibrations, in much the same way
that sound vibrations from a musical instrument interact. Furthermore, the
number of observed absorptions may be decreased by molecular symmetry,
spectrometer limitations, and spectroscopic selection rules.
One selection rule that influences the intensity of infrared absorptions is
that a change in dipole moment of the molecule should occur for a vibration
to absorb infrared energy. Absorption bands associated with C=O bond
stretching are usually very strong because a large change in the dipole takes
place in that mode. The reason for the smaller than expected number is that
several of the vibrations are redundant or degenerate, that is, the same
amount of energy is required for these vibrations. Some general trends are as
follows:
? Stretching frequencies are higher than corresponding bending
frequencies. (It is easier to bend a bond than to stretch or compress it.)
? Bonds to hydrogen have higher stretching frequencies than those to
heavier atoms.
? Triple bonds have higher stretching frequencies than corresponding
double bonds, which in turn have higher frequencies than single bonds.
(Except for bonds to hydrogen).
The stretching and bending vibrations for the important organic group,
?CH
2
, are illustrated in Figure B.6.7
9
. (The 3n?6 rule does not apply since
the ?CH
2
group represents only a portion of a molecule.) Note that bending
vibrations occur at lower frequencies than the corresponding stretching
vibrations.
9 Socrates, G. (2001). Infrared and Raman characteristic group frequencies: tables and
charts. (G. Socrates. Ed.) Journal of Raman Spectroscopy (Vol. 35. p. 347). Wiley.
157
The fundamental requirement for infrared activity, leading to
absorption of infrared radiation, is that there must be a net change in dipole
moment during the vibration for the molecule or the functional group under
study. Another important form of vibrational spectroscopy is Raman
spectroscopy, which is complementary to infrared spectroscopy. The selection
rules for Raman spectroscopy are different to those for infrared spectroscopy,
and in this case a net change in bond polarizability must be observed for a
transition to be Raman active.
Figure B.6.7: Stretching and bending vibrational modes for a CH2 goup.
Both the stretching and bending vibrations of a molecule as illustrated
in the above figures can be predicted mathematically, at least to a useful
approximation using the following mathematical description of stretching
vibrations.
158
B.2.1 Stretching Vibration
The stretching frequency of a bond can be approximated by Hooke?s
Law. In this approximation, the fundamental frequency of vibration of two
atoms and the connecting covalent bond are treated as a simple harmonic
oscillator composed of 2 masses (atoms) joined by a spring and can be
expressed by:
1
2
?
?
??
=
(B.3)
Where ? is the vibrational frequency
? is the force constant of the spring, and
? is the reduced mass.
Figure B.6.8: Ball and spring model for atoms and bonds respectively.
Using vibrations and wave equations of quantum mechanics, the
potential energy can be written as:
1
()
2
En h?= +
(B.4)
where, h is the Plank?s constant, 6.6 x 10
34
J/s,
n is the vibrational quantum number (0, 1, 2, 3, . . . )
At the ground state ( 0? = )
0
12Eh?= . Following the selection rule, when
a molecule absorbs energy, there is a promotion to the first excited state
1
32Eh?= . These correspond to bands called overtones in an IR spectrum.
159
They are of lower intensity than the fundamental vibration bands. The
difference in energy levels between the vibrational quantum states exactly
matches the radiation energy expressed as:
31
22 2
h
E h hh
?
???
??
??
?= ? = =
??
??
(B.5)
Since, cc? ??= = , and Eq. (B.3) for a diatomic molecule becomes
( )
12
12
11 1 2
1
where,
2
mm
mm
c mm m m
?
??
?
+
?
= =
+
(B.6)
where, c is the speed of light, 3 x 10
8
m/s.
? is the wavenumber, inverse of the wavelength, ? .
For example, using Hooke?s law approximation, CH bond stretching
vibrations can be estimated as:
The actual range for C?H absorptions is 2850?3000 cm
?1
. The region of
an IR spectrum where bond stretching vibrations is seen depends primarily
on whether the bonds are single, double, or triple or bonds to hydrogen. The
following table shows where absorption by single, double, and triple bonds
are observed in an IR spectrum.
m
C
12gm
6.023 10
23
?
:= m
H
1gm
6.023 10
23
?
:=
?
1
2? c?
? m
C
m
H
+
( )
?
m
C
m
H
?
:=
? 3032 cm
1?
=
160
Table B.1: IR spectrum absorption for different bond types.
Bond type Force constant
dyne/cm
Bond IR absorption range
cm
1
Single 5 x 10
5
CC, CO, CN 8001300
Double 10 x 10
5
C=C, C=O, C=N, N=O 15001900
Triple 15 x 10
5
C?C, C?N 20002300
Hydrogen CH, NH, OH 27003800
The general regions of the infrared spectrum in which various kinds of
vibrational bands are observed are outlined in the Figure B.6.9.
Figure B.6.9: Vibrational bands in infrared spectrum.
B.3 Near Infrared Spectroscopy
Absorption bands in the near infrared region (NIR) (12820  4000 cm
1
) are
weak because they arise from vibrational overtones and combination
bands. Combination bands occur when two molecular vibrations are excited
simultaneously. The intensity of overtone bands reduces by one order of
overtone for each successive overtone when a molecule is excited from the
ground vibrational state to a higher vibrational state. When the vibrational
161
quantum number v is greater than or equal to 2 then an overtone absorption
results. The first overtone results from 0? = to 2? = (52 12 2h hh? ???=).
The second overtone occurs when 0? = transitions to 3? =
( )
72 12 3? ???=h hh. Transitions arising from the nearIR absorption are
weak, hence they are referred to as forbidden transitions but these
transitions are relevant when nondestructive measurements are required
such as a solid sample.
B.4 Characterizing IR Spectroscopy
Since relative intensities are primarily functions of the atom specific dipole
changes caused by the vibration of the corresponding bonds, it follows that
their size and shape are indicators of molecular architecture. The IR
absorbance frequencies and magnitudes of the functional groups spectrums
are listed in following tables and were compiled from Socrates
9
.
Table B.2: IR absorbance frequencies and magnitudes of functional
groups.
Methine Groups, CH 
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Bending (?) 1360 1320 1340 w 90
Stretching (?) 2890 2880 2885 w 90
Methylene Groups, CH2
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Scissoring Bend (?
s
) 1480 1440 1460 m 50
Symmetrical Stretching (?
s
) 2870 2840 2855 m 50
Asymmetrical Stretching (?
a
) 2940 2915 2928 ms 30
162
Methyl Groups, CH3
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Sym Bend (?
s
) 1390 1370 1380 ms 30
Asym. Bend. (?
a
) 1465 1440 1453 m 50
Symmetrical Stretching (?
s
) 2885 2865 2875 m 50
Asymmetrical Stretching (?
a
) 2975 2950 2963 ms 30
Tetramethyl Groups, C(CH3)3
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CC Skeletal Bend (?
s
) 930 925 928 m 50
CC Skeletal Bend (?
s
) 1010 990 1000 mw 70
CC Skeletal Bend (?
s
) 1225 1165 1195 m 50
CC Skeletal Bend (?
s
) 1255 1245 1250 m 50
CCH
3
Sym. Bend. (?
s
) 1395 1350 1373 ms 30
CCH
3
Sym. Bend. (?
s
) 1420 1375 1398 m 50
CCH
3
Asym. Bend. (?
a
) 1475 1435 1455 m 50
CH
Sym. Stretching (?
s
) 2885 2865 2875 m 50
CH
Asym. Stretching (?
a
) 2975 2950 2963 ms 30
Aliphatic Methoxy Groups, OCH3 (Special Methyl)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CO Def. Bend. (?
d
) 580 340 460 mw 70
CH
3
/CO Rocking Bend (?
d
) 1190 1100 1145 mw 70
CH
3
Rock Bend (?
d
) 1235 1155 1195 mw 70
CH
3
Sym Bend (?
s
) 1460 1420 1440 M 50
CH
3
Asym. Bend. (?
a
) 1475 1435 1455 m 50
CH
3
Asym. Bend. (?
a
) 1485 1445 1465 m 50
CH
3
Sym. Str. (?
s
) 2880 2815 2848 m 50
163
CH
3
Asym. Str. (?
a
) 2985 2920 2953 m 50
CH Asym. Str. (?
a
) 3030 2950 2990 m 50
Vinyl Group, CH=CH2
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
C=C Tors. Bend (?
T
) 485 410 448 ms 30
C=C Eth. Twist. Bend. (?
t
) 600 380 490 ms 30
C=C Eth. Twist. Bend. (?
t
) 720 410 565 w 90
CH
2
OoP Rock. Bend. (?
r
) 980 810 895 s 10
CH OoP Bending. (?
r
) 1010 940 975 s 10
CH IP Def. Bend. (?
d
) 1180 1010 1095 mw 70
CH
2
Def. Bend. (?
d
) 1330 1240 1285 m 50
CH
2
Sci. Bend. (?
s
) 1440 1360 1400 m 50
C=C Stretching (?) 1645 1640 1643 mw 70
CH
2
1
st
Overtone Bend (2?) 1840 1820 1830 v 90
CH 1
st
Overtone Bend (2?) 1990 1970 1980 v 90
CH
2
Sym. Stretch (?
s
) 3070 2930 3000 M 50
CH Stretch (?) 3110 2980 3045 M 50
CH
2
Asym. Stretch (?
a
) 3150 3000 3075 M 50
Vinylidene Group, CH2=C 
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
C=C Skeletal Stretch (?) 470 435 453 mw 70
C=C Skeletal Stretch (?) 560 530 545 s 10
C=C Eth. Twist. Bend. (?
t
) 715 680 698 w 90
CH
2
OoP Rock. Bend. (?
r
) 895 885 890 s 10
CH
2
IP Def. Bend. (?
d
) 1320 1290 1305 w 90
CH
2
Sci. Def Bend. (?
s
) 1420 1405 1413 w 90
164
C=C Stretching (?) 1675 1625 1650 mw 70
CH
2
1
st
Overtone Bend (2?) 1800 1750 1775 w 90
CH
2
Sym. Stretch (?
s
) 2985 2970 2978 mw 70
CH
2
Asym. Stretch (?
a
) 3095 3075 3085 mw 70
cisVinylene Group, CH=CH
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CH Tors. Bend (?
T
) 490 320 405 ms 30
C=C Skeletal Bend (?
T
) 500 460 480 s 10
C=CH Def. Bend. (?
d
) 590 440 515 ms 30
C=C Eth. Twist. Bend. (?
t
) 630 570 600 s 10
CH Wag. Bend. (?
w
) 790 650 720 ms 30
CH Wag. Bend. (?
w
) 1000 850 925 mw 70
CH Def. Bend. (?
d
) 1295 1185 1240 w 90
CH Def. Bend. (?
d
) 1425 1355 1390 w 90
C=C Stretching (?) 1665 1630 1648 m 50
CH Stretch (?) 3040 2980 3010 m 50
CH Stretch (?) 3090 3010 3050 m 50
transVinylene Group, CH=CH
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CH Tors. Bend (?
T
) 490 320 405 ms 30
C=C Skeletal Bend (?
T
) 500 480 490 s 10
C=CH Def. Bend. (?
d
) 590 440 515 ms 30
C=C Eth. Twist. Bend. (?
t
) 580 515 548 ms 30
CH Wag. Bend. (?
w
) 850 750 800 mw 70
CH Wag. Bend. (?
w
) 1000 910 955 v 90
CH Def. Bend. (?
d
) 1305 1260 1282.5 v 90
CH Def. Bend. (?
d
) 1340 1355 1347.5 v 90
165
C=C Stretching (?) 1680 1665 1673 mw 70
CH Stretch (?) 3050 3000 3025 m 50
CH Stretch (?) 3065 3015 3040 m 50
Hydroxyl Group, OH (with intermolecular Hbonding)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Bending (?) 710 570 640 m 50
Stretching (?) 3550 3230 3390 ms 30
Primary Alcohol Group, CH2OH (with intermolecular Hbonding)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CO Def. Bend (?
d
) 555 395 475 mw 70
CO IP. Def. Bend (?
d
) 500 440 470 w 90
OH OoP. Def. Bending (?
d
) 710 570 640 mw 70
CCO Stretch (?) 900 800 850 m 50
CH
2
Twist Bend (?
t
) 960 800 880 mw 70
CCO Stretch (?) 1090 1000 1045 S 10
CH
2
Twist. Bending (?
t
) 1300 1280 1290 mw 70
CH
2
Wag Bend (?
w
) 1390 1280 1335 mw 70
OH Def. Bend (?
d
) 1440 1260 1350 ms 30
CH
2
Def Bend (?
d
) 1480 1410 1445 mw 70
CH
2
Sym. Stretch (?
s
) 2935 2840 2888 mw 70
CH
2
Asym. Stretch (?
a
) 2990 2900 2945 mw 70
OH Stretching (?) 3550 3230 3390 ms 30
166
Secondary Alcohol Group,  CHOH (with intermolecular Hbonding)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
CO OoP. Def. Bend (?
d
) 390 330 360 mw 70
CO IP. Def. Bend (?
d
) 500 440 470 w 90
OH OoP. Def. Bending (?
d
) 660 600 630 mw 70
CCO Stretch (?) 900 800 850 m 50
CO Stretch (?) 1150 1075 1113 mw 70
CH Def. Bending (?
d
) 1350 1290 1320 s 10
CH Wag Bend (?
w
) 1400 1330 1365 s 10
OH + CH
2
Coup. Bend. (?
c
) 1430 1370 1400 mw 70
OH Def. Bend (?
d
) 1440 1260 1350 mw 70
CH Stretching (?) 2890 2880 2885 ms 30
OH Stretching (?) 3550 3230 3390 mw 70
Aliphatic Ether Group, O
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
COC def vib (?
d
) 440 420 430 w 90
Sym COC str (?
s
) 1140 820 980 w 90
Asym COC Str (?
a
) 1150 1060 1105 s 10
Rocking vib 1200 1185 1193 mw 70
Wagging vib 1400 1360 1380 m 50
Asym and Sym CH
3
def. vib 1470 1435 1453 m 50
CH
2
def vib 1475 1445 1460 m 50
Sym CH
2
str 2880 2835 2858 m 50
Sym. CH
3
Str 2900 2840 2870 m 50
Asym CH
2
str 2955 2920 2938 m 50
Asym. CH
3
Str 2995 2955 2975 m 50
167
Alkyl Peroxide Group, OO
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
OO Stretch (?) 900 800 850 w 90
CO Stretch (?) 1150 1030 1090 ms 30
Saturated Aliphatic Ester Group, COO
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
COC Sym. Stretch (?
s
) 1160 1050 1105 s 10
COC Asym. Stretch (?
a
) 1275 1185 1230 s 10
C=O Stretch (?) 1750 1725 1738 s 10
C=O 1
st
Overtone (2?
s
) 3460 3440 3450 w 90
Saturated Aliphatic Methyl Ester Group, COOCH3
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Unlisted 450 430 440 ms 30
COO Rocking Bend (?
r
) 530 340 435 w 90
CCO Sym. Stretch (?
s
) 1160 1050 1105 s 10
CO Stretch (?) 1175 1155 1165 s 10
CCO Asym. Stretch (?
a
) 1275 1185 1230 s 10
OCH
3
Stretch (?) 1315 1195 1255 s 10
Unlisted 1370 1350 1360 w 90
CH
3
Sym. Def. Bend (?
d
) 1460 1420 1440 ms 30
CH
3
Asym. Def. Bend (?
d
) 1465 1420 1443 ms 30
CH
3
Asym. Def. Bend (?
d
) 1485 1435 1460 m 50
C=O Stretch (?) 1750 1725 1738 s 10
CH
3
Sym. Stretch (?) 3000 2860 2930 m 50
168
CH
3
Asym. Stretch (?) 3030 2950 2990 mw 70
CH
3
Asym. Stretch (?) 3050 2980 3015 mw 70
C=O 1
st
Overtone (2?
s
) 3460 3440 3450 w 90
Saturated Aliphatic Ethyl Ester Group, COOCH2CH3
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
COC Def Bend (?
d
) 370 250 310 mw 70
COC Def Bend (?
d
) 395 305 350 mw 70
COO Rocking Bend (?
r
) 485 365 425 mw 70
CO OoP Rocking Bend (?
r
) 700 550 625 w 90
CH
2
Rocking Bend (?
r
) 825 775 800 w 90
CC str (?) 940 850 895 w 90
CH
3
Rock. Bend (?
r
) 1150 1080 1115 w 90
CCO Sym. Stretch (?
s
) 1160 1050 1105 s 10
CH
3
Rock. Bend (?
r
) 1195 1135 1165 w 90
CCO Asym. Stretch (?
a
) 1275 1185 1230 s 10
CH
2
Twist. Bend (?
T
) 1340 1325 1333 mw 70
CH
2
Wag. Bend (?
w
) 1385 1335 1360 mw 70
CH
3
Sym. Def. Bend (?) 1390 1360 1375 ms 30
CH
3
Asym. Def. Bend (?) 1480 1435 1458 m 50
OCH
2
Def. Bend. (?) 1490 1460 1475 mw 70
C=O Stretch (?) 1750 1725 1738 s 10
CH
3
Stretch (?) 2920 2860 2890 w 90
CH
3
Sym. Stretch (?
s
) 2930 2890 2910 w 90
CH
3
Asym. Stretch (?
a
) 2995 2930 2963 m 50
C=O 1
st
Overtone (2?
s
) 3460 3440 3450 w 90
169
Acrylate Ester Group, CH2=CHCOO
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
C=C Tors. Bend (?
T
) 485 410 448 ms 30
COO Rocking Bend (?
r
) 485 365 425 mw 70
C=C Eth. Twist. Bend. (?
t
) 600 380 490 ms 30
COC Def Bend (?) 675 660 668 m 50
CO OoP Rocking Bend (?
r
) 700 550 625 w 90
=CH2 Twist Bend (?
t
) 810 800 805 ms 30
CH
2
Rocking Bend (?
r
) 825 775 800 w 90
CC str (?) 940 850 895 w 90
=CH2 Wag. Bend (?
w
) 970 960 965 s 10
CH Def. Wag (?
w
) 990 980 985 m 50
CH OoP Bending. (?
r
) 1010 940 975 s 10
CC Skel. Bend (?) 1070 1065 1068 m 50
CH
3
Rock. Bend (?
r
) 1150 1080 1115 w 90
CCO Sym. Stretch (?
s
) 1160 1050 1105 s 10
CH IP Def. Bend. (?
d
) 1180 1010 1095 mw 70
Unlisted 1200 1195 1198 s 10
CCO Asym. Stretch (?
a
) 1275 1185 1230 s 10
=CH Rock. Bend (?
r
) 1290 1270 1280 m 50
Unlisted 1290 1280 1285 s 10
=CH2 Def Bend (?) 1420 1400 1410 m 50
CH
2
Sci. Bend. (?
s
) 1440 1360 1400 m 50
C=C Stretch (?) 1635 1615 1625 m 50
C=C Stretch (?) 1650 1630 1640 ms 30
C=O Stretch (?) 1725 1710 1718 s 10
CH
2
1
st
Overtone Bend (2?) 1840 1820 1830 w 90
CH 1
st
Overtone Bend (2?) 1990 1970 1980 w 90
CH
2
Sym. Stretch (?
s
) 3070 2930 3000 m 50
CH Stretch (?) 3110 2980 3045 m 50
CH
2
Asym. Stretch (?
a
) 3150 3000 3075 m 50
C=O 1
st
Overtone (2?
s
) 3460 3440 3450 w 90
170
Methacrylate Ester Group, CH2=C(CH3)COO
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
C=C Skeletal Stretch (?) 470 435 453 mw 70
C=C Skeletal Stretch (?) 560 530 545 s 10
COC Def Bend (?) 660 645 653 m 50
C=C Eth. Twist. Bend. (?
t
) 715 680 698 w 90
CC Skel Bend (?) 825 805 815 ms 30
CH
2
OoP Rock. Bend. (?
r
) 895 885 890 s 10
=CH2 Wag. Bend (?
w
) 950 935 942.5 s 10
CC Skel. Bend (?) 1010 990 1000 m 50
CC Skel. Bend (?) 1020 1000 1010 m 50
COC Sym. Stretch (?
s
) 1160 1150 1155 s 10
COC Asym. Stretch (?
a
) 1275 1185 1230 s 10
Unlisted 1310 1290 1300 s 10
CH
2
IP Def. Bend. (?
d
) 1320 1290 1305 w 90
=CH Rock. Bend (?
r
) 1335 1315 1325 m 50
CH
3
Sym Bend (?
s
) 1390 1370 1380 ms 30
=CH2 Def Bend (?) 1420 1400 1410 m 50
CH
3
Asym. Bend. (?
a
) 1465 1440 1453 m 50
C=C Stretch (?) 1650 1630 1640 m 50
C=O Stretch (?) 1725 1710 1718 s 10
CH
2
1
st
Overtone Bend (2?) 1800 1750 1775 w 90
CH
3
Sym. Stretching (?
s
) 2885 2865 2875 m 50
CH
2
Sym. Stretch (?
s
) 2985 2970 2978 mw 70
CH
3
Asym. Stretching (?
a
) 2975 2950 2963 ms 30
CH
2
Asym. Stretch (?
a
) 3095 3075 3085 mw 70
C=O 1
st
Overtone (2?
s
) 3460 3440 3450 w 90
171
oAlkyl Phenol Group (With Hbonding)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
COH IP Bending (?) 450 375 413 w 90
OH OoP. Def. Bending (?
d
) 720 600 660 s 10
CO Stretch (?) 1260 1180 1220 s 10
OH IP Bending (?) 1410 1310 1360 s 10
COH bending vib 1330 1310 1320 m 50
OH Stretching (?) 3250 3000 3125 m 50
CO Str 1255 1240 1248 s 10
OH def and CO str vib 1175 1160 1168 s 10
OH def and CO str vib 760 740 750 m 50
OR substituted 3595 3470 3533 m 50
pAlkyl Phenol Group (With Hbonding)
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
COH IP Bending (?) 450 375 413 w 90
OH OoP. Def. Bending (?
d
) 720 600 660 s 10
CO Stretch (?) 1260 1180 1220 s 10
OH IP Bending (?) 1410 1310 1360 s 10
OH Stretching (?) 3250 3000 3125 m 50
CO Str 1260 1245 1253 s 10
OH def and CO str vib 1175 1165 1170 s 10
OH def and CO str vib 835 815 825 m 50
OR substituted 3595 3470 3533 m 50
172
Monosubstituted Benzenes
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Ring OoP def vib 560 415 488 ms 30
Ring IP def vib 630 605 618 mw 70
=CH Ring OoP def vib 710 670 690 s 10
=CH OoP def vib 820 720 770 s 10
=CH OoP def vib 900 860 880 mw 70
=CH IP def vib 1010 990 1000 w 90
=CH IP def vib 1040 1000 1020 mw 70
=CH IP def vib 1085 1050 1068 m 50
=CH IP def vib 1175 1130 1153 w 90
=CH IP def vib 1195 1165 1180 mw 70
=CH IP def vib 1250 1230 1240 w 90
C=C Str Vib 1625 1590 1608 v 90
=CH Str. Vib 3105 3000 3053 m 50
1,2,4 Trisubstituted Benzene
Band Wavelength Region [cm
1
]
Relative
Intensity
%
Transmi
ttance
High Low Average
Ring OoP def vib 475 425 450 ms 30
=CH OoP def vib (2H) 740 690 715 mw 70
=CH OoP def vib (2H) 780 760 770 s 10
=CH OoP def vib (2H) 860 840 850 ms 30
=CH OoP def vib (1H) 940 885 913 ms 30
=CH IP def vib 1040 1020 1030 mw 70
=CH IP def vib 1160 1140 1150 mw 70
=CH IP def vib 1220 1200 1210 w 90
C=C Str Vib 1625 1590 1608 v 90
=CH Str. Vib 3105 3000 3053 m 50
173
Table B.3: Abbreviations, names and structures of investigated ionic liquid training set.
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
1 [bmIm]mSO
4
N
+ CH
3
NCH
3
O

SO
CH
3
O
O
0.2890 1212 269.1
1butyl3methyl imidazolium Methyl sulfate
2 [emIm]PF
6
N N
+ CH
3
CH
3
P

F
F
F
F
F
F
0.0234 1422 307.1
1ethyl3methyl imidazolium Hexafluorophosphate
3 [emIm]BF
4
N N
+ CH
3
CH
3
B

F
FF
F
0.0665 1248 247.1
1ethyl3methyl imidazolium Tetrafluoroborate
4 [bmIm]dCN
N
+ CH
3
NCH
3
NN
N

0.0332 1058 267.1
1butyl3methyl imidazolium Dicyanamide
174
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
5 [1,2m
2
pIm]Tf
2
N
N
+
N
CH
3
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0901 1457 288.1
1,2dimethyl3propyl imidazolium Bis(trifluromethylsulfonly)amide
6 [b2,3m
2
Im]BF
4
CH
3
N
CH
3
N
+CH
3
B

F
FF
F
0.4558 1094 310.1
1butyl2,3dimethyl imidazolium Tetrafluoroborate
7 [bm
3
NH
4
]Tf
2
N
N
+
CH
3
CH
3
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.1407 1397 289.1
butyltrimethylammonium Bis(trifluromethylsulfonly)amide
8 [bmIm]PF
6
N
+ CH
3
NCH
3
P

F
F
F
F
F
F
0.4500 1360 283.1
1butyl3methyl imidazolium Hexafluorophosphate
175
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
9 [bPy]Tf
2
N
N
+
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0756 1454 299.1
1butylpyridinium Bis(trifluromethylsulfonly)amide
10 [bPy]BF
4
N
+
CH
3
B

F
FF
F
0.2231 1203 279.8
1butylpyridinium Tetrafluoroborate
11 [bmIm]CF
3
SO
3
N
+ CH
3
NCH
3
S O

O
O
F
F
F
0.0990 1384 262.2
1butyl3methyl imidazolium Triflate
12 [1,3m
2
Im]Tf
2
N
N
CH
3
N
+ CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0475 1570 299.1
1,3dimethyl imidazolium Bis(trifluromethylsulfonly)amide
176
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
13 [m
2
Im]Tf
2
N
N
+ CH
3
N
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0260 1519 240.1
1ethyl3methyl imidazolium Bis(trifluromethylsulfonly)amide
14 [e
3
oNH
4
]Tf
2
N
N
+
CH
3
CH
3
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.1810 1270 287.1
N,N,Ntriethyl1octanaminium Bis(trifluromethylsulfonly)amide
15 [pmIm]PF
6
N
+ CH
3
NCH
3
B

F
FF
F
0.1030 1240 256.1
1propyl3methyl imidazolium Tetrafluoroborate
16 [emIm]dCN
N N
+ CH
3
CH
3
NN
N

0.0169 1106 255
1ethyl3methyl imidazolium Dicyanamide
177
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
17 [emIm]mSO
4
N N
+ CH
3
CH
3
O

SO
CH
3
O
O
0.0785 1234 236.3
1ethyl3methyl imidazolium Methylsulfate
18 [emIm]Cl
N N
+ CH
3
CH
3
Cl

1.583 1186 358.1
1ethyl3methyl imidazolium Chloride
19 [bmIm]Cl
N
+ CH
3
NCH
3
Cl

40.89 1080 340.1
1butyl3methyl imidazolium Chloride
20 [hmIm]Tf
2
N
N
CH
3
N
+CH
3
N

S S
O
O O
O
F
F
F
F
F
F
0.0780 1370 266
1hexyl3methylimidazolium Bis(trifluromethylsulfonly)amide
178
ID Abbreviation
Structure & Name Properties
Cation Anion ? @
293K
Pa.s
? @
298K
Kg/m
3
T
m
K
21 [m
2
Im]mSO
4
CH
3
N
+
CH
3
N
O

SO
CH
3
O
O
0.0928 1328 308.9
1,3dimethyl imidazolium Methylsulfate
22 [emIm]CF
3
SO
3
N N
+ CH
3
CH
3
S O

O
O
F
F
F
0.0500 1384 262.2
1ethyl3methyl imidazolium Triflate
23 [bmIm]oSO
4
N
+ CH
3
NCH
3
S
O
O

O
O
CH
3
0.8745 1060 307.6
1butyl3methyl imidazolium Octylsulfate
179
Table B.4: Candidate ionic liquid solutions enumerated from exhaustive search in latent property space.
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
1 [meIm]mSO
4
N
+
N
CH
3
CH
3
O

SO
CH
3
O
O
0.0185 0.0045 0.0293
1methyl3ethyl imidazolium Methyl sulfate
2 [ePy]mSO
4
N
+
CH
3
O

SO
CH
3
O
O
0.0216 0.0039 0.0140
1ethylpyridinium Methyl sulfate
3 [e
3
pNH
4
]mSO
4
N
+
CH
3
CH
3
CH
3
CH
3
O

SO
CH
3
O
O
0.0457 0.0292 0.0147
N,N,N triethyl propanaminium Methyl sulfate
4 [meIm]PF
6
N
+
N
CH
3
CH
3
P

F
F
F
F
F
F
0.0113 0.0193 0.0314
1methyl3ethyl imidazolium
Hexafluorophosphate
180
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
5 [ePy]PF
6
N
+
CH
3
P

F
F
F
F
F
F
0.0144 0.0109 0.0160
1ethylpyridinium
Hexafluorophosphate
6 [e
3
eNH
4
]PF
6
N
+
CH
3
CH
3
CH
3
CH
3
P

F
F
F
F
F
F
0.0384 0.0440 0.0126
N,N,N triethyl propanaminium Hexafluorophosphate
7 [meIm]BF
4
N
+
N
CH
3
CH
3
B

F
FF
F
0.0111 0.0231 0.0012
1methyl3ethyl imidazolium Tetrafluoroborate
8 [e2,3m
2
Im]BF
4
CH
3
N
CH
3
N
+CH
3
B

F
FF
F
0.0050 0.0185 0.0162
1ethyl2,3dimethyl imidazolium
Tetrafluoroborate
9 [ePy]BF
4
N
+
CH
3
B

F
FF
F
0.0142 0.0148 0.0165
1ethylpyridinium
Tetrafluoroborate
181
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
10 [e
3
eNH
4
]BF
4
N
+
CH
3
CH
3
CH
3
CH
3
B

F
FF
F
0.0028 0.0128 0.0302
N,N,N triethyl ethanaminium
Tetrafluoroborate
11 [e
3
bNH
4
]BF
4
N
+
CH
3
CH
3
CH
3
CH
3
B

F
FF
F
0.0290 0.0557 0.0135
N,N,N triethyl butanaminium
Tetrafluoroborate
12 [e
3
pNH
4
]dCN
N
+
CH
3
CH
3
CH
3
CH
3
NN
N

0.0427 0.0068 0.0130
N,N,N triethyl propanaminium
Dicyanamide
13 [meIm]Tf
2
N
N
+
N
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0091 0.0069 0.0002
1methyl3ethyl imidazolium Bis(trifluromethylsulfonly)amide
182
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
14 [e2,3m
2
Im]Tf
2
N
CH
3
N
CH
3
N
+CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0030 0.0022 0.0176
1ethyl2,3dimethyl imidazolium
Bis(trifluromethylsulfonly)amide
15 [ePy]Tf
2
N
N
+
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0121 0.0015 0.0151
1ethylpyridinium
Bis(trifluromethylsulfonly)amide
16 [e
3
eNH
4
]Tf
2
N
N
+
CH
3
CH
3
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0008 0.0035 0.0316
N,N,N triethyl ethanaminium
Bis(trifluromethylsulfonly)amide
17 [e
3
bNH
4
]Tf
2
N
N
+
CH
3
CH
3
CH
3
CH
3
N

S S
O
O
O
O
F
F
F
F
F
F
0.0270 0.0394 0.0149
N,N,N triethyl butanaminium
Bis(trifluromethylsulfonly)amide
183
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
18 [meIm]CF
3
SO
3
N
+
N
CH
3
CH
3
S O

O
O
F
F
F
0.0069 0.0182 0.0118
1methyl3ethyl imidazolium
Triflate
19 [e2,3m
2
Im]CF
3
SO
3
CH
3
N
CH
3
N
+CH
3
S O

O
O
F
F
F
0.0008 0.0135 0.0292
1ethyl2,3dimethyl imidazolium
Triflate
20 [ePy]CF
3
SO
3
N
+
CH
3
S O

O
O
F
F
F
0.0099 0.0098 0.0035
1ethylpyridinium
Triflate
21 [e
3
bNH
4
]CF
3
SO
3
N
+
CH
3
CH
3
CH
3
CH
3
S O

O
O
F
F
F
0.0248 0.0507 0.0265
N,N,N triethyl butanaminium
Triflate
184
Abbreviation
Structure & Name Latent Properties
Cation Anion q1 q2 q3
22 [ePy]oSO
4
N
+
CH
3
S
O
O

O
O
CH
3
0.0103 0.0116 0.0241
1ethylpyridinium
Octylsulfate
23 [e
3
pNH
4
]oSO
4
N
+
CH
3
CH
3
CH
3
CH
3
S
O
O

O
O
CH
3
0.0344 0.0446 0.0046
N,N,N triethyl propanaminium Octylsulfate