Multivariate Characterization, Modeling, and Design of Chemical Products in Property Cluster Space by Subin Hada A dissertation submitted to the graduate faculty of Auburn University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Auburn, Alabama August 3, 2013 Keywords: Chemical Product Design, Multivariate Statistical Analysis, Chemometric Technique, Reverse Problem Formulation, Property Clustering. Copyright 2013 by Subin Hada Approved by Mario R. Eden, Chair and McMillan Professor of Chemical Engineering Christopher B. Roberts, Uthlaut Professor of Chemical Engineering Elizabeth Lipke, Assistant Professor of Chemical Engineering Sushil Adhikari, Assistant Professor of Biosystems Engineering Mahmoud El-Halwagi, McFerrin Professor of Chemical Engineering ii ABSTRACT??.. The focus of this dissertation is on the development of in silico approaches for the logical and systematic solution of chemical product design problems. The application of multivariate characterization, modeling, and design is accomplished by utilizing interdisciplinary methods and tools that extend through multivariate statistics, applied mathematics and computer science. Methodologies and techniques such as spectroscopy-based group contribution methods, chemometric/chemoinformatic techniques, reverse problem formulation, and property clustering techniques are integrated within computer-aided molecular/mixture design (CAMD) algorithms to design chemical products in a computationally efficient manner that provides optimum performance in terms of customer requirements. Property-based design techniques and multivariate data-driven modeling and optimization strategies are presented in this dissertation covering two specific areas of chemical product design: mixture and molecular design. In mixture design, the property integration framework is combined with multivariate statistical techniques and applied in a reverse problem formulation on chemical product design problems by systematic and insightful use of past data describing the properties of the raw materials, their blend ratios, and the process conditions during the production of a range of product grades to achieve new and improved products. Projection methods, like principal component analysis (PCA) and partial least squares (PLS) are applied to identify the underlying relationships necessary for simultaneous optimization of all three variables. The method is illustrated using a polymer blending problem. iii In molecular design, multivariate characterization techniques like infrared (IR) spectroscopy are utilized to generate numerical descriptors of molecular architecture in terms of IR frequency of a set of representative samples. Models based on quantitative structure-property relations (QSPR) are used to elucidate structure-property relationships. Applying principal component analysis, high dimensional and highly correlated molecular descriptor variables are transformed into low dimensional and statistically independent latent variables. These latent variables are then used to calibrate latent property models. Finally, the reverse design of molecules is accomplished by exhaustively searching for molecular structures with target properties, from the combinatorial building blocks. A characterization-based group contribution method is utilized to estimate the properties of the formulated chemical products. The concepts and the solution methodologies are demonstrated using two proof-of-concept examples: biodiesel additive formulation and ionic liquid design. iv ACKNOWLEDGEMENTS My deepest gratitude to my advisor, Dr. Mario R. Eden, for his gentle guidance, patience, and constant encouragement throughout my doctoral program at Auburn University. His mentorship and support has been invaluable on both an academic and a personal level. I will miss travelling with him around the world for conferences. It is a pleasure to thank my dissertation committee, and outside reader, respectively: Dr. Christopher Roberts, Dr. Elizabeth Lipke, Dr. Sushil Adhikari, Dr. Mahmoud El-Halwagi, and Dr. Steven Taylor. Thank you for having patience as I endeavored to complete my work. I appreciate the feedback and review that helped to improve and complete this dissertation. I thank my fellow labmates in the Eden laboratory who supported me and shared their enthusiasm for and comments on my work. My special thanks to Robert Herring for the stimulating discussions and for the sleepless nights we worked together before deadlines. I gratefully acknowledge the financial support provided by the Southeastern Regional Sun Grant Program, and the USDA-AFRI (Award# 2011-67009-20077). In addition, I thank Dr. Koji Muteki; Pfizer Inc., CT, USA, for providing the starch blending data used in the mixture design proof- of-concept example. I cannot end without thanking my parents, Rajendra P. Hada and Subrata Hada, and my brother Sujin Hada, on whose constant encouragement and love I have relied on throughout long years of my educational journey. Their persistent audacity and conviction will always inspire me. It is to them that I dedicate this work. I would also like to thank my wife, Fui Chi Yap, for her unequivocal support. v TABLE OF CONTENTS Page ABSTRACT??.. ............................................................................................... ii ACKNOWLEDGEMENTS .............................................................................. iv TABLE OF CONTENTS .................................................................................. v LIST OF TABLES ......................................................................................... viii LIST OF FIGURES........................................................................................... x LIST OF ABBREVATIONS .......................................................................... xiv CHAPTER 1 INTRODUCTION ................................................................. 1 1.1 Chemical Product Design ....................................................... 1 1.2 Challenges and motivations ................................................... 4 1.3 Scope and Objectives ............................................................. 11 1.4 Significance of the Research ................................................. 12 1.5 Organization .......................................................................... 13 CHAPTER 2 THEORY AND METHODOLOGY .................................... 15 2.1 Process-Product Design ........................................................ 15 2.2 Property-Based Process and Product Design ....................... 16 2.2.1 Reverse Problem Formulation ................................ 17 2.2.2 Property Clustering Technique ............................... 20 2.3 Prediction of Properties ........................................................ 25 2.3.1 Group Contribution Method .................................... 27 2.4 Characterization Techniques ............................................... 29 2.4.1 Spectroscopy ............................................................ 31 2.5 General Regression Models .................................................. 34 2.6 Latent Variable Models ........................................................ 36 2.6.1 Principal Component Analysis (PCA) ..................... 39 vi 2.6.2 Principal Component Regression (PCR) ................. 46 2.6.3 Partial Least Squares (PLS) ................................... 47 2.6.4 Model Validation ..................................................... 49 2.7 Principal Properties in Cluster Space .................................. 51 2.8 Computer-Aided Design using QSPR and cGCM ................ 52 2.8.1 Solution to Reverse Problem ................................... 53 2.8.2 Solution to Forward Problem .................................. 54 CHAPTER 3 MIXTURE DESIGN ............................................................ 58 3.1 Introduction ........................................................................... 58 3.2 Traditional Approach ............................................................ 59 3.3 Multi-Block Data Structure .................................................. 62 3.4 Multi-Block Regression Models ............................................ 63 3.5 Proof of Concept Example ? Starch Blending ...................... 66 3.5.1 Structure of starch blending data ........................... 66 3.5.2 Data analysis ........................................................... 70 3.5.3 Model development .................................................. 75 3.5.4 Design of desired products in score space .............. 77 3.5.5 Design of desired products in cluster space ........... 79 3.6 Conclusion ............................................................................. 83 CHAPTER 4 DESIGN OF BIODIESEL ADDITIVES ........................... 85 4.1 Introduction ........................................................................... 85 4.2 Structure Property Relationships ........................................ 89 4.3 Technical Difficulties with Biodiesel Use ............................ 90 4.4 Fuel Additives ....................................................................... 93 4.5 Additive Design ..................................................................... 93 4.5.1 Types of Additives ................................................... 94 4.5.2 Additive Property Estimation ................................. 95 4.5.3 Characterization of the Additive Molecules ........... 97 4.5.4 Additive IR Data Analysis ...................................... 98 4.5.5 Latent Variable Model Development .................... 100 4.5.6 Translating Physical properties to Latent Properties ............................................................... 102 4.5.7 Evaluation of Desired Additive Feasibility Region .................................................................... 104 4.5.8 Enumeration of Desired Additive Molecules ........ 106 4.6 Conclusion ........................................................................... 112 vii CHAPTER 5 REVERSE DESIGN OF IONIC LIQUIDS .................... 113 5.1 Introduction ......................................................................... 113 5.2 Density Functional Theory ................................................. 115 5.3 Data Analysis and Model Development ............................. 120 5.4 Reverse Design of Ionic Liquids using QSPR and cGCM .. 123 5.5 Conclusion ........................................................................... 130 CHAPTER 6 FUTURE WORK ............................................................... 132 6.1 Methodology Improvements ............................................... 132 6.1.1 Multi-Dimensional Characterization .................... 132 6.1.2 Multi-Way Modeling .............................................. 134 6.1.3 Stochastic Search and Optimization..................... 136 6.1.4 Managing and Handling Uncertainty .................. 137 6.2 Design of Inherently Benign Chemical Process Routes .... 138 REFERENCES. ............................................................................................. 141 APPENDIX A MIXTURE MODELS ......................................................... 150 A.1 Scheffe Mixture Model ................................................ 150 A.1 Cox Mixture Model ...................................................... 151 APPENDIX B SPECTRAL INTERPRETATION ................................... 153 B.1 Infrared Spectroscopy.................................................. 154 B.2 Molecular Vibrational Spectroscopy ........................... 155 B.3 Near Infrared Spectroscopy ........................................ 160 B.4 Characterizing IR Spectroscopy .................................. 161 viii LIST OF TABLES Table Page Table 3.1: Starch material property data matrix. ....................................... 67 Table 3.2: Blending ratio and process condition data matrix. .................... 68 Table 3.3: Mixture product quality data matrix. ......................................... 69 Table 3.4: PCA score values for X- and Y-blocks. ........................................ 74 Table 3.5: PCA loading values for X- and Y-blocks. .................................... 75 Table 3.6: The regression coefficients for the expressions in Eqs. (3.13) through Eqs. (3.15). ...................................................................... 76 Table 3.7: Score values for desired products in Figure 3.11........................ 78 Table 3.8: Desired product properties for desired products in Figure 3.11. .............................................................................................. 78 Table 3.9: Required mixture conditions to achieve target product properties. ..................................................................................... 79 Table 3.10: Candidate ternary mixtures and fractional contributions of the constituents. ........................................................................... 82 Table 4.1: Chemical structures of common fatty acids. ............................... 87 Table 4.2: Fatty acid profiles of some common biodiesel feedstock ............ 88 Table 4.3: Fuel properties of biodiesel fuels and diesel ............................... 89 Table 4.4: Specifications for biodiesel and diesel ......................................... 92 Table 4.5: Commercially available diesel additives and their estimated properties. ..................................................................................... 97 Table 4.6: Model coefficients using PCR. ................................................... 101 Table 4.7: Biodiesel target properties. ....................................................... 102 Table 4.8: Physical and latent properties describing feasibility region. ... 103 Table 4.9: Biodiesel target latent properties.............................................. 103 Table 4.10: Crude biodiesel properties. ........................................................ 103 ix Table 4.11: Biodiesel additive latent property feasibility region. ............... 105 Table 4.12: Molecular groups and their latent property contributions. ..... 107 Table 4.13: Results from characterization-based molecular design. .......... 109 Table 4.14: Additive solubility in FAME at 25 o C. ....................................... 110 Table 5.1: B3LYP/6-311+G(2d,p) vibrational assignments (cm -1 ) of [emIm]PF 6 . ................................................................................. 118 Table 5.2: PCA score values for X- and Y-blocks. ...................................... 121 Table 5.3: Model coefficients using PCR. ................................................... 122 Table 5.4: Target ionic liquid properties. ................................................... 124 Table 5.5: Anion groups and their latent property contributions. ............ 125 Table 5.6: Cation groups and their latent property contributions. ........... 126 Table 5.7: Alkyl chain attached to cation groups and their latent property contributions. .............................................................. 127 Table 5.8: Thirteen candidate ionic liquid molecules that match target properties in property space. ..................................................... 130 Table B.1: IR spectrum absorption for different bond types. ..................... 160 Table B.2: IR absorbance frequencies and magnitudes of functional groups. ........................................................................................ 161 Table B.3: Abbreviations, names and structures of investigated ionic liquid training set. ..................................................................... 173 Table B.4: Candidate ionic liquid solutions enumerated from exhaustive search in latent property space. ................................................ 179 x LIST OF FIGURES Figure Page Figure 1.1: Stages involved in chemical product design. ................................ 2 Figure 1.2: Chemically formulated products. .................................................. 3 Figure 1.3: Hierarchy of chemical and biological informatics. ........................ 9 Figure 2.1: Conventional solution approach for process and molecular design problems............................................................................ 16 Figure 2.2: Product and process design problem using reverse problem formulation methodology. ............................................................ 18 Figure 2.3: A multi-scale product design framework showing (I) traditional approach and (II) RPF approach linking each of the scales via a common property domain. ................................. 19 Figure 2.4: Representation of intra- and inter-stream conservation of clusters in ternary diagram. ........................................................ 22 Figure 2.5: Representation of feasibility reason with source-sink mapping using clusters in ternary diagram. The clustering points are converted from ternary to Cartesian coordinate. ...... 24 Figure 2.6: Classification of property estimation methods. .......................... 25 Figure 2.7: QSAR model development steps. ................................................. 26 Figure 2.8: An overview of the interconnectivity of characterization techniques, molecular architecture, and physical properties and attributes of chemical and material products...................... 30 Figure 2.9: IR spectra of butylated hydroxytoluene molecule....................... 31 Figure 2.10: Unit variance scaled and mean-centered variables. ................... 38 Figure 2.11: Projection of higher dimensional data onto a lower dimensional subspace. ................................................................. 40 Figure 2.12: PCA decomposition of X matrix. ................................................. 42 Figure 2.13: Scree plot of the correlation matrix. ............................................ 43 xi Figure 2.14: Principal component analysis: Score plot (a, left) of t 1 /t 2 and loading plot (b, right) of p 1 /p 2 . The ellipse represents the Hotelling T 2 with 95% confidence in score plot. .......................... 45 Figure 2.15: PLS regression on descriptive (X) and response (Y) variables. ... 47 Figure 2.16: Forward and reverse problems in computer-aided molecular design. ........................................................................................... 52 Figure 2.17: An overview on the methodology of multivariate characterization, modeling, and design. ..................................... 57 Figure 3.1: Traditional approaches in mixture design. ................................. 61 Figure 3.2: Data structure for three manipulative variable matrixes and a quality/response variable matrix. ............................................. 63 Figure 3.3: Data structure for combined manipulative variable matrixes and a quality/response variable matrix. ..................................... 64 Figure 3.4: Data structure of X and R matrix for two classes of raw materials and their blend ratios. ................................................. 67 Figure 3.5: Distribution plot, outlier box plot and normal quantile plot. ..... 70 Figure 3.6: Scree plot and pareto plot for PCA on X-variables. .................... 71 Figure 3.7: Combined PCA score and loading plots (Biplot) on first and second components for X-block. ................................................... 72 Figure 3.8: Combined PCA score and loading plots (Biplot) on first and second components for Y-block. ................................................... 72 Figure 3.9: PLS model coefficients for blend property matrix Y using four latent factors. ............................................................................... 73 Figure 3.10: Predicted vs. actual product properties using PCR model. ........ 77 Figure 3.11: Visualization of target product properties in score space. ......... 78 Figure 3.12: Visualization of starch blending formulation in cluster space. .. 80 Figure 3.13: Visualization of starch blending formulation in cluster spac .... 81 Figure 4.1: Overall stoichiometric transesterification reaction scheme. ...... 86 Figure 4.2: Stearic acid methyl ester. ............................................................ 87 Figure 4.3: A typical triglyceride molecule with different fatty acid chains of soybean oil. ................................................................... 88 Figure 4.4: Oleic acid methyl ester. ................................................................ 90 xii Figure 4.5: Compositions of fats and oils and their effects on the fuel properties. ..................................................................................... 91 Figure 4.6: Approaches to improving biodiesel fuel properties. .................... 92 Figure 4.7. Target specifications for bio-diesel and its blend in terms of cetane number, melting point, and kinematic viscosity. ............ 94 Figure 4.8: Infrared spectra of diesel additive molecules. ............................. 98 Figure 4.9: Scree plot for PCA on additive IR data. ...................................... 99 Figure 4.10: PCA score plots on first third PCs for additive IR data. ............ 99 Figure 4.11: Predicted vs. actual product properties using PCR model. ...... 101 Figure 4.12: Target feasibility region and crude biodiesel in cluster space. 104 Figure 4.13: Desired additive design feasibility region in cluster space. ..... 105 Figure 4.14: Cluster diagram for biodiesel blending problem. ...................... 108 Figure 4.15: Spatial representation of candidate additive molecules according to B3LYP/6311++G(3df,3dp) calculations: (a) isopropenyl acrylate (CM1), and (b) ethyl acetate (CM3)......... 111 Figure 5.1: Selection of anions, cations, and side chains attached in cations for a task specific ionic liquid application. ................... 114 Figure 5.2: Geometry optimized molecular structure of [emIm]PF 6 at the B3LYP/6-311G(2d,p) computational level. ................................ 116 Figure 5.3: (a) Calculated vs. experimental IR frequencies and (b) infrared spectrum at 2cm -1 resolution of [emIm]PF 6 at the B3-LYP/6-311G(2d,p) computational level................................ 117 Figure 5.4: Predicted vs. actual IL properties using PCR model. ............... 122 Figure 5.5: Scenario of reverse design of ILs. .............................................. 128 Figure 5.6: Enumerated and validated candidate IL molecules. ................ 129 Figure 6.1: Improvements in the process of multivariate characterization, modeling, and design. ................................... 133 Figure 6.2: The decomposition of X-block by (a) PCA and (b) PARAFAC. . 135 Figure 6.3: Structure of an artificial neural network. ................................. 136 Figure 6.4: Aspects of early process design. ................................................. 138 Figure B.6.5: IR regions of the electromagnetic spectrum. .......................... 154 Figure B.6.6: Stretching and bending vibrational modes for H 2 O. .............. 155 xiii Figure B.6.7: Stretching and bending vibrational modes for a CH 2 goup. .. 157 Figure B.6.8: Ball and spring model for atoms and bonds respectively. ..... 158 Figure B.6.9: Vibrational bands in infrared spectrum. ................................ 160 xiv LIST OF ABBREVATIONS ADMET Absorption, Distribution, Metabolism, elimination, and Toxicity ANN Artificial Neural Network AUP Augmented Property Index ASTM America Society for Testing and Materials BIC Bayesian Information Criterion BHT Butylated Hydroxytoluene CAMD Computer Aided Molecular Design cCAMD Characterization-based Computer-Aided Molecular Design cGCM Characterization-based Group Contribution Method CI Connectivity Indices CN Cetane Number CFPP Cold Filter Plugging Point CP Cloud Point DFT Density Functional Theory EB Elongation at Break EEM Emission-Excitation Matrix EG Ethylene Glycol EGMEA Ethylene Glycol Methyl Ether Acrylate DOE Design of Experiments DG Diglyceride DTBP Di-Tert-Butyl Peroxide EHS Environment, Health, and Safety FAME Fatty Acid Methyl Ester xv FBN Free Bond Number GA Genetic Algorithm GC Group Contribution GCM Group Contribution Methods GL Glycerol IBI Inherent Benign-ness Indicator ILs Ionic Liquids IPA Iso Propyl Alcohol IR Infra-Red LCA Life Cycle Assessment LP Linear Programs LVs Latent Variables MSA Mass Separating Agent MC Monte Carlo MD Molecular Dynamics MG Monoglyceride MINLP Mixed Integer Non-Linear Programs MLR Multi-Linear Regression MM Molecular Mechanics MO Methyl Oleate MTBE Methyl Tert-Butyl Ether NLP Non Linear Programs NMR Nuclear Magnetic Resonance NIR Near Infra-Red NIST National Institute of Standard and Technology NPCT Nonlinear Principal Component Regression OA Oleic Acid OLS Ordinary Least Square PARAFAC PARAllel FACtor analysis xvi PCs Principal Components PCA Principal Component Analysis PCR Principal Component Regression PLS Partial Least Square PSE Process Systems Engineering PMMA Poly Methyl MethAcrylate PP Pour Point PRESS Predicted Residual Sum of Squares PSE Process Systems Engineering PY Pyrogallol QM Quantum Mechanics QSAR Quantitative Structure Attribute Relationships QSPR Quantitative Structure Property Relationships RMSECV Root Mean Square Error of Cross-Validation RMSEP Root Mean Square Error of Prediction SA Stearic Acid SVM Support Vector Machines TBHQ T-Butyl Hydro Quinone TG Triglyceride TI Topological Indices TM Tensile Modulus TS Tensile Strength UFF Unified Force Field XRD X-Ray Diffraction 1 CHAPTER 1 INTRODUCTION 1.1 Chemical Product Design Chemical product design is a response to major changes in the chemical industry which have occurred in recent decades. Traditionally, the chemical industry has focused on process design involving the manufacture of bulk commodity chemicals with an objective primarily being efficient production to reduce costs in order to be competitive. Such chemicals are produced in dedicated equipment and at very large scale such as ethylene, ammonia, and sulfuric acid. The market success of such products depends on the cost of making them. However, in recent decades, specialty, higher value-added, smaller volume chemical products such as pharmaceuticals, electronics, and pigments have become increasingly important. The success for such products depends on their discovery and their time to market. In the production of specialty chemicals, the key is improvement in performance rather than minimization of cost. Unlike conventional process systems engineering (PSE) approaches that were focused on the synthesis, design, optimization and control of chemical processes which are based on a priori knowledge on the products, in chemical product design, the identity of the final product is not known. Instead, the basic idea of its behavior is known and the problem is to find the most appropriate chemical(s) that will exhibit and/or cause the desired behavior [1]. Several common desirable behaviors are: biodegradability, lower toxicity, environmental benignity, and less hazard. Since customer needs and customer attributes are the most important sources of product requirements, 2 product design should address these needs by translating them into the final commercial products. Rapidly developing new products required by customers with minimum cost is an increasingly important problem. According to Cussler and Moggridge [2], chemical product design is a procedure consisting of four stages: 1. Needs: Identification of customer needs and the translation of these needs into product specifications. 2. Ideas: Generate ideas that satisfy this need. 3. Selection: Screen and select the best idea for manufacture. 4. Manufacture: Decide what the product should look like and how it should be manufactured. The 2 nd and 3 rd stages together represent Molecular Design and Mixture/Blend Design problems. The 1 st and the 4 th stages may be considered problem formulation and process design stages respectively. Figure 1.1 is a schematic representation of the stages involved in chemical product design and problems that are addressed in each stage [1]. Figure 1.1: Stages involved in chemical product design. Process-Product Design Pre-Design ?define needs and goals? Process Design ?manufacture and test products? Molecular Design Mixture Design Product Design ?generate and select alternatives? 3 The molecular and mixture/blend design problems can be solved independent of the process design problem or as an integrated product- process design problem. In product design, we describe which product to make; in process design, we explore how we will make it. Chemical product design is the larger topic, and includes the process design in the final step. Product design is now considered as an emerging paradigm in the field of chemical engineering [1, 2, 3] because it requires a different set of tools and skill sets from other problems traditionally encountered in the field. Design of environmentally benign solvents and alternative media for extraction and purification are new challenges within product design. Examples of chemically formulated products include performance chemicals, paints, cosmetics, pharmaceuticals, proteins, semi-conductors, foods, fuel mixtures, and many more (Figure 1.2). Figure 1.2: Chemically formulated products. The search for potentially new molecules that possess one or more desired properties using traditional experimental approaches costs time and resources. This awareness has generated momentum to find alternative ways of numerical characterization of chemicals. Ertl [4] estimated that the 4 chemical space could be 10 20 - 10 24 compounds whereas the current size of the CAS Registry is 2 x 10 7 compounds. There are abundantly more chemical compounds waiting to be prepared and characterized. Here, the application of computational tools can facilitate the investigation of chemical product formulations prior to experimentation and simulation of their manufacturing processes, giving the flexibility to handle changing design constraints early in the development process. In silico research and practice can alleviate problems arising from diverse customer demands and shorter life time of specialty chemical products. When appropriate property models are available to describe and predict the target product properties, computer aided methodologies can be utilized to solve the design problems, and are considered computer aided molecular/mixture design (CAMD) problems [1]. Chemical product design that utilizes systematic approaches to integrate the developed chemistry with manufacturing constraints is considered a new chemical engineering paradigm [3, 5]. 1.2 Challenges and motivations Unlike the design of commodity chemicals, which have a known molecular architecture and limited raw material options from which to build an optimum process configuration, the design of specialty chemical products does not have defined molecular architectures or raw material sources. Developing techniques to discover and optimize the molecular architecture that delivers desired attributes is the focus of research in the process systems engineering (PSE) community. In addition, it is important to develop a systematic methodology to produce chemicals that possess both the consumer specified attributes and environmentally acceptable characteristics. CAMD facilitates the application of computer algorithms to solve the mathematical formulations of chemical product design. For computational searching, the desired property must be calculated from a model describing 5 the structure-property relationship. Based on the type of the property model, the resulting CAMD problem can be a linear program (LP) or a nonlinear program (NLP). Various CAMD-based techniques have been proposed and can be classified as (a) mathematical optimization techniques [6, 7, 8], (b) stochastic optimization techniques [9], and (c) exhaustive enumeration (generate and test) [1, 10]. Traditionally, molecular and/or mixture/blend design (chemical product design) procedures suffer from two major challenges: the ability to solve large scale optimization problems, and the ability to predict the physical and chemical properties of a given molecule [11]. Computational methods aimed at finding new molecular structures that possess desired product properties follow the experimental approach of generate-and-test. These approaches involve the minimization or maximization of an objective function with many linear and non-linear constraint equations. They are commonly referred to as a Mixed-Integer Non-Linear Programs (MINLP) [1]. The generic mathematical programming formulation of integrated process-product synthesis/design problems can be presented as [1]: ( ){ }max Objective function T obj F Cy f x= + (1.1) s.t. ( ) 1 0 Process design specificationshx= (1.2) ( ) 2 0 Process model equationshx= (1.3) ( ) 3 , 0 CAMD specificationsh xy= (1.4) ( ) 11 1 Process design constraintsl gx u?? (1.5) ( ) 22 2 , CAMD constraintsl g xy u?? (1.6) 6 33 Logical constraintsl By Cx u?+? (1.7) In the above equations, x represents continuous variables (such as temperature, flowrate and mixture compositions, etc.) and y represents binary integer variables (such as the existence or absence of certain process units, raw materials, or molecular groups). Many variations of the above mathematical formulation exist for solving different chemical product design problems. An in-depth review of the problem formulations has been presented by Gani [1]. The presence of non-linear constraints means that the solution to the MINLP may only be considered a local optimum and not necessarily a global optimum and as such may not be able to generate a complete set of candidate molecules. This leads to sub-optimal design and inefficient iterative solution approach. Although mathematical programming and hybrid methods provide the framework to transform solution techniques into computer-aided methods and tools to determine the optimal design, they are computationally intensive if the problem is not well-defined or is highly nonlinear. The major non- linearity of the problem is introduced by the property descriptors which either need to be supplied (measured or retrieved from database) and/or predicted through appropriate models. In addition, when considering interfacing product and process design, most algorithms face a bottleneck when it comes to using property models suitable for both product design and process design [1]. If multiple models with restricted application ranges are included for the same property, the algorithm may suffer from discontinuities in the solution trajectory, which may make it more complicated to achieve solution convergence [1]. Overall, purely mathematical optimization-based approaches are computationally intensive and require very detailed models for all considered unit operations, thus it is desirable to reduce the search space prior to invoking the optimization solver. 7 Almost all CAMD methods have used group contribution-based property prediction methods to evaluate the generated compound with respect to a specified set of desired properties [12, 13, 14]. Group contribution methods (GCM) are simple, have acceptable accuracy for many properties and are predictive in nature. However, GCM do not exist for all necessary properties, and reliability of predictions is often questionable for large, complex molecules. There are many property parameters or consumer attributes for which group contribution (GC) data is not available, for example, attributes such as paper softness are difficult to define in terms of properties and molecular architecture. In order to characterize the properties of observations, one measures variables. In chemical and engineering practice, it is often assumed that our systems are driven by inherent, latent variables. Changes in intrinsic molecular properties cannot easily be measured directly. However, macroscopic properties, which are manifestations of the intrinsic properties, can be predicted through the use of relevant molecular descriptors. Such descriptors can be related to chemical (reactivity, pH, enthalpy of formation, lipophilicity, etc.), physical (boiling point, viscosity, density, etc.), and structural (electron distribution, van der Waals interactions, hydrogen bonds, etc.) properties of molecules. Discrete change of a substituent in one part of a molecule or polymer does not just affect one isolated property, but several of the above properties. Therefore, the chemical behavior of a molecule is governed by these intrinsic, latent molecular properties [15]. Thus, it is important to capture the intrinsic properties using relevant molecular descriptor variables. Chemical structures can be numerically encoded by molecular descriptors. Thousands of molecular descriptors have been reported in the literature, ranging from simple topological to more complicated topographical properties of molecules. They cover features from constitution to geometry and electronic properties [16, 17]. Relating chemical structures to physical 8 and chemical properties is not a new endeavor. However, to deal with the enormous amount of data generated, it becomes crucial to develop and utilize features selection algorithms, along with data visualization, interpretation, and mining techniques to progress in computer-aided molecular design. Molecular descriptor variables can be of spectral origin (e.g. IR, NIR, NMR, UV, X-ray, etc.); chromatographic origin (e.g. HPLC, GC, TLC, etc.) or they may be measurements from sensors in a process (e.g. temperatures, flows, pressure, etc.). When variables are highly correlated or are highly redundant, they are said to be collinear. As the chemical engineering discipline moves from data poor to data rich, the research in process systems engineering (PSE) shifts the focus from simulation and complex experimentation toward data-driven techniques to acquire the needed translation of product attributes into common physical- chemical properties [18]. With data-driven means gaining popularity as being powerful and inexpensive, computer usage is being increasingly exploited for scientific investigation. Also, the application of chemometrics and chemoinformatics is gaining a great deal of recognition and application in order to address problems in chemistry, chemical engineering, biochemistry, biology, and medicine [16]. According to Wold [19], chemometrics is the science of extracting chemically relevant information by data-driven means and deriving the respective multivariate statistical models and descriptors. Chemometrics is therefore a process. Chemoinformatics is a subfield of chemometrics which involves the transformation of data into information and information into knowledge to facilitate decision making [18, 16]. Combination of chemometrics and chemoinformatics can lead to target- focused two-and three-dimensional structures from molecular composition. The algorithm and informatics methods can be applicable and transferable to a wide range of chemical and biological systems, regardless of whether the starting point is DNA or a chemical element distribution as illustrated in Figure 1.3 [19]. 9 Figure 1.3: Hierarchy of chemical and biological informatics. Chemometrics can be applied to solve both descriptive and predictive problems involved with such data [19]. In descriptive applications, properties of chemical systems are modeled with the intent of learning the underlying relationships and structure of the system. In predictive applications, properties of chemical systems are modeled with intent of predicting new property values or behaviors of interest. Chemoinformatics can be utilized to analyze a large volume of data generated from molecular modeling, chemical information, and computational chemistry techniques [16]. Moreover, when a good theoretical process model is not available or before new experiments are conducted, historical process data is often available in industry that encompasses a wide spectrum of operating conditions and product grades. Such multi-block data involving the properties of the pure materials, their blend ratios, and the process conditions, could provide an opportunity to enhance the performance of the final product in blending operations for coatings, food, cosmetics, pharmaceuticals and fuels [20]. The data-driven modeling approach is apt for processes and phenomena where cause-and-effect cannot be simply approached from first-principles. For example, a real-life industrial process system often involves complex, nonlinear, incomplete, and uncertain data. Chemistry Molecular composition Connectivity (graph) Molecular similarity Chemotype Structure Interaction Specific activity Drug Biology DNA sequence Protein sequence Sequence similarity Family Structure Interaction Function intervention I n f o r m a t i c s 10 In chemical product (molecular or mixture) design, the identity of the final product is unknown, however, the general behavior or characteristics of the product (goal) is known. Since the properties of the component or mixture of components dictate whether or not the design is useful, the basis for solution approaches in this area should be based on the properties themselves. The recognition of property-based design came about as a direct result of the following observations: ? Many processes are driven by properties not components. ? Performance objectives are often described in terms of measurable physical-chemical properties. ? Often objectives cannot be described by composition alone. ? Molecular/mixture design is based on properties. ? Insights are often hidden by not integrating properties directly. The discussion presented here provided the motivation that guided the research, and as a result the methods and tools developed must accomplish the following: ? Integrate process and product design problems via a methodology within the property integration paradigm. ? The approach needs to be systematic and capable of setting up the design performance requirements or ?targets? a priori, i.e. a targeting approach. ? Utilize a data-driven approach combined with multivariate statistical techniques to solve both descriptive and predictive problems. ? Apply multivariate characterization, modeling, and design. ? Incorporate the concepts of reverse problem formulation and property clusters to aid in the decomposition of the design problem. ? The technique should take advantage of the benefits of visualization tools in the formulation of the problem and as part of its solution algorithms. 11 1.3 Scope and Objectives General background information on traditional frameworks for solving computer-aided molecular/mixture design (CAMD) problems is presented in this work. Techniques such as group contribution, reverse problem formulation, property clustering, and multivariate statistics are presented within CAMD to formulate and solve chemical product design problems. These methods form the basis for the property-based design techniques presented in this dissertation. Primarily, the effort of this dissertation is on the formulation of chemical products through the application of multivariate characterization, modeling, and design. Property-based molecular and mixture design algorithm within the general property clustering framework utilizing chemometric techniques and group contribution methods in a reverse problem formulation are presented. Data-based modeling and optimization strategies will be presented in this dissertation covering two specific areas of chemical product design: 1. Mixture/Blend design: For mixture products, a systematic and insightful use of past data describing the properties of the raw materials, their blend ratios, and the process conditions during the production of a range of product grades is investigated to explore their effects on the final product properties and to achieve new and improved products. 2. Molecular design: For molecular products, chemically feasible, structured molecules are exhaustively enumerated from a set of appropriate descriptors (represented by fragments or building blocks) to identify compounds exhibiting certain desirable or specified behavior. The present work is limited to systems that can be characterized by three properties, however, algebraic and optimization-based approaches are available to extend the application range to include more properties [21]. 12 1.4 Significance of the Research A significant result of the developed methodology is that for problems that can be adequately described by just three properties, the process and molecular design problems are solved visually and simultaneously on a ternary diagram, irrespective of how many substructural molecular fragments are included in the search space. The primary benefits gained through the developed methodology are as follows: ? utilization of interdisciplinary methods and tools that extend through multivariate statistics, applied mathematics, computational expertise, and experimentation across all scales, ? formulation and solution of chemical design problem on property basis, ? bridging domains via property models facilitate nested and nonlinear routines to be reduced to a single sub-problem in cluster space. ? identification of design targets without performing detailed calculations, ? visualization and solution of problem in lower dimensional space, which provide valuable information by elucidating hidden relationships in data and multivariate understanding of complex processes and phenomena, ? multivariate projection-based regression retains better memory of the structure of the training set data to predict molecules with similar properties and structures, ? reverse design approach using the descriptors of group contribution type combined with chemoinformatics enables exhaustive search of the structures corresponding the target physical properties, and ? explicitly tracking and integrating properties in a systematic manner relieve, the iterative nature, multiple component combinatorial explosion, and difficulty of formulating and solving the mixed integer non-linear programs (MINLP) of conventional design techniques. The aim is to allow us to see the big picture first, and the details later. 13 1.5 Organization Chapter 2 introduces the general theoretical background on the methods and tools that the research in this dissertation is based on. Section 2.2 introduces the property-based approach that facilitates the flow of information from the process level to the molecular level, and vice versa. Section 2.3 highlights property estimation methods based on combinatorial techniques such as the group contribution method. Different characterization (e.g. IR spectroscopy) based group contribution methods are presented in Section 2.4. The characterization-based groups serve as the descriptive application of chemometric techniques that describe the molecular architecture of a chemical product. Section 2.5 discusses the limitation of using general regression models that is often encountered when the number of descriptor variables is larger than the number of samples and when the variables exhibit linear relationships with each other. The predictive application of chemometric techniques is presented using statistical multivariate techniques like PCA and PLS to model the properties of chemical systems in Section 2.6. Chapter 3 presents the developed methodology for solution of mixture design problems utilizing the tools presented in Chapter 2. The multi-block data structures (L- and T-shaped) available in blending operations are presented in Section 3.3. Combining multiple blocks using appropriate mixing rules and matrix algebra to help simplify the analysis and design by not having to differentiate between mixture and process variables and not having to assume independence of the factors when multivariate analysis techniques are used is discussed in Section 3.4. Section 3.5 presents a case study to illustrate the method and concept using the development of thermo- plastic as a case study. The example incorporates all three degrees of freedom available in blending operation. Chapter 4 presents the developed molecular design framework based on a characterization-based group contribution method. The advantages of 14 biodiesel, a short overview on its production, chemical profiles of different feedstocks and their effects on product quality are present in Section 4.1. Technical difficulties with biodiesel that have impaired its use and commercialization are presented in Section 4.3. Additive design is presented as an alternative approach to solve the technical difficulties with biodiesel in Section 4.4. Section 4.5 discusses the methodology for additive design that combines the framework and the tools presented in Chapter 2. In this Section, the types of training set additive molecules, their property estimation methods, IR-based characterization of their molecular architecture, multivariate statistical analysis of IR data for descriptive application, latent variable property model development for predictive application, evaluation of target properties for additive design, and design of additive molecules that meet the target property specifications are discussed. Chapter 5 introduces the use of density functional theory (DFT) based simulation techniques to generate the IR spectra as molecular descriptors with which to develop predictive property models. This step relieves the dependency on measured or database values for the required IR data used in the methodology presented in Chapter 4. The interdisciplinary and novel framework proposed will be demonstrated using a case study focused on the reverse design of ionic liquids with tailored properties. The methods in this Chapter will integrate the framework and tools presented in Chapter 2 and Chapter 4 with an additional ability to generate spectroscopic data from quantum chemical calculations. Chapter 6 introduces future works that can be extended from the methodologies and tools developed in this dissertation. Section 6.1 introduces potential improvement or extension areas in the characterization, modeling, and design methodologies presented in this dissertation. Section 6.2 incorporates the design of inherently benign chemical process routes using process route descriptors related to environment, health, and safety factors. 15 CHAPTER 2 THEORY AND METHODOLOGY In this chapter, theoretical background on current methods used in chemical product design along with tools that will be employed in this dissertation research will be defined and discussed. 2.1 Process-Product Design The molecular and mixture/blend design problems can be solved independent of the process design problem or as an integrated product-process design problem as shown in Figure 2.1. In general, the objective in the design or optimization of process is to find a balance between satisfying process unit requirements/constraints and the use of appropriate raw materials in order to maximize profit or to minimize cost. Traditionally, process design and molecular/mixture design have been treated as two separate problems, with little or no feedback between the two approaches as represented by Figure 2.1 [22]. The raw materials could be searched from a material database or can be obtained from molecular design. Since process and molecular/product design are decoupled, molecular design can be performed based on qualitative process knowledge and/or experience. This can lead to sub-optimal designs and an inefficient iterative solution approach. However, the decoupling of the process and the molecular design problems can be addressed by systematically solving a series of reverse problems [22, 24]. This is accomplished by a property-based approach which 16 facilitates the flow of information from the process level to the molecular level, and vice versa. Figure 2.1: Conventional solution approach for process and molecular design problems. 2.2 Property-Based Process and Product Design In molecular or mixture design, the identity of the final product is unknown, however, the general behavior or characteristics of the product (goal) are known. The objective is to find the most appropriate chemical, or mixture of chemicals, that will satisfy these goals. Since the properties of the component or mixture of components dictate whether or not the design is useful, the basis for solution approaches in this area should be based on the properties themselves. Similarly, in process design, since properties (or functionalities) form the basis of performance of many process units (e.g., vapor pressure in condensers; specific gravity in decantation; relative volatility in distillation; Henry?s coefficient in absorption; density and head in pumps; density, pressure ratio, and heat capacity ratio in compressor; etc.), it would be very insightful to develop procedures based on key properties instead of key components [22]. Discrete Decisions (e.g. structural modifications) Continuous Decisions (e.g. Operating conditions) Process Design Discrete Decisions (e.g. type of compound, number of functional groups) Continuous Decisions (e.g. Operating conditions) Molecular Design Given set of components to be screened (e.g. raw materials, MSA?s) Given set of molecular groups to be screened (building blocks) Optimize process objectives to meet desired performance (e.g. recovery, yield, cost) Optimize molecular structures to meet given set of property values (e.g. physical, chemical) 17 However, unlike component-based chemo-centric approaches where chemical components are conserved and mixing of components is linear; in a property-based approach, properties are not conserved entities and mixing of properties is not necessarily linear. To overcome these limitations, Shelley and El-Halwagi et al. [23] introduced the property clustering framework that uses conserved quantities called clusters. It was later applied to process and product design by Eden et al. [22]. To bridge the gap between the process design and the molecular design problems, Eljack et al. [24] extended the property integration framework by combining the property clustering technique and Group Contribution Methods (GCM). Qin et al. [21] introduced an algebraic approach using property clusters to relieve the limitation in the use of more than three properties. These contributions enabled simultaneous consideration of process performance requirements and molecular property constraints in the cluster domain and solution of the design problem in this reduced domain. The following sections discuss the property integration framework based on reverse problem formulation and property clustering techniques. 2.2.1 Reverse Problem Formulation The reverse problem formulation technique decomposes a process-product design problem into two reverse problems linked by property targets/constraints [22, 24]. This enables a two-step approach, where the property targets that satisfy the process performance/constraints are identified in the first step and then the molecules that match the targets are identified in the second step. This gives the ability to identify optimum solutions to process and product design problems more easily than solving the conventional forward problems, which are iterative in nature [22, 24, 25, 27]. Figure 2.2 is a schematic representation of the reverse problem formulation concept. The first reverse problem is the reverse of a simulation problem, where the process model is solved in terms of the constitutive/design 18 variables instead of the process variables, thus providing the design targets. The second reverse problem solves the constitutive equations (property models) to identity candidate components by matching the design targets. Figure 2.2: Product and process design problem using reverse problem formulation methodology. Prediction of macroscopic properties from molecular information is nontrivial and is by no means extensive. In most methods, the computed information at smaller length scales is passed to models at a larger length scale. This is achieved by coarse-graining (removing degrees of freedom) and information passing between low- and high-scales [26]. Pathway I in Figure 1 represents the dependency of overall higher scale performance on the lower scale phenomena. This path (approach) necessitates coupling of mathematically different models and phenomena across two or more scales. Discrete Decisions (e.g. structural modifications) Continuous Decisions (e.g. Operating conditions) Process Design Desired process performance Given set of molecular groups to be screened (building blocks) Integrated design capable of achieving desired performance Reverse Property Prediction Constraints on property values obtained by targeting optimum process performance Designed Molecules (e.g. raw materials, MSA?s) Discrete Decisions (e.g. type of compound, number of functional groups) Continuous Decisions (e.g. Operating conditions) Molecular Design Design Property Targets Reverse Process Simulation 19 This results in a greater coupling in design of products at multiple scales, thereby increasing problem complexity, and combinatorial explosion in the number of models and parameters [12]. Figure 2.3: A multi-scale product design framework showing (I) traditional approach and (II) RPF approach linking each of the scales via a common property domain. The use of reverse problem formulation (RPF) helps circumvent the challenge posed by coupling of scales by bridging them through a property domain (pathway II in Figure 2.3) [27]. RPF uses the duality of linear programming to reformulate the design problem as a series of reverse problems solved in the property domain. This way, an immense computational cost associated with the hierarchical nesting across multiple- scales is relieved leading to a much more efficient solution achieved through reduction in the need for enumeration [27]. QM, DFT Models Ab Initio Methods (Quantum Scale) Fundamental Principles of Chemical Engineering Process Design (Macro Scale) Chemometric Models Consumer Attributes (Macro/Meta Scale) MM, MD, MC Simulations Atomistic Methods (Atomic Scale) Coarse-Graining and Chemometric Models Microstructure (Meso Scale) II II II II II I I I I/II GCM, Topological Indices & Chemometric Models CAMD (Molecular Scale) I/II I/II PROPERTIES II I 20 2.2.2 Property Clustering Technique Property clustering is a property-based visualization tool for mapping the design problem from the non-conserved property domain into the conserved (component-less) cluster domain [23]. The property clustering technique utilizes property operators, which are functional relationships describing the attributes and physical-chemical properties. The clusters map property relationships into a lower dimensional domain, thus allowing for visualization and insights into the problem. Details can be found in Eden et al. [22]. Only highlights will be presented in this dissertation. 2.2.2.1 Property Operator Functions The clustering approach utilizes property operators, ?, which are functions of the original raw physical properties. In Eq. (2.1), the property is described by a general linear mixing rule: ( ) ( ) 1 ?? = = ? ? n jj i jj Mix i i y xy (2.1) where, ? ? = the fractional contribution of component i ? ? = the j th property Although the property operator equation must have linear mixing rules, the property operator itself may be nonlinear. For example, if the property operator describes density, then to meet the linear criteria imposed by Eq. (2.1) we would use specific volume as the property operator, ignoring any interaction effects from mixing, as shown in Eq. (2.2) and (2.3). The operator expressions will invariably be different for molecular fragments and process streams, however, as they both represent the same property, they can be visualized on a common cluster domain in a similar fashion. 21 ( ) 1 jj i i y? ? = (2.2) 1 11 n i i Mix i x ?? = = ? ? (2.3) Since the properties may have various functional forms and units, the operators are normalized into a dimensionless form by dividing by an arbitrary reference operator and then summarized to yield an Augmented Property Index (AUP) as: ( ) ( ) 1 , p jj i ji i ji j jj ref y AUP y ? ? = ?= = ? ? (2.4) A cluster is then defined by dividing the non-dimensionalized property by the AUP, as: 1 where 1 , k ji ji ji i i CC AUP = ? = = ? (2.5) Since the clusters are tailored to maintain the two fundamental rules for intra- and inter- stream conservation, additive rules, e.g. lever-arm rules, are needed to ensure that the mixture property cluster of two streams with different individual property clusters can be easily determined. The mixture cluster, Eq. (2.6), and mixture AUP, Eq. (2.7), values can be calculated through the linear mixing rules as follows: , 1 where n ii j Mix i ji i i Mix x AUP CC AUP ?? = ? ?= ? , (2.6) 1 n Mix i i i AUP x AUP = = ? ? (2.7) 22 The relative cluster arm, ? ? , which represents the fractional contributions of the streams, can be calculated using Cartesian coordinates as well as estimated visually from the relative lengths of the lever arms as shown in Figure 2.5. Visually, intra-stream conservation means that once two clusters are known, the third one is automatically determined. In order to use common tools like Microsoft Excel? that do not support ternary plot representation, the ternary coordinate is converted to Cartesian coordinates. Figure 2.4: Representation of intra- and inter-stream conservation of clusters in ternary diagram. 2.2.2.2 Ternary to Cartesian Coordinates The coordinate transformations from ternary to Cartesian for an equilateral triangle can be accomplished using the Pythagorian theorem in terms of ternary coordinates defined as cluster values and dimensionless property operators are as follows: 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 C 3 C 2 C 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Target S 1 S 2 C 1s C 2s C 3s ? 1 ? 2 23 12 ,1 2 123 0.5 0.5 ss CC s s s sss XC C ? + ?? =+= ? +? +? (2.8) 2 ,2 123 s CC s s sss YC ? = = ? +? +? (2.9) Furthermore, the relative arms can be calculated accordingly as: 22 ,2 , ,2 , 1 ,2 ,1 ,2 ,1 ( )( ) ( )( ) CC CC mix CC CC mix CC CC CC CC X X YY X X YY ? ? +? = ? +? (2.10) 22 ,1 , ,1 , 2 ,2 ,1 ,2 ,1 ( )( ) ( )( ) CC CC mix CC CC mix CC CC CC CC X X YY X X YY ? ? +? = ? +? (2.11) In cluster space, points are used to represent discrete property values while feasibility regions are used to represent a range of accepted property values (Figure 2.5). 2.2.2.3 Feasibility Region Boundaries For problems that can be adequately described by just three properties, the desired process performance range or product property requirement range can be visualized within the boundary of the true feasibility region defined by six unique points, Eq. (2.12), characterized by values of dimensionless operators. The feasibility region narrows down the search space and guarantees that no feasible points will exist outside it. The feasibility region boundary analysis helps identify and describe the expression for the feasibility region a priori and requires no enumeration of an infinite number of feasible points possible within its boundaries [22, 23]. In Figure 2.5, the feasibility region is shown in dashed lines, to form a hexagon in a ternary cluster diagram. 24 ( ) ( ) ( ) ( ) ( ) ( ) min min max min max max min max min 123 123 123 max max min max min max max min min 123 123 123 ,, ,, ,, ,, ,, ,, ??? ??? ??? ??? ??? ??? (2.12) Figure 2.5: Representation of feasibility reason with source-sink mapping using clusters in ternary diagram. The clustering points are converted from ternary to Cartesian coordinate. Visualization of the problem allows for easy identification of optimum strategies for the combination of molecular groups, while the unique feature of linear mixing rules allow for the use of simple lever arm analysis to solve the problem in reduced cluster space. This way, the design problem can be solved by identifying the product properties corresponding to the desired process performance. The property clustering methods for solving mixture and molecular design problems form the basis for the methods developed in this dissertation. Design of target chemical products that have tailored structures and that exhibit an array of unique functional properties requires the use of 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 C 3 C 2 C 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Feasibility Region S 1 S 2 Y X Y CC X CC 25 physical-chemical properties and/or biological activities. Experimental measurements of properties of interest for products such as drugs, ionic liquids, additives, etc. are scarce and limited. With respect to the synthesis, characterization, and applications of such chemical products, it is essential to estimate properties through theoretical or empirical means. 2.3 Prediction of Properties As process-product design may involve both process and molecular design, separate property models may be required. Property models can range from easy to use data-dependent regressed models whose application range is small, to computationally complex models like ab initio (quantum-chemical) calculation, which can be applied to any chemical system. In the middle are the most common methods that balance full empiricism with basic chemical theory to adapt the rigorous equations that describe molecular architecture in a form suitable for design. A comprehensive overview of a wide range of property estimation methods can be found in Poling et al. [28]. Constantinou and Gani [12] classified property estimation methods as shown in Figure 2.6. Figure 2.6: Classification of property estimation methods. Reference Semi-Empirical Models ? Corresponding State Theory ? Topology/Geometry ? Group/Atom/Bond additivity y = a + b . log(x) + ? Empirical Models ? Chemometrics ? Pattern matching ? Factor analysis ? QSAR/QSPR y = a + b . x + ? Fundamental Models ? Quantum Mechanics ? Molecular Mechanics ? Molecular Simulation H? = E? Approximate Property Estimation Models 26 Regardless of the type of property model used, it is essential that they are easy-to-use, accurate, reliable, and computationally efficient throughout the entire design domain. Group contribution (GC) and topological indices (TI) are two such techniques that utilize quantitative structure property/activity relationships (QSPR/QSAR) that have been developed using empirical relationships between molecular architecture and physical- chemical properties found in large databases [11, 29, 30]. A QSPR model finds the relationship between structural features (constitutional, topological, geometrical, etc.) and physical, mechanical, or chemical properties of materials. In drug design, the QSAR model finds the underlying relationship between molecular descriptors with pharmacokinetic/pharmacodynamics, and ADMET (Absorption, Distribution, Metabolism, Elimination and Toxicity) properties [16, 31, 32]. Since QSPR models provide information on features affecting the compounds? physicochemical properties, they can be used for screening and further optimization. Thus, development of robust QSPR model can facilitate conserving resources and accelerating the process of development of new and enhanced products. Figure 2.7: QSAR model development steps. Training Set (Chemical, Biological data) Generate Descriptors (Spectroscopy, DRAGON) Modeling Methods (MLR, PCR, PLS, ANN) Model Validation (Internal or External) Property Prediction (Screening new compounds) Knowledge and Chemical Intuition Synthesis and Testing Forward QSPR Reverse QSPR 27 Figure 2.7 presents QSPR model development steps. Several machine learning methods such as artificial neural network (ANN), Bayesian statistics, support vector machines (SVM), genetic algorithm (GA), partial least-squares analysis (PLS), and multiple linear regression analysis (MLR) have been used in literature to perform QSPR modeling [31]. However, in order for a developed model to be relevant to the target product properties, the training set must contain optimum molecules covering balanced variation (high diversity) of the features spanning the chemical/property space believed to be important for interaction with the physical, chemical, or biological target [30]. The diversity of the chemical features can be achieved by using statistical experimental design such as factorial design and D-optimal design. The coverage can be achieved by employing quadratic or cubic models instead of linear model design [30, 33]. However, the requirement of high diversity and full coverage often leads to redundancy. Since a set of structural descriptors is never complete, some redundancy could be allowed at this level. The use of projection methods (discussed later in Section 2.6) will circumvent the problem of redundancy in the descriptor data. 2.3.1 Group Contribution Method Most property estimation methods are based on the group contribution method (GCM), where appropriate descriptors or predefined fragments (group, bond, or atom, etc.) representing a molecule are identified and the properties of the molecule are estimated by summing all the contributions from each fragment that make up the molecule [12, 14, 22, 24, 25, 34]. The GCM is a powerful product/molecular design tool, which allows prediction of the physical properties of molecules from structural information alone. An additive, three level, group contribution property estimation model, which estimates the property of a compound as a linear combination of the appropriate descriptor contributions, is as follows: 28 ? st nd rd 1 Order 2 Order 3 Order Property Function Group Contribution Terms () ii j j kk ij k f x NC M D OE=++ ?? ? ????? ????? ????? ????????????? (2.13) where, C i = the contribution from first-order group i N i = the number of occurrence of first-order group i D j = the contribution from second-order group j M j = the number of occurrence of second-order group j E k = the contribution from third-order group k O k = the number of occurrence of third-order group k First order groups contain basic information and can be combined linearly since they assume no interaction between groups. Second order groups can be estimated from first order groups and correct for the interactions between first order groups. Third order groups can be derived in a similar manner and help to correct for poly-functional compounds with more than four carbon atoms in the main chain. However, any application of group contribution relies on the availability of atom type, molecular group type, or type of chemical bonding present to describe the structure as well as tables giving the property contributions of each group. There are many properties which cannot be estimated by GCM. For instance, cetane number is an important performance indicator for biodiesel, but GCM parameters are not available to describe this property. Furthermore, not all possible atomic arrangements and structures can be represented in GCM. Hence, there is a need for an efficient methodology for the design of structured molecules. One such approach to structured product design is combining decomposition techniques with multivariate methods [35, 36, 37]. 29 This approach first utilizes multivariate characterization techniques such as infrared (IR) and near infrared (NIR) spectroscopy to describe a set of representative samples, and then uses decomposition techniques such as principal component analysis (PCA) and partial least squares onto latent surfaces (PLS) to find the underlying latent variable models that describe the molecule's properties. The factors are called ?latent? because they cannot be observed directly, but can be characterized indirectly. The orthogonal nature of these models allows for group-based interpretations and property predictions which can be utilized to design new molecules not found in the original set of molecules. The structure and identity of candidate molecules can then be identified by combining or "mixing" substructural molecular fragments until the resulting properties match the targets [24]. 2.4 Characterization Techniques Characterization is a class of tools associated with the determination of not only chemical constituents or molecular structure, but also of larger structural characteristics describing the orientation and alignment of these molecules by exploiting the fact that molecules absorb specific frequencies that are characteristic of their structure. Some common characterization techniques include nuclear magnetic resonance (NMR), x-ray diffraction (XRD), and infrared spectroscopy (IR). Characterization techniques are often applied to a training set of molecules used to explore a set of property attributes. Solvason [27] proposed a general guide for managing the complexity of the information through a systematic method for determining which specific information on molecular architecture will be necessary to build appropriate models for a specific application as shown in Figure 2.8. 30 Figure 2.8: An overview of the interconnectivity of characterization techniques, molecular architecture, and physical properties and attributes of chemical and material products. Spectroscopy will be the primary characterization technique explored in this dissertation. Spectroscopy such as infrared (IR) and near-infrared (NIR) provide specific information on the presence of functional groups, information on the orbital configurations of the electrons, and details of the carbon- hydrogen structure of the chemical products. The added structural information available from this characterization can also be used to distinguish some orientation specific information in various isomer geometries. More on spectroscopy can be found elsewhere [38]. Appendix B gives an overview on IR spectroscopy. IR / NIR / Raman Spectroscopy UV-Vis Spectroscopy NMR Spectroscopy Mass Spectroscopy GC / HPLC / SEC Chromatography Scattering & Diffraction Thermomechanical Analysis Rheological Analysis AFM / SEM / TEM Microscopy Electronic Structure (1? ? 2nm) Atomic Structure (1? ? 1nm) Group Structure (5? ? 5nm) Chain Structure (10nm ? 2?m) Intermolecular Structure (2? ? 10nm) Nanostructure (1nm ? 100nm) Microstructure (1?m ? 100?m) Mesostructure (10?m ? 1mm) Electrical Thermodynamic Toxicological Environmental Kinetic Transport viscoelastic Mechanical Characterization Type Molecular Architecture Attributes & Properties 31 Figure 2.9: IR spectra of butylated hydroxytoluene molecule. Figure 2.9 is an IR spectrum of a butylated hydroxytoluene molecule [39] and its molecular structure showing 1 st and 2 nd order GC groups. IR spectra of such molecules contain large quantities of descriptor data involving information on molecular architecture. In addition, there is a high probability that many of the descriptor variables will be correlated, i.e., that some variables will be linear functions of other variables [40]. Managing such complexity of information to design chemical products and to build appropriate models for a specific application will require a systematic method for capturing important features of the molecular architecture. Therefore multivariate statistical techniques can be used to decompose large quantities of information about the system in the initial training set. More detailed discussion on multivariate techniques will be presented in Section 2.6. 2.4.1 Spectroscopy Many substances in solution follow Beer-Lambert?s law, showing a linear relationship between concentration and absorbance. Beer?s law relates the absorption of light to the properties of the material through which the light is Wavenumber [cm -1 ] A bsor ba nc e [ % ] 32 traveling. The absorbance of a single wavelength due to the presence of a single component can be written as [41]: A KC= ? (2.14) where, A = the absorbance at that wavelength K = the absorbance coefficient for that wavelength and component C = the concentration of the component This concept can be applied to any system where we can measure a quantity, A, that is proportional to some property, C, of the sample. The generalization of the above equation for multiple components, multiple wavelengths and multiple samples gives: 1 n ws wc cs c A KC = = ? ? (2.15) where, Aw = the absorbance at the w th wavelength in the s th sample Kwc = the absorbance coefficient for the w th wavelength and for the c th component C c = the concentration of the c th component in the s th sample n = the total number of components In expanded form Eq. (2.15) takes the form: 1 11 1 12 2 1 2 21 1 22 2 2 3 21 1 32 2 3 21 1 2 2 ... ... ... ... s c cs s c cs s c cs ws w wc cs A KC KC KC A KC KC KC A KC KC KC A KC K C K C = ?+ ?+ + ? = ?+ ?+ + ? = ?+ ?+ + ? = ?+ ?+ + ? ? ? ??? (2.16) It is clear from Eq. (2.16) that the absorbance at a given wavelength, w, is simply the sum of the absorbances at that wavelength from each of the components present. Since matrix notation can easily generalize the above Eq. (2.16), a single matrix equation is: 33 A KC= ? (2.17) Note: The notations used here are limited to this section only. New notations will be introduced in following sections that will be used throughout the remainder of the dissertation. Boldfaced unsubscripted letters are used to refer to vectors and matrices. In spectroscopy, it is clear that the spectrum of a sample is the sum of the spectra of the constituents multiplied by their concentrations in the sample. If the concentrations are t and the spectra p, we get the latent variable model 11 2 2 ... TT T T cc X t p t p t p T P noise=?+?+?=?+ (Beer-Lambert?s law). In many applications this interpretation with the data explained by a number of ?factors? (components) makes sense [41]. Chemometric modeling methodologies can be investigated and employed as a means to derive mathematical relationships between spectroscopic measurements and measured product properties. Kramer [41] suggests that the power and limitations of chemometric techniques that should be realized while solving data-driven problems are as follows: We can use these chemometric techniques to: ? remove as much noise as possible from the data. ? extract as much information as possible from the data. ? use the information to learn how to make accurate predictions about unknown samples. In order for this to work, two essential conditions must be met: ? the data must have information content. ? the information in the data must have some relationship with the property(s) which we are trying to predict. Because consumer attributes are difficult to quantify physically, the relationship between them and the underlying fundamental physical- 34 chemical properties and/or the molecular architecture will most likely involve empirical relationships. Empirical models describe the underlying phenomena?s relationship to a set of experimental data using regression analysis. 2.5 General Regression Models Traditionally, the modeling of the predicted response, Y, by means of a descriptor variable, X, is done using multi-linear regression (MLR), which works well as long as the X-variables are fairly few and fairly uncorrelated, i.e., X has full rank (rank is a number expressing the true underlying dimensionality of a matrix). Often, the relationship between X and Y variables can be approximated using a linear model and can be represented mathematically as: YX?E ML MK KL ML? ?? ? = ?+ (2.18) where, M = the number rows of sample readings or observations L = the number columns of measured response properties K = the number columns of descriptor variables (like components) ? = the regression coefficients or sensitivities matrix E = the error or residual matrix Three cases can be distinguished as described by Geladi and Kowalski [42] in Eq. (2.18) to determine ?: 1. ? > ?: There is no unique solution for ? as infinite numbers of solutions exist, unless one deletes predictor variables. 2. ? = ?: There is one unique solution provided that X has full rank. 0EYX?= ??= 3. ? < ?: There is no exact solution for ?, however, a solution can be achieved by minimizing the residual in the following equation: EYX?=?? 35 The ordinary least-square (OLS) method is the most popular method to find the regression coefficients by maximizing the model sum of squares and minimizing the residual sum of squares. Using least-square, ? can be estimated by: ( ) 1 ? ? XX XY TT ? = ? ?? (2.19) where, the superscript T symbolizes the transpose of a matrix. When the number of X-variables is large compared to the number of observations, for example in spectroscopy (?. ?.? > ?), it can lead to a singular (? ? ?) matrix whose inverse does not exist. This is because the number of unknown variables is greater than the number of equations, leading to an underdetermined equation system which has an infinite number of solutions for ?. This is the most frequent problem in MLR. One can exclude variables ? > ? that are not significant; however, it is not a guaranteed solution [42]. In addition, where a situation like mixture design is considered, the X- variables could be component proportions, ? ? , in the mixture that are not mutually independent (?.?.? ? ? = 1 ? ?=1 , where ? ? ? 0 ? i = 1,2, ? ,?). This situation is referred to as X being rank deficient, collinearity, zero determinant, singularity, and ill-conditioned. In such a situation, the inversion of (? ? ?) matrix may not exist leading to the estimation of ? using OLS with large variances. Here, a statistical multivariate method provides the right tool to extract systematic variables and remove collinearity in the data set as described in Section 2.6. In molecular products, an important objective is to find a chemical product that exhibits certain desirable or specified behavior. Assuming the model in Eq. (2.18), and the parameter estimates ? ?, a new x-variable can be predicted from a desired y-variable such that 36 ( ) ( ) 11 ? ,? TT L K KL des new yx ? ?? = ? (2.20) The model inversion of Eq. (2.20) gives ( ) ( ) ( ) 1 11 ?? ? ?? ? T TT T K L LK KL LK new des xy ? ? ? ?? ? = ?? (2.21) where, ( ) 1 ?? ? ?? ? TT ? ? is the generalized inverse or pseudo-inverse of ? ? Eq. (2.20) does not contain any information about the covariance structure within the manipulated X variables. Consequently, the solution given by Eq. (2.21) will not respect those previous structural relationships when solving for the new conditions ? ??? ? . Therefore, the standard regression model (Eq. (2.20)) and its inversion (Eq. (2.21)) possess serious limitations in solving similar problems. Multivariate projection methods like PCA and PLS become indispensable tools in dealing with such difficulties and can be used for exploration, calibration, and classification of multivariate data. 2.6 Latent Variable Models The most effective tools in multivariate data analysis are Principal Component Analysis (PCA) and Partial Least Square (PLS). PCA and PLS are decomposition techniques that compress a large quantity of data and extract the information by projecting them into a low-dimensional subspace that summarizes all the important information for analysis [43, 44, 45]. Then, further design work can be conducted in the reduced subspace. If the data used is high dimensional and noisy, and the number of samples are small while developing a calibration model, there is the danger of over-fitting. In such cases, PCA (which can capture a dominant part of data variance) is more appropriate to reduce the data dimensionality and then train a regression model with the reduced latent variables. Principal component analysis is presented in this section as a variable reduction tool including important aspects such as pretreatment of data, validation, and 37 outlier detection that must be considered. Prior to PCA, data often needs to be pre-treated, in order to transform the data into a form suitable for analysis. Data Pre-processing: Pre-processing of data can make the difference between a useful model and no model at all. Pre-treating data is often employed to transform data into a form suitable for analysis. It is general practice to mean-center and scale the property variables prior to analysis [42, 46]. Variables often have substantially different numerical ranges. A variable with a large range has a large variance, whereas a variable with a small range has a small variance. Since PCA is a maximum variance projection method, it follows that a variable with a large variance is more likely to be expressed in the modeling than a low-variance variable [15]. In particular, the property descriptor data matrix ? ??? , consisting of M observations described by K descriptors, is mean-centered and unit variance scaled (UV) (also known as auto-scaled) across the M properties. ? Mean centering: The mean for each variable (each column) is calculated based on the entire sample and subtracted from each measurement (elements of matrix ? ??? ). , 1 1 where, M ij ij j j i j j x xx x x M = =?= ? (2.22) ? Scaling: Variance scaling of the data to unity across the K descriptor properties can be accomplished by standardizing variables. For each measurement, this is done by dividing mean centered data by the standard deviation (? ? ). ( ) 2 2 , 1 1 where, 1 M ij j ij j i j j j j xx x s xx sM = ? = = ? ? ? (2.23) 38 Auto-scaling puts all variables on equal footing; i.e., all variables have the same chance of entering the model and of taking part in the model. PCA performed on auto-scaled data is referred to as a correlation PCA. Figure 2.10: Unit variance scaled and mean-centered variables. Figure 2.10 is a simple geometrical understanding of UV-scaling and mean-centering of variables [15]. In this representation, each bar corresponds to one variable and the short horizontal line inside each bar represents the mean value. The length of a bar (vector) is equal to its standard deviation (square root of variance). After UV-scaling, we get a shrinking of ?long? variables and a stretching of ?short? variables. By putting all variables on a comparable footing, no variable is allowed to dominate over another because of its length. Prior to any pre-processing, the variables have different variance and mean values. After UV-scaling, the ?length? of each variable is identical; however, mean values still remain different. After mean-centering, mean values are zero (i.e., the centroid of the whole data set is zero). This improves the interpretability of the model developed. However, it must be noted that in some cases, it is not necessarily advantageous to use auto-scaling, and some other choice might be more appropriate. Also, the X- and Y-variables can be scaled differently because the regression coefficients absorb the differences in scaling. Unit variance scaling Mean centering 0 Measured values & ?length? 39 2.6.1 Principal Component Analysis (PCA) Principal component analysis is a factor analysis method that is widely used in identification of systematic patterns in data and provides visualization of multivariate data by using as few variables as possible. It linearly maps multi-dimensional data onto lower dimensions with minimum loss of information [44]. The goals of PCA are to: ? extract the most important information from the data table, ? compress the size of the data set by keeping only this important information, ? simplify the description of the data set, and ? capture and analyze the structure of the variables. In order to achieve these goals, PCA transforms a set of correlated variables into a new set of uncorrelated ones, known as principal components (PCs) such that ? the first PC is the linear combination of the standardized original variables that have the greatest possible variance, ? each subsequent PC is the linear combination of the standardized original variables that have the greatest possible variance, is orthogonal and has zero correlation with all previously defined PCs. The orthogonality constraint imposed by the mathematics of PCA ensures that each variance-based axis is independent. PCs are arranged in order of decreasing eigenvalues. First PC is the most informative. Figure 2.11 shows the dimensionality reduction of original data to a low dimensional subspace using PCA which is much easier to visualize and analyze. 40 Figure 2.11: Projection of higher dimensional data onto a lower dimensional subspace. Note: In the following discussion, matrices are denoted by capital bold characters (X, Y), column vectors by small italic characters (t), and row vectors by transpose vector (? ? ). Using PCA, a structural descriptor data set of molecular architecture information or a process condition matrix (? ??? ), representing M observations of K variables, can be decomposed to 11 2 2 1 .......X TP TT MK K T ii i T MK KK tp tp K tp ? = ?? =?+?+ = ? = ? ? (2.24) where, T = the score matrix with mutually orthonormal columns P = the loading matrix with mutually orthonormal columns Principal components (PCs) are new lines that best approximate the data in the least squares sense and passes through the average point [15]. When two PCs are derived they, together, define a plane as seen in Figure 2.11). The score matrix (T) represents the projections of the data onto this Data Projections Data Center PCs Initial Data Data Plane 41 line in order to get a coordinate value along the PC-line. The new coordinate value is known as a score (? ? ). The loadings define the orientation of the PC plane with respect to the original X-variable. The loading matrix (L) contains the coefficients in the linear combination of the original variable defining the principal components (PCs). PCA-based soft models are both linear and additive. The loadings unravel the magnitude (large or small correlation) and the manner (positive or negative correlation) in which the measured variables contribute to the scores. Together the scores and loadings describe the principal components of the data set. Normally, the first three A (? ? ?) PCs capture most of the variance in the data (80-90% of total variance [46]). By retaining only the first A PC?s the X matrix can be approximated by: 11 2 2 1 ? .......X TP TT MK A T ii i T M A AK tp tp A tp ? = ?? =?+?+ = ? = ? ? (2.25) In order to characterize the properties of the observations one measures variables. Observations are comprised of two parts: signal and noise. Signal describes the property or effect of interest, and noise is everything else. Using methods based on variance such as PCA, multivariate data can be separated into signal and noise. Using Eq. (2.25), matrix X can be reconstructed as ? 1 ? 1 1 1 X XE E T P E T MK MK MK A TT i i MK i TT M A AK M K Noise Structure x x tp x ? ?? ? = ?? ? =?+ + =?+ ?+ =?+ ? + ? ????? (2.26) 42 Here, 1 ??? ? represents the variable average which originates from the pre-treatment step. This way, the data matrix ? ??? , containing K highly correlated variables, is transformed into the score matrix ? ??? , containing only A (where, ? ? ?) mutually independent latent variables, which are linear combinations of the original K variables, have better properties (orthogonality) and also span the multidimensional space of ? ??? . The residual matrix (E) comprises the distances of the original variables to their projection onto the principal components. Figure 2.12 graphically represents how scores and loadings form the T . P (structure) part of the PC model equation represented by Eq. (2.24). Figure 2.12: PCA decomposition of X matrix. Here vectors are designated as column vectors and the corresponding transposed vectors are designated as row vectors. The dashed lines in the matrix indicate the mean centering and scaling direction. 2.6.1.1 Number of Principal Components Before calibration of a model from PCs, it is important to determine the number of components necessary to extract the most relevant information from a data matrix. The errors calculated for the calibration set decrease E P T T t 2 t 1 += + (M x K) matrix ?A + (M x K) matrix = + E p T 1 p T 2 (M x K) matrix ( M x A ) m a t ri x (A x K) matrix 43 continuously as the number of components increase. Two methods [47] to help choose the number of components are to: ? plot the eigenvalues according to their size (the so called scree plot). The plot provides a visual aid for deciding at what point in this graph (often called an elbow) including additional components no longer increases the amount of variance accounted for by a nontrivial amount (slope of the graph goes from steep to flat). Keep only the components before the elbow. For example, the scree plot represented in Figure 2.13 suggests that using three PCs is appropriate. ? keep only the components whose eigenvalue is larger than the average eigenvalue. For correlation PCA, this means to keep only the eigenvalues larger than 1. Figure 2.13: Scree plot of the correlation matrix. One must be aware of the fact that unwanted variability in the data set, such as random noise, may also be taken into account by a model constructed with too many PCs. The model is said to be over-fitted, showing excellent results for evaluating samples belonging to the calibration set but failing on prediction of an external validation set. 0 1 2 3 4 5 0 2 4 6 8 10 12 14 E i g en v al ue Number of Components Three components 44 2.6.1.2 Scores Plot The scores plot shows correlations between observations, measurements or responses. Basically, it helps answer questions like: ? are observations related to each other? Responses close to each other have similar properties, whereas those far from each other are dissimilar with respect to descriptor variable profile (Figure 2.14 (a)). ? are there any groups or trends? In Figure 2.14 (a), data represented by different symbols (circle, triangle and square) represent a group of responses with similarity in descriptor properties. 2.6.1.3 Loading Plot The loadings plot shows correlations between variables. It helps answer questions like: ? how the descriptor variables are correlated? Descriptor variables contributing similar information are grouped together, that is, they are correlated. For instance, in Figure 2.14 (b), variables 15 and 28 are correlated. ? which variables are influential? If the variables are in the same quadrant, they are positively correlated, whereas variables in opposite quadrants (opposite side of plot origin) are negatively (inversely) correlated. For instance in Figure 2.14 (b), when the value of one variable in upper-right hand corner (say variable 13) increases or decreases, the value of the other variable (say variable 16) has a tendency to change in the same way. Moreover, as the distance of a variable from the plot origin increases, the stronger is the impact of this variable on the model. This suggests that the variables 6, 13, 16, 44, 2, and 5 (from Figure 2.14 (b)) separate the three response groups (in Figure 2.14 (a)). 45 Figure 2.14: Principal component analysis: Score plot (a, left) of t1/t2 and loading plot (b, right) of p1/p2. The ellipse represents the Hotelling T 2 with 95% confidence in score plot. Figure 2.14 depicts score and loading plots obtained from PCA. Comparing the loadings plot to the scores plot helps one understand how the variables relate to the observations. 2.6.1.4 Outlier Removal Real data is rarely homogeneous (outlier-free) and therefore most statistical methods require removal of outliers prior to the calibration of a model. Outlying samples may have a huge influence on the calibration of a model and may decrease its predictive ability. Outlier identification is based on distance from data centroids. Hotelling?s ? 2 can be used to detect outliers inside the model space [48]. Hotelling?s ? 2 is a multivariate generalization of the univariate student?s t-test, and provides a check for observations obeying to multivariate normality. When this statistic is used in conjunction with a score plot, Hotelling?s ? 2 defines a 95% or 99% tolerance region. In Figure 2.14 (a), one observation can be considered to be an outlier as it may not belong to the majority of the sample population and reveals a mistake in the property value obtained. P [1] (b) P [2] t [1] t [2] (a) 46 2.6.2 Principal Component Regression (PCR) After applying PCA to the ? ??? block for variable reduction, regression can be used to predict a particular quantitative characteristic (like attributes) as a function of score vectors. The multi-linear regression (MLR) relationship between the principal component scores ? ??? and the attribute properties ? ??? can be developed using a PCR model as: ( ) 1 ?? , where,Y T B B TT TY M L M A AL ? ? ?? = ? = ? ?? (2.27) Unlike, in OLS where the columns of Y are regressed onto the large and highly correlated columns of X, in PCR the columns of Y are regressed onto the reduced and mutually independent latent variables T. For any desired (1 ? ?) vector of Y-variable, (? ? ) ??? , one can compute a (1 ? ?) vector of new latent variable scores as ( ) ( ) ( ) 1 11 ?? ? BB B T TT T A L LA AL LA new new ty ? ? ? ?? ? = ?? (2.28) and then predict the (1 ? ?) vector of new X-variables as ( ) ( ) 11 TT K A AK new new x tp ? ?? = ? (2.29) Notice that Eq. (2.27) has the same structure as Eq. (2.18), however, instead of finding K variables to estimate (? ? ) ??? , now only ? (where,? ? ?) latent variables has to be found to estimate (? ? ) ??? thereby achieving reduction in dimension of the equation system involved. Also, since Eq. (2.25) preserves the covariance structure of X, the new X-variable found will be consistent with the past ones. More details can be found in [49]. 47 2.6.3 Partial Least Squares (PLS) PLS is a regression extension of principal component analysis (PCA). It generalizes and combines features from PCA and multiple linear regressions (MLR). Besides, relating two data matrices, X and Y, PLS also models the common structure between them thereby, giving richer results than the traditional multiple regression approach. PLS regression modeling has been described extensively in the open literature and can be found in Erikson et al. [15], Gabrielsson et al. [35], Geladi and Kowalski [42], among others. While this approach ensures the best possible correlation between the two data sets, it does not guarantee to best describe the X and Y data individually. Figure 2.15 shows a PLS model being generated between the descriptor data matrix X, which could be molecular descriptors or property descriptors, and response data matrix Y, which could be attribute or property information. The method fits two ?PCA-like? models at the same time, one for X and one for Y. However, the projections differ from those obtained with PCA on both blocks separately. The outer relation for X and Y block are: 1X TP E TT MK MA AK MK x ? ?? ? =?+ ? + (2.30) 1 TT ML MA AL ML y ? ?? ? =?+ ? +Y UVF (2.31) Figure 2.15: PLS regression on descriptive (X) and response (Y) variables. y 1 y 2 y 3 u t x 2 x 3 u t x 1 Attribute & Physical-Chemical Property Variables (Y) PLS Regression Model Property Structure Descriptive Variables (X) 48 The projections of X and Y are then connected through the inner relation U TG MA MA MA??? = + (2.32) where, G is a residual and the regression coefficient is one. The score plot in Figure 2.15 shows a linear relationship between predictors and the responses, however, non-linearities may exist. The PLS score plot u/t shows linear correlation structure between the predictors and the responses. The dash-dot line is the projection if PCA was performed on X and Y individually. In PLS, besides loadings P and V, there are additional loadings called weights ? ? which express the correlation between U and X and are used to calculate T such that T XW MA MK KA ? ? ?? = (2.33) The prediction of Y can be obtained from the PLS model as: ( ) ?? TT ML MA AL MK KA AL MK KL ? ? ?? ? ?? ? ? =?= ? ? = ?Y TV X WV X B (2.34) PCR and PLS can be considered standard calibration techniques for several spectroscopic techniques, among many. The main advantage of these techniques is to avoid collinearity problems thus allowing one to work with a number of variables that is greater than the number of samples. A comparison between these two techniques reveals similar results in terms of prediction ability in multivariate calibration, with no significant difference being reported when both employ the optimized number of principal components (PCs) [50]. PCR yields lower accuracy (degree of closeness of a measured value to the actual value) but higher precision (degree of closeness of the measured values to each other) than PLS. The basis of the model development in this dissertation will be based on PCR, however, PLS will be used for few qualitative applications. 49 2.6.4 Model Validation It is important to validate the predictive ability of the developed model in addition to having a good fit. Validation can be distinguished between two types: ? Internal validation: where the calibration data is also used as validation data. This method is applicable when a proper validation sample is not available, or is not used. Often the calibration sample is separated into a learning set (for calibration) and a testing set (for validation) [15, 51]. A popular internal validation technique is the leave- one-out cross-validation technique in which one sample is used for validation and the remaining samples are used for calibration (one-at-a- time method). This is repeated for each sample. Predicted residual sum of squares (PRESS) is computed as: ( ) 2 1 ? K ii i PRESS y y = = ? ? (2.35) The number of components giving a minimum PRESS is the right number for the model that gives optimal prediction. The root mean square error of cross-validation is, = PRESS RMSECV K (2.36) The smaller the PRESS the better the quality of the estimation for a developed model. Details can be found in Montgomery et al. [51]. ? External validation: is based on a new or independent set of validation data to evaluate the predictive ability of the previously developed model from the training data set. The external validation errors are often presented as the root mean square error of prediction (RMSEP), which is calculated as: 50 ( ) 1 22 1 ? K ii i yy RMSEP K = ?? ? ?? = ?? ? (2.37) Including more latent variable terms in the final model will always increase the goodness of fit, ? 2 , regardless of whether the additional variable is statistically significant or not. However, a large value of ? 2 does not necessarily imply that the regression model is a good one. Because ? 2 always increases as we add terms to the model, adjusted ? 2 statistics must also be considered. The ? adj 2 is particularly useful in the selection stage of model building. ( ) 2 22 1 1 1 where, 1 1 ? ?? =?? =? ?? ? ?? adj total K SS R RR A SS (2.38) where, SS is the residual sum of square. Unlike ? 2 , the ? adj 2 increases only if the necessary terms improve the model more than could be expected by chance. Therefore, when ? 2 and ? adj 2 differ dramatically, there is a good chance that non-significant terms have been included in the model [51]. The ? adj 2 can be negative, and will always be less than or equal to ? 2 . An individual principal component (PC) generated by PCA or PLS is considered significant if its cross-validated Q 2 value is greater than zero. Q 2 is calculated according to 2 1= ? total PRESS Q SS (2.39) The overall significance of each PCA or PLS model is evaluated in terms of Q 2 (cum) as [15]: 2 ( )1 ?? ?? = ?? ???? ?? ?? PRESS Q cum SS (2.40) 51 2.7 Principal Properties in Cluster Space In order to solve a design problem in a single domain, all the physico- chemical attribute properties of interest have to be converted to principal properties by using the regression coefficients from the calibration model [27]. ( ) ( ) 1 11 1 ?? ? , if where, ??? , if ? ?? ? ?? ? ?? ? ? ??? ? ?> ? =?= ? ? ?< ? BB B T Y B B BBB TT LA AL LA M A ML LA LA TT LA AL LA LA LA (2.41) ? ?1 is a generalized inverse or pseudo-inverse of a matrix ? B [52]. In order to utilize the latent variables (LVs) in the property clustering algorithm, it is important that the LV structures follow a linear mixing rule. This can be achieved by standardizing the data structure to obtain a new matrix, ? ??? . Rearranging the data decomposition represented by Eq. (2.25), we get: ? ?? = ?T X P MA MK KA (2.42) If the loadings ? ??? are thought of as the pure values of the principal components, then scores ? ??? serves as the predicted mixture properties represented by ? ??? in Eq. (2.1). Here, the mixture fractions of the multivariate descriptor data ? ??? must sum to one across descriptor variables K. To achieve this form, the latent variable structure is standardized by dividing ? ??? and ? ??? in Eq. (2.42) by the sum total of the property descriptors of each experimental run ? ??? such that: ? ?? = ?Q UP MA MK KA (2.43) Where, ? ??? = ? ??? /? ? , ? ??? = ? ??? /? ? , and = ? H Mi i Sx. A powerful chemical product design framework is achieved, by integrating latent variable methods within property cluster domain and by 52 formulating two reverse problems. The first reverse problem identifies product quality/performance requirements and second identifies mixing conditions (involving selection of materials, blend ratios, or process operation conditions) when considering mixture design, and substructural molecular building blocks when considering molecular design. The following chapters will demonstrate the mixture and molecular design solution framework developed by combining methods and tools presented in this chapter. 2.8 Computer-Aided Design using QSPR and cGCM This section describes how the tools and techniques presented in this chapter are combined within the computer-aided approach to facilitate investigation of chemical product formulations prior to experimentation and simulation. When appropriate property models are available to describe and predict the target product properties, computer-aided molecular design methodologies are utilized to solve the design problems systematically and efficiently. The computer-aided molecular design problems require: (1) prediction of properties from molecular structure (solution to forward problem) and (2) identification of optimized molecular structures that meet given a set of property values (solution to reverse problem) [1, 9, 11, 24- 27, 53]. Figure 2.16 shows the two required solutions of CAMD. Figure 2.16: Forward and reverse problems in computer-aided molecular design. Given set of molecular groups/building-blocks (e.g. CH 3 , C=C, , , Set of target properties values (e.g. ?, ?, T b ) Forward Problem Reverse Problem 6 , ? PF 4 ) ? BF + IM + PY 53 2.8.1 Solution to Reverse Problem The molecular design problem presented in this dissertation follows a reverse problem of property prediction from molecular structures. In this approach, it is required to enumerate structured molecules from a given set of representative molecular building blocks or fragments corresponding to target physical-chemical properties estimated by property models. First, the values of the target properties of desired molecule needed for a specific application are determined a priori. Lower and upper bounds on each property, ? ? , of molecule, i, are obtained either from process design problem (see Section 2.2.1) or from product property requirements/constraints. min max ?? j ij j y yy (2.44) Mapping the design problem formulation from property to cluster space (described in Section 2.2.2) enables visualization of a problem and its solution when three properties are concerned. The normalized property operator of molecule i using corresponding reference values give: min max ? ?? ?? j ij j (2.45) where ? ? is defined by Eq. (2.4). Second, brute-force search or exhaustive searches are performed using numerous permutations or combination of atoms and molecular groups, to systematically enumerate all possible molecular structures, and mixtures of molecules, that satisfy specifications in terms of the normalized target property operator values in Eq. (2.45). The maximum number of similar groups ? ? ? are predefined to limit the size of a generated molecule. The possible number ? ? ? of similar molecular fragments/groups of type g that can be included in the design can be estimated by taking the minimum of the nearest integer value of ?? ? ??? ? ? ? ? and ? ? ? . 54 max ** min : ??? = ?? ? ?? j gg j nN (2.46) where ? ? ??? is the target maximum latent property value of property ?. It must be noted that exhaustive generate-and-test algorithms are adequate to solve problems with small size. For problems with slightly larger search space, reduction in the search space (reducing the set of candidate solutions to a manageable size) using problem-specific heuristics can make the algorithm more efficient [54, 55]. However, for large scale molecular design problems, stochastic or evolutionary search algorithms such as genetic algorithm (GA), are preferred [9]. 2.8.2 Solution to Forward Problem In order to achieve predictive property models, quantitative structure- property relations (QSPR) are utilized where the training set molecular descriptors are generated in terms of infrared (IR) frequencies. Since molecular descriptors are obtained through characterization techniques based on IR spectroscopy (described in Section 2.4.1), the group based property estimation method is termed characterization-based group contribution method (cGCM) [27]. Since the combination of molecular groups can be done in an infinite number of ways, the search space and the size range of enumerated molecules can be reduced by constraining the number of occurrence of similar and dissimilar types of groups within minimum and maximum limit. ** 0 ?? gg nN (2.47) 2 ?? gg nN (2.48) Note that a minimum number of two dissimilar groups, ? ? , must be selected to form a structurally feasible molecule. 55 When order of selection does not matter and repetition of similar groups is allowed, the total number of possible candidate molecules that can be generated by selecting ? ? groups from a set of ? ? groups is given by: ( ) ( ) total 22 1! Candidate ! 1! = = +? = = ? ?? gg g g gg NN ggN n nn gg Nn C nN (2.49) As the number of groups gets high, the problem of combinatorial explosion arises. In order to reduce the combinatorial problem, atoms and several first order groups (except some terminal groups like CH 3 ) are clustered into meta-groups. In group-based property estimation methods, these meta-groups are treated as first order groups. Such groups capture inter- and intra-atomic and group interactions. The property operators of molecules using first order meta-groups-based cGCM are estimated as: 1 g N Mix j g jg g n = ? = ?? ? (2.50) where, ? ? ??? = the mixture property operator values of property ? ? ? = the number of occurrences of dissimilar molecular fragment ? ? ? = the maximum number of possible appearance of dissimilar fragments (user defined), and ? ? = the property contribution of the appeared group in the formulated molecule. When the predicted property values using Eq. (2.50) of enumerated molecules from the combinatorial building blocks satisfy the property constraints in Eq. (2.45), the set of molecules are considered and selected as a candidate solution. Then the enumerated candidate molecules are screened for structural constraints to ensure that a stable and connected molecule was formed. One such structural constraint is to check the number of unused bonds in a generated molecule, i.e. the free bond number (FBN) [24]. 56 11 2 12 gg NN g g g Rings FBN n FBN n N = = ? ?? ? = ? ?? ? ?? ? ?? ? ? ?? ? ?? (2.51) where, ??? ? = the unique free bond number associated with group g ? ???? = the number of rings in the formulation. For an electronically complete molecular formulation the free bond number must be equal to zero. A FBN of zero indicates that the electron valency shells of all atoms in the molecule have been satisfied, which, in most cases, indicates one of the minimum energy connectivity configurations of the atoms in the molecule. Finally the feasible solution can be verified through more rigorous experimentation and/or computational studies based on molecular dynamics and quantum chemical calculations. The methodology presented in this dissertation provides a computationally efficient screening procedure to narrow down potential compounds with desired attributes from a large chemical space. Figure 2.17 is a schematic diagram that summarizes the steps involved with variable transformation, latent variable model calibration, and property integration to solve the computer-aided reverse design of chemical products in reduced space. 57 Figure 2.17: An overview on the methodology of multivariate characterization, modeling, and design. , 1 H T MA MA hm h x ?? = = ? QT ? M L M A AL? ?? = ?Y TB ,, T new M A New M L L A? ?? = ? T YB MH? X ? T M H M A AH? ?? = ?X TP ( ) ? = a a ma m qq ( ) ( )???= am aa aa m ref qq 1= = ? ? A m am a AUP Chemometric Technique ? Dimensionality reduction and multicollinearity removal from molecular/process descriptor data set. ? PCA to achieve a qualitative analysis of data and PLS to develop the calibration model. ? Transform physicochemical properties to principal properties (latent variable domain). Property Clustering Technique ? Standardize principal properties such that they follow linear mixing rule (suitable in cluster domain). ? Property operators as functions of standardized principal properties. ? Dimensionless property operators follow linear mixing rules and allow optimization based on lever- arm analysis. ? AUP values of the formulated product and the target must match along with matching the cluster targets. ? Clusters follow intra- and inter-stream conservation. 0=FBN 1= ? = ?? ? g N Mix a g ag g n , ? ?? ?? LU a am a Reverse Molecular Design ? Mixing of properties operators follow linear mixing. ? Selection of maximum number of dissimilar molecular fragments. ? Solution properties must lie within the lower and upper target property limits. ? Molecules must be chemically stable and structurally feasible. =? ma ma m C AUP 0 ?? gg nN 58 CHAPTER 3 MIXTURE DESIGN 3.1 Introduction Mixing/blending processes play an important role in today?s manufacturing of value-added chemical products such as commodity chemicals, food, cosmetics, oil and pharmaceuticals. The product formulation design problem in industrial research and development is one such area where the analysis of mixture data could be utilized more effectively. Mathematical models are commonly used to characterize a system, which is to be controlled or optimized by a set of variables, to study the effects of various factors and to make predictions about behavior. Generally, models can be classified into mechanism-driven models and data-driven models. Traditionally, the chemical engineering discipline has focused on mechanistic models that describe underlying phenomena with a system of differential and algebraic equations (DAEs). However, real-life industrial process systems, which are often complex and nonlinear with incomplete and/or uncertain data, cannot be adequately described by such models [18]. Data-driven models offer an alternative solution as we move from limited data, which used to be obtained through time-consuming experiments and simulations, to massive amounts of data that can be generated from analytical instruments, images, spectra, etc. (i.e. ?data poor? to a ?data rich? paradigm shift due to rapid instrumentalization of science and technology). This has been the trend in recent decades [18, 19, 35, 27, 41, 56, 57]. Industrial reality also suggests that a good theoretical process model is often not available. Before new experiments are conducted, historical process 59 data that encompass a wide spectrum of operating conditions and existing product grades can be utilized to achieve new and improved products (20). Today, tremendous amounts of diverse data are readily available from extensive monitoring of equipment, processes, and products at all scales (18). However, managing such abundance of complex data to build appropriate models for a specific application and exploring their effects on the final product properties remain major challenges. 3.2 Traditional Approach In classical non-mixture designs such as factorial and response surface designs, all the factors are orthogonal or independent. This means that it is possible to freely choose the level of a factor regardless of the other factors? levels [58]. In general, the model parameters can be used to judge the effects of the mixture components. For example, consider a mixture experiment in which three components ? 1 , ? 2 , and ? 3 , were blended to form a product with response variable ?. If 1 2 3 12 13 23 ? 11.7 9.4 16.4 19.0 11.4 9.6y x x x xx xx xx= ++ + + ? can adequately represent the response, it can be concluded that component 3 has the largest contribution to the highest response value because the coefficients are in the order ? ? 1 > ? ? 2 > ? ? 3 . Furthermore, since ? ? 12 and ? ? 13 are positive, presence of components 1 and 2 or components 1 or 3 enhances the response. This is an example of synergistic blending effects. Components 2 and 3 have antagonistic blending effects because ? ? 23 is negative and the presence of both components works against the response of interest. However, this freedom does not exist for mixture designs, because each component in a mixture is dependent upon the settings of the other component settings. In mixture design, the factors are interdependent and the effects of the factors on the responses are not separable. For example, for a mixture containing N components, if the proportions of the first N-1 components are defined, the proportion of the N th component cannot be freely 60 chosen. If x i is the percentage content of component i, the sum of all the mixture components is given by the fundamental mixture constraint relationship 12 1 1.0 N iN i xxx x = = + +???+ = ? (3.1) 0 1.0 1, 2, ..., i x iN? ? ?= (3.2) Analyzing mixture data with multiple regressions necessitates a special model form to eliminate the mixture constraint (Eq. (3.1)). The Scheffe canonical models [59] and Cox polynomial models [60] are the two most commonly used to analyze mixture data with multiple regressions. Scheffe introduced canonical models of various orders by eliminating some terms from the complete polynomial model. Cornell [58] provides multiple standard references to regression, modeling and analysis of mixture data. A short review on Scheffe and Cox models is also included in Appendix B. In most practical applications, mixture data can be noisy and highly collinear because of process or operational constraints (20). Although the Scheffe canonical models and the Cox polynomial models (a reparameterized and constrained version of the Scheffe model) eliminated the true collinearity, and enabled the use of multiple regressions for the estimation, the problem of near collinearities with mixture data remains. Design of experiments (DOE) with response surface methods is usually used to determine the optimum combination of chemical constituents that give a desired response using a minimum number of experimental runs [35]. While such a design approach is adequate for most experimental designs, it suffers from combinatorial explosion and visualization difficulties when dealing with multi-component mixtures [22,61]. Solvason et al. [61] presented a solution to the above problems by integrating the property clustering framework with existing mixture design techniques. 61 In addition, traditional mixture models are usually employed to investigate the relationships between the blend ratio matrix (R) and the final blend product property matrix (Y) only (Eq. (3.3)), given that the properties of the pure raw materials (X) and the process conditions used to manufacture them (Z) are already chosen. The preceding topics represent this situation. However, for the development of new products that meet target properties with minimum experiments and minimum total material cost, it is important to simultaneously take into account all three degrees of freedom (X, R, and Z) available in blending operations [57]. In general, mixture design approaches tend to treat the three problems as separate steps. A set of raw materials is selected usually based on the experimenters? best guess, and then a set of blending experiments with certain constant process conditions are run to see if the target properties can be achieved. If the results are not acceptable, another set of raw materials and/or process conditions are selected and the process repeated. Figure 3.1 depicts the traditional mixture design approach which may lead to many blending experiments, a very inadequate investigation of the large number of raw materials, and a very long development time. Figure 3.1: Traditional approaches in mixture design. Material Selection Process Condition Selection Blend Ratio Selection Target Properties meet? Final Product Select set of Raw Materials and/or process conditions based on best guess or experience Design set of blending experiments or trial and error Yes No 62 Statistical multivariate models such as principal component analysis (PCA), partial least squares (PLS), and neural networks (NN) provide powerful tools to extract systematic variables and remove both of the above collinearities in the data set, thus allowing one to work with a number of variables that is greater than the number of samples [15, 49, 56, 57]. Kettaneh-Wold [56] proposed the use of a partial least square (PLS) model for mixture data, and showed how it effectively deals with both the above collinearity problems, and can simultaneously incorporate process conditions. Recently, Muteki et al. [57, 62] added the relationship between the raw material properties and final blend properties that allows investigation of the effect of raw material properties on the final mixture product properties. However, it is important that the raw material properties selected are sufficiently correlated with the final blend product properties. The multivariate statistical methods presented in this chapter are based on Muteki?s development on mixture-property models combined with the property clustering technique and reverse problem formulation. The nomenclatures are kept consistent from that in Muteki?s papers for less confusion. In this chapter we shall call the X-variables factors or predictors and the Y-variables qualities or responses. 3.3 Multi-Block Data Structure The data structure generally available on raw material property data and blending data in mixture design is shown in Figure 3.2 [57]. The raw material properties matrix, ? ??? , consists of N available raw materials with K number of properties. The blend ratio matrix, ? ??? , consists of M number of blends of N materials used in the formulation of the blends such that ? ? ? = 1.0 ? ?=1 , where ? ? ? 0 ? i = 1,2, ? , ?. The process conditions matrix, ? ??? , consists of J process conditions. The final response or property matrix, ? ??? , consists of L properties measured on the final product. It must be noted that 63 there is no common dimension over the entire data matrix. However, ? ? has an indirect relationship to Y through R as it has one dimension in common with R but no dimension in common with Y. It is referred to as L-or T-shaped data structure [57]. Figure 3.2: Data structure for three manipulative variable matrixes and a quality/response variable matrix. 3.4 Multi-Block Regression Models Traditional mixture models such as the Scheffe and the Cox models [58], have commonly been used to model the relationship between R and Y as: ()YRf ?= + (3.3) Muteki et al. [57] used ideal mixing rules for combining the raw material properties matrix (X) and the blend ratios matrix (R) in order to relate all (Z, R, ? ? ) to the Y matrix with a common dimension M. , ( , ) where,Y XZ X R X mix mix M K M N N K f ? ? ?? =+=? (3.4) ? ??? is the mixture-raw material mixture-property matrix. Muteki [57] demonstrated that the mixture-property models account for the similarities among the raw material properties and their effect on the blends. This can X T (K x N) matrix R aw Mat er i al P r oper t y Material Z (M x J) matrixS am pl e Process Condition R (M x N) matrix Material Proportion Y (M x L) matrix Mixture Property S am pl e S am pl e 64 successfully capture more inherent latent relationships between the mixture properties (? ??? = ???) and the final product properties Y than those which exist simply between the ratios (R) and final blend properties (Y). When a property model does not follow linear mixing, then the model can be moved from its non-linear domain to a linear domain by a suitable transformation of the model formulation [34]. If the process operating conditions (Z) change between blending experiments, then the effect of these changes are easily accounted for by incorporating Z into the PLS models: [ ] ( ) ( , ) where, Y X Z X X Z RXZ mix all mix all mix f ?=+==??? ?? (3.5) ? ??? ??? is a matrix combining ? ??? and Z in parallel. The above data combination can be better understood graphically in Figure 3.3. (The dotted lines in each block indicate the mean centering and scaling direction.) Figure 3.3: Data structure for combined manipulative variable matrixes and a quality/response variable matrix. Y (M x L) matrix X T (K x N) matrix R (M x N) matrix Z (M x J) matrix Y (M x L) matrix X mixall = [(R . X)Z] (M x H) matrix Ra w M at er i al P r oper t y Material E xp . N o Material Proportion E xp . N o Process Condition E xp . N o Mixture Property Adjustable Process Variables E xp . N o E xp . N o Mixture Property 65 This simplifies the analysis and design by not having to differentiate between mixture and process variables and not having to assume independence of the factors when multivariate analysis techniques are used [56]. The multi-block PCR (Section 2.6.2) or PLS (Section 2.6.3) model can be used to obtain the relationship in Eq. (3.5). PLS regression is performed by projecting the ? ??? ??? data and Y data onto a lower dimensional subspace: * , 1 where, A TT mix all M H i i M H M A A H M H mix all i tp ? ? ?? ? = = ?+ = ? + = ? ? X E T P E T X W (3.6) 1 A TT ML i i ML MA AL ML i tv ? ? ?? ? = = ?+ = ? + ? Y F TV F (3.7) The prediction of Y can be obtained from the PLS model as: ,, ?? TT M L M A A L mix all M H H A A L mix all M H H L? ?? ? ?? ? ? =?= ? ?? ? Y TVX WVX B (3.8) The prediction of Y can be obtained from the PCR model as: ?? M L M A AL? ?? = ?Y TB (3.9) This way, the data matrix ? ??? ???,??? , containing ? (= ? + ?) highly correlated manipulated (or predictive) variables is transformed into the score matrix, ? ??? , containing only A (where,? < ?) independent latent variables, which are linear combinations of the original manipulated variables. The weights of this linear combination are captured in the loading matrices, ? ??? ? , ? ??? and ? ??? . If new raw material properties, (? ? ) ??? , have to be predicted from the desired product quality specifications, (? ? ) ??? , then the inversion of the latent variable model gives: 1 ,1 ,1 ? ??()BB B P T T T TT new K des K L A A K xy ? ? ? ?? = ??? ? (3.10) 66 3.5 Proof of Concept Example ? Starch Blending The development of thermo-plastics from the mixing of starches, lactic acids and additives using latent variables in cluster space is presented as a case study to illustrate the method and concept described in this chapter. Data for the polymer blend problem was obtained from Muteki [63] and involves study of the influence of raw material properties (? ? ), blend ratios (R) and process conditions (Z) on the product property matrix Y. This work employs multivariate data analysis techniques and the mixture-property model introduced by Muteki et al. (2006) then formulates and solves the mixture design problem in the cluster space. 3.5.1 Structure of starch blending data The raw material data matrix (X) consists of 5 properties (Amylose content and 4 properties related to molecular weight distribution) on 3 starches. The blend matrix (R) consists of different blend ratios of 3 starches and 3 other materials (1 polylactic acid and 2 additives) in each blend of 28 mixtures. The process condition matrix (Z) consists of the molding temperature as the process variable. The final product property matrix (Y) consists of 4 polymer properties (tensile strength (TS), tensile modulus (TM), elongation at break (EB) and density (Rho)) for 28 mixture blends. Figure 3.4 represents a more complex extension of the data structure for the X and R matrices shown in Figure 3.2. The raw material property data matrices result in a staircase type of the data structure and are not overlapped with each other because their raw material properties contain measurements on different variables due to the different classes of materials. These blend ratio matrices were designed by a D-optimal design, and therefore the conditioning is relatively good. In this case, only one class of raw materials has the property data information (? ?????? ? ), and the property data matrix (? ?????? ? ) of the other materials (PLA and additives) is not 67 available [63]. The data structure of the X matrix for this problem of the mixture-property model using ideal mixing rule becomes ? ??? ??? = [(??? ??? )?] = [(? ?????? ?? ?????? )? ????? ?]. It is also known that many polymer properties, such as the average molecular weight approximately follow ideal mixing rules [64]. Figure 3.4: Data structure of X and R matrix for two classes of raw materials and their blend ratios. Table 3.1 and Table 3.2 contain the data for the raw material properties and Table 3.3 contains mixture conditions and product properties [63]. Table 3.1: Starch material property data matrix. Raw Material Property Data, ? ?????? Starch I.D. Amylose content, Average molecular weights (?10 -3 g/mol) wt% Mn Mw Mz Mw/Mn (PDI) Starch1 70 41 214.5 463.7 5.23 Starch2 23 62.3 531.4 821.2 8.53 Starch3 0 90.1 722.2 1040 8.02 X T starch (SK x SN) S t ar c h P r oper t y Starch X T other (OK x ON) O t her P r oper t y Other R = [R starch R other ] (M x N) E xp . N o Material Proportion R other (M x ON) R starch (M x SN) 68 Table 3.2: Blending ratio and process condition data matrix. Material Blend Ratio Process Condition Mixture R starch R other Z I.D. Starch 1 Starch 2 Starch 3 PLA Additive 1 Additive 2 Molding temperature 1 0.50 0.00 0.00 0.00 0.50 0.00 150 2 0.41 0.00 0.00 0.18 0.41 0.00 150 3 0.41 0.00 0.00 0.18 0.41 0.00 170 4 0.33 0.00 0.00 0.33 0.33 0.00 150 5 0.33 0.00 0.00 0.33 0.33 0.00 170 6 0.23 0.00 0.00 0.54 0.23 0.00 150 7 0.23 0.00 0.00 0.54 0.23 0.00 160 8 0.23 0.00 0.00 0.54 0.23 0.00 170 9 0.39 0.00 0.00 0.17 0.39 0.05 150 10 0.32 0.00 0.00 0.32 0.32 0.05 150 11 0.22 0.00 0.00 0.51 0.22 0.05 150 12 0.16 0.00 0.00 0.63 0.16 0.05 150 13 0.36 0.00 0.00 0.36 0.24 0.05 150 14 0.36 0.00 0.00 0.36 0.24 0.05 170 15 0.38 0.00 0.00 0.38 0.19 0.05 150 16 0.38 0.00 0.00 0.38 0.19 0.05 170 17 0.36 0.00 0.00 0.36 0.18 0.09 150 18 0.36 0.00 0.00 0.36 0.18 0.09 170 19 0.00 0.52 0.00 0.22 0.26 0.00 150 20 0.00 0.52 0.00 0.22 0.26 0.00 170 21 0.00 0.40 0.00 0.40 0.20 0.00 150 22 0.00 0.40 0.00 0.40 0.20 0.00 170 23 0.00 0.26 0.00 0.61 0.13 0.00 150 24 0.00 0.26 0.00 0.61 0.13 0.00 170 25 0.40 0.00 0.00 0.40 0.20 0.00 150 26 0.40 0.00 0.00 0.40 0.20 0.00 170 27 0.00 0.00 0.40 0.40 0.20 0.00 150 28 0.00 0.00 0.40 0.40 0.20 0.00 170 69 Table 3.3: Mixture product quality data matrix. Product Quality, Y Mixture I.D. Tensile strength Mpa Tensile modulus Mpa Elongation at break % Density g/cm 3 1 0.888 18.59 10.1 1.367 2 3.72 162.3 10.1 1.341 3 3.25 166.1 7.6 1.332 4 9.7 629.1 6.59 1.311 5 7.55 489.1 4.02 1.316 6 22.8 1522 7.59 1.296 7 23.8 1469 8.00 1.298 8 20.0 1430 4.21 1.294 9 3.03 134.5 11.00 1.345 10 8.73 568.9 7.3 1.315 11 17.6 1327 20.1 1.296 12 22.4 1700 19.4 1.295 13 14.8 1067 3.46 1.344 14 11.5 692.3 4.15 1.327 15 16.8 1740 2.17 1.346 16 16.8 1386 1.61 1.336 17 12.9 1149 2.20 1.339 18 16.2 1291 1.90 1.331 19 5.08 250.6 13.0 1.391 20 5.09 198.9 15.5 1.377 21 15.1 926.2 6.22 1.367 22 10.4 721.7 3.02 1.347 23 24.5 1557 5.30 1.324 24 19.2 1496 1.99 1.312 25 13.85 937.5 3.32 1.353 26 11.84 854.8 2.06 1.342 27 13.79 784.8 7.03 1.363 28 14.45 968 3.17 1.367 70 3.5.2 Data analysis The data was first mean centered and unit variance scaled as described in Section 2.6. Data analysis was performed using the JMP? 9.0 statistical software package by SAS [65]. PCA was used as the projection method to analyse multivariate data and examine the overall data structure. Figure 3.5 is an outlier analysis based on Hotelling?s ? 2 . The data contains no outliers. Figure 3.5: Distribution plot, outlier box plot and normal quantile plot. Figure 3.6 is a scree plot of eigenvalues vs. number of principal components (PCs) combined with a pareto plot. The pareto plot shows each eigenvalue as a percentage of the total eigenvalue. Eigenvalues sum to the number of variables when the principal component analysis is done on the correlation matrix. Cumulative percent shows the cumulative percent of variation represented by the eigenvalues. The scree plot is useful for visualizing the dimensionality of the data space. In this example, the scree plot suggests that using three PCs is adequate as any additional PC did not substantially increase the amount of variance accounted for. The first three principal components (i. e.? = 3) that are guaranteed to be orthogonal, captured 86.82% of the total variance of ? ??? ??? [= (? ?????? ?? ?????? )? ???? ?] 4 6 8 10 12 14 28 27 25 23 24 1 26 18 17 20 19 10 1214 9 115 321 2 22415 13 16 8 6 7 -1.64 -1.28 -0.67 0.0 0.67 1.28 1.64 0.1 0.2 0.4 0.6 0.8 0.9 Normal Quantile Plot 71 data. The score variables generated by PCA are optimal summaries of the original variables. Table 3.4 and Table 3.5 contain the score and the loading values, respectively, for X- and Y- blocks. Figure 3.6: Scree plot and pareto plot for PCA on X-variables. Figure 3.7 and Figure 3.8 present loading plots combined with score plots for the PCA on X-variables. There are 28 mixture observations involved in this problem. Since a three dimensional plot is the highest possible dimensional plot that can be constructed, only three variables can be studied at a time. Therefore, the relationship between all 28 observations is hidden in a highly dimensional space. The score plot projects high dimensional variables onto low dimensional variables thereby facilitating the analysis of all 28 observations simultaneously on a two dimensional plot (see Section 2.6.1.2). The loading plot shows the relationships among the 7 variables. The variables that are most influential for the model are found on the periphery whereas the less influential variables are encountered around the origin of the loading plot (see Section 2.6.1.3). Since the score and the loading are complementary and superimposable, a biplot can be created that combines both plots (Figure 3.7 and Figure 3.8). 72 Figure 3.7: Combined PCA score and loading plots (Biplot) on first and second components for X-block. Figure 3.8: Combined PCA score and loading plots (Biplot) on first and second components for Y-block. ? Mixtures ? Mixture variables ? Mixtures ? Product variables 73 It is clear from Figure 3.7 that the variables PDI, Mn, Mz and Mw are positively correlated, i.e. when one increase or decreases, the other has a tendency to change in the same direction. On the other hand, the variable molding temperature is negatively or inversely correlated to amylose% i.e. increase in one will have inverse effect on the other, and vice versa. Similarly, Figure 3.8 shows a loading plot combined with a score plot for the PCA on Y-variables (which explained 86.79% of the variance of the final product properties). In this loading plot all product property variables, except the tensile strength and the tensile modulus, have the trade-off relationship among them. Figure 3.9 shows the regression coefficients for the final product properties and how the starch properties, PLA and additives are correlated. For instance: high additive1, low molding temperature and low amylose content leads to high elongation. High lactic acid and low additive1 leads to high tensile- strength and modulus. The observations are well known to most experienced polymer chemists [64]. The results obtained here are consistent with Muteki?s results [63]. Since only three properties can be represented in the ternary cluster diagram and tensile strength and modulus are affected similarly by the mixture factors, only one of them will be used together with elongation and density as three targeted product properties. However, in situations where more than three properties have to be considered, an algebraic approach may be used [24, 21]. Figure 3.9: PLS model coefficients for blend property matrix Y using four latent factors. 74 Table 3.4: PCA score values for X- and Y-blocks. X score value Y score value t 1 t 2 t 3 t 1 t 2 0.252 4.221 -0.155 -2.814 0.288 -0.426 2.466 -0.204 -1.928 0.625 -0.279 2.257 1.655 -1.743 0.299 -1.022 0.948 -0.248 -0.215 0.330 -0.874 0.740 1.611 -0.597 -0.179 -1.784 -1.043 -0.302 2.264 0.566 -1.711 -1.148 0.628 2.248 0.618 -1.637 -1.252 1.557 2.023 0.002 -0.922 2.265 -0.817 -2.117 0.737 -1.459 0.858 -0.854 -0.463 0.413 -2.204 -1.049 -0.909 1.237 2.864 -2.657 -2.219 -0.942 2.147 2.688 -1.217 0.342 -0.952 0.259 -0.770 -1.070 0.133 0.908 -0.196 -0.346 -1.093 0.018 -1.011 1.222 -1.133 -0.946 -0.191 0.848 1.002 -1.043 -1.507 -0.072 -1.494 0.301 -0.936 -1.359 -0.280 0.365 0.918 -0.905 4.256 0.867 -1.106 -2.659 0.435 4.403 0.658 0.753 -2.535 1.090 2.462 -0.608 -0.949 -0.355 -0.572 2.609 -0.816 0.910 -0.588 -0.832 0.369 -2.330 -0.767 2.025 -0.247 0.516 -2.539 1.093 1.765 -0.659 -0.602 0.111 -0.408 -0.134 -0.900 -0.455 -0.098 1.451 -0.188 -0.958 4.105 -1.016 -1.259 -0.590 -0.347 4.252 -1.224 0.601 -0.287 -1.128 75 Table 3.5: PCA loading values for X- and Y-blocks. X loading value Y loading value p 1 p 2 p 3 p 1 p 2 -0.276 0.469 0.097 0.616 -0.012 0.460 0.066 -0.069 -0.132 0.924 0.468 -0.052 -0.078 -0.474 -0.374 0.468 0.036 -0.067 0.447 0.118 -0.043 -0.122 -0.623 0.008 -0.029 0.599 0.095 -0.231 -0.013 -0.359 0.072 -0.102 0.914 3.5.3 Model development Latent variable models were developed using three latent variables (? ? , obtained from PCA) to predict three product properties: tensile strength (TS), elongation at break (EB) and density (Rho). The general form of the equation is: 3 33 3 2 0 11 ? i i ij i j ii i i i j ji i y t tt t?? ? ? = <> = =++ + ? ?? ? (3.11) Since addition of parameters result in the risk of overfitting, it is important to select the optimum number of parameters during model selection. Bayesian Information Criterion (BIC) introduces a penalty term for the number of parameters chosen in the model. BIC can be expressed as: 2log ln( )BIC likelihood B M=?+ (3.12) where, B = the number of parameters, including intercept and error terms in the model M = the number of observations in the data set. 76 Minimization of BIC was used as a criterion during the model selection process in the statistical package JMP 9.0 and the detailed expressions for the property models are as follows: 2 0 1 1 2 2 3 3 13 1 3 33 3 TS t t t t t t?? ? ? ? ?= + + + + ?+ (3.13) 22 0 1 1 2 2 3 3 12 1 2 23 2 3 11 1 22 2 EB t t t t t t t t t?? ? ? ? ? ? ?= + + + + ?+ ?+ + (3.14) 22 0 1 1 2 2 3 3 12 1 2 11 1 22 2 Rho t t t t t t t?? ? ? ? ? ?= + + + + ?+ + (3.15) The estimated regression coefficients for the respective property models are listed in Table 3.6. Figure 3.10 shows a plot with actual versus predicted properties. Table 3.6: The regression coefficients for the expressions in Eqs. (3.13) through Eqs. (3.15). Regression coefficients TS (MPa) EB (%) Rho (g/cm 3 ) 0 ? 14.59 2.534 1.345 1 ? -1.119 -1.328 0.0146 2 ? -3.938 0.4816 0.0073 3 ? -0.1558 -0.7965 -0.0068 12 ? - 0.7488 0.0016 23 ? - 0.6859 - 13 ? -0.4749 - - 11 ? - 0.7361 -0.0018 22 ? - 0.5084 -0.0013 33 ? -1.59 - - ? 2 0.917 0.698 0.852 ? adj 2 0.898 0.593 0.810 77 Figure 3.10: Predicted vs. actual product properties using PCR model. 3.5.4 Design of desired products in score space It is desirable to cover a wide range of product properties with a minimum number of products in order to minimize the manufacturing costs, inventories and material cost. Figure 3.11 is a product properties score plot (same as Figure 3.8) where the desired product properties can be independently selected such that a set of product grades spans the desired property space [63]. Here five target products are selected in the score space to provide a wide range of final product properties. Table 3.7 shows the score values from the selected points in Figure 3.11. Using the score and the loading values from the PCA on Y (Table 3.5), the properties for the desired products (tabulated in Table 3.8) can be calculated as 2 , 1 ? 1 = ?? =?+ ? ? ?? ?? ? T des h des a a h a y x t ps (3.16) where, 1 ??? = the variable average ? = the standard deviation originated from the pre-processing step. 78 Figure 3.11: Visualization of target product properties in score space. Table 3.7: Score values for desired products in Figure 3.11. Products t 1 t 2 D1 0.0 0.0 D2 2.0 0.0 D3 1.0 2.0 D4 -2.0 1.0 D5 0.0 -1.0 Table 3.8: Desired product properties for desired products in Figure 3.11. Products TS (MPa) EB (%) Rho (g/cm 3 ) D1 13.06 6.861 1.335 D2 21.32 5.510 1.309 D3 17.03 15.64 1.302 D4 4.725 12.94 1.350 D5 13.14 2.136 1.345 ? : Existing Products Grades ? : Desired Product Properties D4 D1 D3 D2 D5 79 From the five desired products, product D2 (from Table 3.8) was selected as a target product (?) in score space. The mixture conditions (i.e., the raw materials, their blend ratios, and process condition) that give the desired product properties of the target product can be estimated by minimizing the sum of square differences between the target and predicted properties. The results are listed in Table 3.9. Using this information, a new blend product can be synthesized. Table 3.9: Required mixture conditions to achieve target product properties. Target R starch R other Z Starch 1 Starch 2 Starch 3 PLA Add 1 Add 2 Temp o C D2 0.171 0.132 0.053 0.404 0.223 0.016 159.5 Muteki used a mixed integer non-linear programming (MINLP) algorithm to find raw materials that were not part of previous product grades. The results can be found in [63]. Such optimization is outside of the scope of this study. 3.5.5 Design of desired products in cluster space The ternary cluster space provides an excellent platform for simultaneous visualization and solution of mixture design problems. Figure 3.12 is a ternary cluster plot with a feasibility region for blend product properties that incorporates all the existing product grades and all five desired product properties identified from Figure 3.11 and Table 3.8. The method to convert score values to cluster values is discussed in Section 2.7. The mixture feasibility region in cluster space ensures that the products within its boundary are physically feasible, consistent with past operating strategies and expected to yield the desired product qualities. 80 Figure 3.12: Visualization of starch blending formulation in cluster space. Analogous to the use of the score plot to identify a product whose properties span the desired property space, cluster space can achieve similar objectives. Cartesian coordinate points for D2 cluster are (0.634, 0.540). Since property clusters are tailored to maintain the fundamental rules for intra- and inter- stream conservation, the mixing operation can be optimized using lever-arm analysis. For instance, if two polymer product grades are compatible, i.e. miscible, they can be mixed together to get a polymer with properties somewhere between those of the two polymers mixed. In latent property cluster space, one can rapidly screen out inherently infeasible combinations of candidate constituents visually. The reduced search space can then be explored for the feasibility of formulating binary, ternary and 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Rho EB TS 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Mixture Design Feasibility Region ? : Desired Product Properties ? : Target Product Properties ? : Existing Blend Products D3 D2 D1 D5 D4 8 24 15 16 13 18 4 10 26 5 22 2 9 3 25 6 81 multi-component mixtures to achieve optimum products based on lever-arm analysis without extensive enumeration. For example, the target product (D2) synthesized in Table 3.9 can also be formulated by mixing existing blends 24 and 25 with blend 6 (denoted as 24-25-6 in Table 3.10). In Figure 12, M1 represents a mixture of existing blend products 24 and 25 in a proportion 0.640 and 0.360 respectively. When M1 is mixed with blend 6 in a proportion 0.207 and 0.793 respectively, product target (D2) with property targets (listed in Table 3.8) can be achieved. A simple visual analysis of Figure 3.13 suggests that the binary mixture of existing blend products is inherently infeasible to achieve the target product. Figure 3.13: Visualization of starch blending formulation in cluster spac 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Rho EB TS 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 6 25 24 Target M1 Mixture Design Feasibility Region ?: Possible Mixture of Blends ? : Desired Product Properties ? : Target Product Properties ?: Existing Blend Products ? 6 ? M1 82 Table 3.10: Candidate ternary mixtures and fractional contributions of the constituents. S.N. Mixture (A-B-C) Feasible A B C 1 8-2-6 Yes 15% 5% 80% 2 8-3-6 No 12% 6% 82% 3 8-4-6 No 10% 10% 80% 4 8-5-6 No 6% 11% 83% 5 8-9-6 Yes 16% 5% 79% 6 8-10-6 No 11% 8% 81% 7 8-13-6 Yes 15% 8% 77% 8 8-15-6 Yes 14% 13% 73% 9 8-16-6 No 1% 21% 78% 10 8-18-6 No 1% 20% 79% 11 8-22-6 No 12% 7% 81% 12 8-25-6 Yes 16% 7% 77% 13 8-26-6 No 9% 9% 82% 14 24-2-6 Yes 12% 6% 82% 15 24-3-6 No 10% 7% 83% 16 24-4-6 No 8% 10% 82% 17 24-5-6 No 5% 12% 83% 18 24-9-6 Yes 14% 6% 80% 19 24-10-6 No 9% 9% 82% 20 24-13-6 Yes 12% 10% 78% 21 24-15-6 No 12% 14% 74% 22 24-16-6 No 1% 23% 76% 23 24-18-6 No 1% 21% 78% 24 24-22-6 No 10% 8% 82% 25 24-25-6 Yes 13% 8% 79% 26 24-26-6 No 7% 10% 83% However, a total of 26 candidate ternary mixtures were identified. In order to validate the feasibility of the designed formulation, the AUP values of the formulated mixtures and the target must match along with matching the cluster targets. After performing this analysis, only 9 out of 26 identified 83 candidate ternary mixtures matched the cluster target and AUP value of the target product and the results are summarized in Table 3.10. The 17 infeasible mixtures, although, matched the cluster targets but had AUP values higher than ?1% of the target value (1.802). Therefore, the discrepancy was assumed to be higher than the accuracy that can be expected by graphical lever-arm analysis. 3.6 Conclusion This research effort has focused on the formulation and solution of product design problems by systematic and insightful use of past data. Properties of the raw materials, their blend ratios, and the process conditions are used to predict and enhance the performance of a target product. Using the duality of linear programming to solve the design problem in the lower dimensional property domain, instead of high dimensional component space, significantly reduced the computational complexity of the problem. Moreover, the ternary diagram provides a quick targeting tool that aides in the evaluation, analysis, and screening of alternatives. The approach differs from conventional techniques because it is non-iterative, avoids the combinatorial explosion when multiple components are involved, and avoids the difficulty of formulating and solving the mixed integer non-linear programming involved in many mixture design problems. Such a practice in industry results in fewer trials and experiments to run, thereby saving resources, capital and most importantly the product development time. However, the method and example presented above is limited to selecting, i.e. identifying candidates from a database of known raw materials. The raw materials used, are selected from a list of pre-defined candidate components, therefore limiting the performance to those components. The problem here is that these decisions are made ahead of design and are usually based on qualitative (or at best quantitative) process knowledge 84 and/or experience and thus possibly yield a sub-optimal design. In order to guarantee global optimality, all possible compounds must be considered. Chapter 4 will present a biodiesel additive design problem to demonstrate molecular design concepts. 85 CHAPTER 4 DESIGN OF BIODIESEL ADDITIVES 4.1 Introduction High prices and environmental impact related to fossil-based raw materials are the driving force for sustainable development concepts and their use in industry. This will also have implications for the design of chemical products and their production routes. Biofuels, mainly ethanol used in gasoline engines and fatty acid alkyl esters (biodiesel) as well as their blends with petro-diesel used in diesel engines, are some of the few alternatives that have not required significant new infrastructure or change on the part of consumers or auto manufacturers. As a result, these biofuels are the most widely deployed substitute for conventional fossil fuels in transportation today. In the U.S., biofuels can currently be blended up to 10% (ethanol) and 20% (biodiesel) in every gallon of fossil fuel. Among these alternatives, biodiesel has become a fast growing market and is expected to outpace gasoline demand [66, 67, 68]. Some advantages of biodiesel, compared to petro-diesel, include: ? renewability and domestic origin, ? biodegradability and sustainable, ? higher flash point, ? reduction of most regulated exhaust emissions due to lack of sulfur, ? miscibility in all ratios with petro-diesel, ? compatibility with the existing fuel distribution infrastructure, and ? inherent lubricity. 86 Any type of feedstock that contains free fatty acids and/or triglycerides such as vegetable oil, waste oil, animal fat, and waste grease can be converted into biodiesel. The American Society for Testing and Materials (ASTM) defines biodiesel as a fuel comprised exclusively of monoalkyl esters of long-chain fatty acids derived from vegetable oils or animal fats, designated B100 (100% pure biodiesel), meeting the requirements of ASTM designation D6751 [69]. Biodiesel is a very good example of chemical product design. A high viscosity and high melting point make its use directly as a fuel in common engines difficult. The conversion of triglycerides into methyl or ethyl esters through the transesterification (also called alcoholysis) process reduces the molecular weight to one-third that of the triglyceride, reduces the viscosity by a factor of about eight and marginally increases the volatility [70]. Transesterification is an ester conversion process that splits up the triglyceride (TG); that is, it takes the glycerol (GL) of the TG and replaces it with alkyl radical of the alcohol used [71]. In Figure 4.1, R 1 , R 2 , and R 3 represent long chain fatty acid radicals of the mixed TG used whereas R' represents an alkyl radical of the alcohol used. Some common fatty acids, which exist in the triglyceride molecule are shown in Table 4.1. Figure 4.1: Overall stoichiometric transesterification reaction scheme. Glycerol is the main byproduct in biodiesel production; almost 13% of glycerol comes from biodiesel production. Therefore, a valuable use of glycerol O R 1 C O CH 2 + 3R' OH (catalyst) O R 2 C O CH O R 3 C O CH 2 O R 1 C O R' O R 2 C O R' O R 3 C O R' + CH 2 CH 2 OH CH OH OH Triglyceride Alcohol Glycerol Biodiesel 87 is a very important success factor for this technology. Product design principles have to be applied to search for profitable applications that also fulfill the requirements of the sustainability principles [72]. Table 4.1: Chemical structures of common fatty acids. Fatty Acid CXX:Y Chemical Structure Myristic acid (C14:0) CH 3 -(CH 2 ) 12 -COOH Palmitic acid (C16:0) CH 3 -(CH 2 ) 14 -COOH Stearic acid (C18:0) CH 3 -(CH 2 ) 16 -COOH Oleic acid (C18:1) CH 3 -(CH 2 ) 7 -CH=CH-(CH 2 ) 7 -COOH Linoleic acid (C18:2) CH 3 -(CH 2 ) 4 -CH=CH-CH 2 -CH=CH-(CH 2 ) 7 -COOH Linolenic acid (C18:3) CH 3 -CH 2 -CH=CH-CH 2 -CH=CH-CH 2 -CH=CH-(CH 2 ) 7 -COOH Arachidic acid (C20:0) CH 3 -(CH 2 ) 18 -COOH Behenic acid (C22:0) CH 3 -(CH 2 ) 20 -COOH Erucic acid (C22:1) CH 3 -(CH 2 ) 7 -CH=CH-(CH 2 ) 11 -COOH If methanol is used, the biodiesel produced is fatty acid methyl ester (FAME). For example, the structure of stearic acid methyl ester can be obtained by replacing the H atom in the COOH- group with a CH 3 group as shown in Figure 4.2. O CH 3 O Figure 4.2: Stearic acid methyl ester. Figure 4.3 shows a typical triglyceride structure of soybean oil that is made up of mixed fatty acid fragments and a glycerol fragment. The physical and chemical fuel properties of biodiesel basically depend on the fatty acids distribution of the triglyceride used in the production. Fatty acids vary in their carbon chain length and in the number of double bonds (unsaturation level), and are represented by C XX:Y where ?XX? is the number 88 of carbon atoms and ?Y? is the number of double bonds. The fatty acid distributions of some feedstock commonly used in biodiesel production are shown in Table 4.2 [73]. Figure 4.3: A typical triglyceride molecule with different fatty acid chains of soybean oil. Since the fatty acid profile varies with different feedstock, the final quality of the biodiesel varies depending upon the parent feedstock used. A feedstock dependent fuel property is one of the technical difficulties associated with biodiesel that have limited its wide usability. Table 4.2: Fatty acid profiles of some common biodiesel feedstock Fatty acids (wt %) Feedstock C14:0 C16:0 C16:1 C18:0 C18:1 C18:2 C18:3 Sunflower ? 6.08 ? 3.26 16.93 73.73 ? Rapeseed ? 3.49 ? 0.85 64.40 22.30 8.23 Soybean ? 10.58 ? 4.76 22.52 52.34 8.19 Soybean soapstock ? 17.20 ? 4.4 15.7 55.6 7.10 Used frying oil ? 12.00 ? ? 53.00 33.00 1.00 Tallow 3?6 24?32 ? 20?25 37?43 2?3 ? Lard 1?2 28?30 ? 12?18 4?50 7?13 ? Yellow grease 2.43 23.24 3.79 12.96 44.32 6.97 0.67 Brown grease 1.66 22.83 3.13 12.54 42.36 12.09 0.82 O C O CH 2 O C O CH O C O CH 2 Glycerol fragment CH 3 -(CH 2 ) 7 -CH=CH-(CH 2 ) 7 -COOH CH 3 -(CH 2 ) 4 -CH=CH-CH 2 -CH=CH-(CH 2 ) 7 -COOH Fatty acid fragment Oleic acid (C18:1) Linoleic acid (C18:2) Linolenic acid (C18:3) CH 3 -CH 2 -CH=CH-CH 2 -CH=CH-CH 2 -CH=CH-(CH 2 ) 7 -COOH 89 Transesterification does not alter the fatty acid composition of the feedstocks and this composition plays an important role in some critical parameters of biodiesel. The vegetable oils are mainly characterized by certain fuel related properties. Some of them are tabulated in Table 4.3 [74]. Table 4.3: Fuel properties of biodiesel fuels and diesel Vegetable oil Kinematic viscosity at 38 o C Cetane No. Heating Value Cloud Point Pour Point Flash Point Density Carbon residue (mm 2 /s) ( o C) (MJ/kg) ( o C) ( o C) ( o C) (kg/l) (wt.%) Sunflower 34.4 36.7 39.6 7.2 -15.0 274 0.916 0.27 Rapeseed 37.3 37.5 39.7 -3.9 -31.7 246 0.912 0.30 Soybean 33.1 38.1 39.6 -3.9 -12.2 254 0.914 0.25 Peanut 40.0 34.6 39.8 12.8 -6.7 271 0.903 0.24 Palm 39.6 42.0 ? 31.0 ? 267 0.918 0.23 Cottonseed 33.7 33.7 39.5 1.7 -15.0 234 0.915 0.24 Corn 35.1 37.5 39.5 -1.1 -40.0 277 0.909 0.24 Diesel 2.0-4.5 51.0 43.8 ? 18 -25 55 0.820- 0.860 ? 4.2 Structure Property Relationships Chain length and number, position and configuration of double bonds account for the variation in physical properties of fatty acids. Saturated chains are highly flexible but the fully extended conformation is the most stable because of the lack of steric interference. The tetrahedral bond angle on carbon results in a molecular geometry for saturated fatty acids that is relatively linear (see Figure 4.2). This molecular structure allows a close arrangement of the fatty acid molecules with strong intermolecular interactions. On the other hand, the introduction of double bonds with predominantly cis-configuration in the hydrocarbon chain in unsaturated fatty acids results in bends of about 30 degrees in the molecular geometry (see Figure 4.4). The molecules do not arrange very closely due to chain branching. It allows more flexibility and weaker van der Waals force between the molecules. 90 O CH 3 O Figure 4.4: Oleic acid methyl ester. 4.3 Technical Difficulties with Biodiesel Use Although, biodiesel, because of its biological origin, has many advantages compared to its petroleum counterparts, it also has several technical problems that have persisted to the present and have impaired its use and commercialization. Three major limitations are [71]: ? Oxidative stability: Biodiesel undergoes oxidative degradation over time, mainly influenced by temperature and oxygen availability/exposure. Residual products of biodiesel such as insoluble gums, organic acids, and aldehydes formed from the degradation may cause engine deposits and injection problems. Biodiesel is essentially non-aromatic. Petroleum diesel contains essentially no olefinic bonds, while biodiesel can contain a significant number of these reactive, unsaturated sites that provide pathways for oxidation instability. ? Low-temperature Operability: Biodiesel contains wax molecules, which are dissolved in the fuel at higher temperatures. As the fuel temperature drops, the wax molecules begin to crystallize. At lower temperature, the larger crystals fuse together and form agglomerations that eventually prevent pouring of the fuel and plugging of filters. ? Feedstock dependent fuel property: The fatty acid profile of biodiesel is identical to that of the parent oil or fat. Therefore, feedstock origin will impact the final quality of the biodiesel product. To produce fuel grade biodiesel, the characteristics of the feedstock are 91 very important during the initial research and production stage (fuel properties of diesel and biodiesel fuels from various sources are given in Table 4.3.) Moreover, simultaneous solution of these problems has proven difficult as improvements in one area tend to impair another. Due to the inverse relationship between oxidative stability and low-temperature operability, the design of an optimal fuel for all environments can be a rather difficult task. Structural factors that improve oxidative stability adversely influence low- temperature operability and vice versa. Figure 4.5 is a graphical representation of the effects of the biodiesel feedstock profile on the biodiesel properties published by the National Renewable Energy Laboratory in 2007. Figure 4.5: Compositions of fats and oils and their effects on the fuel properties. To ensure a uniform quality of biodiesel produced from vegetable oils or animal fats, the final products must meet stringent international and regional quality requirements such as ASTM D 6751 [69] in the U.S. and EN 14214 [75] in Europe. These standards identify the parameters the pure biodiesel (B100) must meet before being used as a pure fuel or being blended with petroleum-based diesel fuel. Table 4.4 provides the specifications for biodiesel and diesel. 92 Table 4.4: Specifications for biodiesel and diesel Specification EN 14214:2008 ASTM D 6751:2009 EN 590:1999 Applies to Units FAME a FAAE b Diesel Kinematic Viscosity mm?/s, @ 40?C 3.5-5.0 1.9-6.0 2.0-4.5 Cetane number - 51 minimum 47 minimum 51 minimum Cloud point ?C - report - CFPP ?C Location and time specific - Location and time specific Oxidation stability hr, @110?C 6 minimum 3 minimum N/A (25 g/m 3 ) a refers to fatty acid alkyl esters b refers to fatty acid methyl esters The listed specifications in biodiesel standards are directly influenced by the fatty acid profile of the biodiesel fuel which is directly influenced by the fatty acid profile of parent oil used (see Table 4.1 and Table 4.2). Figure 4.6: Approaches to improving biodiesel fuel properties. Several pathways are possible for improving the fuel-related properties of biodiesel. Figure 4.6 presents an overview of the various approaches (A to E) that have been explored [76, 77]. A ppr o a c he s Unchanged fatty ester composition Additives A Unchanged fatty ester composition Change alcohol B Change fatty acid profile Physical procedures C Inherently different fatty acid profile Genetic modification D Alternative feedstock E 93 4.4 Fuel Additives Fuel additives, such as antioxidants, cetane enhancers, or cold-flow improvers have become a common and indispensable tool not only to decrease the drawbacks described above but also to produce specified products that meet international and regional standards, allowing fuels trading to take place. However, an additive solution to one problem often aggravates another problem. In addition, the questions of additive compatibility, required addition levels, the effect on other properties, and whether these additives function as designed for biodiesel fuels with differing fatty acid profiles still remain challenges that require further investigation [78]. Therefore, it is desired to molecularly design biodiesel additives to account for the unintended effect on other fuel properties in the neat and the blend fuel in order to achieve the performance properties of the petroleum- based fuel. In this way, biofuels can be formulated that are adaptable to a range or blend of feedstocks and the desirable fuel characteristics like oxidative stability and wide operating temperature range. Moreover, such a sustainable biofuel must also meet the specifications required by the transportation and the aviation industry. 4.5 Additive Design Fuel characterization data obtained with near infrared spectroscopy (see Section 2.4) is combined with property clustering techniques (see Section 2.2.2) in a reverse problem formulation (see Section 2.2.1) to design additive molecules which, when mixed with off-spec biodiesel, produce biodiesel that meets the desired fuel specifications. The characterization data consists of a multitude of property values (such as cetane number, melting point, and kinematic viscosity) specified by the aviation industry to ensure adequate performance. To facilitate an efficient design we propose consolidating these various properties into a latent property domain using principal component 94 analysis (PCA) and principal component regression (PCR) techniques (see Section 2.6.1 and 2.6.2). Characterization-based molecular design using group contribution parameters are then used to build novel additives that match the fuel specifications in the latent property space. Sustainability may be controlled by environmental and ecological constraints on the design of the additives. The additives found can then be used to offset the impact of the feedstock residuals on the biofuel blend properties. 4.5.1 Types of Additives Different additives that are commercially available to improve diesel fuel performance are selected as the training set molecules [79, 80]. Antioxidant additives can help slow the degradation process and improve fuel stability. Cold flow enhancers can improve the cold-flow properties to solve the low temperature operability problem. Cetane improvers can help improve ignition properties thereby reducing NOx emission. Table 4.5 shows some of the available additives used as training set in this formulation study. Figure 4.7. Target specifications for bio-diesel and its blend in terms of cetane number, melting point, and kinematic viscosity. 95 Since cetane number, cold flow, kinematic viscosity, and oxidation stability are critical properties for the operation of a fuel in a diesel engine, Knothe [78] proposed a three-dimensional plot of the cetane number, melting point, and kinematic viscosity to describe the optimum properties to meet the ASTM D6751 fuel requirements. Figure 4.7 visually represents this required biodiesel property space by the shaded rectangular box [78]. 4.5.2 Additive Property Estimation Since many additives are proprietary products, their physical-chemical properties are not readily available and therefore experimental data such as melting point temperature, viscosity and cetane number are difficult to find. Therefore, for the commercially available additive compounds, these properties are estimated using structural information alone. The estimated values are later used as the measured values for property model calibration. However, the property values used in the calibration would normally be measured in an industrial setup. Melting temperature: The normal melting point temperature (? ? , K) values were estimated using a group contribution expression [34]: 1 23 exp m i mi j m j k m k ij mo T NT M T OT t ?? =++ ?? ?? ?? ? (4.1) where, ? ? , ? ? and ? ? are the number of first-, second-, and third-order groups of types i, j and k, respectively. Viscosity: The dynamic viscosity (?, mPa.s) values at 300 K were estimated using the GC + method. GC + combines the group contribution (GC) method and the atom-connectivity indices (CI) method [81]. ( ) 1 23 ln( ) ii j j k k ij k N M OF? ? ? ?? ? =+ ++ ?? ? CI TermsGC Terms ???????????????? (4.2) 96 where, ?(? ? ) is a function of viscosity for all missing GC groups/fragments. ( ) 1 () K k k k F nF d?? ? = = + ? (4.3) where, ?(?) ? = a function of the viscosity contribution for a missing GC group/fragment ?, K = the number of missing groups/fragments, n = the number of times a missing group/fragment appears in the molecule, and d = a constant [81]. ( ) ( ) 01 () 2 ii i F aA b c ?? ? ??=++ ? (4.4) where, ? ? = the occurrences of the i th atom in the molecular structure, ? ? = the contribution of atom i, and b & c = adjustable parameters. The zero-order (atomic) connectivity index ( 0? ? ) and the first-order (bond) connectivity index ( 1? ? ) are defined by [82] 01 11 where, LM k ij k ii ? ? ?? ? ? ? ? ?? ??= = ?? ?? ?? ??= = = ?? ?? ?? (4.5) where, L = the number of atoms in the hydrogen suppressed graph, M = the number of bonds in the graph, and ? ? = the bond indices defined by atomic indices ? ? (the values can be found in [81]). The kinematic viscosity can be converted into dynamic viscosity through the density. Cetane number: The correlation used for estimation of the cetane number of additives is from Lapuerta et al. [83] 97 ( ) 22 21.157 7.965 1.785 0.235 0.099 db db c c CN N N N N=?+? + ? (4.6) where, ? ?? = the number of double bonds and ? ? = the number of carbon atoms in the molecule. Table 4.5 tabulates the estimated properties for different type of additives. Table 4.5: Commercially available diesel additives and their estimated properties. Type Compound exp(T m /T mo ) ln(?) CN Ox ida t iv e s t a b il it y impr o v e r s Butylated hydroxytoluene (BHT) 10.11 0.465 76.04 t-butylhydroquinone (TBHQ) 13.26 0.465 48.59 Isopropyl alcohol (IPA) 4.114 0.465 1.847 Pyrogallol (PY) 18.791 0.465 14.79 Methyl tert-butyl ether (MTBE) 3.484 0.465 16.19 C old f low impr o v e r s Polymethyl Methacrylate (PMMA) 3.035 -1.062 15.19 Ethylene glycol methyl ether acrylate (EGMEA) 4.299 -0.517 7.289 C et a n e n u m b er impr o v e r s Di-tert-butyl peroxide (DTBP) 3.138 -3.376 36.23 Methyl Oleate (MO) 7.002 2.177 44.47 Ethylene glycol (EG) 6.497 3.154 -5.623 Oleic acid (OA) 12.47 3.134 31.82 Stearic acid (SA) 11.84 3.924 62.24 4.5.3 Characterization of the Additive Molecules Infrared (IR) spectroscopy-based characterization was used to determine the chemical constituents or molecular structures of the additive training set that describe the orientation and alignment of these molecules. Using the NIST Webbook [39], the complete IR spectral region (4,000-400) cm -1 for twelve additives including antioxidants, cetane enhancers, and cold-flow improvers 98 were obtained. Ideally, both IR and NIR data would be available for the training set to provide better predictions, but unfortunately the availability of NIR data for the components of interest is severely limited. Figure 4.8: Infrared spectra of diesel additive molecules. Figure 4.8 is IR spectra of the diesel additive molecules tabulated in Table 4.5 that serve as the descriptors of their molecular architecture. The IR spectra were translated to discrete variables by a process of digitization. In this process, each spectrum is fragmented into small equal fragments (2 cm -1 resolution) along the wavelength axis resulting with 1801 frequencies (descriptor variables). 4.5.4 Additive IR Data Analysis The principal component scores were used to describe the variation in the multivariate characterization data with a minimum of variables to elucidate the underlying structure of the data. Principal components (PCs) captured the most variation possible in the smallest number of dimensions and 0 50 100 4 00 8 00 12 00 16 00 20 00 24 00 28 00 32 00 36 00 40 00 T r ans m i t t anc e ( % ) Wavenumber (cm-1) BHT IPA PY MTBE PMMA EGMEA DTBP MO EG OA SA TBHQ Functional group region Fingerprint region 99 consolidate multiple property effects into single, underlying latent variables which are devoid of collinearity. The first three principal components captured 69.53% of the total variance of the standardized IR data (Figure 4.9). Figure 4.9: Scree plot for PCA on additive IR data. Figure 4.10: PCA score plots on first third PCs for additive IR data. ? : Oxidative Stability Improver ? : Cold Flow Improver ? : Cetane Number Improver 100 Figure 4.10 is a score plot that maps the three different additives types involved in the training set. Additives that are close to each other should have similar properties, whereas those far from each other should be dissimilar with respect to the three properties of interest. It is clear from the picture that cold flow additives (?) are together in all the plots. However, oxidative stability (?) and cetane improvers (?) are scattered. This may suggest that an additive that improves oxidative stability also improves cetane and vice versa. This information is supported by the trend depicted in Figure 4.5. 4.5.5 Latent Variable Model Development As each score vector ti is a linear combination of the initial predictor variables X (the IR data), nonlinear regression models that describe response variables Y (the fuel properties) subjected to the X values are obtained. First, the PCA for variable reduction and second, multiple regression for calibration model development could be considered as a nonlinear PCR (NPCR). Nonlinear Principal Component Regression resulted in second order regression models for melting point (Eq. (4.7)) and kinematic viscosity (Eq. (4.8)), whereas, a third order regression model was developed for cetane number (Eq. (4.9)). The generalized forms of these latent variable property models are as follows: 3 33 0 1 exp m i i ij i j i i jjimo T t tt T ?? ? = <> ?? ?? ?? =++ ? ?? (4.7) 3 33 0 1 ln( ) i i ij i j i i j ji t tt?? ? ? = <> =++ ? ?? (4.8) 33 3 2 0 11 33 i i ii i ijk i k i i kj j i jji CN PC t t tt?? ? ? = = ><> =+ ++ ?? ??? (4.9) The left hand sides (LHS) of the above models are forced to have a particular form such that the respective properties follow linear additive 101 rules as described by the general group contribution model equation (Eq. (2.13)). Table 4.6 tabulates the regression coefficients for the latent variable models represented by Eq. (4.7) through Eq. (4.9). Figure 4.11 is a plot comparing the actual versus the model predicted properties. Table 4.6: Model coefficients using PCR. Properties ??? (? ? ? ?? )? ??(? ) CN 0 ? 8.1702 0.8132 65.89 1 ? 0.0793 -0.0005 -0.2833 2 ? -0.1758 -0.0545 -0.0195 3 ? -0.0401 0.0021 -1.3906 12 ? 0.0048 0.0033 - 13 ? 0.0063 -0.0024 - 23 ? 0.0144 -0.0015 - 11 ? - - -0.0491 22 ? - - 0.3826 33 ? - - -0.0632 123 ? - - 0.0038 ? 2 0.816 0.764 0.790 ? adj 2 0.595 0.480 0.422 Figure 4.11: Predicted vs. actual product properties using PCR model. 102 4.5.6 Translating Physical properties to Latent Properties The purpose of this case study is to identify candidate additives which, when mixed with a crude (off-spec) biodiesel, produce a fuel that meets the required fuel specifications presented in Table 4.7. Table 4.7 has much more stringent property constraint criteria than the one defined in Figure 4.7. Here, kinematic viscosity of 6.0mm 2 /s at 40 o C is the maximum limit in ASTM D6751 (Table 4.4). Table 4.7: Biodiesel target properties. Fuel Property T m [C] ? [mm 2 /s] @ 40 o C CN Lower Limit -60 4.51 47 Upper Limit -20 6.00 65 In order to ensure that the target property space is properly explored, a feasibility region on the ternary diagram was evaluated. The identity of the exact shape of the feasibility region without extensive enumeration was evaluated by the six unique points described by Eq. (2.12). The three physical properties (? ? ) were converted to latent properties (? ? ) by solving Eq. (4.7), (4.8) and Eq. (4.9) simultaneously. The score variables are then standardized (? ? ) using Eq, (2.43). The results are tabulated in Table 4.8. The feasibility region was mapped from the three-dimensional volume to a two-dimensional area utilizing the property clustering technique described in section 2.2.2. Figure 4.12 shows the biodiesel target feasibility region described by Table 4.8. The minimum and maximum latent properties that correspond to physical properties in Table 4.7 are in Table 4.9. 103 Table 4.8: Physical and latent properties describing feasibility region. Feasibility Region Points Physical Property Latent Property Score Standardized Score T m [K] ? [mm 2 /s] CN t 1 t 2 t 3 q 1 q 2 q 3 min,min,max 213 4.5 65 81 -21 -33 0.180 -0.048 -0.073 min,max,max 213 6.0 65 80 -21 -34 0.184 -0.049 -0.078 min,max,min 213 6.0 47 80 -20 -32 0.177 -0.043 -0.070 max,max,min 253 6.0 47 74 -19 -31 0.183 -0.048 -0.078 max,min,min 253 4.5 47 75 -20 -30 0.178 -0.046 -0.071 max,min,max 253 4.5 65 75 -21 -33 0.186 -0.052 -0.081 Table 4.9: Biodiesel target latent properties. Latent Property q 1 q 2 q 3 Lower Limit 0.174 -0.054 -0.088 Upper Limit 0.191 -0.043 -0.065 Reference value 0.06 -0.02 -0.05 The properties of a crude biodiesel feedstock were selected such that it does not meet the fuel specification target properties defined in Table 4.9. The standardized latent property values are tabulated in Table 4.10. The crude biodiesel along with the petro-diesel property values are mapped onto the cluster space as shown in Figure 4.12. Table 4.10: Crude biodiesel properties. Latent Property q 1 q 2 q 3 Crude biodiesel 0.142 -0.0480 -0.0820 104 Figure 4.12: Target feasibility region and crude biodiesel in cluster space. 4.5.7 Evaluation of Desired Additive Feasibility Region An infinite number of possible additives may exist that could be mixed with crude biodiesel. However, it is difficult to narrow down the feasible additive candidates such that the final mixture products meet the target biodiesel properties. Since the property cluster formulation enables linear mixing and lever arm analysis of the latent properties, it is straightforward to identify the feasibility region for the additives. The region bounded by the black dashed lines (shaded area) in Figure 4.13 represents the entire latent property search space for the potential additive molecules. The corresponding latent property ranges are presented in Table 4.11. The additive feasibility 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 C3 C2 C1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Crude Diesel Biodiesel Target Region ?: Diesel ?: Crude Biodiesel 105 region will serve as the property target region in the molecular design algorithm. Table 4.11: Biodiesel additive latent property feasibility region. Latent Property q 1 q 2 q 3 Lower Limit 0.519 -0.076 -0.138 Upper Limit 0.174 0.000 0.000 Figure 4.13: Desired additive design feasibility region in cluster space. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 C3 C2 C1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Crude a c Diesel Biodiesel Target Region ?: Diesel ?: Crude Biodiesel Additive Feasibility Region 106 4.5.8 Enumeration of Desired Additive Molecules The characterization-based group contribution method (cGCM) developed by Solvason [27] was used to take advantage of the additional molecular architecture information provided by characterization data. Characterization- based molecular groups/fragments are linearly combined to formulate a molecule. The property of the formulated molecule is determined by the individual contributions of each molecular group that make up the molecule. Twenty-three molecular groups were selected from the additive training set. Some of the groups are the fundamental building blocks present in every additive molecule and represent first order contributions, while the rest are larger groups and represents higher level of contributions. These selected molecular groups are considered to be a set of basic groups which represent the chemical make-up of the training set and are listed in Table 4.12 . The infrared (IR) descriptor data [X-matrix] for the molecular fragments was compiled from Socrates [84] and is listed in Appendix B, Table B.2. The latent property contributions of each group are evaluated using the loading matrix (P T ) obtained from the PCA of the training set data such that: MA MK KA? ?? = ?T XP (4.10) The standardized score values (? ? ) are calculated as described previously (Eq. (2.43)) and the results are tabulated in Table 4.12. Visual Basic for Application (VBA) codes are used to enumerate all potential molecules from the characterization-based groups in Table 4.12 that satisfied the target property constraints described in Table 4.11. Using the group-based property model, molecular groups or fragments are added together analogous to inter- stream conservation. A maximum number of similar groups, ? ? = 2, was selected such that progressive combinations of similar groups are added until the maximum is reached. 107 Table 4.12: Molecular groups and their latent property contributions. S.N. Molecular Group Latent Property q 1 q 2 q 3 1 Methine Group, -CH- - 0.04 0.459 0.005 2 Methylene Group, -CH 2 - -0.05 -0.12 0.003 3 Methyl Group, -CH 3 -0.04 0.156 -0.025 4 Tetramethyl Group, -C(CH 3 ) 3 0.025 0.359 0.061 5 Aliphatic Methoxy Group, -O-CH 3 -0.03 -0.66 0.061 6 Vinyl Group, -CH=CH 2 0.125 -0.43 0.106 7 Vinylidene Group, CH 2 =C- - -0.04 0.339 0.002 8 cis-Vinylene Group, -CH=CH- -0.18 -0.17 0.084 9 trans-Vinylene Group, -CH=CH- -0.17 -0.18 0.091 10 Hydroxyl Group, -OH -0.02 -0.37 -0.229 11 Primary Alcohol Group, -CH 2 OH 0.093 -0.21 -0.086 12 Secondary Alcohol Group, - -CHOH 0.09 0.275 -0.056 13 Alilphatic Ether Group, -O- 0.15 -0.27 0.072 14 Alkyl Peroxide Group, -O-O- 0.211 -0.34 0.134 15 Saturated Aliphatic Ester Group, -CO-O- 0.247 0.328 0.155 16 Saturated Aliphatic Methyl Ester Group, -CO-O-CH 3 0.184 0.404 0.172 17 Saturated Aliphatic Ethyl Ester Group, -CO-O-CH 2 CH 3 0.162 0.5 0.139 18 Acrylate Ester Group, CH2=CH-CO-O- 0.13 0.179 0.14 19 Methacrylate Ester Group, CH2=C(CH 3 )-CO-O- 0.111 0.547 0.105 20 o-Alkyl Phenol Group (With H-bonding) 0.006 0.455 -0.042 21 p-Alkyl Phenol Group (With H-bonding) 0.007 0.42 -0.042 22 Monosubstituted Benzenes -0.1 -0.27 0.081 23 1,2,4- Trisubstituted Benzene 0.055 -0.42 0.069 108 The enumerated candidate molecules were then screened for structural constraints, such as free bond number (FBN), to ensure that a stable, connected molecule was formed. The candidate molecular structures that were identified to fall within the feasibility region for the additive properties in Figure 4.14 and satisfy all the constraints, are presented in Table 4.13. The list of potential molecules is mapped onto the cluster space and is shown in Figure 4.14. Figure 4.14: Cluster diagram for biodiesel blending problem. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 C3 C2 C1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Crude a c Diesel Biodiesel Target Region CM3 CM4 ?: Diesel ?: Crude Biodiesel ?: Candidate Molecules CM1 : H 2 C=CH-C(=O)O-C(=CH 2 )-CH 3 CM2 : H 2 C=CH-C(=O)O-CH(-OH)-CH 3 CM3 : H 3 C-C(=O)O-CH 2 -CH 3 CM4 : p-Methyl Phenol Additive Feasibility Region CM1 CM2 109 Table 4.13: Results from characterization-based molecular design. I.D. Candidate Molecules T m [K] ? [mm 2 /s] CN CM1 CH 2 O O CH 3 CH 2 (Isopropenyl acrylate) 203 0.400 3.63 CM2 CH 2 O O CH 3 OH (Hydroxyethyl acrylate) 261 1.04 3.04 CM3 CH 3 O O CH 3 (Ethyl acetate) 172 0.72 2.92 CM4 CH 3 OH (4-methyl phenol) 306 9.68 7.07 In addition, solubility parameters can be used as a simple method to predict and calculate the dissolving power of the above candidate additive molecules in biodiesel as a further screening criterion. As a general rule, two substances with close solubility parameters (?) should be mutually soluble [85]. In the thermodynamics of solution, the Hansen solubility parameters related to dispersion force (? ? ), polar interaction (? ? ), and hydrogen bond interaction (? ? ) have been conveniently used to estimate the solubility or miscibility between two compounds. These parameters can be estimated from additive group contributions [85]: 110 2 dp h pi di hiii i mm m F FE VV V ?? ?= = = ? ?? (4.11) where ? ?? , ? ?? , and ? ?? are contributions from group i for calculating dispersion, polar, and hydrogen component solubilities respectively using Hoftyzer and Van Krevelen method [85]. The molar volume (? ? ) of a molecule was estimated by group contribution methods [34]. The total Hansen solubility parameter (equivalent to Hildebrand solubility parameter) can be expressed as: 222 d ph ? ???= ++ (4.12) Table 4.14: Additive solubility in FAME at 25 o C. I.D. ? d MPa 1/2 ? p MPa 1/2 ? h MPa 1/2 ? MPa 1/2 V m cm 3 /mol Feasible CM1 13.75 3.59 7.16 15.91 136.68 Yes CM2 12.44 5.12 14.05 19.45 107.35 No CM3 10.97 3.59 7.16 13.58 97.80 Yes CM4 19.55 4.77 13.78 24.39 105.37 No FAME 15.7 1.46 4.57 16.43 335.36 - Table 4.14 tabulates the solubility parameters and molar volumes of the additives and a common fatty acid methyl ester (FAME). Since linoleic acid (C18:2) represents a major constituent in the fatty acid profile (Table 4.2), it is used to check the solubility of the candidate additive molecules in methyl ester (biodiesel). Additives with lower a solubility parameter than FAME were considered miscible in most proportions with FAME (? ? 16 MPa 1/2 ). Based on this criterion, the candidates CM1 and CM3 satisfied the screening criteria and were considered feasible solutions. 111 From Table 4.14 it can be inferred that among the Hansen parameters, the high dispersion parameter values (? ? ) have the major attractive factor for FAME and the additive molecules involved in the process of solubilization. These interactions arise from induced dipoles and their strength is related to the polarizabilities of the molecules (dipole moment of FAME, CM1, and CM3 molecules are primarily from the dipole moment of the carbonyl group). The polar parameter (? ? ) and the hydrogen bonding parameter (? ? ) of FAME, CM1, and CM3 are comparatively low. Figure 4.15: Spatial representation of candidate additive molecules according to B3LYP/6311++G(3df,3dp) calculations: (a) isopropenyl acrylate (CM1), and (b) ethyl acetate (CM3). Furthermore, density functional theory (DFT) calculations were performed for the CM1 and CM3 additive molecules to optimize their geometry. Using the B3LYP method with the 6311++G(3df,3dp) basis set the lowest energy conformer for each molecule is presented in Figure 4.15. The estimated dipolar moments of CM1, CM3 and FAME are 1.54, 2.11 and 4.51 respectively. This is in agreement with the polar parameters obtained with Hansen?s theory (Table 4.14). The difference between dipole moments (|? ???? ?? ???????? |) is at a minimum for CM1, which corresponds to the better affinity between CM1 and FAME and to the highest solubility limits of this additive. Consequently, we can conclude that both the polarity and spatial 0 o -67 o (a) (b) 112 configurations of feasible additive molecules are involved in the process of solubilization. 4.6 Conclusion Unlike in group contribution methods (GCM), where parameters of the contribution are obtained by fitting the group contribution model to experimental data for a set of chemical compounds, characterization-based group contribution method (cGCM) will account for the chemical information thus extending the predictive capabilities of this method. Additionally, insights into the molecular structures of the candidates and candidate mixtures were obtained by incorporating the dipole moment (an important descriptor of a molecule) indirectly through the use of IR as a descriptor of molecules. It should be noted that the limited amount of data and supporting characterization information available in the training set (and thus the molecular fragments available for molecular design) impacts the chemical stability/feasibility of the molecules that are generated in this step. Additional training set data can improve the quality of the predictions and thus increase the application range but would not require a different optimization methodology. 113 CHAPTER 5 REVERSE DESIGN OF IONIC LIQUIDS 5.1 Introduction Design of environmentally benign solvents and alternative media for extraction and purification are new challenges within chemical product design. One class of novel compounds being studied for such application is ionic liquids (ILs) [11, 86, 87, 88]. They have become the subject of an increasing number of investigations due to their unique properties such as high polarity, stability at high temperature, flame resistance, and negligible vapor pressure. Ionic liquids that have tailored structures with an array of unique functional properties can have important applications in areas such as CO 2 capture and sequestration, sulfur removal from fuels, energy storage, biomass pre-treatment, and chemical separations. Through variation of both cation and anion, particular ionic liquids with tunable physical properties can be tailored. For example, the miscibility of ionic liquids with water or organic solvents can be varied with alkyl chain lengths on the cation and the type of anion present. It is estimated that over 10 14 unique cation/anion combinations are possible for use as room temperature ionic liquids, the majority of which have never been synthesized [87]. Thus, it is essential to develop a logical and systematic approach of selectively choosing a given ionic pair that matches a set of desired physico-chemical property targets. However, the traditional experimental trial-and-error approach of searching through this large molecular space is unrealistic as it is both time and labor intensive. An 114 example of possible combinations of cation, anion, and the alkyl chain of the side chain attached in the cation is represented graphically in Figure 5.1. Figure 5.1: Selection of anions, cations, and side chains attached in cations for a task specific ionic liquid application. Computer-assisted approaches towards the automated design of chemically formulated products with desired physical, chemical and biological properties hold immediate potential. In such methods, predicting the properties of ILs would be necessary for molecular design of such compounds. The advantage of the newly developed characterization-based group contribution technique [27] has previously been demonstrated to predict physical-chemical properties and design of biodiesel additive molecules (Section 4.5.3) [53]. By exploiting the fact that molecules absorb specific frequencies that are characteristic of their structure, vibrational spectroscopy is used to elucidate chemical constituents, and the orientation and alignment of molecules. For example, infrared (IR) spectroscopy-based characterization CH 3 N N + CH 3 B - F FF F Alkyl chain Cation Anion ? 4 BF 6 ? PF 33 ? CF SO 27 ? Al Cl 3 ? NO 4 ? AlCl 2 ? Tf N halide NN + R 1 R 2 S + R 1 R 2 R 3 N + R 2 R 2 Ionic Liquid 3 CH 2 CH OH 22 CH CH=CH 23 CH OCH 3 CH O 115 contains large quantities of descriptor data involving information on molecular architecture at atomic-, nano-, and micro-scales to describe physical properties and attributes of chemical products. Some common characterizations used to quantify molecular architectures at other length scales include nuclear magnetic resonance (NMR) and x-ray diffraction (XRD) (See Section 2.4 for more detail). In addition, the characterization-based group contribution method (cGCM) utilizes the latent property parameters based on characterization data instead of conventional regression-based property parameters which often exhibit poor attribute-property relationships [27]. The choice of an appropriate spectra training set is vital to building the latent variable structure since the training set defines the molecular architecture building blocks that can be used in the cGCM. In order for the method to be independent of the availability of experimental spectroscopic data, in this Chapter we investigated the use of density functional theory (DFT) based simulation techniques to generate the required IR spectra as molecular descriptors to develop predictive property models that can be used for the reverse design of ionic liquids. 5.2 Density Functional Theory A set of ionic liquids with three properties were compiled from IUPAC Ionic Liquid database (ILThermo) [89] and listed in Table B.3. For each of these ionic liquid, the ion-paired structures were drawn into the Accelrys Draw 4.0 program to develop an initial 2-dimensional MDL Molfile. This information was then imported into Avogadro (v1.0.3) [90] to develop an initial geometry utilizing a molecular mechanics (MM) force-field for energy minimization. The MM force-field chosen for this task was MMFF94, also known as the Merck Molecular Force-Field. MMFF94 has been parameterized for a wide range of organic chemistry calculations and several charged molecules have 116 been included as well, most notably imidazolium cations [91]. For IL?s with phosphate containing anions, the unified force-field (UFF) was utilized to generate starting point geometries because the MMFF94 force-field is not suited to handle these types of molecules. With this starting point geometry, the final molecular geometry optimization was performed using the quantum chemical Gaussian 09 program [92] executed through a supercomputing cluster run by the Alabama Supercomputing Authority. As a molecular model to simulate the pure ionic liquid, ion-paired structures combining the cations and anions were optimized as a whole. The DFT method B3-LYP utilizing the 6-311G(2d,p) basis set in the ideal gas phase was used for molecular geometry optimization of all ion-paired structures in Table B.3. The method and basis set were chosen because similar computational level (B3-LYP/6-311+G(2d,p)) has previously been demonstrated to have the lowest root-mean-square error (when compared to Hartee-Fock (HF) method utilizing similar basis sets) for simulated fundamental molecular vibrations [93]. Figure 5.2 shows the optimized structure of the stable gas phase (local potential energy minimum) 1-ethyl-3-methyl imidazolium hexafluorophosphate [emIm]PF 6 ionic liquid conformer . Figure 5.2: Geometry optimized molecular structure of [emIm]PF6 at the B3LYP/6-311G(2d,p) computational level. 117 Vibrational frequency calculations were performed for each ion-paired structure to confirm the presence of an energy minimum. This is confirmed as the resulting frequencies were all positive values (no imaginary frequencies) [93]. Negative values appearing in this data would reveal the presence of a transition state geometry since the vibrational modes are derived from a square root of the force constants from the Hessian matrix (Eq. B.6). In general, the values of harmonic vibrational frequencies determined by ab initio computational methods are larger than the experimental frequencies because the methods neglect anharmonicity effects, incompletely incorporate electron correlations and uses finite basis sets [94]. The vibrational spectra calculated using DFT (which incorporates electronic correlation) requires a correlation factor of 0.965 at the B3-LYP/6-311G(2d,p) level [95]. The base set is a mathematical function to approximate the electronic wave function. Figure 5.3 (a) shows the high correlation (R 2 of 0.998) of the IR frequency calculated to the available experimental data [93] for [emIm]PF 6 . Figure 5.3 (b) shows the resulting simulated vibrational spectra using Gaussview software. Figure 5.3: (a) Calculated vs. experimental IR frequencies and (b) infrared spectrum at 2cm -1 resolution of [emIm]PF6 at the B3-LYP/6-311G(2d,p) computational level. ? (cm -1 ) R? = 0.9983 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 C a lc u la t e d V ib r a t io n , ? ( c m - 1) Experimental Vibration, ? (cm -1 ) 118 Table 5.1 contains the theoretical and experimental vibrations along with vibrational assignments for [emIm]PF 6. Table 5.1: B3LYP/6-311+G(2d,p) vibrational assignments (cm -1 ) of [emIm]PF6. Frequency (cm -1 ) Assignment Calculated Experimental Cat-An tors 21 - Cat-An bend 39 - Cat-An bend 56 - Cat-An bend 62 - Cat-An bend, CH 2 (N) bend 99 - at-An bend, CH 2 (N) bend 106 - CH 3 (N) twist 127 - CH 3 (N) twist 162 - CH 2 (N) bend 186 - terminal CH 3 twist 210 - CH 3 (N) bend 218 - PF 6 scissors 284 - PF 6 scissors 287 - PF 6 scissors 295 - CCH bend, CH 3 (N) bend 296 - CCH bend, CH 3 (N) bend 326 - CCH bend, CH 3 (N) bend 389 - PF 6 scissors 434 - PF 6 scissors 439 - PF 6 scissors 440 - FPF sym str 447 - FPF sym str 472 - FPF sym bend 527 - FPF sym bend 528 - FPF sym bend 531 559 ring ip sym bend 569 559 ring op asym bend 570 625 119 Frequency (cm -1 ) Assignment Calculated Experimental ring op asym bend 593 649 FPF sym str, ring op bend 695 - ring ip bend, CH 3 (N) bend, CH 2 (N) bend 710 - ring HCCH sym bend 749 741 CCH bend 812 - ring HCCH asym bend 826 836 FPF asym str, ring HCCH asym bend 875 838 FPF asym str, ring NC(H)N bend, CCH bend 900 847 FPF asym str 905 - ring NC(H)N bend 907 - CC str 960 - ring ip sym str 981 - ring sym str, CH 3 (N) str, CH 2 (N) str 1012 1033 CC str, ring ip sym str 1063 1087 ring HCCH sym bend, ring ip sym str 1082 - ring HCCH sym bend, ring ip sym str 1138 - CC str 1159 1114 CH 3 (N) HCH bend 1173 - ring sym str, CH 3 (N) str, CH 2 (N) str 1200 1172 ring ip asym str, CC bend 1216 - ring ip asym str, CH 3 (N) str 1267 - ring ip sym str, CH 2 (N) str 1298 1340 ring ip sym str, CH 2 (N) str, CH 3 (N) str 1328 - ring ip asym str, CH 2 (N) bend 1382 1387 CC str 1429 - ring ip asym str 1447 - ring ip asym str, CH 3 (N) str 1468 1432 CCH HCH sym bend, CH 3 (N)HCH sym bend 1505 - CCH HCH asym bend, CH 3 (N)HCH sym bend 1513 1468 CH 3 (N) asym bend 1522 - CC HCH bend 1533 - CH 3 (N) HCH sym bend 1538 - ring ip sym str, CH 3 (N) str, CH 2 (N) str 1547 1575 120 Frequency (cm -1 ) Assignment Calculated Experimental ring ip asym str, CH 3 (N) str, CH 2 (N) str 1591 1575 terminal CH 3 HCH sym str 3059 2878 CH 3 (N) HCH sym str 3065 - CH 2 HCH sym str 3076 - terminal CH 3 HCH asym str 3102 2952 CH 2 HCH asym str 3109 - CH 3 (N) HCH asym str 3137 - CC HCH asym str 3145 - CH 3 (N)HCH asym str 3157 3115 ring NC(H)NCH str 3176 3134 ring HCCH asym str 3300 3168 ring HCCH sym str, ring NC(H)NCH str 3316 3179 5.3 Data Analysis and Model Development IR data of 22 training set IL molecules generated using DFT contained 701 descriptor variables at 2 cm -1 resolution for (3500-0) cm -1 frequency range. Using PCA, the first three principal components captured about 60% of the total variance. The score values are listed in Table 5.2. The loading values are not presented as they constitute a matrix of [701x3]. Using PCR, the latent variable property models (Figure 5.4) were developed. The generalized expressions are shown in Eq. 5, 6, and 7 for dynamic viscosity, density, and melting temperature, respectively. Table 5.3 contains the respective model coefficients. Note that the log, inverse, and exponent transformations of the calibrated QSPR models were made to ensure linear mixing of the property operators of these properties. 33 2 0 11 ln( ) i i ii i ii tt?? ? ? = = =++ ?? (5.1) 121 33 2 0 11 1 ii iii ii tt?? ? ? = = =++ ?? (5.2) 3 33 3 2 0 11 exp m i i ij i j ii i i i j ji i mo T t tt t T ?? ? ? = <> = ?? =++ + ?? ?? ? ?? ? (5.3) Table 5.2: PCA score values for X- and Y-blocks. X score value t 1 t 2 t 3 -3.102 -1.676 2.218 37.75 -0.137 6.502 25.57 -1.434 5.686 -21.98 -10.25 7.347 -10.86 -1.346 -11.18 -0.315 -1.392 3.867 -0.923 1.408 -8.155 28.34 1.497 3.089 6.778 0.023 -7.867 -0.438 -0.708 1.950 18.46 9.313 -5.078 -6.854 -1.547 -7.027 -6.392 0.682 -11.33 -5.741 -0.417 -6.179 18.04 -1.227 4.456 -16.84 -11.40 8.355 -13.73 -3.260 2.640 -23.92 -5.838 9.794 -36.60 -1.853 7.743 -8.058 0.288 -10.72 -25.84 29.45 10.24 14.50 -1.365 3.655 -4.452 -0.656 -2.259 122 Figure 5.4: Predicted vs. actual IL properties using PCR model. Table 5.3: Model coefficients using PCR. Properties ??(?) ? ?? ??? (? ? ? ?? )? 0 ? -1.273 8.13e -4 6.778 1 ? -0.0736 -6.03e -7 -0.0311 2 ? 0.315 -3.25e -6 0.0152 3 ? 0.175 1.05e -5 -0.0206 13 ? - - -0.0137 23 ? - - 0.0293 11 ? - -1.14e -7 0.0035 22 ? -0.0162 - -0.0168 ? 2 0.609 0.643 0.541 ? adj 2 0.517 0.559 0.312 The poor accuracy of predictions for the properties of interest could be related to the quality of the experimental data used for calibrating the property models. Moreover, the gas phase nature of IR simulation has difficulties to take into account the structural features of ionic liquids in liquid state (electrostatic, van der Waals interactions, hydrogen bonds, etc.). In addition, accounting for the impact of variation in the structure of ions (size, symmetry, conformational flexibility, etc.) on properties is always challenging [31]. For example, melting point (? ? ) of ILs decreases with increase in the length of alkyl substituents due to reduction of electrostatic 123 interactions between ions. However, this trend is not always true. With the length of alkyl group increasing, van der Waals attraction between bulk alkyl radicals favors an increase in (? ? ). Experimental observations have shown oscillation of the melting point of ILs with the size of alkyl groups [96]. In addition, characterization of ionic liquids using the vibrational spectroscopy techniques like IR works better on molecules with covalent bonds. Although, ionic liquids have some degree of covalent bonding, the ionic character dominates the covalent character. 5.4 Reverse Design of Ionic Liquids using QSPR and cGCM Identification of ionic liquids that possess task specific properties through time consuming experiments and simulations of individual alternative molecules is virtually impossible. However, there are relatively small numbers of distinct building-blocks/functional-groups/fragments that can theoretically be combined to generate a wide variation of possible molecules. In order to capture the group contribution and interaction variability in the 22 training ILs (listed in Table B.3), a total of fifteen groups are selected. Table 5.5, Table 5.6, and Table 5.7 contain seven anion, six cation, and two alkyl chain cation substituents, respectively. These selected molecular groups are considered to be a set of basic groups which represent the chemical make- up of the training set. Anion groups are considered as a whole molecule without any possible attachment position. Cation groups consist of imidazolium, pyridinium, and alkyaminium bases with a maximum of four possible alkyl group attachments. Two alkyl groups are selected as possible alkyl chains attached to cation groups as these were the only groups present in the training set. All these groups are considered first order in the group contribution-based property estimation of the formulated ionic liquid molecule. Since most of the groups cover a significant proportion of a potential IL molecule, they 124 incorporate most of the interactions between the groups. Moreover, the use of projection methods for property model calibration also capture the most common features and underlying latent relationships resulting from group- group, ion-ion, and ion-group interactions among the ionic pairs in the training set molecules. The loadings (P) from PCA preserve the covariance structure which can be applied to reverse design of ionic liquid molecules that are consistent with structural attributes in the training set molecules. The infrared frequency data for the anion, cation, and alkyl groups are generated using B3-LYP/6-311G(2d,p) computational level and by following the procedure outlined in Section 5.2. These IR descriptors values are then transformed to score values using ? = ???. Finally, the three latent properties are obtained by normalizing the score values to follow linear mixing in terms of property operators (see Section 2.7 for detail). The latent property values which are the contributions of each group to the three properties of interest (?, ?, and ? ? ) are tabulated in Table 5.5, Table 5.6, and Table 5.7. As a proof-of-concept example, an ionic liquid is to be designed as an alternative benign solvent that should replace a molecular solvent with environmental and health hazard properties. Table 5.4 presents the target property ranges in terms of viscosity, density, and normal melting temperature for a certain task-specific application. The corresponding latent property values were obtained following the steps outlined in Section 2.6.2. Table 5.4: Target ionic liquid properties. Physical Properties Latent Properties ? [Pa.s] ? [kg/m 3 ] T m [K] q 1 q 2 q 3 Lower Limit 1.0 1200 230 0.0004 -0.0077 -0.0346 Upper Limit 1.1 1300 300 0.0671 0.4807 0.0248 125 Table 5.5: Anion groups and their latent property contributions. Latent Property Anion Groups q 1 q 2 q 3 O - SO CH 3 O O Methylsulfate 0.43 2.45 -0.08 P - F F F F F F Hexafluorophosphate 0.15 -1.91 -0.14 B - F FF F Tetrafluoroborate 0.14 -2.92 0.74 NN N - Dicyanamide 0.27 7.70 -0.71 N - S S O O O O F F F F F F Bis(trifluromethylsulfonly)-amide 0.11 3.51 1.40 S O - O O F F F Triflate -0.06 -2.67 0.70 S O O - O O CH 3 Octylsulfate -0.04 -5.17 -0.90 126 Table 5.6: Cation groups and their latent property contributions. Latent Property Cation Groups Valency q 1 q 2 q 3 N + CH 3 N R methyl imidazolium 1 -0.06 0.73 0.16 N N + RCH 3 methyl imidazolium (+) 1 -0.20 -0.27 0.00 N N + CH 3 R CH 3 2,3 dimethyl imidazolium 1 0.55 0.12 0.37 N + R Pyridinium 1 -0.43 0.33 -0.16 N + R CH 3 CH 3 CH 3 Triethy alkyaminium 1 0.91 0.99 0.67 N + R R 1 R 2 R 3 Trialkyl alkyaminium 4 0.23 -0.90 -0.04 127 Table 5.7: Alkyl chain attached to cation groups and their latent property contributions. Latent Property Alkyl Groups Valency q 1 q 2 q 3 CH 3 Methyl 1 -3.08 0.08 -0.37 ?CH 2 ? Methylene 2 4.08 0.92 1.37 The reverse design of potential IL molecules is accomplished by exhaustively generating combinatorial structures from the given molecular fragments until the resulting properties match the target property values in Table 5.4. In exhaustive searches, selection from among numerous permutations of anion, cation, and alkyl chain attached to cation groups is performed. First, an anion is selected, and the properties of interest are evaluated by using a group-based property estimation method, changing the cations and the length of side alkyl chain attached to the cation. For every cation; CH 3 groups, equal to the free bond number (FBN) of the cation, are added as a cap at the free end of the cation. Finally, varying numbers of CH 2 groups are added until the sum of property values of all the groups fall within the target property range. In this case study, a maximum of fifteen CH 2 groups are allowed to occur in a generated ionic liquid pair. Figure 5.5 depicts a scenario for the reverse design of ILs. The molecular property is estimated based on the first order group contribution method: 15 1 j gg g n = ? = ?? ? (5.4) where ? ? is the normalized latent property operator of property ? (which includes 3 properties), ? ? is the similar number of group appearances, and ? ? is the normalized latent property contribution of the appeared group. 128 Figure 5.5: Scenario of reverse design of ILs. A heuristic and exhaustive search algorithm in the Python programming language was written to automate the generate-and-test procedure to enumerate all possible IL molecules by selecting collections of groups, estimating their physical properties, and checking these against the target values. A total of 10,638 possible candidate were generated. Only 26 of these enumerated IL molecules satisfied the target property specifications in latent space (Table 5.4) and the free bond number (FBN) structural constraint. The structures whose FBN equals to zero ensure feasible, stable, and connected Target Properties Viscosity, Density, Melting Point Temperature etc. Generation of IL Structures from Combinatorial Building Blocks Alkyl chain R 1 ? 4 BF 6 ? PF 33 ? CF SO 27 ? Al Cl 3 ? NO 4 ? AlCl 2 ? Tf N halide NN + R 1 R 2 S + R 1 R 2 R 3 N + R 2 R 2 Cation R 2 + Anion - Select appropriate IL molecular building blocks Exhaustive search by computer Structured IL Molecules Meet target property values, Electronically complete 3 CH 2 CH OH 22 CH CH=CH 23 CH OCH 3 CH O 129 molecules are formed (see Section 2.8.2). Out of 26 of these molecules, 23 solutions were not present in the training data set (Table B.3). Table B.4 contains 23 unique solutions in latent or principal property space. When these 23 IL molecules were mapped from principal property space to physical property space, only 13 IL molecules satisfied the target physical properties (Table 5.4). It must be noted that all 23 solutions are feasible; however, only 13 are feasible in both the spaces. The loss of 10 molecules when properties of the IL molecules were transformed from the latent space to physical space could be contributed to the uncertainties associated with the property models. Figure 5.6: Enumerated and validated candidate IL molecules. Table 5.8 lists the final candidate IL molecules for which predicted properties matches the target properties and are structurally feasible. The ID numbers and candidate molecule names are the same as in Table B.4. These molecules could further be subjected to experimental and/or more detailed computational simulation tests. The enumeration and selection after validation of IL molecules is schematically presented in Figure 5.6. 130 Table 5.8: Thirteen candidate ionic liquid molecules that match target properties in property space. ID Candidate Molecule Physical Properties ? [Pa.s] ? [kg/m 3 ] T m [K] 3 [e 3 pNH 4 ]mSO 4 1.00 1248 267 4 [meIm]PF 6 1.00 1226 290 7 [meIm]BF 4 1.00 1261 270 8 [e2,3m 2 Im]BF 4 1.00 1261 266 9 [ePy]BF 4 1.00 1291 260 11 [e 3 bNH 4 ]BF 4 1.00 1247 267 12 [e 3 pNH 4 ]dCN 1.00 1259 261 14 [e2,3m 2 Im]Tf 2 N 1.00 1247 267 17 [e 3 bNH 4 ]Tf 2 N 1.00 1233 247 18 [meIm]CF 3 SO 3 1.00 1263 266 21 [e 3 bNH 4 ]CF 3 SO 3 1.00 1221 277 22 [ePy]oSO 4 1.10 1203 252 23 [e 3 pNH 4 ]oSO 4 1.00 1255 262 5.5 Conclusion Ionic liquids (ILs) as green solvents can be used in separation processes, chemical synthesis, catalysis and electrochemistry, successfully replacing the conventional volatile, flammable and toxic organic solvents. Within the computer-aided molecular design (CAMD) framework, a characterization based method was combined with chemometric techniques towards the design of IL structures corresponding to particular physical properties. Infrared spectra (IR) generated from density functional theory (DFT) simulations were used as molecular descriptors for capturing information on molecular architecture, and for calibration of latent variable property models to design ILs in a logical and systematic methodology. The use of DFT eliminates the dependency on the availability of experimental IR data, 131 thereby extending the capabilities of a design method based on such characterization techniques. In addition, the design of ILs using the characterization-based group contribution method (cGCM) further demonstrated the advantages of using it compared to the conventional GCM for predicting properties. For new class of chemical compounds such as ILs, tabulated group contribution values for many molecular groups are not available in the literature. Here, cGCM expands the application range of general GCM to predict properties of the enumerated IL molecules from the combination of IL molecular fragments by using the molecular information captured from characterization technique based on infrared spectroscopy data. 132 CHAPTER 6 FUTURE WORK 6.1 Methodology Improvements The chemical product formulation methodology presented in this dissertation utilizes multivariate data analytics in three areas: (1) characterization, (2) modeling, and (3) design. Each of these aspects in product design plays an important role at their respective levels. The contribution of this work has been in the development of a generalized methodology that integrates all of the above tools and techniques to design new and improved materials with tailored properties. The applications are presented in the form of case studies involving the design of biofuel additives, thermoplastic formulation, and ionic liquid. Several of the tools and the techniques combined in this work may not be an ideal choice; however, the general problem formulation as well as the solution approach can be extended to other problems in chemical engineering. In the future, some of the areas of improvement can be seen in each of the multivariate analytics. Figure 6.1 presents schematics of the potential tools and techniques in the characterization, modeling, and design. 6.1.1 Multi-Dimensional Characterization In silico molecular design approaches presented in the foregoing chapters can take advantage of an increasing amount of techniques available to characterize molecular architectures. Recently, the advancement of high performance computers and robust theories, such as quantum chemistry, information theory, graph theory, etc., are accelerating a paradigm shift in 133 the molecular modeling. Today, thousands of molecular descriptors that capture and transform the information encoded in the molecular structure are effortlessly generated using readily available algorithms and software. Each molecular descriptor takes into account a small part of the whole chemical information contained within a real molecule. It is important to consider the whole environment of the compound as a potential source of information to investigate its interplay with physicochemical properties and biological activities. QSPR/QSAR modeling is an integral part of in silico molecular/drug design approaches. Figure 6.1: Improvements in the process of multivariate characterization, modeling, and design. Since knowledge gained from chemical data alone is inadequate for success in drug discovery, it is important to closely combine or merge chemoinformatics and bioinformatics in the future works. Among many computer programs, DRAGON TM computer software can generate more than 1,600 molecular descriptors, covering topological, molecular, and 3D Multivariate Characterization Modeling Design ? Descriptors including: 0D, 1D, 2D, 3D.... ? Computer programs: ADMET Predictor, POLLY, DRAGON ? Multi-way analysis: Parafac, Tucker.... ? Clustering: NN, SVM, K-mean, Decision Trees, LDA... ? Stochastic optimization: GA, SA, Tabu search.. Mutation Selection Initial Molecule/Group Final Molecule 134 properties of molecules [97]. E-DRAGON applet is an electronic version of DRAGON and is freely available at www.vcclab.org/edragon/. Likewise, ADMET Predictor TM is a powerful software that can estimate a number of properties associated with physicochemical and biological properties of drug- like chemicals and their molecular structures [98]. The addition of descriptors with diverse chemical features and wider molecular space coverage would make the presented methodology more flexible and robust, but would not require a new problem formulation and solution framework. In addition, since spectroscopic methods are fast and non-invasive, their use has grown considerably in recent decades in both research and industries such as food, petrochemical and pharmaceutics production. To establish more chemical features of samples, two distinct analytical methods can also be combined. For example, chromatography can be applied first to separate the components, while infrared spectroscopy can be applied later to identify them. When two distinct methods are combined, in so-called hyphenated methods, the resulting data is multivariate and multi-way. Likewise, fluorescence excitation-emission matrix (EEM) spectroscopy is a flexible, rapid, and portable organic matter characterization tool. In order to use these characterization techniques, however, two-way data decomposition techniques such as principal component analysis (PCA) and partial least square (PLS), used in this dissertation, cannot be applied without further extension. 6.1.2 Multi-Way Modeling Decomposition of multi-way data arrays can be accomplished by using parallel factor analysis (PARAFAC), which is an extension of the two-way PCA to three-way systems. Unlike the bi-linear PCA model, the tri-linear PARAFRAC model factors do not have rotational freedom, and thus the solution is essentially unique [99, 100]. Figure 6.2 (a) and Figure 6.2 (b) show a schematic of two-way data and three-way tensor decomposition, 135 respectively. While PCA has a score and loading matrix (Figure 6.2 (a)), PARAFAC has a score matrix and two loading matrices (Figure 6.2 (b)) for three-way systems. Figure 6.2: The decomposition of X-block by (a) PCA and (b) PARAFAC. Another important area where PARAFAC becomes instrumental is in the batch-wise manufacturing processes that are common in chemical, pharmaceutical, and semi-conductor industries. Batch processes give rise to three-dimensional matrices as shown in Figure 6.2 (b). The mixture formulation problem presented in Chapter 3 involves a two-way data structure. The PARAFAC decomposition of a three-way tensor ? is represented by Eq. (6.1). ` ,, , , , ,, 1 1,...., ; 1,..., ; 1,..., = = ??+ = = = ? A i jk ia ja ka i jk a x a b c e i Ij Jk K (6.1) For batch processes, Eq. (6.1) contains I batches, J time points, and K variables. For fluorescence EEM, Eq. (6.1) contains I samples with J emission wavelength and K excitation wavelengths. Analogous to PCR regression, = + a 2 a 1 b 1 b 2 c 1 c 2 (b) EEM Fluorescence t 2 t 1 = + p 1 p 2 (a) NIR 136 PARAFAC can be combined with MLR to obtain a multi-way regression model: decompose ? using PARAFAC and regress ? on the PARAFAC scores ?` as presented in Sections 2.6.2 and 2.6.3. In the other hand, when linear approximations (such as PCA or PARAFAC) are not valid, neural networks (NNs), which use a series of weights (? ? ) and hidden neurons to detect complex and non-linear relationships between inputs and outputs, can be utilized [101]. Figure 6.3 is a schematic for an artificial neural network. NNs are non-linear statistical data modeling tools suitable for high-dimensional and non-linear data such as the data generated from EEM fluorescence spectroscopy. They are an adaptive system that changes its structure based on external or internal information that flows through the network. They are useful when predictive accuracy is the most important objective. Figure 6.3: Structure of an artificial neural network. 6.1.3 Stochastic Search and Optimization The deterministic solver is used in the molecular design problems presented in this dissertation, involving the design of biodiesel additives and ionic liquids. The developed methodology shifted the product design paradigm from 137 guess-and-test to generate-and-test approach. Molecules with desired properties are generated using dynamic programming that searches through molecular space defined by a set of combinatorial building blocks. The combinatorial explosion is minimized by using meta-groups (see Section 2.8.2). In group-based property estimation methods, these meta-groups are treated as first order groups. The problem formulation using a deterministic approach is able to handle the presented design problem because the molecular search space is small. However, when the molecular search space becomes combinatorially large and nonlinear, stochastic optimization algorithms like evolutionary algorithm [9], simulated annealing (SA), tabu search [11], and artificial neural networks (ANN) are more appropriate and effective computer algorithms. Genetic algorithms (GAs) are stochastic evolutionary searches that use an analogy to chromosome encoding and Darwinian model of natural selection and evolution [9]. GAs are useful for combinatorial problems dealing with highly complex and a highly dimensional search space. Unlike traditional search methods that move in the objective area deterministically (point-by-point forward movement), GAs move probabilistically (parallel movement) in the optimal direction. In GAs, by stochastically favoring the mating of a more fit population of molecules, the most promising areas of the search space are explored at the expense of low performance regions [9]. 6.1.4 Managing and Handling Uncertainty Uncertainty accumulates through multiple steps in variable transformation and can make the variance of the final response undesirably large. The identification, quantification, and communication of model prediction uncertainties are important steps in determining the usefulness of any model. Uncertainty analysis techniques such as uncertainty propagation (forward and inverse) need to be incorporated into model calibration and design methodologies. By developing new design methods that are capable of 138 dealing with uncertainty in multi-scale models of materials and its propagation through subsequent design and analysis, the application range increases and design becomes robust to uncertainties in the design process. Importance of model uncertainty becomes greater as the extrapolation becomes farther from the historical or training data set. 6.2 Design of Inherently Benign Chemical Process Routes Traditionally, process design was guided by two major facets, technical and economic decision criteria. However, the incorporation of two further dimensions of sustainability, the ecological and the social aspects, simultaneously into the early process design stages has become necessary in recent years [102]. Figure 6.4 depicts different aspects involved in early process design. Opportunities for identification and development of inherently safer process alternatives for solvents, reaction paths, catalysts, etc., are most abundant during the early design stage. An inherently safer process avoids or reduces hazards instead of controlling them [103, 104]. Figure 6.4: Aspects of early process design. Analogous to the molecular architecture descriptors used in molecular design work, various process route descriptors related to environment, health, and safety factors such as toxicity, reactivity, flammability, heat of Economic Sustainability Ecological Technical Process Design Aspects Social Environmental Life-cycle Health Safety 139 reaction, temperature, process yield, properties of chemicals involved, etc., can be explored and used to design inherently safer synthesis pathways. Here, the primary objective is to minimize the amount of raw material usage, energy usage, waste generation, and the hazard to both life and environment. Many index-based approaches exist to compare and rank routes based on environment, health, and safety (EHS) hazards and life cycle assessment (LCA) impacts, however, they generally suffer from scaling and multivariate related issues. The recently proposed statistical analysis based Inherent Benign-ness Indicator (IBI) framework by Srinivasan and Nhan [103] alleviates the above shortcomings. They considered principal components (PCs) as statistically independent factors of the routes instead of traditional empirical weighting schemes. The scores and loadings on the first two PCs make up the IBI and help characterize the inherent benign-ness of a process. The larger the IBI value, the less benign the route is. The work in this dissertation can be easily extended to use this PCA- based IBI methodology to not only investigate the most inherently safer process route among different routes examined, but also to identify the similarities and differences in the EHS footprints of the available routes. This information and insight can be used to determine modifications needed to improve a route?s benign-ness through molecular/mixture design. Reverse problem formulation could be applied as an integration procedure where the process unit performance that minimizes the IBI could be identified in the first step and then the molecule/mixtures that match the targets process performance would be identified in the second step. The computer-aided molecular/mixture design (CAMD) framework combined with the property clustering technique would then be employed to generate products that meet the design constraints. If successful, the method could provide a new generation of greener chemical substitutes (e.g. solvents, catalyst, lubricants, heat transfer fluids, refrigerants, surfactants, etc.) that result in lower safety, health and environmental impacts. The algorithm will 140 be computationally efficient as it will be based on the targeting approach mentioned before in this dissertation. Moreover, simultaneously integrating inherent safety principles (using safer materials and operating conditions, minimizing inventory, etc.) into process and product design, from conception to completion, could lead to a more sustainable process. 141 REFERENCES. [1] Gani, R. (2004). Chemical product design: challenges and opportunities. Computers & Chemical Engineering, 28(12), 2441-2457. [2] Cussler, E. L., & Moggridge, G. D. (2001). Chemical Product Design. USA: Cambridge University Press, New York, ISBN 0521791839. [3] Hill, M. (2009). Chemical Product Engineering - The third paradigm. Computers & Chemical Engineering, 33(5): 947. [4] Ertl, P. (2003). Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups. Journal of Chemical Information and Computer Sciences, 43(2), 374-380. [5] Achenie, L. E. K., Gani, R., & Venkatasubramanian, V. (Eds.). (2002). Computer Aided Molecular Design: Theory and Practice (p. 408). Elsevier Science. [6] Sahinidis, N. V., Tawarmalani, M., & Yu, M. (2003). Design of alternative refrigerants via global optimization. AIChE Journal, 49(7), 1761?1775. [7] Biegler, L. T., Grossmann, I. E., & Westerberg, A. W. (1997). Systematic methods of chemical process design. New Jersey (p. 796). Prentice Hall PTR. [8] Camarda, K. V., & Maranas, C. D. (1999). Optimization in Polymer Design Using Connectivity Indices. Industrial & Engineering Chemistry Research, 38(10), 1884-1892. [9] Venkatasubramanian, V., Chan, K., & Caruthers, J. M. (1995). Evolutionary Design of Molecules with Desired Properties Using the Genetic Algorithm. Journal of Chemical Information and Modeling, 35(2), 188?195. [10] Harper, P. M., & Gani, R. (2000). Computer aided tools for design/selection of environmentally friendly substances. Process Design Tools for Environment (pp. 371?404). Philadelphia, USA: Taylor & Francis. [11] McLeese, S. E., Eslick, J. C., Hoffmann, N. J., Scurto, A. M., & Camarda, K. V. (2010). Design of ionic liquids via computational molecular design. Computers & Chemical Engineering, 34(9), 1476-1480. Elsevier Ltd. 142 [12] Contantinou, L., Gani, R. (1994). New group contribution method for estimating properties of pure compounds. AIChE Journal, 40(10): 1697-1710. [13] Harper, P. M., & Gani, R. (1999). Computer-aided molecular design with combined molecular modeling and group contribution. Fluid Phase Equilibria, 337-347. [14] Joback, K. G., Reid, R. C. (1983). Estimation of Pure-Component Properties from Group Contributions. Chemical Engineering Communication 57: 233. [15] Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikstrom, C., & Wold, S. (2006). Multi- and Megavariate Data Analysis: Basic Principles and Applications (Part I) (2nd ed., p. 425). Umetrics Academy, Umea. [16] Lavine, B. K. (2005). Chemometrics And Chemoinformatics (p. 216). American Chemical Society. [17] Carrera, G., & Aires-de-Sousa, J. (2005). Estimation of melting points of pyridinium bromide ionic liquids with decision trees and neural networks. Green Chemistry, 7(1), 20. [18] Venkatasubramanian, V. (2009). Drowning in data: Informatics and modelling challenges in a data rich networked world. AIChE Journal, 55(1): 2-8. [19] Wold, S. (1995). Chemometrics; what do we mean with it, and what do we want from it? Chemometrics and Intelligent Laboratory Systems, 30(1), 109- 115. [20] Jaeckle, C. M., & Macgregor, J. F. (1998). Product design through multivariate statistical analysis of process data. AIChE Journal, 44(5), 1105- 1118. [21] Qin, X., F. Gabriel, D. Harell, and M. M. El-Halwagi. 2004. Algebraic Techniques for Property Integration via Componentless Design, Industrial & Engineering Chemistry Research, 43 (14) (July): 3792?3798. [22] Eden, M. R. (2003). Property Based Process and Product Synthesis and Design. (Doctoral dissertation). CAPEC, Department of Chemical Engineering, Technical University of Denmark, Retrieved from http://www.orbit.dtu.dk/fedora/objects/orbit85827/datastreams/file_5486802/c ontent [23] Shelley, M. D., El-Halwagi, M.M. (2000). Component-less design of recovery and allocation systems: a functionality-based clustering approach. Computers & Chemical Engineering, 24(9-10), 2081-2091. 143 [24] Eljack, F. T., Solvason, C. C., Chemmangattuvalappil, N., & Eden, M. R. (2008). A Property-Based Approach for Simultaneous Process and Molecular Design. Chinese Journal of Chemical Engineering, 16(3), 424?434. [25] Chemmangattuvalappil, N. G., Eljack, F. T., Solvason, C. C., & Eden, M. R. (2009). A novel algorithm for molecular synthesis using enhanced property operators. Computers & Chemical Engineering, 33(3), 636?643. [26] Vlachos, D. G. (2012). Multiscale Modeling for Emergent Behavior, Complexity, and Combinatorial Explosion. AIChE Journal, 58(5), 1314-1325. [27] Solvason, C. C. (2011). Integrated Multiscale Product Design using Property Clustering and Decomposition Techniques in a Reverse Problem Formulation. (Doctoral dissertation). Department of Chemical Engineering, Auburn University, Auburn, Alabama. [28] Poling, B. E, Prausnitz, J. P., O?Connell, J. P. (2000). The Properties of Gases and Liquids, 5 th edition, McGraw-Hill, New York, USA. [29] Matsuda, H., Yamamoto, H., Kurihara, K., & Tochigi, K. (2007). Computer- aided reverse design for ionic liquids by QSPR using descriptors of group contribution type for ionic conductivities and viscosities. Fluid Phase Equilibria, 261(1-2), 434?443. [30] Linusson, a, Elofsson, M., Andersson, I. E., & Dahlgren, M. K. (2010). Statistical molecular design of balanced compound libraries for QSAR modeling. Current medicinal chemistry, 17(19), 2001?16. [31] Varnek, A., Kireeva, N., Tetko, I. V, Baskin, I. I., & Solov?ev, V. P. (2007). Exhaustive QSPR studies of a large diverse set of ionic liquids: how accurately can we predict melting points? Journal of chemical information and modeling, 47(3), 1111?22. [32] Michielan, L., & Moro, S. (2010). Pharmaceutical perspectives of nonlinear QSAR strategies. Journal of chemical information and modeling, 50(6), 961? 78. [33] Linusson, a, Gottfries, J., Lindgren, F., & Wold, S. (2000). Statistical molecular design of building blocks for combinatorial chemistry. Journal of medicinal chemistry, 43(7), 1320?8. [34] Marrero, J., Gani, R. (2001). Group-contribution-based estimation of pure component properties. Fluid Phase Equilibria, 183-184: 183-208. [35] Gabrielsson, J., Lindberg, N.-O., & Lundstedt, T. (2002). Multivariate methods in pharmaceutical applications. Journal of Chemometrics, 16(3), 141-160. 144 [36] Gabrielsson, J., Lindberg, N.-O., P?lsson, M., Nicklasson, F., Sj?str?m, M., & Lundstedt, T. (2004). Multivariate methods in the development of a new tablet formulation: optimization and validation. Drug development and industrial pharmacy, 30(10), 1037-49. [37] Gabrielsson, J., Sj?str?m, M., Lindberg, N.-O., Pihl, A.-C., & Lundstedt, T. (2006). Multivariate methods in the development of a new tablet formulation: excipient mixtures and principal properties. Drug development and industrial pharmacy, 32(1), 7-20. [38] Workman, J., & Weyer, L. (2008). Practical Guide to Interpretive Near- Infrared Spectroscopy (1st ed., p. 344). CRC Press. [39] S.E. Stein (2010), NIST Mass Spec Data Center, NIST Standard Reference Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology. [40] Eriksson, L., & Johansson, E. (1996). Multivariate design and modeling in QSAR. Chemometrics and Intelligent Laboratory Systems, 34(1), 1?19. [41] Kramer, R. (1998). Chemometric Techniques for Quantitative Analysis. New York (p. 203). CRC. [42] Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: a tutorial. Analytica chimica acta, 185, 1?17. [43] Kresta, J. V., MacGregor, J. F. and Marlin, T. E. (1991). Multivariate statistical monitoring of process performance. Canadian Journal of Chemical Engineering. 69, 35. [44] Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for Easier Model interpretation and as an alternative to variable selection. Journal of Chemometrics, 10, 463-482. [45] MacGregor, J. F., and Kourti, T. (1995). Statistical process control of multivariate process. Control Engineering Practice. 3(3), 403-414. [46] Johnson, R. A. (2007). Applied Multivariate Statistical Analysis (6th ed., p. 800). Pearson. [47] Jolliffe, I.T. (2002). Principal component analysis. New York: Springer. [48] Jackson, J. E. (1991). A User?s Guide to Principal Components. Journal of the Operational Research Society (Vol. 43, pp. 641-641). Wiley-Interscience. [49] Jaeckle, C. M., & Macgregor, J. F. (1998). Product design through multivariate statistical analysis of process data. AIChE Journal, 44(5), 1105- 1118. 145 [50] Wentzell, P. D., & Vega Montoto, L. (2003). Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures. Chemometrics and Intelligent Laboratory Systems, 65(2), 257-279. [51] Montgomery, D. C. (2000). Design and Analysis of Experiments, 5th Edition. America (pp. 1-5). Wiley. [52] Ben-Israel, A., & Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications. (J. Borwein & P. Borwein, Eds.) Book (Vol. 18, p. 371). Springer. [53] Hada, S., Solvason, C. C., & Eden, M. R. (2011). Molecular Design of Biofuel Additives for Optimization of Fuel Characteristics. In A. C. Pistikopoulos, E.N.; Georgiadis, M.C.; Kokossis (Ed.), 21nd European Symposium on Computer Aided Process Engineering (pp. 1633?1637). [54] Brignole, E., Bottini, S., & Gani, R. (1986). A strategy for design and selection for separation processes. Fluid Phase Equilibria, 13, 331?340. [55] Joback, K. G. (1989). Designing Molecules Possessing Desired Physical Property Values. (Doctoral dissertation). Massachusetts Institute of Technology. Retrieved from http://dspace.mit.edu/handle/1721.1/14191 [56] Kettaneh-Wold, N. (1992). Analysis of mixture data with partial least squares. Chemometrics and Intelligent Laboratory Systems, 14(1-3), 57-69. [57] Muteki, K., & Macgregor, J. (2007). Multi-block PLS modeling for L-shape data structures with applications to mixture modeling. Chemometrics and Intelligent Laboratory Systems, 85(2), 186-194. [58] Cornell, J. (1990), Experiments with mixtures, 2nd Ed., Wiley, New York. [59] Scheffe, H. (1963). Simplex-centroid design for experiments with mixtures. Journal of the Royal Statistical Society, 25 (2), 235-263. [60] Cox, D. R. (1971). A note on polynomial response functions for mixtures. Biometrica, 58 (1), 155-159. [61] Solvason, C. C., Chemmangattuvalappil, N. G., Eljack, F. T., & Eden, M. R. (2009). Efficient Visual Mixture Design of Experiments using Property Clustering Techniques. Industrial & Engineering Chemistry Research, 48(4), 2245-2256. [62] Muteki, K., MacGregor, J. F., & Ueda, T. (2006). Rapid Development of New Polymer Blends: The Optimal Selection of Materials and Blend Ratios. Industrial & Engineering Chemistry Research, 45(13), 4653-4660. 146 [63] Muteki, K. (2006). Mixture product design using latent variable methods. (Doctoral dissertation). Department of Chemical Engineering, McMaster University. [64] Grassmann, P., Sawistowski, H., & Hardbottle, R. (1971). Physical principles of chemical engineering. New York: Pergamon Press. [65] JMP, Version 9.0. SAS Institute Inc., Cary, NC, 1989-2012. [66] Janaun, J., & Ellis, N. (2010). Perspectives on biodiesel as a sustainable fuel. Renewable and Sustainable Energy Reviews, 14(4), 1312-1320. Elsevier Ltd. [67] Schober, S., Mitelbach, M. (2004). The impact of antioxidants on biodiesel oxidation stability. European Journal of Lipid Science and Technology, 106, 382-389. [68] Karmee, S.K., Chadha, A. (2005). Preparation of biodiesel from crude oil of Pongamia pinnata. Bioresource Technology, 96 (13), 1425-1429. [69] American Society for Testing and Materials (ASTM), D6751-12 Standard Specification for Biodiesel Fuel Blend Stock (B100) for Middle Distillate Fuels, ASTM International, West Conshohocken, PA, 2012, www.astm.org. [70] Fukud, H., Kondo, A., Noda, H. (2001). Biodiesel Fuel Production by Transesterification of Oils. Journal of Bioscience and Bioengineering, 92(5), 405-416. [71] Knothe, G., Gerpen, J. H. V., & Krahl, J. (2005). The Biodiesel Handbook. (G. Knothe, J. H. Van Gerpen, & Jurgen Krahl, Eds.) Applied Sciences (Vol. 2, p. 302). AOCS Press. [72] Br?ckel, U., Meier, W., & Wagner, G. (2007). Product Design and Engineering: Best Practices. John Wiley & Sons. [73] Canakci, M., & Sanli, H. (2008). Biodiesel production from various feedstocks and their effects on the fuel properties. Journal of industrial microbiology biotechnology, 35(5), 431-441. [74] Demirbas, A. (2003). Chemical and Fuel Properties of Seventeen Vegetable Oils. Energy Sources, 25(7), 721-728. [75] European Committee for Standarization (CEN), EN 14214 (2003). Automotive fuels ? fatty acid methyl esters (FAME) for diesel engines ? Requirements and test methods, CEN, Brussels. [76] Wang, P.S., Tat, M.E., Van Gerpen, J. (2005). The production of fatty acid isopropyl ester and their use as a diesel engine fuel. Journal of the American Oil Chemists Society, 82(11), 845-849. 147 [77] Lee, I., Johnson, L.A., Hammond, E.G. (1995). Use of branched-chain esters to reduce the crystallization temperature of biodiesel. Journal of the American Oil Chemists Society, 72, 1155-1160. [78] Knothe. G. (2008), ?Designer? Biodiesel: Optimizing Fatty Ester Composition to Improve Fuel Properties, Energy and Fuels, 22(2), 1358-1364. [79] Ribeiro, M., Pinto, A. C., Quintella, C. M., Rocha, G. O., Teixeira, L. S. G., Guarieiro, L. N., Rangel, C., et al. (2007). The Role of Additives for Diesel and Diesel Blended (Ethanol or Biodiesel) Fuels?: A Review Nu. Energy & Fuels, 21(4), 2433-2445. [80] Suppes, G. J., Goff, M., Burkhart, M. L., & Bockwinkel, K. (2001). Multifunctional Diesel Fuel Additives from Triglycerides. Energy & Fuels, 15(1), 151-157. [81] Conte, E., Martinho, A., Matos, H. A., & Gani, R. (2008). Combined Group- Contribution and Atom Connectivity Index-Based Methods for Estimation of Surface Tension and Viscosity. Industrial & Engineering Chemistry Research, 47(20), 7940-7954. [82] Hall, L. H., & Kier, L. B. (2001). Issues in representation of molecular structure the development of molecular connectivity. Journal of molecular graphics & modelling, 20(1), 4-18. [83] Lapuerta, M., Rodr?guez-Fern?ndez, J., & De Mora, E. F. (2009). Correlation for the estimation of the cetane number of biodiesel fuels and implications on the iodine number. Social Sciences, 37(11), 4337-4344. [84] Socrates, G. (2001). Infrared and Raman characteristic group frequencies: tables and charts. (G. Socrates. Ed.) Journal of Raman Spectroscopy (Vol. 35. p. 347). Wiley. [85] Krevelen, D. W. V., & Nijenhuis, K. T. (2009). Properties of Polymers. Their correlation with chemical structure; their numerical estimation and prediction from additive group contributions. (Null, Eds.)Endeavour (Vol. 16, pp. 656- 658). [86] Ayala, A. E., Simoni, L. D., Lin, Y., & Brennecke, J. F. (2006). Process design using ionic liquids?: Physical property modeling. Computer Aided Chemical Engineering, 21, 463-468. [87] Turner, E. A., Pye, C. C., & Singer, R. D. (2003). Use of ab Initio Calculations toward the Rational Design of Room Temperature Ionic Liquids. Journal of Physical Chemistry, (Table 1), 2277?2288. [88] Holbrey, J. D., & Seddon, K. R. (1999). Ionic Liquids. Clean Products and Processes, 1, 223-236. 148 [89] Ionic Liquids Database- (ILThermo). NIST Standard Reference database # 147. Retrieved from http://ilthermo.boulder.nist.gov/ILThermo/mainmenu.uix [90] Hanwell, M. D., Curtis, D. E., Lonie, D. C., Vandermeersch, T., Zurek, E., & Hutchison, G. R. (2012). Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. Journal of cheminformatics, 4(1), 17. [91] Halgren, T. A. (1996). Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry, 17(5-6), 490?519. [92] Gaussian 09 (2009), Revision A.1, Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, Jr., J. A.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, ?.; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian, Inc., Wallingford CT. [93] Talaty, E. R., Raja, S., Storhaug, V. J., Do, A., & Carper, W. R. (2004). Raman and Infrared Spectra and ab Initio Calculations of C 2-4 MIM Imidazolium Hexafluorophosphate Ionic Liquids. Journal of Physical Chemistry, 108, 13177?13184. [94] Scott, A. P., & Radom, L. (1996). Harmonic Vibrational Frequencies: An Evaluation of Hartree?Fock, M?ller?Plesset, Quadratic Configuration Interaction, Density Functional Theory, and Semiempirical Scale Factors. Journal of Physical Chemistry, 100(41), 16502?16513. [95] http://cccbdb.nist.gov/vibscalejust.asp [96] Wasserscheid, P., & Welton, T. (Eds.). (2007). Ionic Liquids in Synthesis (2nd ed., p. 776). WILEY-VCH. [97] Todeschini, R., & Consonni, V. (2009). Molecular Descriptors for Chemoinformatics. (R. Mannhold, H. Kubinyi, & G. Folkers, Eds.) (2nd ed., p. 1257). Weinheim: WILEY-VCH. [98] ADMET Predictor (2011). User Manual. Simulations Plus, Lancaster, CA. 149 [99] Bro, R., & Kiers, H. A. L. (2003). A new efficient method for determining the number of components in PARAFAC models. Journal of Chemometrics, 17(5), 274?286. [100] Thygesen, J., & Van Den Berg, F. (2011). Calibration transfer for excitation- emission fluorescence measurements. Analytica Chimica Acta, 705(1-2), 81-7. [101] Bieroza, M., Baker, A., & Bridgeman, J. (2011). Classification and calibration of organic matter fluorescence data with multiway analysis methods and artificial neural networks: an operational tool for improved drinking water treatment. Environmetrics, 22(3), 256?270. [102] Albrecht, T., Papadokonstantakis, S., Sugiyama, H., & Hungerb?hler, K. (2010). Demonstrating multi-objective screening of chemical batch process alternatives during early design phases. Chemical Engineering Research and Design, 88(5-6), 529?550. [103] Srinivasan, R., & Nhan, N. T. (2008). A statistical approach for evaluating inherent benign-ness of chemical process routes in early design stages. Process Safety and Environmental Protection, 86(3), 163?174. [104] Banimostafa, A., Papadokonstantakis, S., & Hungerb?hler, K. (2012). Evaluation of EHS hazard and sustainability metrics during early process design stages using principal component analysis. Process Safety and Environmental Protection, 90(1), 8?26. 150 APPENDIX A MIXTURE MODELS The Scheffe canonical models and Cox polynomial models are the two most commonly used models to analyze mixture data with multiple regressions. A.1 Scheffe Mixture Model Scheffe 1 introduced canonical models of various orders by eliminating some terms from the complete polynomial model. More details can be found in Cornell 2 . The first- and second- order point estimate forms of the Scheffe models are represented by Eqs (A.2) and (A.3) respectively. 1 n ii i yx? = = ? (A.2) 1 n nn i i ij i j i i j ji y x xx?? = <> = + ? ?? (A.3) By removing the constant term, the primary collinearity introduced by Eq. (3.1) is eliminated; however, it makes it impossible to center these models, which leads to ill-conditioning of the ? ? ? matrix and poor estimates of the coefficients 3 . Moreover, the Scheffe model is difficult to interpret when the objective of the data analysis is determination of the component effects on 1 Scheffe, H. (1963). Simplex-centroid design for experiments with mixtures. J. R. Stat. Soc. B., 25 (2), 235-263. 2 Cornell, J. (1990), Experiments with mixtures, 2nd Ed., Wiley, New York. 3 Kettaneh-Wold, N. (1992). Analysis of mixture data with partial least squares. Chemometrics and Intelligent Laboratory Systems, 14(1-3), 57-69. 151 the response. Moreover, secondary collinearity from additional constraints such as upper and lower limits on components, results in poor estimates of the regression coefficients. A.1 Cox Mixture Model Cox 3 , recognizing the difficulties with the Scheffe mixture model, derived new mixture models with respect to a specified reference point in the experimental region. They are regular polynomials with constraints involving the reference mixture. The first and second order polynomial forms are represented by Eqs (A.4) and (A.5). 1 n o ii i yx?? = = + ? (A.4) 1 n nn o i i ij i j i ij j y x xx?? ? = ? =++ ? ?? (A.5) In terms of change in constituent i, i ? , Eqs (A.4) and (A.5) can be rewritten as: 1 () 1 n i i i i y ys s ? = ?? =+? ?? ? ?? ? (A.6) ( ) 2 2 11 () 1 1 nn i ii ii ii i i y ys s s ?? = = ?? ?? = + ?+ ??? ?? ? ? ?? ?? ?? (A.7) where, y(s) is the expected response at the standard reference mixture. The Cox coefficients represent the change in the response as one moves away from the standard reference mixture, and hence is meaningful in most applications. However, for multiple regressions, the Cox model encounters estimation difficulties as additional constraints are involved. 152 In most practical cases, (except when X is generated according to an experimental design), however, the X-variables are not statistically independent. This situation is referred to as X being rank deficient. Although the Scheffe canonical models and the Cox polynomial models (a reparameterized and constrained version of the Scheffe model) eliminated the true collinearity, and enabled the use of multiple regressions for the estimation, the problem of near collinearities with mixture data remains. Design of experiments (DOE) with response surface methods are usually used to determine the optimum combination of chemical constituents that give a desired response using a minimum number of experimental runs. While such a design approach is adequate for most experimental designs, it suffers from combinatorial explosion and visualization difficulties when dealing with multi-component mixtures 4 . Solvason et al. 5 presented a solution to these problems by integrating the property clustering framework with existing mixture design techniques. 4 Eden, M. R. (2003). Property-Based Process and Product Synthesis and Design. CAPEC, Department of Chemical Engineering, Technical University of Denmark. Ph.D Thesis. 5 Solvason, C. C., Chemmangattuvalappil, N. G., Eljack, F. T., & Eden, M. R. (2009). Efficient Visual Mixture Design of Experiments using Property Clustering Techniques. Industrial & Engineering Chemistry Research, 48(4), 2245-2256. 153 APPENDIX B SPECTRAL INTERPRETATION The vibrational spectrum of a molecule is considered to be a unique physical property and is a characteristic of the molecule. Any spectrum originates from radiation energy transferred to mechanical energy associated with the motion of atoms held together by chemical bonds in a molecule. The first principles approach, which is based on the fact that structural features of the molecule, whether they are the backbone of the molecule or the functional groups attached to the molecule, produce characteristic and reproducible absorptions in the spectrum 6 . This information can indicate whether there is a backbone to the structure and, if so, whether the backbone consists of linear or branched chains. Next it is possible to determine if there is unsaturation and/or aromatic rings in the structure. Finally, it is possible to deduce whether specific functional groups are present. If detected, one is also able to determine the local orientation of the group and its local environment and/or location in the structure 7 . 6 Salzer, R. (2008). Practical Guide to Interpretive Near-Infrared Spectroscopy. By Jerry Workman, Jr. and Lois Weyer. Angewandte Chemie International Edition,47(25), 4628- 4629. 7 Pasquini, C. (2003). Review Near Infrared Spectroscopy?: Fundamentals , Practical Aspects and Analytical. Applications. Spectroscopy, 14(2), 198-219. 154 B.1 Infrared Spectroscopy The fundamental absorption frequencies (also known as group frequencies) are the key to unlocking the structure?spectral relationships of the associated molecular vibrations. An infrared spectrum is formed as a consequence of the absorption of electromagnetic radiation at frequencies that correlate to the vibration of specific sets of chemical bonds from within a molecule. Figure B.6.5 8 shows the IR region of the electromagnetic spectrum. The distribution of energy possessed by a molecule at any given moment can be defined as the sum of the contributing energy terms: total electronic vibrational rotational trnaslational EE E E E=+++ (B.1) Figure B.6.5: IR regions of the electromagnetic spectrum. 8 Raffael, K. D. (2002). Infrared Spectroscopy: The Theory. Journal of Molecular Spectroscopy, 214(1), 21-27 Visible Infrared Microwave 155 B.2 Molecular Vibrational Spectroscopy There are two types of molecular vibrations, stretching and bending. A molecule consisting of n-atoms has 3n degrees of freedom, corresponding to the Cartesian coordinates of each atom in the molecule. In a nonlinear molecule, 3 of these degrees are rotational and 3 are translational and the remaining corresponds to fundamental vibrations; in a linear molecule, 2 degrees are rotational and 3 are translational. This is because in a linear molecule, all of the atoms lie on a single straight line and hence rotation about the bond axis is not possible. The net number of vibrational degrees of freedom for a given molecule can be determined from Eq. (B.2): 3 6 (nonlinear) 3 5 (linear) Number of normal mode n n = ? = ? (B.2) For example, water, which is nonlinear, has three fundamental vibrations as shown in Figure B.6.6 8 . Figure B.6.6: Stretching and bending vibrational modes for H2O. If we calculate the number of modes for a simple hydrocarbon, such as methane (nonlinear, tetrahedral structure), a value of nine is obtained. This would imply that nine sets of absorption frequencies would be observed in the spectrum of methane gas. In practice, infrared spectra do not normally display separate absorption signals for each of the 3n-6 fundamental vibrational modes of a molecule. The number of observed absorptions may be Symmetrical stretching asymmetrical stretching Scissoring (bending) 156 increased by additive and subtractive interactions leading to combination tones and overtones of the fundamental vibrations, in much the same way that sound vibrations from a musical instrument interact. Furthermore, the number of observed absorptions may be decreased by molecular symmetry, spectrometer limitations, and spectroscopic selection rules. One selection rule that influences the intensity of infrared absorptions is that a change in dipole moment of the molecule should occur for a vibration to absorb infrared energy. Absorption bands associated with C=O bond stretching are usually very strong because a large change in the dipole takes place in that mode. The reason for the smaller than expected number is that several of the vibrations are redundant or degenerate, that is, the same amount of energy is required for these vibrations. Some general trends are as follows: ? Stretching frequencies are higher than corresponding bending frequencies. (It is easier to bend a bond than to stretch or compress it.) ? Bonds to hydrogen have higher stretching frequencies than those to heavier atoms. ? Triple bonds have higher stretching frequencies than corresponding double bonds, which in turn have higher frequencies than single bonds. (Except for bonds to hydrogen). The stretching and bending vibrations for the important organic group, ?CH 2 , are illustrated in Figure B.6.7 9 . (The 3n?6 rule does not apply since the ?CH 2 group represents only a portion of a molecule.) Note that bending vibrations occur at lower frequencies than the corresponding stretching vibrations. 9 Socrates, G. (2001). Infrared and Raman characteristic group frequencies: tables and charts. (G. Socrates. Ed.) Journal of Raman Spectroscopy (Vol. 35. p. 347). Wiley. 157 The fundamental requirement for infrared activity, leading to absorption of infrared radiation, is that there must be a net change in dipole moment during the vibration for the molecule or the functional group under study. Another important form of vibrational spectroscopy is Raman spectroscopy, which is complementary to infrared spectroscopy. The selection rules for Raman spectroscopy are different to those for infrared spectroscopy, and in this case a net change in bond polarizability must be observed for a transition to be Raman active. Figure B.6.7: Stretching and bending vibrational modes for a CH2 goup. Both the stretching and bending vibrations of a molecule as illustrated in the above figures can be predicted mathematically, at least to a useful approximation using the following mathematical description of stretching vibrations. 158 B.2.1 Stretching Vibration The stretching frequency of a bond can be approximated by Hooke?s Law. In this approximation, the fundamental frequency of vibration of two atoms and the connecting covalent bond are treated as a simple harmonic oscillator composed of 2 masses (atoms) joined by a spring and can be expressed by: 1 2 ? ? ?? = (B.3) Where ? is the vibrational frequency ? is the force constant of the spring, and ? is the reduced mass. Figure B.6.8: Ball and spring model for atoms and bonds respectively. Using vibrations and wave equations of quantum mechanics, the potential energy can be written as: 1 () 2 En h?= + (B.4) where, h is the Plank?s constant, 6.6 x 10 -34 J/s, n is the vibrational quantum number (0, 1, 2, 3, . . . ) At the ground state ( 0? = ) 0 12Eh?= . Following the selection rule, when a molecule absorbs energy, there is a promotion to the first excited state 1 32Eh?= . These correspond to bands called overtones in an IR spectrum. 159 They are of lower intensity than the fundamental vibration bands. The difference in energy levels between the vibrational quantum states exactly matches the radiation energy expressed as: 31 22 2 h E h hh ? ??? ?? ?? ?= ? = = ?? ?? (B.5) Since, cc? ??= = , and Eq. (B.3) for a diatomic molecule becomes ( ) 12 12 11 1 2 1 where, 2 mm mm c mm m m ? ?? ? + ? = = + (B.6) where, c is the speed of light, 3 x 10 8 m/s. ? is the wavenumber, inverse of the wavelength, ? . For example, using Hooke?s law approximation, C-H bond stretching vibrations can be estimated as: The actual range for C?H absorptions is 2850?3000 cm ?1 . The region of an IR spectrum where bond stretching vibrations is seen depends primarily on whether the bonds are single, double, or triple or bonds to hydrogen. The following table shows where absorption by single, double, and triple bonds are observed in an IR spectrum. m C 12gm 6.023 10 23 ? := m H 1gm 6.023 10 23 ? := ? 1 2? c? ? m C m H + ( ) ? m C m H ? := ? 3032 cm 1? = 160 Table B.1: IR spectrum absorption for different bond types. Bond type Force constant dyne/cm Bond IR absorption range cm -1 Single 5 x 10 5 C-C, C-O, C-N 800-1300 Double 10 x 10 5 C=C, C=O, C=N, N=O 1500-1900 Triple 15 x 10 5 C?C, C?N 2000-2300 Hydrogen C-H, N-H, O-H 2700-3800 The general regions of the infrared spectrum in which various kinds of vibrational bands are observed are outlined in the Figure B.6.9. Figure B.6.9: Vibrational bands in infrared spectrum. B.3 Near Infrared Spectroscopy Absorption bands in the near infrared region (NIR) (12820 - 4000 cm -1 ) are weak because they arise from vibrational overtones and combination bands. Combination bands occur when two molecular vibrations are excited simultaneously. The intensity of overtone bands reduces by one order of overtone for each successive overtone when a molecule is excited from the ground vibrational state to a higher vibrational state. When the vibrational 161 quantum number v is greater than or equal to 2 then an overtone absorption results. The first overtone results from 0? = to 2? = (52 12 2h hh? ???=). The second overtone occurs when 0? = transitions to 3? = ( ) 72 12 3? ???=h hh. Transitions arising from the near-IR absorption are weak, hence they are referred to as forbidden transitions but these transitions are relevant when nondestructive measurements are required such as a solid sample. B.4 Characterizing IR Spectroscopy Since relative intensities are primarily functions of the atom specific dipole changes caused by the vibration of the corresponding bonds, it follows that their size and shape are indicators of molecular architecture. The IR absorbance frequencies and magnitudes of the functional groups spectrums are listed in following tables and were compiled from Socrates 9 . Table B.2: IR absorbance frequencies and magnitudes of functional groups. Methine Groups, -CH- - Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Bending (?) 1360 1320 1340 w 90 Stretching (?) 2890 2880 2885 w 90 Methylene Groups, -CH2- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Scissoring Bend (? s ) 1480 1440 1460 m 50 Symmetrical Stretching (? s ) 2870 2840 2855 m 50 Asymmetrical Stretching (? a ) 2940 2915 2928 m-s 30 162 Methyl Groups, -CH3 Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Sym Bend (? s ) 1390 1370 1380 m-s 30 Asym. Bend. (? a ) 1465 1440 1453 m 50 Symmetrical Stretching (? s ) 2885 2865 2875 m 50 Asymmetrical Stretching (? a ) 2975 2950 2963 m-s 30 Tetramethyl Groups, -C(CH3)3 Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-C Skeletal Bend (? s ) 930 925 928 m 50 C-C Skeletal Bend (? s ) 1010 990 1000 m-w 70 C-C Skeletal Bend (? s ) 1225 1165 1195 m 50 C-C Skeletal Bend (? s ) 1255 1245 1250 m 50 C-CH 3 Sym. Bend. (? s ) 1395 1350 1373 m-s 30 C-CH 3 Sym. Bend. (? s ) 1420 1375 1398 m 50 C-CH 3 Asym. Bend. (? a ) 1475 1435 1455 m 50 C-H Sym. Stretching (? s ) 2885 2865 2875 m 50 C-H Asym. Stretching (? a ) 2975 2950 2963 m-s 30 Aliphatic Methoxy Groups, -O-CH3 (Special Methyl) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O Def. Bend. (? d ) 580 340 460 m-w 70 CH 3 /CO Rocking Bend (? d ) 1190 1100 1145 m-w 70 CH 3 Rock Bend (? d ) 1235 1155 1195 m-w 70 CH 3 Sym Bend (? s ) 1460 1420 1440 M 50 CH 3 Asym. Bend. (? a ) 1475 1435 1455 m 50 CH 3 Asym. Bend. (? a ) 1485 1445 1465 m 50 C-H 3 Sym. Str. (? s ) 2880 2815 2848 m 50 163 C-H 3 Asym. Str. (? a ) 2985 2920 2953 m 50 C-H Asym. Str. (? a ) 3030 2950 2990 m 50 Vinyl Group, -CH=CH2 Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C=C Tors. Bend (? T ) 485 410 448 m-s 30 C=C Eth. Twist. Bend. (? t ) 600 380 490 m-s 30 C=C Eth. Twist. Bend. (? t ) 720 410 565 w 90 C-H 2 OoP Rock. Bend. (? r ) 980 810 895 s 10 C-H OoP Bending. (? r ) 1010 940 975 s 10 C-H IP Def. Bend. (? d ) 1180 1010 1095 m-w 70 C-H 2 Def. Bend. (? d ) 1330 1240 1285 m 50 C-H 2 Sci. Bend. (? s ) 1440 1360 1400 m 50 C=C Stretching (?) 1645 1640 1643 m-w 70 C-H 2 1 st Overtone Bend (2?) 1840 1820 1830 v 90 C-H 1 st Overtone Bend (2?) 1990 1970 1980 v 90 C-H 2 Sym. Stretch (? s ) 3070 2930 3000 M 50 C-H Stretch (?) 3110 2980 3045 M 50 C-H 2 Asym. Stretch (? a ) 3150 3000 3075 M 50 Vinylidene Group, CH2=C- - Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C=C Skeletal Stretch (?) 470 435 453 m-w 70 C=C Skeletal Stretch (?) 560 530 545 s 10 C=C Eth. Twist. Bend. (? t ) 715 680 698 w 90 C-H 2 OoP Rock. Bend. (? r ) 895 885 890 s 10 C-H 2 IP Def. Bend. (? d ) 1320 1290 1305 w 90 C-H 2 Sci. Def Bend. (? s ) 1420 1405 1413 w 90 164 C=C Stretching (?) 1675 1625 1650 m-w 70 C-H 2 1 st Overtone Bend (2?) 1800 1750 1775 w 90 C-H 2 Sym. Stretch (? s ) 2985 2970 2978 m-w 70 C-H 2 Asym. Stretch (? a ) 3095 3075 3085 m-w 70 cis-Vinylene Group, -CH=CH- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-H Tors. Bend (? T ) 490 320 405 m-s 30 C=C Skeletal Bend (? T ) 500 460 480 s 10 -C=CH Def. Bend. (? d ) 590 440 515 m-s 30 C=C Eth. Twist. Bend. (? t ) 630 570 600 s 10 C-H Wag. Bend. (? w ) 790 650 720 m-s 30 C-H Wag. Bend. (? w ) 1000 850 925 m-w 70 C-H Def. Bend. (? d ) 1295 1185 1240 w 90 C-H Def. Bend. (? d ) 1425 1355 1390 w 90 C=C Stretching (?) 1665 1630 1648 m 50 C-H Stretch (?) 3040 2980 3010 m 50 C-H Stretch (?) 3090 3010 3050 m 50 trans-Vinylene Group, -CH=CH- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-H Tors. Bend (? T ) 490 320 405 m-s 30 C=C Skeletal Bend (? T ) 500 480 490 s 10 -C=CH Def. Bend. (? d ) 590 440 515 m-s 30 C=C Eth. Twist. Bend. (? t ) 580 515 548 m-s 30 C-H Wag. Bend. (? w ) 850 750 800 m-w 70 C-H Wag. Bend. (? w ) 1000 910 955 v 90 C-H Def. Bend. (? d ) 1305 1260 1282.5 v 90 C-H Def. Bend. (? d ) 1340 1355 1347.5 v 90 165 C=C Stretching (?) 1680 1665 1673 m-w 70 C-H Stretch (?) 3050 3000 3025 m 50 C-H Stretch (?) 3065 3015 3040 m 50 Hydroxyl Group, -OH (with intermolecular H-bonding) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Bending (?) 710 570 640 m 50 Stretching (?) 3550 3230 3390 m-s 30 Primary Alcohol Group, -CH2OH (with intermolecular H-bonding) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O Def. Bend (? d ) 555 395 475 m-w 70 C-O IP. Def. Bend (? d ) 500 440 470 w 90 O-H OoP. Def. Bending (? d ) 710 570 640 m-w 70 C-CO Stretch (?) 900 800 850 m 50 C-H 2 Twist Bend (? t ) 960 800 880 m-w 70 C-C-O Stretch (?) 1090 1000 1045 S 10 C-H 2 Twist. Bending (? t ) 1300 1280 1290 m-w 70 C-H 2 Wag Bend (? w ) 1390 1280 1335 m-w 70 O-H Def. Bend (? d ) 1440 1260 1350 m-s 30 C-H 2 Def Bend (? d ) 1480 1410 1445 m-w 70 C-H 2 Sym. Stretch (? s ) 2935 2840 2888 m-w 70 C-H 2 Asym. Stretch (? a ) 2990 2900 2945 m-w 70 O-H Stretching (?) 3550 3230 3390 m-s 30 166 Secondary Alcohol Group, - -CHOH (with intermolecular H-bonding) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O OoP. Def. Bend (? d ) 390 330 360 m-w 70 C-O IP. Def. Bend (? d ) 500 440 470 w 90 O-H OoP. Def. Bending (? d ) 660 600 630 m-w 70 C-CO Stretch (?) 900 800 850 m 50 C-O Stretch (?) 1150 1075 1113 m-w 70 C-H Def. Bending (? d ) 1350 1290 1320 s 10 C-H Wag Bend (? w ) 1400 1330 1365 s 10 O-H + C-H 2 Coup. Bend. (? c ) 1430 1370 1400 m-w 70 O-H Def. Bend (? d ) 1440 1260 1350 m-w 70 C-H Stretching (?) 2890 2880 2885 m-s 30 O-H Stretching (?) 3550 3230 3390 m-w 70 Aliphatic Ether Group, -O- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O-C def vib (? d ) 440 420 430 w 90 Sym C-O-C str (? s ) 1140 820 980 w 90 Asym C-O-C Str (? a ) 1150 1060 1105 s 10 Rocking vib 1200 1185 1193 m-w 70 Wagging vib 1400 1360 1380 m 50 Asym and Sym -CH 3 def. vib 1470 1435 1453 m 50 CH 2 def vib 1475 1445 1460 m 50 Sym CH 2 str 2880 2835 2858 m 50 Sym. -CH 3 Str 2900 2840 2870 m 50 Asym CH 2 str 2955 2920 2938 m 50 Asym. -CH 3 Str 2995 2955 2975 m 50 167 Alkyl Peroxide Group, -O-O- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average O-O Stretch (?) 900 800 850 w 90 C-O Stretch (?) 1150 1030 1090 m-s 30 Saturated Aliphatic Ester Group, -CO-O- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O-C Sym. Stretch (? s ) 1160 1050 1105 s 10 C-O-C Asym. Stretch (? a ) 1275 1185 1230 s 10 C=O Stretch (?) 1750 1725 1738 s 10 C=O 1 st Overtone (2? s ) 3460 3440 3450 w 90 Saturated Aliphatic Methyl Ester Group, -CO-O-CH3 Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Unlisted 450 430 440 m-s 30 CO-O Rocking Bend (? r ) 530 340 435 w 90 C-C-O Sym. Stretch (? s ) 1160 1050 1105 s 10 C-O Stretch (?) 1175 1155 1165 s 10 C-C-O Asym. Stretch (? a ) 1275 1185 1230 s 10 O-CH 3 Stretch (?) 1315 1195 1255 s 10 Unlisted 1370 1350 1360 w 90 CH 3 Sym. Def. Bend (? d ) 1460 1420 1440 m-s 30 CH 3 Asym. Def. Bend (? d ) 1465 1420 1443 m-s 30 CH 3 Asym. Def. Bend (? d ) 1485 1435 1460 m 50 C=O Stretch (?) 1750 1725 1738 s 10 CH 3 Sym. Stretch (?) 3000 2860 2930 m 50 168 CH 3 Asym. Stretch (?) 3030 2950 2990 m-w 70 CH 3 Asym. Stretch (?) 3050 2980 3015 m-w 70 C=O 1 st Overtone (2? s ) 3460 3440 3450 w 90 Saturated Aliphatic Ethyl Ester Group, -CO-O-CH2CH3 Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-O-C Def Bend (? d ) 370 250 310 m-w 70 C-O-C Def Bend (? d ) 395 305 350 m-w 70 CO-O Rocking Bend (? r ) 485 365 425 m-w 70 CO OoP Rocking Bend (? r ) 700 550 625 w 90 CH 2 Rocking Bend (? r ) 825 775 800 w 90 C-C str (?) 940 850 895 w 90 CH 3 Rock. Bend (? r ) 1150 1080 1115 w 90 C-C-O Sym. Stretch (? s ) 1160 1050 1105 s 10 CH 3 Rock. Bend (? r ) 1195 1135 1165 w 90 C-C-O Asym. Stretch (? a ) 1275 1185 1230 s 10 CH 2 Twist. Bend (? T ) 1340 1325 1333 m-w 70 CH 2 Wag. Bend (? w ) 1385 1335 1360 m-w 70 CH 3 Sym. Def. Bend (?) 1390 1360 1375 m-s 30 CH 3 Asym. Def. Bend (?) 1480 1435 1458 m 50 OCH 2 Def. Bend. (?) 1490 1460 1475 m-w 70 C=O Stretch (?) 1750 1725 1738 s 10 CH 3 Stretch (?) 2920 2860 2890 w 90 CH 3 Sym. Stretch (? s ) 2930 2890 2910 w 90 CH 3 Asym. Stretch (? a ) 2995 2930 2963 m 50 C=O 1 st Overtone (2? s ) 3460 3440 3450 w 90 169 Acrylate Ester Group, CH2=CH-CO-O- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C=C Tors. Bend (? T ) 485 410 448 m-s 30 CO-O Rocking Bend (? r ) 485 365 425 m-w 70 C=C Eth. Twist. Bend. (? t ) 600 380 490 m-s 30 C-O-C Def Bend (?) 675 660 668 m 50 CO OoP Rocking Bend (? r ) 700 550 625 w 90 =CH2 Twist Bend (? t ) 810 800 805 m-s 30 CH 2 Rocking Bend (? r ) 825 775 800 w 90 C-C str (?) 940 850 895 w 90 =CH2 Wag. Bend (? w ) 970 960 965 s 10 C-H Def. Wag (? w ) 990 980 985 m 50 C-H OoP Bending. (? r ) 1010 940 975 s 10 C-C Skel. Bend (?) 1070 1065 1068 m 50 CH 3 Rock. Bend (? r ) 1150 1080 1115 w 90 C-C-O Sym. Stretch (? s ) 1160 1050 1105 s 10 C-H IP Def. Bend. (? d ) 1180 1010 1095 m-w 70 Unlisted 1200 1195 1198 s 10 C-C-O Asym. Stretch (? a ) 1275 1185 1230 s 10 =CH Rock. Bend (? r ) 1290 1270 1280 m 50 Unlisted 1290 1280 1285 s 10 =CH2 Def Bend (?) 1420 1400 1410 m 50 C-H 2 Sci. Bend. (? s ) 1440 1360 1400 m 50 C=C Stretch (?) 1635 1615 1625 m 50 C=C Stretch (?) 1650 1630 1640 m-s 30 C=O Stretch (?) 1725 1710 1718 s 10 C-H 2 1 st Overtone Bend (2?) 1840 1820 1830 w 90 C-H 1 st Overtone Bend (2?) 1990 1970 1980 w 90 C-H 2 Sym. Stretch (? s ) 3070 2930 3000 m 50 C-H Stretch (?) 3110 2980 3045 m 50 C-H 2 Asym. Stretch (? a ) 3150 3000 3075 m 50 C=O 1 st Overtone (2? s ) 3460 3440 3450 w 90 170 Methacrylate Ester Group, CH2=C(CH3)-CO-O- Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C=C Skeletal Stretch (?) 470 435 453 m-w 70 C=C Skeletal Stretch (?) 560 530 545 s 10 C-O-C Def Bend (?) 660 645 653 m 50 C=C Eth. Twist. Bend. (? t ) 715 680 698 w 90 C-C Skel Bend (?) 825 805 815 m-s 30 C-H 2 OoP Rock. Bend. (? r ) 895 885 890 s 10 =CH2 Wag. Bend (? w ) 950 935 942.5 s 10 C-C Skel. Bend (?) 1010 990 1000 m 50 C-C Skel. Bend (?) 1020 1000 1010 m 50 C-O-C Sym. Stretch (? s ) 1160 1150 1155 s 10 C-O-C Asym. Stretch (? a ) 1275 1185 1230 s 10 Unlisted 1310 1290 1300 s 10 C-H 2 IP Def. Bend. (? d ) 1320 1290 1305 w 90 =CH Rock. Bend (? r ) 1335 1315 1325 m 50 CH 3 Sym Bend (? s ) 1390 1370 1380 m-s 30 =CH2 Def Bend (?) 1420 1400 1410 m 50 CH 3 Asym. Bend. (? a ) 1465 1440 1453 m 50 C=C Stretch (?) 1650 1630 1640 m 50 C=O Stretch (?) 1725 1710 1718 s 10 C-H 2 1 st Overtone Bend (2?) 1800 1750 1775 w 90 CH 3 Sym. Stretching (? s ) 2885 2865 2875 m 50 C-H 2 Sym. Stretch (? s ) 2985 2970 2978 m-w 70 CH 3 Asym. Stretching (? a ) 2975 2950 2963 m-s 30 C-H 2 Asym. Stretch (? a ) 3095 3075 3085 m-w 70 C=O 1 st Overtone (2? s ) 3460 3440 3450 w 90 171 o-Alkyl Phenol Group (With H-bonding) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-OH IP Bending (?) 450 375 413 w 90 O-H OoP. Def. Bending (? d ) 720 600 660 s 10 C-O Stretch (?) 1260 1180 1220 s 10 O-H IP Bending (?) 1410 1310 1360 s 10 COH bending vib 1330 1310 1320 m 50 O-H Stretching (?) 3250 3000 3125 m 50 CO Str 1255 1240 1248 s 10 OH def and CO str vib 1175 1160 1168 s 10 OH def and CO str vib 760 740 750 m 50 OR substituted 3595 3470 3533 m 50 p-Alkyl Phenol Group (With H-bonding) Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average C-OH IP Bending (?) 450 375 413 w 90 O-H OoP. Def. Bending (? d ) 720 600 660 s 10 C-O Stretch (?) 1260 1180 1220 s 10 O-H IP Bending (?) 1410 1310 1360 s 10 O-H Stretching (?) 3250 3000 3125 m 50 CO Str 1260 1245 1253 s 10 OH def and CO str vib 1175 1165 1170 s 10 OH def and CO str vib 835 815 825 m 50 OR substituted 3595 3470 3533 m 50 172 Monosubstituted Benzenes Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Ring OoP def vib 560 415 488 m-s 30 Ring IP def vib 630 605 618 m-w 70 =C-H Ring OoP def vib 710 670 690 s 10 =C-H OoP def vib 820 720 770 s 10 =C-H OoP def vib 900 860 880 m-w 70 =C-H IP def vib 1010 990 1000 w 90 =C-H IP def vib 1040 1000 1020 m-w 70 =C-H IP def vib 1085 1050 1068 m 50 =C-H IP def vib 1175 1130 1153 w 90 =C-H IP def vib 1195 1165 1180 m-w 70 =C-H IP def vib 1250 1230 1240 w 90 -C=C- Str Vib 1625 1590 1608 v 90 =C-H Str. Vib 3105 3000 3053 m 50 1,2,4- Trisubstituted Benzene Band Wavelength Region [cm -1 ] Relative Intensity % Transmi ttance High Low Average Ring OoP def vib 475 425 450 m-s 30 =C-H OoP def vib (2H) 740 690 715 m-w 70 =C-H OoP def vib (2H) 780 760 770 s 10 =C-H OoP def vib (2H) 860 840 850 m-s 30 =C-H OoP def vib (1H) 940 885 913 m-s 30 =C-H IP def vib 1040 1020 1030 m-w 70 =C-H IP def vib 1160 1140 1150 m-w 70 =C-H IP def vib 1220 1200 1210 w 90 -C=C- Str Vib 1625 1590 1608 v 90 =C-H Str. Vib 3105 3000 3053 m 50 173 Table B.3: Abbreviations, names and structures of investigated ionic liquid training set. ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 1 [bmIm]mSO 4 N + CH 3 NCH 3 O - SO CH 3 O O 0.2890 1212 269.1 1-butyl-3-methyl imidazolium Methyl sulfate 2 [emIm]PF 6 N N + CH 3 CH 3 P - F F F F F F 0.0234 1422 307.1 1-ethyl-3-methyl imidazolium Hexafluorophosphate 3 [emIm]BF 4 N N + CH 3 CH 3 B - F FF F 0.0665 1248 247.1 1-ethyl-3-methyl imidazolium Tetrafluoroborate 4 [bmIm]dCN N + CH 3 NCH 3 NN N - 0.0332 1058 267.1 1-butyl-3-methyl imidazolium Dicyanamide 174 ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 5 [1,2m 2 pIm]Tf 2 N N + N CH 3 CH 3 CH 3 N - S S O O O O F F F F F F 0.0901 1457 288.1 1,2-dimethyl-3-propyl imidazolium Bis(trifluromethylsulfonly)-amide 6 [b2,3m 2 Im]BF 4 CH 3 N CH 3 N +CH 3 B - F FF F 0.4558 1094 310.1 1-butyl-2,3-dimethyl imidazolium Tetrafluoroborate 7 [bm 3 NH 4 ]Tf 2 N N + CH 3 CH 3 CH 3 CH 3 N - S S O O O O F F F F F F 0.1407 1397 289.1 butyl-trimethyl-ammonium Bis(trifluromethylsulfonly)-amide 8 [bmIm]PF 6 N + CH 3 NCH 3 P - F F F F F F 0.4500 1360 283.1 1-butyl-3-methyl imidazolium Hexafluorophosphate 175 ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 9 [bPy]Tf 2 N N + CH 3 N - S S O O O O F F F F F F 0.0756 1454 299.1 1-butylpyridinium Bis(trifluromethylsulfonly)-amide 10 [bPy]BF 4 N + CH 3 B - F FF F 0.2231 1203 279.8 1-butylpyridinium Tetrafluoroborate 11 [bmIm]CF 3 SO 3 N + CH 3 NCH 3 S O - O O F F F 0.0990 1384 262.2 1-butyl-3-methyl imidazolium Triflate 12 [1,3m 2 Im]Tf 2 N N CH 3 N + CH 3 N - S S O O O O F F F F F F 0.0475 1570 299.1 1,3-dimethyl imidazolium Bis(trifluromethylsulfonly)-amide 176 ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 13 [m 2 Im]Tf 2 N N + CH 3 N CH 3 N - S S O O O O F F F F F F 0.0260 1519 240.1 1-ethyl-3-methyl imidazolium Bis(trifluromethylsulfonly)-amide 14 [e 3 oNH 4 ]Tf 2 N N + CH 3 CH 3 CH 3 CH 3 N - S S O O O O F F F F F F 0.1810 1270 287.1 N,N,N-triethyl-1-octanaminium Bis(trifluromethylsulfonly)-amide 15 [pmIm]PF 6 N + CH 3 NCH 3 B - F FF F 0.1030 1240 256.1 1-propyl-3-methyl imidazolium Tetrafluoroborate 16 [emIm]dCN N N + CH 3 CH 3 NN N - 0.0169 1106 255 1-ethyl-3-methyl imidazolium Dicyanamide 177 ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 17 [emIm]mSO 4 N N + CH 3 CH 3 O - SO CH 3 O O 0.0785 1234 236.3 1-ethyl-3-methyl imidazolium Methylsulfate 18 [emIm]Cl N N + CH 3 CH 3 Cl - 1.583 1186 358.1 1-ethyl-3-methyl imidazolium Chloride 19 [bmIm]Cl N + CH 3 NCH 3 Cl - 40.89 1080 340.1 1-butyl-3-methyl imidazolium Chloride 20 [hmIm]Tf 2 N N CH 3 N +CH 3 N - S S O O O O F F F F F F 0.0780 1370 266 1-hexyl-3-methylimidazolium Bis(trifluromethylsulfonly)-amide 178 ID Abbreviation Structure & Name Properties Cation Anion ? @ 293K Pa.s ? @ 298K Kg/m 3 T m K 21 [m 2 Im]mSO 4 CH 3 N + CH 3 N O - SO CH 3 O O 0.0928 1328 308.9 1,3-dimethyl imidazolium Methylsulfate 22 [emIm]CF 3 SO 3 N N + CH 3 CH 3 S O - O O F F F 0.0500 1384 262.2 1-ethyl-3-methyl imidazolium Triflate 23 [bmIm]oSO 4 N + CH 3 NCH 3 S O O - O O CH 3 0.8745 1060 307.6 1-butyl-3-methyl imidazolium Octylsulfate 179 Table B.4: Candidate ionic liquid solutions enumerated from exhaustive search in latent property space. Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 1 [meIm]mSO 4 N + N CH 3 CH 3 O - SO CH 3 O O 0.0185 0.0045 -0.0293 1-methyl-3-ethyl imidazolium Methyl sulfate 2 [ePy]mSO 4 N + CH 3 O - SO CH 3 O O 0.0216 -0.0039 -0.0140 1-ethylpyridinium Methyl sulfate 3 [e 3 pNH 4 ]mSO 4 N + CH 3 CH 3 CH 3 CH 3 O - SO CH 3 O O 0.0457 0.0292 0.0147 N,N,N- triethyl propanaminium Methyl sulfate 4 [meIm]PF 6 N + N CH 3 CH 3 P - F F F F F F 0.0113 0.0193 -0.0314 1-methyl-3-ethyl imidazolium Hexafluorophosphate 180 Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 5 [ePy]PF 6 N + CH 3 P - F F F F F F 0.0144 0.0109 -0.0160 1-ethylpyridinium Hexafluorophosphate 6 [e 3 eNH 4 ]PF 6 N + CH 3 CH 3 CH 3 CH 3 P - F F F F F F 0.0384 0.0440 0.0126 N,N,N- triethyl propanaminium Hexafluorophosphate 7 [meIm]BF 4 N + N CH 3 CH 3 B - F FF F 0.0111 0.0231 0.0012 1-methyl-3-ethyl imidazolium Tetrafluoroborate 8 [e2,3m 2 Im]BF 4 CH 3 N CH 3 N +CH 3 B - F FF F 0.0050 0.0185 -0.0162 1-ethyl-2,3-dimethyl imidazolium Tetrafluoroborate 9 [ePy]BF 4 N + CH 3 B - F FF F 0.0142 0.0148 0.0165 1-ethylpyridinium Tetrafluoroborate 181 Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 10 [e 3 eNH 4 ]BF 4 N + CH 3 CH 3 CH 3 CH 3 B - F FF F 0.0028 0.0128 -0.0302 N,N,N- triethyl ethanaminium Tetrafluoroborate 11 [e 3 bNH 4 ]BF 4 N + CH 3 CH 3 CH 3 CH 3 B - F FF F 0.0290 0.0557 -0.0135 N,N,N- triethyl butanaminium Tetrafluoroborate 12 [e 3 pNH 4 ]dCN N + CH 3 CH 3 CH 3 CH 3 NN N - 0.0427 0.0068 -0.0130 N,N,N- triethyl propanaminium Dicyanamide 13 [meIm]Tf 2 N N + N CH 3 CH 3 N - S S O O O O F F F F F F 0.0091 0.0069 -0.0002 1-methyl-3-ethyl imidazolium Bis(trifluromethylsulfonly)-amide 182 Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 14 [e2,3m 2 Im]Tf 2 N CH 3 N CH 3 N +CH 3 N - S S O O O O F F F F F F 0.0030 0.0022 -0.0176 1-ethyl-2,3-dimethyl imidazolium Bis(trifluromethylsulfonly)-amide 15 [ePy]Tf 2 N N + CH 3 N - S S O O O O F F F F F F 0.0121 -0.0015 0.0151 1-ethylpyridinium Bis(trifluromethylsulfonly)-amide 16 [e 3 eNH 4 ]Tf 2 N N + CH 3 CH 3 CH 3 CH 3 N - S S O O O O F F F F F F 0.0008 -0.0035 -0.0316 N,N,N- triethyl ethanaminium Bis(trifluromethylsulfonly)-amide 17 [e 3 bNH 4 ]Tf 2 N N + CH 3 CH 3 CH 3 CH 3 N - S S O O O O F F F F F F 0.0270 0.0394 -0.0149 N,N,N- triethyl butanaminium Bis(trifluromethylsulfonly)-amide 183 Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 18 [meIm]CF 3 SO 3 N + N CH 3 CH 3 S O - O O F F F 0.0069 0.0182 -0.0118 1-methyl-3-ethyl imidazolium Triflate 19 [e2,3m 2 Im]CF 3 SO 3 CH 3 N CH 3 N +CH 3 S O - O O F F F 0.0008 0.0135 -0.0292 1-ethyl-2,3-dimethyl imidazolium Triflate 20 [ePy]CF 3 SO 3 N + CH 3 S O - O O F F F 0.0099 0.0098 0.0035 1-ethylpyridinium Triflate 21 [e 3 bNH 4 ]CF 3 SO 3 N + CH 3 CH 3 CH 3 CH 3 S O - O O F F F 0.0248 0.0507 -0.0265 N,N,N- triethyl butanaminium Triflate 184 Abbreviation Structure & Name Latent Properties Cation Anion q1 q2 q3 22 [ePy]oSO 4 N + CH 3 S O O - O O CH 3 0.0103 0.0116 -0.0241 1-ethylpyridinium Octylsulfate 23 [e 3 pNH 4 ]oSO 4 N + CH 3 CH 3 CH 3 CH 3 S O O - O O CH 3 0.0344 0.0446 0.0046 N,N,N- triethyl propanaminium Octylsulfate