A Systematic Property Based Approach for Molecular Synthesis Using Higher Order Molecular Groups and Molecular Descriptors by Nishanth Gopalakrishnan Chemmangattuvalappil A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 13, 2010 Key Words: Molecular Design, Topological Indices, Molecular Signatures Copyright 2010 by Nishanth Gopalakrishnan Chemmangattuvalappil Approved by Mario R. Eden, Associate Professor of Chemical Engineering, Auburn University, Chair Christopher B. Roberts, Professor of Chemical Engineering, Auburn University Ram B. Gupta, Professor of Chemical Engineering, Auburn University Jin Wang, Assistant Professor of Chemical Engineering, Auburn University Mahmoud M. El-Halwagi, Professor of Chemical Engineering, Texas A&M University ii Abstract In this work, algorithms have been developed for the design of molecules corresponding to the optimum performance of a process. The concept of property clustering has been extended into molecular design based on second and third order group contribution methods. An algebraic approach has been developed utilizing higher order molecular groups built from first order groups. The significant aspect of the aforementioned method is that both the application range and reliability of the molecular property clustering technique are considerably increased by incorporating second and third order estimation. A methodology has been developed for incorporating the property contribution predicted using combined group contribution and connectivity indices into the design framework in case the property contributions of any of the molecular groups of interest are not available in literature. For the design of simple mono-functional molecules, a modified visual approach has been used whereas for the design of more complicated structures and/or for treating more than three properties at a time, an algebraic method has been developed. Until now, most reverse property prediction algorithms are based on group contribution methods. However, a variety of properties can be predicted using Quantitative Structure Activity/Property Relationships (QSAR/QSPR) models. QSAR models make use of topological indices to predict physical properties and biological activities. In this dissertation, a new algorithm has been developed to include topological iii index based property models into the reverse problem formulation framework. This algorithm makes use of the concept of molecular signature descriptors to incorporate a variety of different topological indices on a common platform. A large number of environmental, safety and health related constraints can be now investigated as a part of the integrated process and molecular design. An algorithm for the enumeration of the molecular structures has been developed with very low degeneracy. In the last part, a general framework has been proposed to simultaneously integrate process and product design problems with flowsheet design. This methodology will identify the best candidate molecules that provide the optimum process performance with minimum energy utilization. The dissertation concludes with a list of potential areas where more study can be conducted based on the developed algorithms. iv Acknowledgments I would like to express my profound gratitude to Dr. Mario R. Eden for his constant support, encouragement and assistance. Special recognition is given for his guidance and direction. He has always been a great source of information and inspiration. I would like to thank my research committee members, Dr. Mahmoud El- Halwagi at Texas A&M University, Dr. Ram Gupta, Dr. Christopher Roberts and Dr. Jin Wang for their valuable comments and suggestions. My sincere thanks to my collaborators Charles Solvason, Dr. Fadwa Eljack and Susilpa Bommareddy for their brilliant ideas and feedback. Thanks are also due to my friends and co-workers, Dr. Norman Sammons, Dr. Jeffrey Seay, Wei Yuan, Subin Hada and Gregory Vaughan at Auburn University. Special gratitude and appreciation is given to my parents, Gopalakrishnan C.C. and Vijayakumari T.V., my wife Suchithra, and my sister Dhanya for all their support and encouragement. Finally, I would like to thank the faculty and staff in the Department of Chemical Engineering at Auburn University for making my graduate research experience at Auburn a memorable and rewarding one. v Table of Contents Abstract............................................................................................................................... ii Acknowledgments.............................................................................................................. iv List of Figures.................................................................................................................... ix List of Tables ..................................................................................................................... xi Nomenclature................................................................................................................... xiv 1. Introduction......................................................................................................................1 2. Theoretical Background...................................................................................................9 2.1. Chemical Product Design .......................................................................................9 2.2. Mathematical Formulation of Chemical Product Design .. ..................................14 2.3. Types of Properties and Estimation Techniques...................................................17 2.4. Mixture Design .....................................................................................................19 2.4.1. Design of Experiments.................................................................................19 2.4.2. Mixture Design of Experiments...................................................................21 2.5. Group Contribution Methods................................................................................23 2.5.1. Initial Efforts................................................................................................23 2.5.2. Group Contribution Models with Higher Levels.........................................25 2.6. Topological Indices and Property Prediction........................................................29 2.6.1. Connectivity Indices ....................................................................................32 vi 2.6.2. Edge Adjacency Index .................................................................................37 2.6.3. Shape Indices ...............................................................................................39 2.6.4. Wiener Indices .............................................................................................40 2.6.5. The Hoyosa Topolological Index.................................................................40 2.7. Connectivity Indices and GC+ Method .................................................................41 2.8. Molecular Signature Descriptors ..........................................................................43 2.8.1. Current Status In Inverse Design .................................................................43 2.8.2. Development of Molecular Signature..........................................................49 2.8.3. Application of Signature Descriptors in Property Prediction ......................55 2.9. Flowsheet Property Model....................................................................................56 2.10. Summary.............................................................................................................59 3. Basics of Computer Aided Molecular Design ...............................................................62 3.1. Computer Aided Molecular Design Framework...................................................62 3.2. Computer Aided Molecular Design Techniques...................................................66 3.3. Property Models....................................................................................................67 3.4. Reverse Problem Formulation ..............................................................................69 3.5. Reverse Problem Formulation Methodology........................................................70 3.6. Summary...............................................................................................................71 4. Integrated Process and Molecular Design......................................................................73 4.1. Property Clustering Techniques............................................................................73 4.1.1. Property Operator and Cluster Formulation ................................................73 4.1.2. Conservation Rules ......................................................................................76 4.1.3. Visualization Techniques.............................................................................78 vii 4.1.4. Identification of Feasibility Region .............................................................80 4.2. Molecular Property Operators and Clusters..........................................................82 4.3. Visual Solution of Molecular Design Problem.....................................................85 4.4. Limitations of the Visual Approach in Molecular Design....................................90 4.5. An Algebraic Approach for Molecular Synthesis with Higher Order Groups .....92 4.5.1. General Problem Statement .........................................................................95 4.5.2. Algebraic Approach for Solving the Molecular Design Problem................95 4.5.3. Algebraic Molecular Design Algorithm ....................................................102 4.6. Introduction of GC+ Models into the Cluster Space ..........................................104 4.6.1. GC+ Algorithm for Visual Solution of a Molecular Design problem........106 4.6.2. GC+ Algorithm for Algebraic Solution of a Molecular Design Problem ..109 4.7. Molecular Signatures in Reverse Problem Formulations ...................................110 4.7.1. First Order Connectivity Index ..................................................................111 4.7.2. Kier-Hall Shape Index of Order 1..............................................................112 4.7.3. Reverse Problem Formulation using Molecular Signatures ......................114 4.7.4. Signature Based Algorithm for Molecular Design ....................................115 4.7.5. Expression of Group Contribution Models with Signatures......................126 4.7.6. Property Models with Different Signature Heights ...................................129 4.7.7. Enumeration of Molecular Structures from Signatures .............................131 4.7.8. Stepwise Procedure for Solving a Molecular Design Problem..................136 4.8. General Framework for Integrated Flowsheet and Molecular Design................137 4.9. Summary.............................................................................................................139 5. Case Studies.................................................................................................................141 viii 5.1. Design of Blanket Wash Solvent ........................................................................141 5.2. Metal Degreasing Solvent Design ......................................................................151 5.2.1. Visual Solution...........................................................................................154 5.2.2. Algebraic Solution .....................................................................................158 5.3. Design of Alkyl Substituent for the Fungicide DD ............................................162 5.3.1. Problem Statement.....................................................................................162 5.3.2. Solution of Design Problem with Two Types of Property Models ...........163 5.3.3. Solution of Design Problem with Different Topological Indices ..............173 5.4. Acid Gas Removal ..............................................................................................175 5.4.1. Problem Statement.....................................................................................175 5.4.2. Process Design...........................................................................................177 5.4.3. Molecular Design.......................................................................................178 5.4.4. Proof of Concept for Integrated Flowsheet and Molecular Design ...........187 6. Conclusions and Future Work .....................................................................................191 6.1. Major Achievements...........................................................................................191 6.2. Future Work........................................................................................................195 6.2.1. Statistical Tools for Product Design ..........................................................195 6.2.2. Integrated Process and Product design ......................................................196 6.2.3. Inclusion of More Sophisticated Descriptors.............................................198 6.2.4. Exploration of Biochemical Reaction Pathways........................................199 6.2.5. Design of Reactions ...................................................................................201 References........................................................................................................................203 Appendix: Group Contribution and Connectivity Indices Data ......................................218 ix List of Figures Figure 2.1 General Chemical Product Design ...................................................................12 Figure 2.2 Product Design Steps .......................................................................................13 Figure 2.3 Property Estimation Models.............................................................................19 Figure 2.4 Response Surface of the Second Order Model.................................................21 Figure 2.5 Mixture Design Plots........................................................................................22 Figure 2.6 Multilevel Approach for Property Estimation using GC Method ....................26 Figure 2.7 Example of Hydrogen Suppressed Graphs.......................................................31 Figure 2.8 Molecular Skeleton of 3,3 Dimethyl Pentane...................................................34 Figure 2.9 Atomic Signatures up to Height 3 ...................................................................52 Figure 2.10 Molecular Signature Tree...............................................................................54 Figure 3.1 Iterative Molecular Design...............................................................................62 Figure 3.2 Multilevel Approach for Product Design .........................................................65 Figure 3.3 Simultaneous Consideration of Process and Product Design...........................70 Figure 3.4 Reverse Problem Formulation..........................................................................71 Figure 4.1 Visualization of Intra-stream Conservation of Clusters ...................................77 Figure 4.2 Visualization of Inter-stream Conservation of Clusters ...................................77 Figure 4.3 Mixing of Streams ............................................................................................79 Figure 4.4 Feasibility Region on a Ternary Diagram ........................................................82 Figure 4.5 Mixing of Molecular Groups............................................................................88 x Figure 4.6 Simultaneous Process and Product Design Framework ...................................90 Figure 4.7 Second Order Group Formation .......................................................................94 Figure 4.8 Mixing of CI Group with GC Group..............................................................108 Figure 4.9 Coloring of Atomic Signature ........................................................................120 Figure 4.10 Illustration of Connectivity Principles..........................................................123 Figure 4.11 Integrated Process-Product-Flowsheet Design Framework .........................138 Figure 4.12 Flowchart of the Integrated Process-Product-Flowsheet Design ................139 Figure 5.1 Cluster Diagram for Degreaser Design ..........................................................157 Figure 5.2 Visual Solution of Molecular Design Problem ..............................................158 Figure 5.3 Acid Gas Removal Flowsheet ........................................................................176 Figure 5.4 Best Five Solutions to Acid Gas Removal Problem.......................................186 xi List of Tables Table 2.1 Types of Products ..............................................................................................10 Table 2.2 Group Contribution Models ..............................................................................27 Table 2.3 Adjustable Parameters in Group Contribution Models .....................................28 Table 2.4 Values of KC-X Parameters ................................................................................39 Table 4.1 Visual Molecular Design Algorithm .................................................................85 Table 4.2 Algebraic Approach Algorithm ......................................................................103 Table 4.3 GC+ Model Algorithm ....................................................................................109 Table 4.4 CI Calculation using Signatures ......................................................................111 Table 4.5 Signature Equivalent of Topological Indices ..................................................113 Table 4.6 Consistency of Signatures ..............................................................................124 Table 5.1 Property Constraints for Blanket Wash Solvent .............................................143 Table 5.2 Property Operators and Reference Values ......................................................143 Table 5.3 Adjustable Parameters ....................................................................................143 Table 5.4 Normalized Molecular Property Operator Values ..........................................144 Table 5.5 Property Data of Selected Molecular Fragments.............................................144 Table 5.6 Second Order Groups and Their Contributions ..............................................147 Table 5.7 Valid Formulations and Their Properties ........................................................147 Table 5.8 Groups for Ring Compounds ..........................................................................148 Table 5.9 Second Order Groups and Their Contributions for Cyclic Structures.............149 xii Table 5.10 Valid Cyclic Compounds and Their Properties ............................................150 Table 5.11 Property Constraints for the Degreaser Problem ..........................................152 Table 5.12 Property Operators and Reference Values for Degreaser Design..................154 Table 5.13 Normalized Molecular Property Operator Values ........................................154 Table 5.14 Atom and Bond Indices ...............................................................................155 Table 5.15 First Order Connectivity Indices ...................................................................156 Table 5.16 CI Property Contributions of CI Groups .......................................................156 Table 5.17 Possible Higher Order Groups and Their Property Contributions ................159 Table 5.18 Final Solution Set for Degreaser Design .......................................................160 Table 5.19 Upper and Lower Bounds for Fungicide Properties .....................................163 Table 5.20 Signatures of Height Two for Alkanes ..........................................................166 Table 5.21 Solution in Terms of Signatures ....................................................................171 Table 5.22 Possible Alkyl Substituents ...........................................................................172 Table 5.23 New Solution for Alkyl Substituent ..............................................................175 Table 5.24 Property and Flowrate Data for Acid Gas Removal Problem ......................177 Table 5.25 Property Targets for Molecular design .........................................................178 Table 5.26 Property Operators and Targets ....................................................................181 Table 5.27 New Property Targets for Acid Gas Absorbent Design ...............................187 Table 5.28 Driving Force Data .......................................................................................189 Table 5.29 Energy Index Values .....................................................................................189 Table A.1 First Order Group Contribution Property Data...............................................219 Table A.2 Second Order Group Contribution Property Data ..........................................223 Table A.3 Third Order Group Contribution Property Data .............................................226 xiii Table A.4 First Order GC Data for Acentric Factors and Liquid Molar Volume ...........228 Table A.5 Second Order GC Data for Acentric Factors and Liquid Molar Volume .......230 Table A.6 Regressed Parameters for Different Atom Types in the CI Method...............232 xiv Nomenclature AUP Augmented property index Cj Property cluster for property j D Number of degrees deg Degree FBN Free bond number FBNg Free bond number associated with group g GCM Group contribution method G(x) Molecular subgraph of atom x h Height of signature Hfus Heat of fusion hfus0 Adjustment parameter used in the estimation of heat of fusion hfus1 Contribution of first order group for the estimation of heat of fusion hfus2 Contribution of second order group for the estimation of heat of fusion ?Hv Standard heat of vaporization at 298 K hv0 Adjustment parameter used in the estimation of heat of vaporization hv1 First order group contribution for estimation of heat of vaporization hv2 Second order group contribution for estimation of heat of vaporization xv M Number of edges Ng Total number of first order molecular groups Ns Total number of second order molecular groups Nr Number of rings in molecular structure nf Number of functional groups in the main chain ng Molecular group ngr Groups forming the ring no Number of carbon atoms in the main chain Pjg Contribution to property j from group g R Number of rings Tb Boiling point tbo Adjustment parameter used in the estimation of boiling point tb1 Contribution of first order group for estimation of boiling point tb2 Contribution of second order group for estimation of boiling point TI Topological index V Vertex Tm Melting point tmo Adjustment parameter used in the estimation of melting point tm1 Contribution of first order group for estimation of melting point tm2 Contribution of second order group for estimation of melting point xs Fractional contribution Greek symbols ?j (Pj) Molecular property operator of the jth property xvi )( jirefj P? Reference operator for jth property ?j Normalized property operator for property j ? Number of occurrences of one first order group in a second order group ? Molecular signature ? Number of each signature ? Property function ? Shape index ? Connectivity index 1 1. Introduction The selection of products/product mixtures that give the optimum performance of a process is a critical issue for a design engineer. The process performance is usually understood in terms of physical properties and on many occasions, the physical properties of the product rather than their chemical structure determine the suitability of a specific product as the input to the process. For example, in the design of a blanket wash solvent, the primary focus of designer are the solubility parameter, flammability, vapor pressure etc. of the solvent. Molecular design algorithms generally require target properties to design the molecules. At the same time, to identify the target properties that give the optimum process performance, the process parameters are to be considered as well. Therefore, for obtaining the optimal solution for this type of problem, it is necessary to have a methodology to represent the product performance in terms of measurable physical properties and identify the molecule/mixture that gives the property targets corresponding to the optimum process performance. In spite of the relationship between the process design and product design problems, they have been traditionally considered as two separate problems because the product design part is generally considered as outside the scope of chemical engineering design. Therefore, the product identified by chemists (without considering the process design aspects) may not provide the best solution corresponding to the optimum process performance and this makes the process of identifying the suitable molecule/mixture an iterative process. Therefore, on most occasions, the chemists try to design the product 2 based more on expert knowledge, trial and error, heuristics and experimentation based on intuition rather than any specific scientific reasoning. The attempts to develop a procedure to pass the information between the process and product designer brought about the development of reverse problem formulations (Eden et al., 2004). The conventional way to treat a combined process-product design problem was through mathematical modeling. This approach considers the molecular design as an optimization problem. The objective in such an optimization problem is to minimize the error between the sought values and the values attained by the current design. The advantage with this approach is in most of the cases, it is possible to represent the problem in terms of known mathematical expressions. The system of equations formed consists of balance equations, constraint expressions and constitutive equations (Russel et al., 2000). The non-linear nature of the constitutive equations, which is often the case with most structure-property relationships, makes it difficult to obtain convergence during the computational stages. In such cases, by following decomposition techniques, the values of a subset of intensive variables are determined that match the required property targets. This can be considered as the reverse of a property estimation problem. The target properties are estimated by solving a reverse simulation problem, where for the given values of design and input variables, the desired range of property values can be obtained (Gani & Pistikopoulos, 2002). The reverse problem formulation decouples the complicated property models from the system of equations and the conventional forward problem can be divided into two reverse problems. The first reverse problem solves the balance and constraint equations in terms of properties to provide the design targets. The second reverse 3 problem then solves the constitutive equations to identify the operating conditions/products to match the property targets set from the first reverse problem. Decoupling the constitutive expressions from the system of equations makes the system linear and achieving convergence is easier. In this way, the reverse problem formulation (RPF) lowers the complexity of the problem without compromising the accuracy (Eden et al., 2004). The RPF provides a property-based platform to link process and product design problems since the process performance can be represented in terms of the properties, and the properties form the input to an molecular design problem. However, such an algorithm can be followed only if there is a way to track the properties. The concept of property clustering provides the necessary tools to track properties. Property clustering techniques have recently been extended into molecular design to develop molecular clusters that allow the molecular groups to be combined to match a set of target properties (Eljack et al., 2007). The property models available in group contribution methods formed the link connecting the molecular structure and properties in the above- mentioned algorithm. The purpose of this work is to expand the range of applications of the integrated process and product synthesis approach. The product design part of this approach, even though it conceptually bridged the gap between process and product design is limited in terms of its practical applications in its present form. Three major limitations of the present method are being addressed in this dissertation. The first one is the limitations of the property models that can be applied in that algorithm. The algorithm by Eljack et al. (2007) is based on first order group contribution methods. This method provides a good 4 starting tool in several product design problems. Nevertheless, its applicability is limited to the design of simple molecular structures because the first order group contribution method has limited accuracy especially when dealing with polyfunctional molecules and cyclic molecules. In addition, first order groups cannot capture proximity effects or differentiate between isomers. A variety of newer group contribution methods tried to address this issue by considering the effects of the combinations of certain molecular groups in the chemical structure (Constantinou & Gani, 1994; Marrero & Gani, 2001; Conte et al., 2008). These combinations of molecular groups are known as higher order groups. In order to increase the range and applicability of the design, higher levels of group contribution methods also have to be represented in the cluster domain and considered in generating molecular structures. However, unlike first order groups, the second and third order groups are not linearly represented in a molecule. In addition, since the higher order molecular groups are based on the effects due to the combination of the constituent molecular building blocks, their effects cannot be predicted before knowing the complete molecular structure. Therefore, a new algorithm must be developed for their systematic inclusion into the cluster domain. In the visual approach developed by Eljack et al. (2007), the number of property targets has to be three for the simultaneous consideration of process and product requirements. In their approach, different molecular groups are mixed/combined according to a set of rules developed to obtain different candidate solutions. However, the number of properties of interest may be more than three on many occasions. Similarly, the generation of a complete potential candidate set is very significant for product design problems because, as mentioned in section three (figure 3.2), there are further stages in a 5 product design algorithm after the elimination based on the group contribution based techniques. Therefore, it is very important to make sure that none of the potential candidates are overlooked in the initial stages of product design because the possibility that a molecule not considered could be the optimum candidate based on parameters other than group contribution method based tools cannot be ignored. Therefore, there should be a generalized procedure to generate all potential solutions of a problem to match the given set of property targets. The motivation to produce more sustainable and environmental friendly chemicals to meet the consumer needs has increased considerably over the last decade (Kokossis & Yang, 2009). Therefore, it is important to have a systematic methodology to design chemicals that possess both the consumer specified attributes and environmentally acceptable characteristics. Most biological and environmental properties are structure dependent and group contribution techniques are not available or reliable for the determination of these properties. However, a lot of work has already been done to categorize atoms or molecules systematically based on their structure and to relate these assignments to their biological activities and properties. These relationships are termed as Quantitative Structure-Activity Relationships (QSAR) and Quantitative Structure- Property Relationships (QSPR) (Kier & Hall, 1986). QSAR and QSPR can provide viable tools for the determination of many properties from molecular structure information. However, most QSAR and QSPR techniques are very property specific and thousands of molecular structural descriptors are now available in literature corresponding to different properties (Randic & Basak, 2001). In spite of this, very few attempts have been made to make use of the available QSAR/QSPR relationships to solve inverse design problems. 6 This is because, compared to the group contribution models, the QSAR/QSPR relationships have very complicated formulations due to the highly non-linear nature of most of the topological indices used in developing such relationships. In addition, unlike molecular groups, there is no one to one mapping possible from the solution of an optimization problem to the final molecular structures (Visco et al., 2002; Faulon et al., 2003a). The recently introduced concept of molecular signature description (Visco et al., 2002; Faulon et al., 2003b) is a potential tool for the reverse problem formulation techniques explained in chapter 3 of this dissertation. The motivation behind developing algorithms to introduce signature description into the reverse problem formulation techniques is that, it is an already proven fact that many molecular descriptors can be written in terms of their signatures and many such relationships are linear in nature. Therefore, one single algorithm will have the potential to handle different molecular descriptors (Visco et al., 2002; Faulon et al., 2003b). In addition, the enumeration of molecular structures from the solution of an inverse problem is challenging. Even though, a few stochastic techniques are available to solve this problem (Venkatasubramanian et al., 1994; Sheridan & Kearsley, 1995; Venkatasubramanian et al., 1995), there have been very few attempts to solve this problem using a deterministic approach. Therefore, an algorithm that can solve this type of problems using a deterministic approach would widen the applicability of reverse problem formulations to a different domain of problems. The dissertation has been distributed in six chapters. Chapter 2 covers most of the background information including details of the nature of process design problems, some current product design techniques, the basics of group contribution methods, topological 7 indices and their applications in QSAR/QSPR expressions and a brief description of molecular signature descriptors and their application to property prediction. The chapter 3 covers the current state of the art in the field of computer aided molecular design, the role of property models, the concept of reverse problem formulations and the integrated process and product design framework. Chapter 4 starts with the basics of property clustering and first order molecular property clusters. Then, it covers the systematic development of second and third order molecular property operators. The systematic procedure for the development of an algebraic algorithm for the application of these operators in the solution of integrated process and product design is also presented. The next section describes the procedure followed for the introduction of connectivity index based models into the clustering framework and the steps followed in the development of an algorithm for their application in solving integrated process and product design. The developed visual and algebraic approaches are described. In the next section, the application of molecular signature descriptors in solving molecular design problems is described. Finally, the general framework to integrate flowsheet design techniques to process and molecular design problems is presented. Chapter 5 provides four application examples for the algorithms developed in chapter 4. The first example is a blanket wash solvent design problem to illustrate the development of higher order molecular groups and the algorithm for their introduction into an algebraic solution framework. The second example illustrates the application of the algorithm developed for the introduction of combined connectivity index/group contribution models into the reverse problem algorithm. The third example is a simple molecular design problem that shows the applications of molecular signature descriptors. The fourth example is an integrated 8 flowsheet and molecular design problem. The last chapter covers the major achievements and conclusions from this project and highlights some of the future works that can be done based on the techniques developed in this dissertation. Most of the group contribution data and connectivity index data are provided in appendix. 9 2. Theoretical Background 2.1. Chemical Product Design Chemical product design is an emerging branch of chemical engineering. In the past, the development of new chemical products has always been left to chemists and the chemical engineering community generally focused on the process design aspects and ignored all product related issues other than purity. Due to this, the product design has always been considered separately from the process design with no feedback between each other. This approach often leads to the generation of sub-optimal solutions. In addition, most chemical products currently in use have been developed after scientific experimentation based on knowledge of existing products that has been largely based on heuristics and expert knowledge. This approach often limits the scope of the output solutions because of the inability to produce non-intuitive solutions. The innumerable options available for such a methodology ultimately make this technique researcher specific and many times based on intuition. Therefore, there is a need for a comprehensive and systematic methodology for solving product design problems. Cussler has classified all the chemical products into three broad classes (Cussler et al., 2010): commodities, molecular products and performance chemicals. Commodities are bulk chemicals and the focus of chemical engineers while producing these chemicals are traditionally on designing the processes to produce them economically. The second class of chemicals are molecules with specific applications like pharmaceutical products, and the key to market them depends on the speed of the discovery and the ability to 10 introduce them into the market immediately after the discovery. In the third class of products, the value will be added depends on the specific microstructure. The key to the marketability will be the function and the benefits that they provide like the flavor a chemical provides to the ice cream or the shine a certain polishing material can provide to the shoe. A summary of the types and focus involved in the different classes of products are given in table 2.1 (Cussler et al., 2010): Table 2.1: Types of Products Hill (2009) identified the need for a new mindset along with new chemical engineering approaches for the solution of product design problems and termed the emergence of this field as a new paradigm. This is because, the chemical engineering community generally ignored product related issues with the exception of purity. The process related issues were the only areas of interest for chemical engineers. The mindset behind solving the different classes of process design problems follows the same approach. Here, the focus of design will be on obtaining the process with minimum cost. The design of a new product however, requires a different understanding of the profit, which may not be readily converted into a set of mathematical expressions. In order to successfully solve a product design problem, the designer has to identify both the process requirements and product specifications and to systematically generate a finite number of Commodities Molecules Performance Key Cost Speed Function Basis Unit operations Chemistry Microstructure Risk Feedstock Discovery Science 11 potential candidate solutions to satisfy the problem requirements. The experimentation can now be limited to the systematically obtained candidates because it is not possible to conduct experiments with all available options. A chemical product design problem can be stated like this: Identify a chemical product (molecule or mixture) that satisfies a set of desired needs. So, a product design problem can be considered as an inverse property prediction problem where the attributes are represented in terms of physical properties (Gani & O'Connell, 2001). The purpose of product engineering is not to substitute the traditional experimental techniques and/or heuristics followed to design a molecule. Instead, the product design engineers aim to systematically eliminate the numerous unsuitable options and reduce the search space to a finite set of options. Therefore, the chemical product design can be considered as a phase in the overall product development operation that should precede a well defined experimental program, design and analysis (Hill, 2009). Cussler and Moggridge (2001) have identified the principal steps involved in a product design process. Specific to each problem, solution strategies are to be developed in each step: 1. Identify customer needs 2. Generate ideas to meet the needs 3. Select among ideas 4. Manufacture product 12 This framework is a simplified yet generalized representation of what product design is all about. Each step must be defined more elaborately for different specific classes of problems. However, the framework applies to all kinds of problems. The first step is traditionally considered as a topic handled by marketing experts and chemical engineer?s tasks usually start from step 2. However, a recent work has incorporated the first step as a part of an optimization framework as shown in figure 2.1 (Smith & Ierapetritou, 2009). The motivation behind forming such a customer integrated approach for product design is the realization that, the driving force for a product- centered industry is the consumer needs (Stephanopoulos, 2003). Smith and Ierapetritou (2009) developed a product design methodology for which the inputs are the consumer inputs and economic criteria. The design problem has been formulated as a biobjective optimization problem that ensures the consumer influence in design trade-off considerations. Figure 2.1: General Chemical Product Design (Smith & Ierapetritou, 2009) In addition, chemical engineers are in a better position to refine the understanding of customer needs because of their ability to analyze what is physically possible (Hill, 13 2009). Similarly, once the product to be manufactured is finalized, the actual production of it is more specific to the compound and needs to be treated accordingly. The heart of the product design problem is steps 2 and 3 and so, the focus of the work in this area has to be to generate and select among ideas. Step two has been traditionally performed by chemists according to the specifications provided by process engineers and are typically based on heuristics and expert knowledge. The compounds they identify in this step are analyzed by process design engineers to decide the suitability of the supplied options in step three. If none of the options are practical, their conclusions will be send back to the chemists for more options. Therefore, this is an iterative process as described in figure 2.2. Figure 2.2: Product Design Steps (Gani, 2004a) 14 The computational complexity involved in identifying a suitable molecule with desirable properties can never be underestimated. Without a systematic approach to track the specific solutions, the number of potential solutions will be prohibitively high even with very restrictive criteria. For instance, if the search space is limited to all alkane molecules up to C22, there will be 38 million different isomers (Davidson, 2002). As the number of atoms under consideration increases, the number of candidates increases and results in combinatorial explosion. For example, even if the search restricted to acyclic molecules with three double bonds, the number of structural isomers for C4H10N2O3S2 will be 16 million (Contreras et al., 1994) The traditional approach above points out that, since the process and product design steps are done separately, the product identified may not be corresponding to the optimum process performance since the product design has been done without the knowledge of property targets corresponding to the optimum process performance. A recently developed integrated process and product design approach ( Eljack et al., 2007b; Eljack & Eden, 2008) provides an alternate way to solve this problem. This approach and the various tools used will be discussed later in this dissertation. 2.2. Mathematical Formulation of Chemical Product Design If the focus of the product is on the macroscopic properties, the product design can be considered as a combination of molecular design and mixture design (Achenie et al., 2003). All different types of product design problems can be represented using the following set of generalized mathematical expressions (Gani, 2004a). 15 { } (2.7) (2.6) ),( (2.5) )( (2.4) 0),( (2.3) 0)( (2.2) 0)( (2.1) )(max 33 222 111 3 2 1 uCxByl uyxgl uxgl yxh xh xh xfyCF TObj ?+? ?? ?? = = = += In the above expressions, x is the vector of continuous variables like fraction in a mixture, flowrates etc., y is the vector represents the presence or absence of a group, compound, operation, etc., h1 (x) is a set of equality constraints corresponding to process design specifications, h2(x) is a set of equality constraints corresponding to process model equations, h3(x, y) is a set of equality constraints related to molecular structure generation, mixing rules for properties, etc, g1 (x) is a set of inequality constraints related to process design specifications, g2 (x, y) is a set of inequality expressions corresponding to specific problems related to the product design and f(x) is the vector of objective functions. For process design problems, f(x) will typically be a non-linear function and for an integrated process and product design problem, f(x) typically represent more than one non-linear expression. B and C represent the matrix containing fixed data in constraint defined by eq. (2.7). 16 Depending on the specific nature of the problem, some or all of the equations/constraints listed above may be used. A few different types of product design problems are as follows (Gani, 2004a): 1) Satisfy only the constraint in eq. (2.6): A product design problem based on a database search. Here, the objective would be to identify from the dataset, the compounds matching the property constraints. Here, the molecular structure generation or the application of property models is not necessary. 2) Satisfy constraints in eqs. (2.4) and (2.6): Here, the molecular structures are generated on the basis of property model in eq. (2.4) subject to the constraints in eq. (2.6). 3) Satisfy the objective function and constraints in eqs. (2.4) and (2.6): In this problem, the optimum molecular structure has been identified according to the objective function given in eq. (2.1) subjected to the product design constraints in eq. (2.6). The property model given in eq.(2.4) is used to generate the molecular structure. However, there is no guarantee that the solution obtained is an optimal solution because of the non-linear equations that constitute the property models in eq. (2.4). 4) Satisfy all the constraints: This type of problems identifies the set of products corresponding to the process requirements. Therefore, this is a simultaneous process-product design problem. This will generate feasible, but not necessarily optimal solutions because of the non-linear nature of the property models in both process and product design specifications. 17 5) Solve all the equations: This is an integrated process-product design problem. The non-linear nature of the objective function and the process model equations will make this a complex mixed integer non-linear programming (MINLP) problem. In all these kinds of approaches, the properties either need to be provided or predicted through property models (Achenie et al., 2003). Therefore, except for the first type of product design problems, the application range of any developed product design methodology is limited to the accuracy of the involved property models. Therefore, the success of product design methodologies will depend on the included property constraints and the process/product model. 2.3. Types of Properties and Estimation Techniques Gani and Constantinou (1996) proposed a three tier classification for different properties as primary, secondary and functional. Primary properties Properties that can be estimated from the molecular structure variables. Examples include normal boiling point, normal melting point, heat of vaporization at 298 K etc. Secondary properties Pure component properties which are dependent on other properties. Examples are solubility parameter, density at a given temperature, etc. Functional properties Pure component properties which are dependent on temperature and/or pressure. Examples are density, vapor pressure, enthalpy, etc Apart from these general classifications, there are a number of high level performance characteristics, which are difficult to estimate. These kinds of properties 18 involve the taste of food products, aroma of fragrances, various mechanical properties, etc. Since many of these properties are dynamic and the design objectives can only be specified in terms of a time-evolution profile, the modeling process will be very challenging. Different types of hybrid approaches can be applied for solving such problems such as combining molecular modeling with kinetic phenomena to obtain prediction accuracy (Ghosh et al., 2000). Depending on the types of target properties and expected range of accuracy, different types of property models can be used for product design. Gani and Constantinou (1996) have proposed the classification of property models shown in figure 2.3. However, most of the types of models shown in figure 2.3 are suitable only for forward problems because of the computational complexities involved in the quantum mechanics calculations. The semi-empirical and empirical models also posses many computational difficulties in a traditional mathematical programming based approach. However, different tools have been developed recently to incorporate many of such methods into inverse design formulations (Venkatasubramanian et al., 1995; Camarda & Maranas, 1999; Sahinidis et al., 2003; Papadopoulos & Linke, 2006; Eljack & Eden, 2008; Solvason et al., 2009). 19 Classification of Estimation Methods Reference Approximate Mechanical models Semi-empirical models Empirical models Quantum mechanics Molecular Mechanics Molecular Simulation Corresponding States Theory Topology/Geometry Group/Atom/Bond additivity Chemometrics Pattern matching Factor analysis QSAR/QSPR Figure 2.3: Property Estimation Models (Achenie et al., 2003) 2.4. Mixture Design 2.4.1. Design of Experiments Design of experiments (DOE) is a statistical method to plan and execute experiments in a systematic manner so that maximum information can be gained from the experiments. DOE is a potential tool in the field of product design because, as mentioned previously, the product design in actual practice depends heavily on results obtained through experiments. To develop a model, the first step is to identify the factors that affect the variable of interest. In the next step, a model is postulated to represent the effect of the factors on the response of interest of the variable. The objective is to optimize the response. In the next step, the experimental points are placed to which the model can be fitted. In the final step, the model adequacy is tested. There may be many iterations until the experimenter decides the accuracy is good enough (Cornell, 2002). 20 The accuracy of this method depends on the adequacy of the model equation and the location of the design points. The polynomial model is the most commonly selected model to represent a response surface since it can be expanded through a Taylor series to improve accuracy (Cornell, 2002). Generally, first or second degree models will be adequate to represent the surface (Montgomery, 2005). Figure 2.4 shows one example of a response surface plot for a second degree model where the response y corresponding to the factors x1 and x2 are plotted. The fitted first order and second order equations have the following general form: ??? ++= ? = u i iio xy 1 (2.8) ???? +++= ??? ? ?= u ji u ij jiij u i iio xxxy 1 (2.9) Here, y is the response, xi and xj are the factors affecting the response. The ? values are the regression coefficients and ? is the error observed in the response. Note that, in the second order equation, there is a term corresponding to the interaction effects of the factors involved. 21 x2 -4 -2 0 2 4 x1 -4. 2 -2. 1 0. 0 2. 1 4. 2 Figure 2.4: Response Surface of the Second Order Model 2.4.2 Mixture Design of Experiments Mixture design of experiments (MDOE) is an extension of DOE in which the factors are the chemical constituents. Therefore, the constituent fractions will sum to one and every constituent fraction must have a value between zero and one. Figure 2.5 shows one example of a mixture design plot where the density of a mixture made from the components x1, x2 and x3 is plotted. However, this relationship violates the condition that the factors are independent and random and thereby imposes a colinearity effect. Therefore, even though the model can still be used, it will affect the interpretation of the regression coefficients. However, because of this condition, it is possible to represent the mixture data on a simplex. Scheffe developed the first simplex-lattice designs, which many researchers consider to be the foundation of mixture design (Cornell, 2002). 22 According to Scheffe models, the response surface can be represented in terms of only the pure component and interaction terms. However, due to the colinearity, the regressors will not provide the true interpretation of the pure component or interaction effects. Figure 2.5: Mixture Design Plots In order to address the limitations of Scheffe models, the Cox model (Cox, 1971) was introduced. Even though the Cox model removes the colinearities introduced by the relation between the mixture constituents, it leaves the secondary colinearities introduced by the constraints in the constituent ranges (Solvason et al., 2008). One solution to this issue may be the use of decomposition techniques like Principal Component Regression (PCR) and Partial Least Squares (PLS) (Kettanah-Wold, 1992). Applying the property clustering technique is another way of treating this issue (Solvason et al., 2008). X1 X2 X3 23 2.5. Group Contribution Methods 2.5.1. Initial Efforts Computer Aided Molecular Design (CAMD) techniques have become a significant part of process and product design because of their ability to predict and design molecules with a given set of properties. In all CAMD algorithms, it is necessary to have a systematic method to evaluate whether the designed structures satisfy the property constraints set by the process from well-defined molecular building blocks. Therefore, almost all CAMD techniques use group contribution methods (GCM) to verify whether the generated molecules exhibit the specified set of desirable properties (Harper et al., 1999). In additive group contribution methods ( Benson, 1968; Ambrose, 1978, 1980; Joback & Reid, 1987; Horvath, 1992), a molecule is considered as a collection of various simple groups. The property function of a molecule has been estimated as a summation of the property contributions of all the molecular groups present in the molecular structure. i i iCNXf ?=)( (2.10) Here, f(X) is a function of the actual property X, Ci is the contribution of the molecular group i that occurs Ni times These contributions are estimated through regression of large amounts of experimental data. Group contribution methods are indispensable tools for property prediction of molecules from their structures especially when the experimental values for the properties are not available. They are simple and yet provide reasonably accurate 24 results for many properties. These methods can provide quick estimates of properties without much computational complexity and errors (Constantinou et al., 1993). In the case of simple compounds, GCM can provide accurate trends due to the addition of new functional groups to the existing structure. During molecular synthesis, this will help to generate molecular structures that meet a specific property in a systematic way from basic molecular groups (Joback & Stephanopoulos, 1989). However, as the complexity of the molecule increases, the accuracy of first order GCM becomes less reliable. They generally cannot capture proximity effects or differentiate between isomers (Kehiaian, 1983; Wu & Sandler, 1989, 1991). The simplified representation of molecular structure in any of the above-mentioned methods ignores many of the concepts in organic chemistry and quantum mechanics like resonance, conjugation and various interactions among groups (Mavrovouniotis, 1990). So, several attempts have been made to make the GCM more general and reliable (Fedors, 1982; Reid et al., 1987; Constantinou et al., 1993). The ABC method introduced by Constantinou et al. (1993) is of particular importance. The ABC method is based on the contributions of atoms and bonds in the properties of different conjugate forms of a molecular structure. Here, the property of a molecule has been estimated as the linear combination of contributions from all the conjugate forms of the molecule. However, the generation and enumeration of the conjugate forms is computationally challenging. Nevertheless, this method provided the basis for future methods, which did not require such computational effort (Constantinou & Gani, 1994). 25 2.5.2. Group Contribution Models with Higher Levels A new GC approach was put forward by Constantinou and Gani (1994) in which the property estimation is done in two stages. In this approach, two types of molecular building blocks are defined. The basic level is known as first order groups and the next higher level is called second order groups. The second order groups have first order groups as their building blocks. They essentially represent different types of interactions among the first order groups and the effects of certain molecular group combinations to the property of the final molecule. The second order groups can provide a better description of compounds with many functional groups and differentiate among isomers. However, even the second order groups may not be able to provide a good representation of poly-ring compounds and open-chain polyfunctional compounds with more than four carbon atoms in the main chain (Marrero & Gani, 2001). Therefore, a further level of molecular groups have been identified and their property contributions have been regressed (Marrero & Gani, 2001) for use in group contribution methods. The formation of third order groups is analogous to the second order groups, but the focus is on a different class of molecular groups. The third order groups focus on multi-ring compounds, fused ring compounds and compounds with many functional groups in the structure. Similar to the second order groups, third order groups also have first order groups as their building blocks. The property estimation model developed in this approach has the following form: k k kj j ji i i EOzDMwCNXf ??? ++=)( (2.11) 26 Here, f(X) is a function of the actual property X, Ci is the contribution of first order group i that occurs Ni times, Dj the contribution of second order group j that occurs Mj times and Ek the contribution of third order group k that occurs Ok times in the molecule. The constants w and z can have values zero or unity depending on how many levels of estimation are of interest. The pictorial representation of the property estimation technique using higher order group contribution techniques has been shown in figure 2.6 (Conte et al., 2008). The properties estimated using this technique and the corresponding property functions are listed in table 2.2. The universal constants for each property function are given in table 2.3. Four properties, for which property functions and group contributions are estimated in two different articles (Constantinou et al., 1995; Conte et al., 2008) up to second order level, are also included in the table 2.2. Figure 2.6: Multilevel Approach for Property Estimation using GC Method First level Groups representing simple and monofunctional compounds Second level Groups representing conjugates, employed to distinguish between isomers Third level Groups representing large, complex and multifunctional compounds 27 Table 2.2: Group Contribution Models Property Property Function Group Contribution Terms Normal melting point, Tm ( )0exp mm TT km k kjm j jim i i TOTMTN 321 ??? ++ Normal boiling point, Tb ( )0exp bb TT kb k kjb j jib i i TOTMTN 321 ??? ++ Critical temperature, Tc ( )0exp cc TT kc k kjc j jic i i TOTMTN 321 ??? ++ Viscosity, ? ln(?) 21 j j ji i i MN ?? ?? + Critical volume, Vc 0cc VV ? kc k kjc j jic i i VOVMVN 321 ??? ++ Standard Gibbs energy, Gf 0ff GG ? kf k kjf j jif i i GOGMGN 321 ??? ++ Critical pressure, Pc ( ) 25.01 ccc PPP ?? ? kc k kjc j jic i i POPMPN 321 ??? ++ Standard enthalpy of formation, Hf 0ff HH ? kf k kjf j jif i i HOHMHN 321 ??? ++ Standard enthalpy of vaporization, Hv 0vv HH ? jv j jiv i i HMHN 21 ?? + Standard enthalpy of fusion, Hfus 0fusfus HH ? kfus k kjfus j jifus i i HOHMHN 321 ??? ++ Acentric factor, w ( ) Caw b ?/exp 21 j j ji i i wMwN ?? + Liquid molar volume, Vm dVm ? 21 mj j jmi i i VMVN ?? + Surface Tension, ? ? 21 j j ji i i MN ?? ?? + 28 Table 2.3: Adjustable Parameters in Group Contribution Models Adjustable Parameter Value Tm0 147.45 K Tb0 222.543 K Tc 231.239 K Pc1 5.9827 bar Pc2 0.108998 bar-0.5 Vc0 7.95 cm3/mol Gf0 -34.967 kJ/mol Hf0 5.549 kJ/mol Hv0 11.733 kJ/mol Hfus0 -2.806 kJ/mol a 0.4085 b 0.505 c 1.1507 d 0.01211 The application of group contribution based CAMD techniques rely on the availability of molecular groups and the estimated property contributions corresponding to each group. The properties that can be predicted based on the molecular structure alone are called primary properties and the properties that can be estimated as a function of primary properties and molecular structural data are called secondary properties (Constantinou & Gani, 1994). 29 2.6. Topological Indices and Property Prediction The chemical structure of a molecule can provide a lot of information about the properties that it possesses. The information available from a structural formula includes (1) total number of atoms (2) number of each type of atoms and (3) the bonding between the atoms. These sets of information enable the structure to be represented in a graphical form (Kier & Hall, 1986). The representation of a molecule in the form of a graph is the first step in the development of topological indices. This would allow the conversion of the structural formula into indices and a potential opportunity to relate the structure to properties (Kier & Hall, 1986). Once the chemical structure is represented in the form of a graph, a number of graph theoretical matrices can be formed from the chemical structure. The most commonly used matrices in the formation of topological indices are adjacency matrix and distance matrix (Trinajstic, 1992). The vertex adjacency matrix can be represented as shown in eq. (2.12): ( ) ? ? ? ??? ? = otherwise 0 weighted-k is ) vand (v edge andadjacent are vand v verticesifk adjacent are vand v verticesif 1 jiji ji ijA (2.12) Here, A(G) is the vertex adjacency matrix of the connected molecular graph G. This matrix is an N x N symmetrical matrix, where N is the number of vertices. The topological indices can be calculated from the graph theoretical matrices by performing different operations over the matrix. The topological index of a molecular graph is a single number that can be used to characterize that graph. Therefore, this 30 number must have the same value regardless of the way in which the graph is labeled (Trinajstic, 1992). A topological index thus is a convenient way for representing the chemical constitution in the form of a number. The challenge in developing topological indices is that the descriptor should be able to form Quantitative Structure Property Relationships (QSPR) or Quantitative Structure Activity Relationships (QSAR). Randic and Basak (2001) suggested that, in order for a topological index (TI) to be of practical significance, it should possess a set of desirable attributes. The important qualities that make a meaningful TI include direct structural interpretation, correlation with at least one property, linearly independent, non-triviality, a basis on structural concepts, etc. The topological indices are defined based on the topology of a molecule. These are developed based on the principles in chemical graph theory. Graph theory is a branch of mathematics that deals with objects that are connected (Wilson, 1986). The objects in the graph are called vertices, the lines used to connect the objects are called edges, and the diagram thus obtained is called a graph. The relationships developed pertaining to graphs have been extended to different disciplines. The principles in graph theory applied to analyze the consequences of connectivity in a chemical graph are termed as chemical graph theory (Trinajstic, 1992). Here, the sites may be atoms, molecules, molecular groups, etc. and the connection between those sites may be bonds, interactions etc. The branch of chemical graphs that represent the constitution of molecules are called molecular graphs. More details on applications of molecular graphs in molecular synthesis are given in section 2.8. The molecular graphs are generally represented as hydrogen-suppressed graphs where only the molecular skeletons without hydrogen atoms (except for heteroatoms) are 31 used. Double and triple bonds are also not shown in the hydrogen-suppressed graph. The presence of hydrogen atoms and multiple bonds are handled in the general formulation of molecular indices. The difference between normal molecular structures and hydrogen- suppressed graphs is shown in figure 2.7. The first figure is the molecular structure of acetone and its hydrogen suppressed graph representation is shown in the second figure. In the hydrogen-suppressed graph, the numbers 1, 2 and 4 represent the carbon atoms without hydrogen atoms on it and 3 represents the oxygen atom. The edges a, b and c represent the bonds connecting these atoms. Even though the double bond and single bonds are represented identically, the definition of bond indices is defined in such a way to take care of that difference. CH3 C O CH3 Figure 2.7: Example of Hydrogen Suppressed Graphs The objective in developing quantitative structure activity relationships (QSAR) and quantitative structure property relationships (QSPR) is to develop practical tools to relate the properties to chemical structure. The property of a molecule can only be explained in terms of the three dimensional aspects known as molecular topography such as shape, volume and surface area. However, the topographical characteristics are indeed related to the nature of individual atoms and the bonds between them. The effect of the bonding pattern is such that it controls the topography and thereby the properties. 32 Therefore, the properties also must be related to the topology of the molecule (Kier & Hall, 1986). An exhaustive study has been conducted by Katritzky and Gordeeva (1993) with a variety of classical topological indices and geometric/electric descriptors for their ability to provide meaningful QSAR/QSPR relationships. It has been confirmed that, the classical topological indices give the best correlations for the determination of physical properties whereas a combination of topological indices and geometrical descriptors give the best quality regression expressions even though the relationships with topological indices alone also did not perform too bad. However, for most of the topological indices, no unambiguous criterion has been followed for their selection and verification (Gutman & Polansky, 1986). Therefore, many of them may contain the same kind of structural information with the difference being only in the scaling factor (Trinajstic, 1992). Most topological indices are developed either from the adjacency matrix or from the distance matrix of the molecular graph (Trinajstic, 1992). However, since the distance matrix can be developed from an adjacency matrix (Bondy & Murty, 2008), the topological indices can be developed from the adjacency matrix alone. Some of the common topological indices are described in the following sections. 2.6.1. Connectivity Indices The basic assumption employed in the connectivity index method is that, the structural formula of a molecule has enough information for relating it to its properties. Therefore, the efforts to obtain non-empirical expressions for such index values are logical. In one of the initial works, Randic (1975) found that in the alkane skeletons, the 33 number of adjacent carbon atoms to one specific atom can give a description of the branching of the molecule. In that work, he proposed to use a molecular descriptor called the delta value. The delta value is the count of formally bonded carbon atoms. The delta value is obtained from the individual atomic valencies of the atoms that form the bond. The product of the atomic valencies is raised to the power of -0.5 to obtain the delta value. The sum of all delta values in the molecule provide an index associated with that molecule. The larger the branching in a structure, the lower will be the branching index corresponding to that structure because of the inverse relationship. The following example illustrates the calculation of the branching index value for a 3,3 dimethyl pentane molecule: CH3 C CH2 CH2 CH3 CH3 CH3 The structure is described with a molecular skeleton with the count of all bonded carbon atoms in figure 2.8: 34 a b 2 d e f 1 4 1 1 2 1 c Figure 2.8: Molecular Skeleton of 3,3 Dimethyl Pentane Here, a, b, c, d, e, f represents different bonds. According to the definition above, the delta value corresponding to each bond can be calculated as follows: ( ) ( ) ( ) ( ) ( ) ( ) 0.707 12: Bond 5.0 22: Bond 0.3535 42: Bond 0.5 41: Bond .50 41: Bond 0.5 41: Bond 5.0 5.0 5.0 5.0 5.0 5.0 =? =? =? =? =? =? ? ? ? ? ? ? f e d c b a Branching Index of 3,3 dimethyl pentane molecule = 0.5 + 0.5 + 0.5 + 0.3535 + 0.5 + 0.7 = 3.0605 The branching number was found to correlate with properties like boiling point, Kovats constants and a calculated surface area (Kier & Hall, 1986). However, additional descriptors developed based on delta values have the limitation that they would not differentiate among saturated and unsaturated bonds. 35 Therefore, a new concept was developed by Kier and Hall (1976). The new structural descriptor is termed as the valence delta (?v) which is based on the explicit counting of each bond to a nearby atom and it is estimated as follows: hZ Vv ?=? (2.13) Here, ZV of an atom is the count of all adjacent bonded atoms and all pi and lone pair electrons and h is the number of hydrogen atoms bonded to that atom. While dealing with high atomic weight atoms, the effect of non-valence core electrons to the atomic size and properties must be considered because such core electrons also have a significant role in influencing the properties. For that, the valence delta is redefined for an atom of atomic weight Z as follows: ( ) ( )1?? ?= v v v ZZ hZ? (2.14) The valence delta values for heteroatoms at higher oxidation states will be different from the values given by the above formula. Some empirical values for a few atoms at different oxidation levels have been estimated and are available in literature (Kier & Hall, 1986). If i and j are the atoms involved in the bond, then, bond indices, ?k are defined through the pair of valence delta values (Kier & Hall, 1986). v j v i k ??? = (2.15) 36 The zero?th order connectivity index of an atom has been estimated from the individual valence deltas and the first order CI is formed from the possible bonds present in the molecule as follows (Kier & Hall, 1986): vi v ?? 10 = (2.16) k v ?? 11 = (2.17) Zero order and first order connectivity indices are the most commonly used molecular descriptors. However, higher orders of connectivity indices can be calculated by following the same methodology. The nth order connectivity index is calculated using eq. (2.18): ( )nvivnv ??? .... 1= (2.18) The zero order connectivity indices (CI) for a molecule are obtained by summation of the CI values of each atom: ( ) ( )?= n nvmoleculev 1 00 ?? (2.19) 37 The first order CI for a molecule is obtained as the sum of CI?s of all edges: ( ) ( )?= n n v molecule v 1 11 ?? (2.20) It is to be noted that, if the first order connectivity indices for the different groups are written separately, their sum will not give the CI value of the molecule. This is because, the contribution to first order CI by the bonds between two separate groups is not represented in the expression. To account for that, an additional term for the bond between different groups has to be included in the expression (Gani et al., 2005). ( ) ? ? ?? ? ? ? ? ? ? +?? ? ? ??? ?= k m group v 1 m groups ofout bonds k bonds internal 1 5.01 ??? (2.21) Here, k is the number of bonds inside the group for which the expression is written and m is the number of free bonds the group has. 2.6.2. Edge Adjacency Index The edge adjacency index (?) is a topological index developed by Estrada (1995a). The reported correlation coefficient (R>0.99) of ? in QSPRs to estimate molar volume is significantly higher than any of the available group contribution methods. The calculation of ? index is described below. The development of the vertex adjacency matrix has been explained in section 2.6. Similar to that, an edge adjacency matrix can also be developed. Two edges are 38 adjacent if one vertex in a chemical graph is incident (having one common vertex) to both the edges. If there are m edges in a graph and gij are the elements in that graph, the edge adjacency matrix, E= [gij]mxm can be defined as shown in eq. (2.22): ? ? ?= Otherwise 0 adjacent are and If 1 ji ij eeg (2.22) The edge degree ?(ek) has been defined in eq. (2.23): ( ) ?= i ikk ge? (2.23) This can be calculated as the sum of elements of kth row in the matrix E. Now, edge adjacency index can be calculated using eq. (2.24): ( ) ( )[ ] 21?? ?= ll ji kee ??? (2.24) where the sum is over all l adjacent edges in the graph. Here, k is a constant, which is defined in eq. (2.25): ? ? ?= Otherwise 0 adjacent are and if 1 ji eek (2.25) 39 If the graph contains heteroatoms, it is necessary to account for the differences in bonds formed between heteroatoms and carbon from the other types of bonds. In that case, the values of gik are replaced with KC-X , which are the values corresponding to the bonds between carbon and the heteroatom (Estrada, 1995b). The KC-X parameters are related to the resonance integral associated with the bond between the heteroatoms and the carbon atom (R. Daudel et al., 1959). Different values have been reported in the literature for KC-X. The values reported by Ortiz and Perez (1982) are shown in table 2.4: Table 2.4: Values of KC-X Parameters 2.6.3. Shape Indices The shape index is a molecular descriptor used for the quantification of the molecular shape. Depending on different aspects of the shape, shape indices of different orders have been developed. Shape index of order one is defined as shown in eq. (2.26): C-X Bond KC-X C-N 0.9 C-O 0.8 C=O 1.6 C-S 0.7 C-F 0.7 C-Cl 0.4 C-Br 0.3 40 ( ) ( )21 2 1 1 P nn ?=? (2.26) 1P is the number of paths of length 1. The other shape indices can also be calculated in the similar manner. 2.6.4. Wiener Indices The Wiener number is introduced as the path number, which is the number of bonds between all pairs of atoms in an acyclic molecule. The main significance of this index is that, this was the first time the significance of paths in a molecular skeleton had been recognized (Wiener, 1947). The Wiener number, W is defined as one half of the off- diagonal elements of the molecular distance matrix: ?? = = = N k N l klDW 1 1 5.0 (2.27) where Dkl is the off-diagonal elements of the distance matrix, D. Together with other molecular descriptors, the Weiner number can make predictions on alkane properties like boiling points, heats of formation, heats of vaporization, molar refractions and molar volume. 2.6.5. The Hosoya Topological Index The Hosoya topological index is defined by the count of all possible patterns of considering k disjoint bonds in a molecule (Hosoya, 1971): 41 KZZZZ ++++= ....1 21 (2.28) Here, Z1 is the number of bonds in a graph, Z2 is the number of pairs of disjoint bonds, Z3 is the number of triples of disjoint bonds and so on. The disjoint bonds are any two or more bonds in the structure for which there is no incident vertex. A variety of other connectivity indices is available in many published works. A good review and the classification and applications of different topological indices can be found in Trinajstic (1992) and Balaban (2001). 2.7. Connectivity Indices and GC+ Method The GCM can provide fairly accurate estimates of the properties of the molecules if the group contribution values of all the building blocks are known. However, if a group, whose property contribution is not available, makes up at least one part of the molecule the property estimation cannot be completed. In practice, this is a very common frustration encountered while using GCM for property estimation or in reverse problems because, property contributions of many common molecular groups are not available in literature. So, recent works on the correlation between connectivity indices and some physical properties can be used in such situations (Gani et al., 2005). In a recent work to correlate the CI?s to physical properties, the following pure component property model was proposed by Gani et al (2005): ( ) ( ) ( ) dcbAaYf vv i ii +++= ? 10 2)( ?? (2.29) 42 Here, Y is the sought property, Ai is the number of atom i, ai is the estimated contribution of atom i while b, c and d are adjustable parameters. The property functions are defined the same way as in the GCM. The values of the constants are given in the appendix (Table A.6). It is to be noted that the constants in the above equation are regressed using the same pure-component property data used by Marrero and Gani (2001) in their group contribution model development. Here, as the derived equation is on the atomic level, the expected accuracy of prediction is less than the group contribution model. So, the CI model is used only for deriving the property contributions of groups not available in existing group contribution methods. Since the property functions are defined the same way in both GCM and CI methods, the formulation of a combined approach is straightforward. The combined GC- CI model, known as the GC+ model, has been written as follows (Gani et al., 2005): ( ) ( ) ( )mvmv i imimm cbAaYf 10 ,, 2)( ?? ++= ? (2.30) dYfnYf m mm +?? ?? ? ?= ? )()( * (2.31) ? ? ?? ? ?+? ? ?? ? ?++? ? ?? ? ?= ??? t tt s ss i ii CNCNYfCNYf )()( * (2.32) 43 where m is the number of different missing groups and nm is the number of times the missing group is present in the molecule. f (Y*) is the property function of the missing group. 2.8. Molecular Signature Descriptors 2.8.1. Current Status in Inverse Design The inverse design of identifying the molecules from property constraints is a relatively new problem from a chemical engineering perspective. In this section, the different methodologies developed for designing molecules with a set of target properties will be reviewed. It should be noted that, the current methodologies are very specific to certain classes of property models and even in such restricted situations, the accuracy and computational expenses requires a lot of improvement. Most of the existing methods used group contribution based approaches for solving the inverse design problem (Achenie et al., 2003). A number of recent publications have used the group contribution based techniques to solve for different classes of molecular design problems ( Sahinidis et al., 2003; Achenie & Sinha, 2004; Eljack & Eden, 2008; Chemmangattuvalappil et al., 2009). However, the suitability of group contribution methods for molecular design is limited because of the following reasons: 1. It is not always possible to find a suitable correlation between the molecular groups and properties 2. Not all possible atomic arrangements are represented in GCM 3. Many group contribution models have limited ranges of accuracy As mentioned before, the representation of a molecule in the form of a graph can provide lot of information about its properties through the use of molecular descriptors. 44 Molecular descriptors are operators developed from the molecular graph to characterize the properties of the molecule. The numerical value obtained after performing the operations suggested by the descriptor on the molecular graph can generally be used to correlate and predict physical properties and biological activities (Faulon et al., 2003b). There are thousands of molecular descriptors available and that makes it difficult to select the appropriate one(s) for a specific problem. A lot of work has been done in molecular design using topological indices as structural descriptors (Baskin et al., 1990; Gordeeva et al., 1990; Kvasnicka & Pospichal, 1990; Kier et al., 1993a; Skvortsova et al., 1993). In all these approaches, the descriptors' structural features have been used to generate the feasible molecular structures. However, the inverse relationships between the topological indices generally do not provide a unique molecular graph (Trinajstic, 1992). Therefore, the degeneracy in these approaches is very large. In addition, most topological indices exhibit highly non-linear functional dependence on the elements of the vertex-adjacency matrix (Raman & Maranas, 1998). Because of that, obtaining a global solution when employing mathematical programming techniques is difficult. The techniques developed recently for obtaining unique molecular structures from the molecular descriptors use stochastic approaches. In the algorithms developed by Venkatasubramanian et al. (1994, 1995), and Sheridan and Kearsley (1995), a genetic algorithm based approach has been introduced for large scale molecular design. Genetic algorithms are stochastic optimization methods based on the Darwinian model of evolution. In summary, genetic algorithms identify the population of best candidates from an earlier population following the rules of crossover and mutation and thus producing the best offspring for the next generation (survival of the fittest). A fitness function based 45 on the target properties has been used to evaluate the fitness of the candidate solutions. This approach is expected to identify better offspring after each generation and eventually end up with the optimal solution. These approaches were the first efficient methodology for solving large-scale molecular design problems. They have the ability to locate optimal designs for those problems with multiple target specifications. However, because of the heuristic nature of the developed algorithms, there is no guarantee that a solution will be obtained after running the algorithm. In addition, even though these algorithms can obtain a near optimal solution very fast, the efficiency is very limited in obtaining the final solution. Algorithms based on Monte Carlo simulations (Faulon, 1996; Kvasnicka & Pospichal, 1996) have also been published. A simulated annealing based algorithm has been published by the group of Kokossis (Marcoulaki & Kokossis, 1998) for the design of refrigerants and liquid-liquid extraction solvents. One advantage of simulated annealing over other stochastic optimization methods is that, it can provide probabilistic guarantee on the quality of the final solutions. These algorithms can theoretically provide the solution in polynomial time. A variety of other papers have been published later using stochastic techniques, however, the reconstruction of molecular structures using deterministic techniques has rarely been attempted. The notable contributions that use deterministic techniques for the exhaustive generation of molecular graphs corresponding to the predefined molecular descriptors have come from Kier?s group (Hall et al., 1993a; Hall et al., 1993b; Kier et al., 1993b) and from Skvortsova?s group (Skvortsova et al., 1993). The works from the former group compute the possible degree sequences that match the paths of the target descriptors up to 46 the length of two that can track degree sequences up to length three. These contributions applied the developed techniques to chi-indices. Once the different paths of length two are obtained, the degree sequences corresponding to all the molecular structures will be generated using an isomer generator. Only those structures that match the path of length 3 (which is the maximum path length that can be tracked using this technique) are accepted as the final solutions. In the contribution from Skvortsova?s group, apart from the degree sequence, an edge sequence also has been generated, which simultaneously formed the input to the isomer generator to build the exhaustive list of corresponding structures. The edge sequence can estimate the number of edges between each combination of atoms. This can decrease the degeneracy of the solutions significantly. However, the descriptional features in these methods do not always produce feasible or unique structures. In more recent works (Raman & Maranas, 1998; Camarda & Maranas, 1999), methodologies have been developed to incorporate topological indices within an optimization framework. In the work by Raman and Maranas (1998), many hydrocarbon properties are correlated with connectivity indices and shape indices of different orders. In this work, the molecular structure has been represented in the form of a vertex adjacency matrix that can completely explain the molecular interconnectivity. Even though the actual topological indices used in this work are non-linear, the matrix representation has been used to systematically transform them into linear form. A mixed integer linear program (MILP) formulation has been formed which ensures that a global optimum solution can be achieved. However, its application has been limited to the design of alkyl structures. In the work by Camarda and Maranas (1999), nonlinearities 47 due to the expressions for connectivity indices led to MINLP formulations, which make the solution methodology computationally expensive and susceptible to local optima traps. Another important inverse problem solution technique had been developed by forming a target scaffold (Garg & Achenie, 2001) for drug design. This is the first attempt to apply mathematical programming techniques in the area of drug design. In this work, a target scaffold based on a drug molecule has been used to generate a QSAR and the inverse problem has been solved for the best values of selectivity by changing the substituents on the scaffold. The limitations of this approach was that it was not able to provide nonintuitive solutions because the scaffold limits the type of molecules obtained as solutions. However, this approach was effective in controlling the combinatorial explosion. In a later contribution, a fitness function was directly incorporated into an optimization framework (Siddhaye et al., 2004). However, in many formulations, a globally optimal solution cannot be guaranteed in this approach. In another work, second order connectivity indices had been used for the design of value added soybean oil products (Camarda & Sunderesan, 2005). The interesting aspect of this work is that, the highly non linear second order connectivity indices have been represented with the exact linear equivalent expressions using Glover transformation. Glover transformation is a technique used in non-linear integer programming to represent non-linear expressions in terms of linear equations that captures the essential non-linearities of the original problem. By applying Glover transformation, the non-convex terms have been converted into products of binary variables. Even though this approach eliminates the possibility of 48 local minima traps by avoiding non-convex terms, the computational requirements are relatively high. To decrease the computational complexity, an approach known as ?templating? has been applied in this work. In templating, a portion of the vertex adjacency matrix has been predefined to control the generation of a large number of structures. However, because of templating, the generation of non-intuitive solutions may be eliminated. In some recent works, heuristic methods have been used in order to handle the non-linear constraints. (Lin et al., 2005; Eslick et al., 2009; McLeese et al., 2010). These methods cannot ensure a global optimum solution. However, in order to ensure that the solution is near optimum, Tabu search methods have been employed while solving the problem. In Tabu search, a library will be generated that keeps track of the recently generated local optima solutions that will prevent the generation of the same local minima solution as the search proceeds (Lin et al., 2005). Even though this technique ensures better quality of the solutions, the optimum solution can never be guaranteed because of the non-linear constraints. Apart from these limitations, the above-mentioned techniques can be used only when the QSAR/QSPRs are based on one topological index. For many properties, the QSAR/QSPRs are formed based on more than one topological index. In addition, the topological indices required to form the QSAR/QSPRs may be different for different properties of interest. For instance, the QSPR for one property target may be based on connectivity indices and the second property target may be based on shape indices. Since different topological indices are formulated using different mathematical expressions, there is no standard way to combine everything on a common platform and solve it all 49 simultaneously. Therefore, the current techniques that employ QSAR/QSPR models in reverse problem formulations can handle those property models with one topological index. Different topological indices have to be formulated using different mathematical transformations. In addition, the degeneracy of the solutions in all these methodologies is very high. That means, for a specific solution, there could be many possible molecular structures. The degeneracy increases with the size of the molecules in the solution. Here, we are looking for a computationally efficient algorithm that can simultaneously incorporate different topological indices based on QSAR/QSPRs. The recent works on molecular signature descriptors (Visco et al., 2002; Faulon et al., 2003a; Faulon et al., 2003b; Weis et al., 2005) provides a convenient way to represent a variety of TIs as linear combinations of molecular signatures. Therefore, this approach has the potential to develop an efficient methodology for the design of molecules with QSARs/QSPRs related to diverse TIs. 2.8.2. Development of Molecular Signature The concept of molecular signature ( Visco et al., 2002; Faulon et al., 2003b) is significant for the reverse problem formulation framework because, it forms a finite set of not highly correlated descriptors based on the molecular structure from which all other TI?s can be calculated. The molecular signature is a systematic way of representing the atoms in a molecule using the extended valancies to a pre-defined height. The systematic procedure for the identification of the signature of a molecule developed by Visco et al., (2002) is explained in the next section. 50 One of the characteristics that make the molecular signature descriptors unique among other molecular descriptors is that the building blocks of the molecular signatures complement each other. If G is a molecular graph and x is an atom of G, the atomic signature of height h of x is a canonical representation of the subgraph of G containing all atoms that are at a distance h from x. This canonical representation can be obtained by the following systematic procedure: 1. All atoms (vertices) in the graph are labeled in a canonical order starting with atom x 2. For the atom x for which the atomic signature is to be constructed, all atoms and bonds will be shown up to the height h in the sub graph )(xGh . 3. Construct the tree that spans over all the edges in the sub graph. The root of the tree is the atom x itself. The tree is constructed one layer at a time up to level h. The first layer of the tree are the nearest neighbors of atom x, the second layer of the tree consists of all the neighbors of the vertices in layer 1 except the atom x. In general, when the tree has been constructed up to level h-n, then, layer h-n+1 will be constructed considering each vertex of layer h-n. All vertices in the tree are labeled and colored with the necessary coloring function. The vertex color of a graph is the assignment of a unique description provided to its vertices in order to distinguish among different groups of atoms. The coloring function will be selected based on the type of the molecule. Typical coloring functions include the type of atoms, type of bonds and valency. It is possible to have one vertex more than once in the graph. However, no edge should be repeated in the same graph. 51 4. The signature can be written by reading the tree from the atom x. The child level vertices will be enclosed in parenthesis at each level. The vertex color must be written along with the vertices in each level. After each level, the next level must be followed until the signature reaches the required height. While writing the signatures, all the neighbors including the root atom must be written. One example of the construction of atomic signature is shown in figure 2.9. Here, the stepwise procedure for obtaining the atomic signature of atom N (nitrogen) up to height 3 in a molecule is illustrated. In the first step, the atoms are labeled to distinguish between them when writing the molecular signatures in the later stages. In the second step, all the atoms at height 3 from atom N are extracted. In other words, the neighbors of height three includes the atoms bonded to N (say y), the atoms bonded to all atoms in y (say z) and all the atoms bonded to all atoms in z. In the third step, the molecular groups are replaced with vertices. All the vertices are colored with the atom type, valency and the type of bond. Here the atoms types are C and N. The different carbon types are distinguished with their valancies and the type of bond. Note that the bond type has been retained in the canonical representation in step 2. In the final step, the signatures of different heights have been formed by reading the tree starting from the root N atom. The atomic signature of height zero is the root N atom itself. The atomic signature of height one is the root atom followed by its nearest neighbors (in this case, two carbon atoms) enclosed in parenthesis. In signature of height two, the neighbors of the carbon atoms (including the root N atom) are listed in the 52 parenthesis. Finally, signature of height three is obtained by adding the neighbors of the atoms in the previous layer. While writing the signatures, the vertices at the different levels have been color coded for clarification of the levels. CH2 NH C CH2 CHCH2 CH2CH2 CH3 CH3 CH3 C2 N2 C4 C2 C3C2 CC C CH2 NH C CH2 CHCH2 CH2CH2 CH3 . CH3 CH2 2 NH 1 C 3 CH2 5 CH 6CH24 CH2 9CH27 CH3 10 CH3 8 CH3 11 0?(x) = N Height 0 1?(x) = N2 (CC) Height 1 Molecular structure Step 1 Step 2 Step 3 Step 4 53 2?(x) = N2 (C2 (NC) C4 (=CCN)) Height 2 3?(x) = N2 (C2 (N2(CC) C1(C)) C4 (=C3(=CC) C1(C) N2(CC))) Height 3 Figure 2.9: Atomic Signatures up to Height 3 It is clear that the set of atomic signatures up to a given height is of finite size. So, any molecule can be represented by its coordinates in a vectorial space where the base vectors are its atomic signatures. Thus, the signature of a molecule is defined as the linear combination of atomic signatures (Visco et al., 2002; Faulon et al., 2003b). If )( ihGh X? is a base vector, ih? is the number of atoms having the signature of the base vector and Gh K is the number of base vectors, then the molecular signature )(Gh? is represented as: )()()( 1 i h G K i h i h G Vx hh XxG G h G ???? ?? =? == (2.33) A simple example is presented to show molecular signatures of different heights using eq. (2.33). In the first step, the subgraphs have been generated for all the atoms that produce signatures up to the required height. Then, individual atomic signatures for all the atoms have been generated. Finally, eq. (2.33) is applied to generate the molecular signature. CH3CH2CH(CH3)CH2NH2 54 CH3 1 CH2 2 CH 3 CH3 4 CH 25 NH26 CH3 CH2 CH CH3 CH2 CH3 CH2 CH3 CH CH3 CH2 NH2 CH CH2 CH3 CH3 CH2 NH2 CH3 CH CH2 CH3 CH2 NH2 CH2 CH CH3 CH2 CH3 NH2 NH2 CH2 CH CH3 CH2 CH3 0? = 5C + N Height 0 1? = 2C1(C) + C2(CC) + C3(CCC) + C2(NC) + N1(C) Height 1 2? = C1(C2(CC)) + C2(C1(C)C3(CCC)) + C3(C1(C)C2(CC)C2(NC)) + C1(C3(CCC)) + C2(C3(CCC)N1(C)) + N1(C2(NC)) Height 2 3? = C1(C2(C1(C)C3(CCC))) + 55 C2(C1(C2(CC)) C3(C1(C)C2(CC)C2(NC))) + C3(C1(C3(CCC))C2(C3(CCC)N1(C))C2(C1(C)C3(CCC))) + C1(C3(C1(C)C2(CC)C2(NC))) + C2(C3(C1(C)C2(CC)C2(NC)) N1(C2(NC))) + N1(C2(C3(CCC)N1(C))) Height 3 Figure 2.10: Molecular Signature Tree 2.8.3. Application of Signature Descriptors in Property Prediction The atomic signature concept is useful in QSAR/QSPR studies because of its applicability in defining many topological indices. Consider a molecule G with known atomic signature up to height h as defined in section 2.6.1. Suppose the number of vertices in an intermediate layer k of graph G is )(xVk . Note that k