This Is AuburnElectronic Theses and Dissertations

Machine Learning Algorithms for QSPR/QSAR Predictive Model Development Involving High-Dimensional Data




Datta, Shounak

Type of Degree

PhD Dissertation


Chemical Engineering


With advancements in fields such as computational chemistry, computer-aided molecular design and chemoinformatics, the scientific community has now become inundated with a very large set of molecular descriptors. The advantage of availability of large set of descriptors is that computational modelers can now capture different characteristics of molecules of varying sizes in different solvent/reaction mediums. However, the drawback is that during model development, the number of descriptors can exceed the number of instances in a dataset. Such datasets are known as high-dimensional data matrix. This is especially the case when the process of data generation is complex, time-consuming and/or resource intensive. Apart from these reasons, this can also happen when a specific product needs to be developed for a very specific use (e.g. drugs for a specific physical condition, polymers of a specific property, reaction in a specific environment). These cases tend to be very condition-specific, e.g. type of chemical species, activities or responses in specific environment, temperature, pressure, etc. The challenges of modeling such cases include but are not limited to; difficulty of generating a generalizable model, large model uncertainty and overfitting of model(s) generated. To address the aforementioned drawbacks and ensuing challenges, in this work, we have developed hybrid algorithms which are efficient and can generate generalizable models. These algorithms overcome the disadvantage of traditional modeling techniques that break down when the number of descriptors exceed the sample size. The developed algorithms, in our work, can be incorporated in software platforms, useful for automated design of product-centric industrial processes. Such software should be capable of analyzing experimental data and generating the best possible molecular structure for the specific constraints and objectives. It is also required to be fast and accurate at the same time. In the past, such situations were tackled with ab initio calculations, later replaced by DFT (Density Function Theory) based calculations. Apart from being computationally expensive, such methods include problems of manual handling of data for molecular design operations. To address such limitations, molecular descriptors (0D-7D) became attractive alternatives. However, the complexity of the calculation of descriptors increases with the complexity of the molecular structure. 2D (2 dimensional) descriptors, such as connectivity index descriptors, have been proven to be efficient in model generation with significant accuracy. Also, the design calculation steps are not computationally expensive. For these reasons, in this work, the generated models are based on 2D molecular descriptors. In this work, two unique condition-specific situations have been discussed. Case 1 encompasses relating reactant and solvent structures to the reaction rate constants for Diels Alder reactions. As reaction rates are more prone to depend of inter-atom connectivity, connectivity index descriptors were used to develop this model. A hybrid GA-DT (Genetic Algorithm-Decision Tree) algorithm was developed to select features and for model development. This case is unique as it involves the study of three different chemical species while generating the predictive model, and hence a challenge for both traditional and newly developed hybrid algorithms. Further improvements for the model were proposed using Multi-Gene Genetic Programming (MGGP) algorithm to derive non-linear models. Case 2 is based on developing a model to relate structures of 9-Anilinoacridine derivatives with respective DNA-drug binding affinity values. Although this case has only one group of chemical species under consideration, challenges emerge when two or more models with similar metrics are generated. Although the genetic algorithm was used for feature selection, initially, a novel adaptive version of LASSO (Least Absolute Shrinkage and Selection Operator) algorithm was developed. This adaptive correlation-based LASSO (CorrLASSO) was used to perform regression and shrinkage calculations. To evaluate model fitness, R2 and Q2 values were calculated that represent model internal and external validation respectively. For the second case, mean square error (MSE) was also calculated to compare the performances of LASSO and CorrLASSO algorithm.