|Surrogate models are used to map input data to output data when the actual relationship between the two is unknown or computationally expensive to evaluate (Han & Zhang, 2012). Surrogate models can also be constructed for use in surrogate-based optimization when a closed analytical form of the relationship between input data and output data does not exist or is not conducive for use in traditional gradient based optimization methods. The overall goal of this dissertation is to comprehensively investigate and compare the performance of several different surrogate modeling techniques for both approximating functional relationships and surrogate-based optimization, and to link that performance to the characteristics of the data involved in the application. Using the results of the performance comparisons, surrogate modeling techniques are incorporated into a derivative-free optimization framework to use in the application of surrogate-based optimization of chemical processes.
The research activities described here focused on comparison of the performance of eight different surrogate modeling techniques on a collection of generated datasets and construction of a tool to provide recommendations for the appropriate modeling techniques for the datasets based only on the characteristics of the data being modeled. The surrogate modeling techniques include multivariate adaptive regression splines (MARS), random forests (RF), single hidden layer feed forward artificial neural networks (ANN), extreme learning machines (ELM), Gaussian process regression (GP), support vector machines (SVM), Automated Learning of Algebraic Models using Optimization (ALAMO), and radial basis function networks (RBFN). In general, multivariate adaptive regression splines (MARS), artificial neural networks (ANN), and Gaussian process regression (GP) provide the most accurate predictions for approximation, and RF models locate the optimum of a dataset most often. Several of the surrogate modeling techniques were applied to the prediction of the outcomes of cardiac differentiation experiments. RF and GP models were found to provide the most accurate predictions of those outcomes. With feature selection and data-driven modeling using the surrogate modeling techniques, we were able to build models that could predict insufficient yield for a bioreactor differentiation on day seven (out of 10) of the differentiation protocol with up to a 90% accuracy and a 90% precision, using only 16% of the collected bioreactor features.
Based on the results of the surrogate model comparison study, we identified attributes of datasets appropriate for selecting surrogate models for both surface approximation and surrogate-based optimization. Using these attributes, a recommendation tool, PRESTO, was constructed to recommend surrogate modeling techniques for approximating a dataset with 91% accuracy and 90% precision and for performing surrogate based-optimization with 98% accuracy and 99% precision. A surrogate-based, derivative-free optimization algorithm, pyBOUND, was developed for the solution of expensive black-box optimization problems. pyBOUND combines the capabilities of random forest models to accurately locate the optima of a wide variety of problems with MARS models’ high accuracy for making predictions.