This Is AuburnElectronic Theses and Dissertations

Variable selection for industrial process modeling and monitoring




Wang, Zixiu

Type of Degree



Chemical Engineering


In recent years, rapid developments in technology facilitated the collection of vast amount of data from different industrial processes. Data-driven soft sensors have been widely used in both academic research and industrial applications for predicting hard-to-measure variables or replacing physical sensors to reduce cost. It has been shown that the performance of these data-driven soft sensors could be greatly improved by selecting only the vital variables that strongly affect the primary variables, rather than using all the available process variables. Consequently, variable selection has been one of the most important practical concerns in data-driven approaches. By identifying the irrelevant and redundant variables, variable selection can improve the prediction performance, reduce the model complexity and computational load, provide better insight into the nature of the process, and lower the cost of measurements. Given the importance of variable selection, a systematic evaluation of variable selection performance becomes essential. However, the existing performance indicators all have limitations. In this work, a comprehensive evaluation of different variable selection methods for PLS-based soft sensor development is presented, and a new metric is proposed to assess the performance of different variable selection methods. The new performance indicator incorporates information entropy to measure how consistently variable selection performs over multiple Monte Carlos runs. When the ground truth of the data is not available, only consistency index can be accessed to evaluate the variable selection performance, along with the prediction capability. The following seven variable selection methods are compared: stepwise regression (SR), partial least squares (PLS) with regression coefficients (PLS-BETA), PLS with variable importance in projection (PLS-VIP), uninformative variable elimination with PLS (UVE-PLS), genetic algorithm with PLS (GA-PLS), competitive adaptive reweighted sampling with PLS (CARS-PLS), and least absolute shrinkage and selection operator (Lasso). The algorithms of these variable selection methods and their characteristics will be presented. In addition, the strength and limitations when applied for soft sensor development are demonstrated by static case studies (a simulated case and an industrial polyester production) and dynamic case studies (a digester simulator and an industrial Kamyr digest-er). A simple simulation case is used to investigate the properties of the selected variable selection methods. The dataset is generated to mimic the typical characteristics of industrial data, by considering four factors: proportion of relevant predictors, magnitude of correlation between predictors, magnitude of signal to noise ratio, and structure of regression coefficients. In addition, the algorithms are applied to an industrial case study, the production of polyester resin, to test their performance. In both simulated and industrial polyester case studies, Monte Carlos (MC) simulation is adopted to generate different combinations of training, tuning, and testing datasets. Independent tuning datasets are used to optimize each method and to analyze the sensitivities of each method to its tuning parameters. Then independent test datasets are used to compare the prediction performances of PLS models built from different subsets of regressors retained by these variable selection methods. Based on the results, PLS-VIP is the most consistent method, based on both selection and prediction performances. Along with data preprocessing and correlation removal, around 30% of improvement is obtained on the polyester case study. Moreover, the effect of process dynamics on variable selection is examined with applications of a digester simulator and industrial Kamyr digester case studies. Due to the dynamic nature of the process, the selection performances are not as consistent. Therefore, a new variable selection technique is needed. The performances of different variable selection methods are compared and their advantages and disadvantages are discussed with the aim to provide useful insights to practitioners in the field.