This Is AuburnElectronic Theses and Dissertations

On Variable Selection for Data-Driven Soft Sensor Development with Application to Industrial Processes




Wang, Zi Xiu

Type of Degree



Chemical Engineering


In recent years, rapid developments in technology facilitated the collection of vast amount of data from different industrial processes. The data has been utilized in many different areas, such as data-driven soft sensor development and process monitoring, to control and optimize the process. The performance of these data-driven schemes can be greatly improved by selecting only the vital variables that strongly affect the primary variables, rather than all the available process variables. Consequently, variable selection has been one of the most important practical concerns in data-driven approaches. By identifying the irrelevant and redundant variables, variable selection can improve the prediction performance, reduce the computational load and model complexity, obtain better insight into the nature of the process, and lower the cost of measurements [1], [2]. A comprehensive evaluation of different variable selection methods for soft sensor development will be presented in this work. Among all the variable selection methods, seven algorithms are investigated. They are stepwise regression, PLS-BETA, PLS-VIP, UVE-PLS, PLS-SA, CARS-PLS and GA as discussed below. Stepwise regression methods are often used for variable selection in linear regression [3]. The procedure is carried out in such a way that individual predictor/secondary variable is sequentially introduced into the model to observe its relation to the primary variables. Partial Least Squares (PLS) regression is a model parameter based algorithm. Both the regression coefficients estimated by PLS (PLS-BETA) and variable importance in projection (PLS-VIP) are discussed [4]. Another model parameter based method, called Uninformative Variable Elimination by PLS (UVE-PLS), is also related to regression coefficients. How-ever, instead of looking at the regression coefficients only, the reliability of the coefficients is explored [5]. Variable selection algorithms based on sensitivity analysis, PLS-SA, are also studied. In these approaches, the importance of variables is defined by their sensitivity, which is defined as the change in primary variables by varying the secondary variable in its allowable range [6]. Furthermore, properties of genetic algorithms (GA), which have been recently proposed for variable selection applications [7], are also investigated. The algorithms of these variable selection methods and their characteristics will be presented. In addition, the strength and limitations when applied for soft sensor development are studied. The soft sensor prediction performance of models developed by these variable selection methods are compared using PLS. A simple simulation case is used to investigate the properties of the selected variable selection methods. The dataset is generated to mimic the typical characteristics of process data, such as the magnitude of correlations between variables and the magnitude of signal to noise ratio, etc. [4]. In addition, the algorithms are applied to an industrial soft sensor case study. In both cases, independent test sets are used to provide fair comparison and analysis of different algorithms. The final performances are compared to demonstrate the advantages and disadvantages of the different methods in order to provide useful insights to practitioners in the field.