On Variable Selection for Datadriven Soft Sensor Development with Application to
Industrial Processes
by
Zi Xiu Wang
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
December 8, 2012
Keywords: Variable Selection, Soft Sensor, DataDriven Models, Process Industry, Mod
el Sampling
Copyright 2012 by Zi Xiu Wang
Approved by
Jin Wang, Chair, Associate Professor of Chemical Engineering
Qinghua He, Associate Professor of Chemical Engineering, Tuskegee University
Mario R. Eden, Professor of Chemical Engineering
ii
Abstract
In recent years, rapid developments in technology facilitated the collection of vast
amount of data from different industrial processes. The data has been utilized in many
different areas, such as datadriven soft sensor development and process monitoring, to
control and optimize the process. The performance of these datadriven schemes can be
greatly improved by selecting only the vital variables that strongly affect the primary var
iables, rather than all the available process variables. Consequently, variable selection has
been one of the most important practical concerns in datadriven approaches. By identify
ing the irrelevant and redundant variables, variable selection can improve the prediction
performance, reduce the computational load and model complexity, obtain better insight
into the nature of the process, and lower the cost of measurements [1], [2].
A comprehensive evaluation of different variable selection methods for soft sen
sor development will be presented in this work. Among all the variable selection meth
ods, seven algorithms are investigated. They are stepwise regression, PLSBETA, PLS
VIP, UVEPLS, PLSSA, CARSPLS and GA as discussed below. Stepwise regression
methods are often used for variable selection in linear regression [3]. The procedure is
carried out in such a way that individual predictor/secondary variable is sequentially in
troduced into the model to observe its relation to the primary variables. Partial Least
Squares (PLS) regression is a model parameter based algorithm. Both the regression co
efficients estimated by PLS (PLSBETA) and variable importance in projection (PLS
iii
VIP) are discussed [4]. Another model parameter based method, called Uninformative
Variable Elimination by PLS (UVEPLS), is also related to regression coefficients. How
ever, instead of looking at the regression coefficients only, the reliability of the coeffi
cients is explored [5]. Variable selection algorithms based on sensitivity analysis, PLS
SA, are also studied. In these approaches, the importance of variables is defined by their
sensitivity, which is defined as the change in primary variables by varying the secondary
variable in its allowable range [6]. Furthermore, properties of genetic algorithms (GA),
which have been recently proposed for variable selection applications [7], are also inves
tigated.
The algorithms of these variable selection methods and their characteristics will
be presented. In addition, the strength and limitations when applied for soft sensor devel
opment are studied. The soft sensor prediction performance of models developed by these
variable selection methods are compared using PLS.
A simple simulation case is used to investigate the properties of the selected vari
able selection methods. The dataset is generated to mimic the typical characteristics of
process data, such as the magnitude of correlations between variables and the magnitude
of signal to noise ratio, etc. [4]. In addition, the algorithms are applied to an industrial
soft sensor case study. In both cases, independent test sets are used to provide fair com
parison and analysis of different algorithms. The final performances are compared to
demonstrate the advantages and disadvantages of the different methods in order to pro
vide useful insights to practitioners in the field.
iv
Acknowledgments
First and foremost, I would like to express my deepest gratitude to my advisor,
Dr. Jin Wang, for her guidance and constant supervision. Her guidance has made this a
rewarding and meaningful journey. Her dedication and hard work have set a great exam
ple for me to become a better researcher.
I would like to express my deepest appreciation and gratitude to my other re
search advisor, Dr. Qinghua He. His experience, support, and encouragement have truly
made a difference in my journey. Through all the meetings I have had with him, I have
always left the office more encouraged and motivated and I am very thankful for that. I
would also like to thank Dr. Mario R. Eden for agreeing to be on my committee and for
providing me the opportunity to join the Chemical Engineering Department.
Certainly, this journey was not easy, through struggles and success my fellow
group members and I shared disappointments and victories together. Thus, I would like to
extend my gratitude to my group members: Hector Galicia Escobar, Meng Liang, Min
Hea Kim, and Andrew Damiani. Also special thanks to Hector Galicia for his endless
guidance, assistance and support. He has been very instrumental in successful completion
of this thesis.
Friends have always been important to me, and thankfully during my journey
great friendships were always there. I would like to thank my friends: Jimmy Tran, Ari
v
anna Tieppo, Pengfei Zhao, Achintya Sujan and Chuan Cai Zou. They have brought
laughter, happiness, and comfort to not only this journey but my life.
Last but not least, my love and gratitude go to my parents, Hai Bin Wang and Yan
Yun Yu, for raising me to always striving for excellence, to value my education and for
putting my success and happiness before their own. I am also grateful for their uncondi
tional love and support wherever I was. I am thankful to my brother, Yu Dong (Jeffrey)
Wang, for his support and understanding. Their love and encouragement has always been
a constant source of comfort and support to me when times were difficult.
vi
Table of Contents
Abstract ............................................................................................................................... ii
Acknowledgments.............................................................................................................. iv
List of Tables ................................................................................................................... viii
List of Figures .................................................................................................................... ix
List of Nomenclature ....................................................................................................... xiv
Chapter 1. Introduction ....................................................................................................... 1
Chapter 2. Soft Sensor Development .................................................................................. 7
2.1 Multiple Linear Regression ...................................................................................... 7
2.2 Principal Component Analysis ................................................................................. 8
2.3 Principal Component Regression (PCR) .................................................................. 8
2.4 Partial Least Squares Regression .............................................................................. 8
Chapter 3. Variable Selection Theory and Algorithm ...................................................... 10
3.1 Stepwise Regression ............................................................................................... 13
3.2 Genetic Algorithm .................................................................................................. 15
3.3 Uninformative Variable Elimination ...................................................................... 17
3.4 Partial Least Squares with Sensitivity Analysis ..................................................... 18
3.5 Competitive Adaptive Reweighted Sampling with Partial Least Squares .............. 20
3.6 Partial Least Squares with Variable Important in Projection ................................. 22
3.7 Partial Least Squares with Regression Coefficients ............................................... 25
vii
Chapter 4. Variable Selection Method with Its Application to Simulated and Industrial
Dataset............................................................................................................................... 27
4.1 Introduction ............................................................................................................. 27
4.2 Simulated Case Study ............................................................................................. 28
4.2.1 Results .............................................................................................................. 30
4.2.2 Conclusion and Discussion .............................................................................. 75
4.3 Industrial Case Study .............................................................................................. 76
4.3.1 Data Preprocessing ........................................................................................... 79
4.3.2 Results .............................................................................................................. 81
4.3.3 Conclusion and Discussion ............................................................................ 105
Chapter 5. Conclusions and Future Works ..................................................................... 107
5.1 Conclusions ........................................................................................................... 107
5.2 Future Works ........................................................................................................ 109
Bibliography ................................................................................................................... 110
viii
List of Tables
Table 3.1 Confusion Matrix and Descriptions of Its Entries ............................................ 24
Table 4.1 Comparison of Sensitivity of Different Variable Selection Methods to
Proportion of Relevant Predictors ..................................................................................... 32
Table 4.2 Comparison of Sensitivity of Different Variable Selection Methods to
Magnitude of Correlation between Predictors .................................................................. 33
Table 4.3 Comparison of Sensitivity of Different Variable Selection Methods to
Regression Coefficient Structure ...................................................................................... 34
Table 4.4 Comparison of Sensitivity of Different Variable Selection Methods to
Magnitude of Signal to Noise Ratio.................................................................................. 35
Table 4.5 List of Process Variables Included in Polyester Resin Dataset ........................ 77
Table 4.6 Comparison of Different Variable Selection for Preprocessing Method 1 ....... 83
Table 4.7 Comparison of Different Variable Selection for Preprocessing Method 2 ....... 83
Table 4.8 Comparison of Different Variable Selection for Preprocessing Method 3 ....... 84
Table 5.1 Limitations and Strengths of Each Variable Selection Method ...................... 108
ix
List of Figures
Figure 3.1 Stepwise Regression Algorithm ...................................................................... 15
Figure 3.2 Genetic Algorithm with PLS ........................................................................... 17
Figure 3.3 Procedure of Uninformative Variable Elimination with PLS ......................... 18
Figure 3.4 PLSSA Algorithm .......................................................................................... 20
Figure 3.5 Graphical illustration of the exponentially decreasing function ...................... 21
Figure 3.6 Illustration of adaptive reweighted sampling technique using five variables in
three cases as an example. The variables with larger weights will be selected with higher
frequency........................................................................................................................... 22
Figure 3.7 General Procedure of CARPLS ..................................................................... 22
Figure 3.8 Procedure of PLSVIP ..................................................................................... 25
Figure 3.9 Procedure of PLSBETA ................................................................................. 26
Figure 4.1 Sensitivity of Proportion of Relevant Predictors in Terms of Average G ....... 36
Figure 4.2 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average G ......................................................................................................................... 36
Figure 4.3 Sensitivity of Regression Coefficient Structure in Terms of Average G ........ 37
Figure 4.4 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average G.... 37
Figure 4.5 Sensitivity of Proportion of Relevant Predictors in Terms of Average RMSE in
Training Set ....................................................................................................................... 38
Figure 4.6 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average RMSE in Training Set ........................................................................................ 38
Figure 4.7 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE in
Training Set ....................................................................................................................... 39
x
Figure 4.8 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average RMSE
In Training Set .................................................................................................................. 39
Figure 4.9 Sensitivity of Proportion of Relevant Predictors in Terms of Average MAPE
in Training Set................................................................................................................... 40
Figure 4.10 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average MAPE in Training Set ........................................................................................ 40
Figure 4.11 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE
in Training Set................................................................................................................... 41
Figure 4.12 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
MAPE in Training Set ...................................................................................................... 41
Figure 4.13 Sensitivity of Proportion of Relevant Predictors in Terms of Average RMSE
in Validation Set ............................................................................................................... 42
Figure 4.14 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average RMSE in Validation Set ..................................................................................... 42
Figure 4.15 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE
in Validation Set ............................................................................................................... 43
Figure 4.16 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
RMSE in Validation Set.................................................................................................... 43
Figure 4.17 Sensitivity of Proportion of Relevant Predictors in Terms of Average MAPE
in Validation Set ............................................................................................................... 44
Figure 4.18 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average MAPE in Validation Set ..................................................................................... 44
Figure 4.19 Sensitivity of Regression Coefficient Structure in Terms of Average MAPE
in Validation Set ............................................................................................................... 45
Figure 4.20 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
MAPE in Validation Set ................................................................................................... 45
Figure 4.21 Frequency of Variables Selected by SR ........................................................ 46
Figure 4.22 Frequency of Variables Selected by SR ........................................................ 47
Figure 4.23 Frequency of Variables Selected by SR ........................................................ 48
Figure 4.24 Frequency of Variables Selected by SR ........................................................ 49
Figure 4.25 Frequency of Variables Selected by GAPLS ............................................... 50
xi
Figure 4.26 Frequency of Variables Selected by GAPLS ............................................... 51
Figure 4.27 Frequency of Variables Selected by GAPLS ............................................... 52
Figure 4.28 Frequency of Variables Selected by GAPLS ............................................... 53
Figure 4.29 Frequency of Variables Selected by UVEPLS............................................. 54
Figure 4.30 Frequency of Variables Selected by UVEPLS............................................. 55
Figure 4.31 Frequency of Variables Selected by UVEPLS............................................. 56
Figure 4.32 Frequency of Variables Selected by UVEPLS............................................. 57
Figure 4.33 Frequency of Variables Selected by PLSSA ................................................ 58
Figure 4.34 Frequency of Variables Selected by PLSSA ................................................ 59
Figure 4.35 Frequency of Variables Selected by PLSSA ................................................ 60
Figure 4.36 Frequency of Variables Selected by PLSSA ................................................ 61
Figure 4.37 Frequency of Variables Selected by CARSPLS .......................................... 62
Figure 4.38 Frequency of Variables Selected by CARSPLS .......................................... 63
Figure 4.39 Frequency of Variables Selected by CARSPLS .......................................... 64
Figure 4.40 Frequency of Variables Selected by CARSPLS .......................................... 65
Figure 4.41 Frequency of Variables Selected by PLSVIP .............................................. 66
Figure 4.42 Frequency of Variables Selected by PLSVIP .............................................. 67
Figure 4.43 Frequency of Variables Selected by PLSVIP .............................................. 68
Figure 4.44 Frequency of Variables Selected by PLSVIP .............................................. 69
Figure 4.45 Frequency of Variables Selected by PLSBETA .......................................... 70
Figure 4.46 Frequency of Variables Selected by PLSBETA .......................................... 71
Figure 4.47 Frequency of Variables Selected by PLSBETA .......................................... 72
Figure 4.48 Frequency of Variables Selected by PLSBETA .......................................... 73
xii
Figure 4.49 Visualization of Autoscaled Process Data from A Reference Batch in
Polyester Production ......................................................................................................... 78
Figure 4.50 Product Quality Variables from A Reference Batch in Polyester Production.
(a) is the acidity number in gNaOH/gresin; (b) is the viscosity in poise. ............................... 78
Figure 4.51 Illustration of Unfolding ThreeDimension Array to Preserve the Direction of
Variables ........................................................................................................................... 80
Figure 4.52 Dynamic Parallel Coordinate Plot of Autoscaled Unfolded Process Data of A
Permutation Run of Polyester Resin Dataset .................................................................... 80
Figure 4.53 Product Quality Variables of A Permutation Run of Polyester Production. (a)
is the acidity number in gNaOH/gresin; (b) is the viscosity in poise. .................................... 81
Figure 4.54 Comparison of Acidity Number Full Models from Each Preprocessing
Method in Terms of RMSE............................................................................................... 85
Figure 4.55 Comparison of Acidity Number Full Models from Each Preprocessing
Methods in Terms of MAPE ............................................................................................. 85
Figure 4.56 Comparison of Viscosity Full Models from Each Preprocessing Methods in
Terms on RMSE ............................................................................................................... 86
Figure 4.57 Comparison of Viscosity Full Models from Each Preprocessing Methods in
Terms of MAPE ................................................................................................................ 86
Figure 4.58 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of RMSE in Training Set ....................................................................................... 87
Figure 4.59 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of MAPE in Training Set ....................................................................................... 87
Figure 4.60 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of RMSE in Validation Set .................................................................................... 88
Figure 4.61 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of MAPE in Validation Set .................................................................................... 88
Figure 4.62 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of RMSE in Training Set .................................................................................................. 89
Figure 4.63 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of MAPE in Training Set .................................................................................................. 89
Figure 4.64 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of RMSE in Validation Set ............................................................................................... 90
xiii
Figure 4.65 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of MAPE in Validation Set ............................................................................................... 90
Figure 4.66 Selection Frequency of SR Acidity Number Model ..................................... 91
Figure 4.67 Selection Frequency of SR in Viscosity Model ............................................ 92
Figure 4.68 Selection Frequency of GA in Acidity Number Model................................. 93
Figure 4.69 Selection Frequency of GA in Viscosity Model............................................ 94
Figure 4.70 Selection Frequency of UVE in Acidity Number Model .............................. 95
Figure 4.71 Selection Frequency of UVE in Viscosity Model ......................................... 96
Figure 4.72 Selection Frequency of PLSSA in Acidity Number Model ......................... 97
Figure 4.73 Selection Frequency of PLSSA in Viscosity Model .................................... 98
Figure 4.74 Selection Frequency of CARSPLS in Acidity Number Model .................... 99
Figure 4.75 Selection Frequency of CARSPLS in Viscosity Model ............................. 100
Figure 4.76 Selection Frequency of PLSVIP in Acidity Number Model ...................... 101
Figure 4.77 Selection Frequency of PLSVIP in Viscosity Model ................................. 102
Figure 4.78 Selection Frequency of PLSBETA Acidity Number Model ...................... 103
Figure 4.79 Selection Frequency of PLSBETA in Viscosity Model ............................. 104
xiv
List of Nomenclature
Symbols Descriptions
PLS components
ARS Adaptive reweighted sampling
BETA Regression Coefficients
CARS Competitive Adaptive Reweighted Sampling
EDF Exponential decreasing function
Geometric mean of sensitivity and specificity
GA Genetic Algorithm
GAVDS Genetic AlgorithmBased Process Variables and Dynam
ics Selection
Number of variables
Number of batches
Reciprocal of signal to noise ratio
KRR Kernel Ridge Regression
MAPE Mean absolute percentage error
MLR Multiple Linear Regression
MC Monte Carlo
Mean square error
Total number of simulation runs
Normalized regression coefficient
xv
PCA Principal Component Analysis
PCR Principal Component Regression
PLS Partial Least Squares
PLSLDA Partial Least Squares Linear Discriminant Analysis
RMSEP Root mean square error of prediction
Random variable added in PLSSA
SPA Statistics Pattern Analysis or Successive Projection Algo
rithm
SR Stepwise Regression
Sum square error
Score matrix
USE Uninformative Sample Elimination
UVE Uninformative Variable Elimination
Cutoff value of VIP score
( ) Sample variance
VIP Variable Importance in Projection
Weighting matrix
Independent variable matrix
Extended matrix with the experimental and random varia
bles in UVE
Dependent variable matrix
is the confidence level in statistic testing
Variancecovariance matrix
Normal distributed random noise
xvi
Magnitude of correlation between predictors
Variable subset
Standard deviation of error,
1
Chapter 1. Introduction
In recent years, rapid developments in technology facilitated the collection of vast
amount of data from different industrial processes. The data has been utilized in many
different areas, such as datadriven soft sensor development and process monitoring, to
control and optimize the process. The performance of these datadriven schemes can be
greatly improved by selecting only the vital variables that strongly affect the primary var
iables, rather than all the available process variables. Consequently, variable selection has
been one of the most important practical concerns in datadriven approaches. By identify
ing the irrelevant and redundant variables, variable selection can improve the prediction
performance, reduce the computational load and model complexity, obtain better insight
into the nature of the process, and lower the cost of measurements [1], [2].
A comprehensive evaluation of different variable selection methods for soft sen
sor development will be presented in this work. Among all the variable selection meth
ods, seven algorithms are investigated. They are stepwise regression, PLSBETA, PLS
VIP, UVEPLS, PLSSA, CARSPLS and GAPLS as discussed below. Stepwise regres
sion methods are often used for variable selection in linear regression [3]. The procedure
is carried out in such a way that individual predictor/secondary variable is sequentially
introduced into the model to observe its relation to the primary variables. Partial Least
Squares (PLS) regression is a model parameter based algorithm. Both the regression co
efficients estimated by PLS (PLSBETA) and variable importance in projection (PLS
VIP) are discussed [4]. Another model parameter based method, called Uninformative
2
Variable Elimination by PLS (UVEPLS), is also related to regression coefficients. How
ever, instead of looking at the regression coefficients only, the reliability of the coeffi
cients is explored [5]. Variable selection algorithms based on sensitivity analysis, PLS
SA, are also studied. In these approaches, the importance of variables is defined by their
sensitivity, which is defined as the change in primary variables by varying the secondary
variable in its allowable range [6]. Furthermore, properties of genetic algorithms (GA),
which have been recently proposed for variable selection applications [7], are also inves
tigated.
Stepwise regression has been applied to the selection of predictors for both classi
fication and multivariate calibrations [8], especially in nearinfrared (NIR) spectral. Gau
chi and Chagnon proposed a stepwise variable selection method based on maximum
and applied to manufacturing processes in oil, chemical and food industries [9].
Broadhurst et al. applied genetic algorithm to pyrolysis mass spectrometric data
and showed that GA is able to determine the optimal subset of variables to provide better
or equal prediction performance [10]. Arcos et al. successfully applied GA to a wave
length selection for PLS calibration of mixtures of indomethacin and acemethacin, in
spite of the fact that the two compounds have almost identical spectra [11]. A modified
genetic algorithmbased wavelength selection method has been proposed by Hiromasa
Kaneko and Kimito Funatsu to select process variables and dynamic simultaneously [12].
This method is named as genetic algorithmbased process variables and dynamics selec
tion method, GAVDS. The result of GAVDS, based on its application to a dynamic pro
cess of distillation column in Mitsubishi Chemical Corporation, shows its robustness to
the presence of nonlinearity and multicollinearity in process data. GA has also been well
3
recognized in molecular modeling. Jones et al. have shown three application of GA in
chemical structure handling and molecular recognition [13].
A modified uninformative variable elimination method based on the principle of
Monte Carlo (MC) was applied in quantitative analysis of NIR spectra by Cai et al. [14].
UVEMC is proven to be capable of selecting important wavelength and making the pre
diction more robust and accurate in quantitative analysis. Some researchers also suggest
ed to combine UVE with wavelet transform to further simplify the model and to reduce
computation time [14], [15]. In the work of Koshoubu et al., they have extended UVE to
eliminate uninformative samples (USE) that do not contribute much in the calibration
model [16], [17]. They proposed an algorithm where the uninformative wave
lengths/variables are eliminated first by UVEPLS, and then the uninformative samples,
which are determined by their standard deviation of prediction error calculated from
leaveoneout cross validation, are eliminated from the calibration. Another new method
which combined UVE with successive projection algorithm (SPA) has been proposed by
[18]. UVE is implemented to remove uninformative variables before application of SPA
to improve the efficiency of variable selection by SPA.
Sensitivity analysis has become more popular in selection of optimal variable
subsets in recent years. Zamprogna et al. has introduced a novel methodology based on
principal component analysis (PCA) to select the most suitable soft sensor inputs [19].
Instead of using the secondary variables directly, the instantaneous sensitivity of each
secondary variable to the primary variables are estimated and utilized as the regressor
inputs. Li and Shao have proposed a novel method using sensitivity analysis to select the
4
optimal secondary variables to be used as inputs to kernel ridge regression (KRR) to im
plement online soft sensing of distillation column compositions [20].
Competitive adaptive reweighted sampling (CARS) method has been proposed by
Li et al. [21]. CARS is model independent. In other words, CARS can be combined with
any regression or classification models. In [22], [23], CARS has been applied in combi
nation with partial least squares linear discriminant analysis (PLSLDA) to effectively
classify two classes of samples in colorectal cancer data.
Variable importance in the projection (VIP) and regression coefficients (BETA)
have been broadly adapted as a criterion in partial least squares modeling paradigm for
variable selection. Both PLSVIP and PLSBETA are model based variable selection
methods. Mehmood et al. presented an algorithm that balances the parsimony and predic
tive ability of model using variables selection based on PLSVIP [24]. It is shown that the
proposed method increases the understandability and consistency of the model and re
duces the classification error. Lindgren et al. also implemented PLSVIP on a benchmark
data for variable selection, Selwood dataset [25]. In their study, PLSVIP is combined
with permutation test to extensively investigate the technique. A bootstrapPLSVIP has
been implemented as a wavelength interval selection method in spectral imaging applica
tions by Gosselin et al. [26]. Their result demonstrates its ability to identify relevant spec
tral intervals and its simplicity and relatively low computational cost. PLSVIP and PLS
BETA have also been seen in food science. Andersen and Bro applied PLSVIP and PLS
BETA to NIR spectral of beer sample and obtained useful insight of the process [27]. A
variable selection algorithm based on the standardized regression coefficients are pro
5
posed in [28]. The developed models are optimized by the leaveoneout values and
validated by an external testing set.
The algorithms of these variable selection methods and their characteristics will
be presented. In addition, the strength and limitations when applied for soft sensor devel
opment are studied. The soft sensor prediction performance of models developed by these
variable selection methods are compared using PLS.
A simple simulation case is used to investigate the properties of the selected vari
able selection methods. The dataset is generated to mimic the typical characteristics of
process data, such as the magnitude of correlations between variables and the magnitude
of signal to noise ratio, etc. [4]. In addition, the algorithms are applied to an industrial
soft sensor case study. In both cases, independent test sets are used to provide fair com
parison and analysis of different algorithms. The final performances are compared to
demonstrate the advantages and disadvantages of the different methods in order to pro
vide useful insights to practitioners in the field.
This work is structured as follows. In Chapter 2, a brief review of the multivariate
statistical techniques is presented, which will be required for further discussion on varia
bles selection methods. Chapter 3 provides detail descriptions of algorithms of different
variable selection methods covered in this work: Stepwise Regression (SR), Genetic Al
gorithm with Partial Least Squares (GAPLS), Uninformative Variables Elimination by
Partial Least Squares (UVEPLS), Partial Least Squares with Sensitivity Analysis (PLS
SA), Competitive Adaptive Reweighted Sampling with Partial Least Squares (CARS
PLS), Partial Least Squares with Variable Importance in Projection (PLSVIP), and Par
tial Least Squares with regression coefficients (PLSBETA). In Chapter 4, application of
6
all seven variable selection methods on simulated case study and industrial case study
will be investigated. The simulation case is generated to mimic the typical characteristics
of industrial data by considering four factors: proportion of relevant predictors, magni
tude of correlation between predictors, structure of regression coefficients, and magnitude
of signal to noise ratio. A detailed description of data generation will be provided. The
industrial case study is focused on the process data of polyester resin production plant. A
brief specification of the plant will be included, followed by discussion of characteristics
of batch process. The results and comparison of variable selections on both simulated and
industrial case studies will be investigated. Chapter 5 will conclude this work with major
discussion and contributions. Furthermore, suggestions on future works will be provided.
7
Chapter 2. Soft Sensor Development
Soft sensors have been developed and implemented decades ago, where predictive
models have been built based on large amount of data being measured stored in process
industries [29], [30]. Soft sensors can be classified into two categories: modeldriven and
datadriven. The modeldriven soft sensors are based on the first principle models that
describe the physical and chemical characteristics of the process. Datadriven soft sensors
are based on the data measured and collected within the plants[29?32]. The most popular
soft sensor techniques include principal component analysis (PCA) [33] and partial least
squares (PLS) [34], artificial neural networks [35], neurofuzzy systems [36] and support
vector machines [37]. In our work, only the linear models are considered.
2.1 Multiple Linear Regression
The goal of multiple linear regression (MLR) is to establish a linear relationship
between the secondary variables and primary variables in the form of Equation (2.1),
where is the secondary variable, is the primary variable, is the sensitivity, and is
the residuals.
?
(2.1)
The above linear relationship can also be written in matrix form as:
(2.2)
8
2.2 Principal Component Analysis
Principal component analysis (PCA) is linear technique that transforms the origi
nal data matrix into a smaller set of uncorrelated variables that would capture most of
the information in the original space. This linear transformation can be expressed as in
Equation (2.7), where is the score matrix and is the loading matrix.
(2.3)
2.3 Principal Component Regression (PCR)
Principal component regression (PCR) is a combination of PCA and MLR. MLR
can be written in the form of score matrix, which has better properties than the original
data matrix.
(2.4)
2.4 Partial Least Squares Regression
Partial least squares (PLS) regression has established itself as a valuable alterna
tive for analyzing secondary variables that are highly correlated, with high measurement
noise, and of high dimensionality. PLS model is built based on the properties of NIPALS
algorithms by letting the score matrix represent the data matrix [38]. In PLS, the decom
position of matrix and are done in such a way that the covariance is maximized. The
algorithm of PLS were developed by Wold et al. [39]. The decomposition of data matrix
is done by Equation (2.3). And the decomposition of can also be done in a similar
way by Equation (2.7), where and is the score and loading matrices of , respective
ly, and is the residual.
(2.5)
9
The objective of PLS is to describe maximum amount of variation in and get a
useful relation between and simultaneously. This can be done by introducing a linear
model between the score matrices of and Y.
(2.6)
Consequently, matrix can be estimated as in Equation (2.7), is to be mini
mized. The detail algorithm of PLS can be found in [38?40].
? (2.7)
10
Chapter 3. Variable Selection Theory and Algorithm
Due to prompt development of technology, thousands of process measurements
are collected by process computers every day. Researchers have been utilizing these data
to build soft sensor, which is also known as datadriven soft sensor. By correlating the
secondary variables with the primary variables, soft sensors can provide information on
those immeasurable, but important variables. Furthermore, soft sensors can provide pre
diction on infrequently measured variable so that control actions can be taken to prevent
process failure. It has been proved by many studies that the performance of soft sensor
can be tremendously improved if only the few vital variables are included in soft sensor
development. Consequently, variable selection has been one of the most important practi
cal concerns in datadriven approaches. By identifying the relevant variables, variable
selection can improve the prediction performance of soft sensor, reduce the computation
load and model complexity, provide better insight into the nature of the process, and low
er the measurement cost [2], [27].
Seven variable selection methods are explored in this work. They are selected
based on their popularity, implementation practicability, complexity, and artificial based
criterion. They can be categorized into four groups: iterative methods (stepwise regres
sion and genetic algorithm combined with PLS), methods based on artificial standard
(uninformative variable elimination method combined with PLS and PLS based on sensi
tivity analysis), enforced variable elimination methods (competitive adaptive reweighted
11
sampling method with PLS), and methods based on predictive properties (PLS based on
variable importance in projection and PLS based on regression coefficients).
Stepwise regression has been applied to the selection of predictors for both classi
fication and multivariate calibrations [8], especially in nearinfrared (NIR) spectral. Gau
chi and Chagnon proposed a stepwise variable selection method based on maximum
and applied to manufacturing processes in oil, chemical and food industries [9].
Broadhurst et al. applied genetic algorithm to pyrolysis mass spectrometric data
and showed that GA is able to determine the optimal subset of variables to provide better
or equal prediction performance [10]. Arcos et al. successfully applied GA to a wave
length selection for PLS calibration of mixtures of indomethacin and acemethacin, in
spite of the fact that the two compounds have almost identical spectra [11]. A modified
genetic algorithmbased wavelength selection method has been proposed by Hiromasa
Kaneko and Kimito Funatsu to select process variables and dynamic simultaneously [12].
This method is named as genetic algorithmbased process variables and dynamics selec
tion method, GAVDS. The result of GAVDS, based on its application to a dynamic pro
cess of distillation column in Mitsubishi Chemical Corporation, shows its robustness to
the presence of nonlinearity and multicollinearity in process data. GA has also been well
recognized in molecular modeling. Jones et al. have shown three application of GA in
chemical structure handling and molecular recognition [13].
A modified uninformative variable elimination method based on the principle of
Monte Carlo (MC) was applied in quantitative analysis of NIR spectra by Cai et al. [14].
UVEMC is proven to be capable of selecting important wavelength and making the pre
diction more robust and accurate in quantitative analysis. Some researchers also suggest
12
ed to combine UVE with wavelet transform to further simplify the model and to reduce
computation time [14], [15]. In the work of Koshoubu et al., they have extended UVE to
eliminate uninformative samples (USE) that do not contribute much in the calibration
model [16], [17]. They proposed an algorithm where the uninformative wave
lengths/variables are eliminated first by UVEPLS, and then the uninformative samples,
which are determined by their standard deviation of prediction error calculated from
leaveoneout cross validation, are eliminated from the calibration. Another new method
which combined UVE with successive projection algorithm (SPA) has been proposed by
[18]. UVE is implemented to remove uninformative variables before application of SPA
to improve the efficiency of variable selection by SPA.
Sensitivity analysis has become more popular in selection of optimal variable
subsets in recent years. Zamprogna et al. has introduced a novel methodology based on
principal component analysis (PCA) to select the most suitable soft sensor inputs [19].
Instead of using the secondary variables directly, the instantaneous sensitivity of each
secondary variable to the primary variables are estimated and utilized as the regressor
inputs. Li and Shao have proposed a novel method using sensitivity analysis to select the
optimal secondary variables to be used as inputs to kernel ridge regression (KRR) to im
plement online soft sensing of distillation column compositions [20].
Competitive adaptive reweighted sampling (CARS) method has been proposed by
Li et al. [21]. CARS is model independent. In other words, CARS can be combined with
any regression or classification models. In [22], [23], CARS has been applied in combi
nation with partial least squares linear discriminant analysis (PLSLDA) to effectively
classify two classes of samples in colorectal cancer data.
13
Variable importance in the projection (VIP) and regression coefficients (BETA)
have been broadly adapted as a criterion in partial least squares modeling paradigm for
variable selection. Both PLSVIP and PLSBETA are model based variable selection
methods. Mehmood et al. presented an algorithm that balances the parsimony and predic
tive ability of model using variables selection based on PLSVIP [24]. It is shown that the
proposed method increases the understandability and consistency of the model and re
duces the classification error. Lindgren et al. also implemented PLSVIP on a benchmark
data for variable selection, Selwood dataset [25]. In their study, PLSVIP is combined
with permutation test to extensively investigate the technique. A bootstrapPLSVIP has
been implemented as a wavelength interval selection method in spectral imaging applica
tions by Gosselin et al. [26]. Their result demonstrates its ability to identify relevant spec
tral intervals and its simplicity and relatively low computational cost. PLSVIP and PLS
BETA have also been seen in food science. Andersen and Bro applied PLSVIP and PLS
BETA to NIR spectral of beer sample and obtained useful insight of the process [27]. A
variable selection algorithm based on the standardized regression coefficients are pro
posed in [28]. The developed models are optimized by the leaveoneout values and
validated by an external testing set.
3.1 Stepwise Regression
Stepwise regression has been widely used for variable selection in linear regres
sion [3]. Stepwise regression is a combination of forward selection and backward elimi
nation methods [9]. Both are well known methods for variable selection in multiple re
gressions. The forward selection and backward elimination methods are done by intro
duction or elimination of the variables onebyone according to the specific thresholds. In
14
stepwise regression, a sequence of regression models is constructed iteratively by adding
or removing variables. The variables are selected according to their statistical signifi
cance in a regression [8]. Partial Ftest or ttest is used for determination of its signifi
cance.
The standard stepwise regression procedure is illustrated in Figure 3.1and summa
rized as follows:
1. Define thresholds of probability of incorrectly rejecting the true null hy
pothesis, which is also known as Type I error. The threshold for adding a
variable to a model is 0.05, , and the threshold for removing a
variable from the model is 0.1, .
2. Assume the total number of variables is , and { }
is a subset of variables included in linear regression model. The unselected
variables are examined by calculating their partial Fstatistic using equa
tions (3.1) and (3.2), where is the residual sum of squares due to re
gression, and is the mean square error. The variable with maximum
Fstatistic among all the unselected ones is added to the model, provided
that .
(  )
( )
(3.1)
(  ) ( ) ( ) (3.2)
3. Once a new subset of variables is determined, the same procedure is car
ried out to check if any of these variables inside the model should be re
15
moved. The variable with the smallest Fstatistic is removed, provided that
. Otherwise, the variable is retained in the model.
4. Repeat Step 2 and Step 3 until no other variables can be added into or re
moved from the model.
B u i l d M L R
m o d e l o n a
s u b s e t o f
v a r i a b l e s
C a l c u l a t e
t h e F 
s t a t i s t i c s o f
e a c h
v a r i a b l e
A d d t h e v a r i a b l e
w i t h l a r g e s t F 
v a l u e i n t o t h e
m o d e l
R e m o v e t h e
v a r i a b l e w i t h
s m a l l e s t F  v a l u e
f r o m t h e m o d e l
E v a l u a t e t h e
p r e d i c t i o n
p e r f o r m a n c e
I f t h e
p e r f o r m a n c e
i m p r o v e s
E n d
Y E S
N O
Figure 3.1 Stepwise Regression Algorithm
3.2 Genetic Algorithm
Genetic algorithm has been used widely in solving complex problems of optimi
zation and search problems [41]. More recently, GA has been used to find the optimum
subset of regressor variables for a given modeling method based on the results of cost
function evaluations for all candidate genetic chromosomes [42].
The original algorithm can be found in [43?45]. Generally speaking, there are five
steps in GA: coding of variables, initiation of population, evaluation of the responses, re
productions, and mutations [46]. The last three steps are implemented iteratively until a
termination criterion is reached. In our work, GA combined with PLS regression model is
studied. These following terms must be defined:
1. Initiation of population. Percentage of variables included in the initial
population (30% 50%).
16
2. Population size. This value is dependent on the total number of variables.
There is a tradeoff between the initial coverage of the original space and
computation load.
3. Maximum number of generations (50500). This could be used as one of
the termination criterion.
4. Percentage of the population retained after each generation (50% 80%).
This number defines the top percentage of populations to be kept in each
generation. In other words, only the remaining populations will go through
reproduction.
5. Breeding crossover rule (single or double crossover). It is analogous to re
production. It is a genetic operator used to vary programming of chromo
somes from one generation to the next.
6. Mutation rate (0.0010.01). Chance of alternation of genes after crossover.
An initial population is generated by randomly choosing 30% of the total varia
bles. This is repeated multiple times depending on the population size. A PLS model is
built for each population/chromosomes. Populations are then sorted in descending order
by its cross validation metrics. Only the top percentages of the populations are remained
unchanged, and the rest will undergo crossover/reproduction. A new generation of chro
mosomes is then produced. This is done iteratively until a termination criterion is
reached. This termination criterion can be based on the maximum number of generations
or prediction improvement deficiency. The algorithm is also shown in Figure 3.2.
17
G e n e r a t e
i n i t i a l
p o p u l a t i o n s b y
r a n d o m l y
c h o o s i n g
~ 3 0 % o f
v a r i a b l e s .
B u i l d P L S
m o d e l f o r
e a c h
p o p u l a t i o n
a n d s o r t
t h e m b y t h e i r
p e r f o r m a n c e
R e t a i n t h e
t o p 5 0 % o f
t h e
p o p u l a t i o n
T h e
r e m a i n i n g
u n d e r g o e s
r e p r o d u c t i o n
t o o b t a i n
n e w
g e n e r a t i o n
T e r m i n a t i o n
c r i t e r i o n
r e a c h e d ?
E n d
Y E S
N O
Figure 3.2 Genetic Algorithm with PLS
3.3 Uninformative Variable Elimination
A method for eliminating uninformative variables by comparing with artificial
variables was proposed by V. Center, et al. [5]. Models are built using both experimental
and artificial variables. The analysis is based on the regression coefficients from the
model.
In our work, uninformative variable elimination by Partial Least Squares (UVE
PLS) will be studied. The procedure is illustrated in Figure 3.3 and summarized as fol
lows:
1. For a given set of experimental variables, , generate an artificial
random variable matrix with very small magnitude and same dimension as
the experimental variables. This results in a matrix with dimension of by
, .
2. Build PLS model for based on leaveoneout procedure. This will
yield a regression coefficient matrix, .
3. Calculate the reliability index of each variable using Equation (3.3),
where and ( ) are the mean and standard deviation of variable ob
tained from leaveoneout procedure.
18
(
)
(3.3)
?
(3.4)
( ) (? ( )
)
(3.5)
4. Determine the maximum absolute reliability index of the artificial varia
bles, ( ( )). The experimental variables with absolute reliabil
ity index less than ( ( )) are eliminated, i.e., ( )
( ( )).
5. A new PLS model is built using only the remaining variables.
A d d a n
a r t i f i c i a l
v a r i a b l e
m a t r i x
w i t h s a m e
d i m e n s i o n
E m p l o y
M o n t e C a r l o
S a m p l i n g ,
a n d b u i l d P L S
m o d e l f o r
e a c h s u b s e t
C a l c u l a t e
r e l i a b i l i t y o f t h e
r e g r e s s i o n
c o e f f i c i e n t , c
j
,
u s i n g t h e m e a n
a n d s t a n d a r d
d e v a t i o n
I f c
j
< m a x ( c
a r t i
)
K e e p t h e v a r i a b l e
R e m o v e t h e v a r i a b l e
R e b u i l d P L S
m o d e l o n t h e
r e t a i n e d
v a r i a b l e s
E n d
N O
Y E S
Figure 3.3 Procedure of Uninformative Variable Elimination with PLS
3.4 Partial Least Squares with Sensitivity Analysis
A novel variable selection algorithm proposed in [6] that combines Partial Least
Squares with sensitivity analysis [47], PLSSA, is investigated. The sensitivity of each
variable is often expressed in terms of its regression coefficient in linear regression mod
els. In PLS models, the coefficients calculated are a mixture of the original variables.
Hence, an alternative measure of sensitivity of individual variable is proposed. In
19
Rueda?s work, the sensitivity of each variable is defined as the absolute maximum change
in the PLS prediction (maximum minus the minimum values), when the value of is
varied in its allowable range and all other variable are kept constant at their mean/median
value [6]. This measurement is referred as ?
.
In PLSSA, the value of ?
is only computed over the training set; the remaining
samples are used to measure the predictive power of the model. The relevance of each
variable is determined by comparing its sensitivity to that of a random variable, RV. The
effect of RV to the response variable should be insignificant since it is random. In sensi
tivity analysis, a RV is added to the original dataset, and then its sensitivity, ? , is com
puted.
To balance the comparison fairness and computational load, the extended data is
divided into several subsets. PLS models are built for each subset. The sensitivity values
along with their averages and standard deviations are also computed. A variable is
found significant if inequality (3.6) is satisfied. The process is carried out in an iterative
manner. In every iteration, the variables with sensitivity values below the sensitivity of
random variable are eliminated. The predictive power of the new set of variables is de
termined. The variables are eliminated permanently only if the predictive power is im
proved. The process stops when no more variables can be dropped from the model. The
stepwise algorithm of PLSSA is illustrated in Figure 3.4.
?
????
?
?
??????
?
(3.6)
20
F o r j = 1 : N o . o f r e g r e s s o r
I f a v e r a g e s e n s i t i v i t y o f X j >
a v e r a g e s e n s i t i v i t y o f R V
A d d a
r a n d o m
v a r i a b l e
( R V ) t o
o r i g i n a l
d a t a s e t
K e e p t h e
v a r i a b l e
I f R M S E o f
i t e r a t i o n t < R M S E
o f i t e r a t i o n t  1
D r o p t h e
v a r i a b l e
I f N o . o f v a r i a b l e s
d r o p p e d > 0
E n d
Y E S
N O
Y E S
N O
S p l i t t h e e x t e n d e d
d a t a s e t i n t o s m a l l
s u b s e t s
C o m p u t e R M S E o r
Q
2
a n d a v e r a g e
s e n s i t i v i t y o f e a c h
v a r i a b l e
C a l c u l a t e s e n s i t i v i t y
o f e a c h v a r i a b l e o n
e x t e n d e d d a t a
i n c l u d i n g R V
C o m p u t e P L S i n e a c h
t r a i n i n g s e t , t h e n u s e
t h e m o d e l t o p r e d i c t Y
o n v a l i d a t i o n s e t
Figure 3.4 PLSSA Algorithm
3.5 Competitive Adaptive Reweighted Sampling with Partial Least Squares
Hongdong Li et al. have proposed a novel strategy based on the principle ?surviv
al of the fittest?, named competitive adaptive reweighted sampling (CARS) [21], [22].
This method utilizes the absolute values of the regression coefficients to evaluate varia
bles? importance. In an iterative manner, subsets of variables are selected by CARS
from Monte Carlo (MC) sampling runs. At the end, cross validation is employed to
evaluate each subset. The general procedure can be described as follows and shown in
Figure 3.7:
1. In each MC sampling run, a PLS model is built using 8090% of the ran
domly selected samples. The regression coefficients are normalized using
Equation (3.7), where is the total number of variables.
? (3.7)
2. In CARS, an exponentially decreasing function (EDF) is introduced as in
Equation (3.8). EDF is utilized to eliminate variables with relatively small
21
absolute regression coefficients by force. The ratio of variables to be re
tained in the sampling run is calculated by Equation (3.8) to (3.10),
where
(3.8)
( )
(3.9)
( )
(3.10)
3. Adaptive reweighted sampling (ARS) is followed by EDFbased reduction
to further eliminate variables in a competitive way. In other words, varia
bles with larger regression coefficients will be selected with higher fre
quency.
Figure 3.5 Graphical illustration of the exponentially decreasing function
Stage 1
Fast Selection
Stage 2
Refined Selec
tion
22
The EDF process in Step 2 is roughly divided into two stages. In the first stage,
the variables are eliminated rapidly, so it is called fast selection. In the second stage, the
variables are eliminated in a much slower fashion, thus it is called refined selection. An
example of EDF is shown in Figure 3.5. Hence, EDF becomes a very efficient algorithm
for removing the variables with little information.
The ARS in Step 3 mimics ?survival of fittest? principle. The idea of ARS is illus
trated in Figure 3.6. Three scenarios are considered, equal weight, little weight differ
ence, and large weight difference.
Weights of Variables Sampled Variable
1 2 3 4 5
Case 1: 0.20 0.20 0.20 0.20 0.20 2 1 3 4 5
Case 2: 0.30 0.30 0.20 0.10 0.10 1 1 2 3 2
Case 3: 0.40 0.05 0.40 0.10 0.05 1 3 3 3 1
Figure 3.6 Illustration of adaptive reweighted sampling technique using five variables in
three cases as an example. The variables with larger weights will be selected with higher
frequency. E m p l o y
M o n t e C a r l o
S a m p l i n g
M e t h o d
B u i l d P L S
s u b 
m o d e l s
I m p l e m e n t
E D F t o
c a l c u l a t e t h e
r a t i o o f
v a r i a b l e s t o
k e e p
A p p l y A R S t o
f u r t h e r
e l i m i n a t e t h e
v a r i a b l e s
C r o s s
v a l i d a t i o n t o
e v a l u a t e e a c h
s u b s e t
E n d
Figure 3.7 General Procedure of CARPLS
3.6 Partial Least Squares with Variable Important in Projection
Variable importance in the projection (VIP) score estimates the importance of
each variable in the projection used in a PLS model. It was first published in [48]. The
VIP score for the variable can be calculated using Equation (3.11), where ( )
. is the column vector of score matrix . is the element of regression
23
coefficient vector . is the column vector of weighting matrix . It gives the
weighted variability of variable in the retained dimensions. VIP score calculates the
contribution of each variable according to variance explained by each PLS component
[26]. The expression ? ? represents the importance of variable in the PLS
component. The ( ) is the variance of explained by the PLS component. And
the summation of ( ), denominator term, is the total variance explained by the PLS
model with components.
? ?( ( )( ?
?
)
) ? ( )
(3.11)
A variable selection method based on VIP scores estimated by PLS regression
model is known as PLSVIP. In general, ?greater than one rule? is used as criterion for
variable selection. In other words, only variables with VIP values greater than one are
considered significant. However, it has been suggested by IlGyo Chong et al. that the
proper cutoff value for VIP can be utilized to increase the performance of PLSVIP [4].
This value is defined by the following equation:
{ (
{ }
( )) (
{ }
( ))}
(3.12)
where varies from 0.01 to 3 with increments of 0.01. And , the geometric mean of
sensitivity and specificity, is a function of defined by Equation (3.13). Sensitivity is
defined as proportion of selected relevant predictors among relevant predictors. Specifici
ty is the proportion of unselected irrelevant predictors among irrelevant predictors. They
are both calculated from the confusion matrix shown in Table 3.1. The every value
24
chosen, the elements in the confusion matrix will change. Therefore, the sensitivity, spec
ificity, and will change as well. The value of ranges between 0 and 1, where 1 indi
cates all the predictors are classified correctly. For every run/replication, the values that
maximize are identified. The optimal cutoff value for VIP, , is obtained by taking the
average of the identified ?s using Equation (3.12).
( ) (3.13)
( ) (3.14)
( ) (3.15)
Table 3.1 Confusion Matrix and Descriptions of Its Entries
Predicted classes
Irrelevant predictor
(IR)
Relevant predictor
(R)
True classes Irrelevant predictor
(IR)
a: the number of ir
relevant predictors
classified correctly
b: the number of ir
relevant predictors
classified incorrectly
Relevant predictor
(R)
c: the number of rel
evant predictors
classified incorrectly
d: the number of rel
evant predictors
classified correctly
Overall PLSVIP procedure can be described as follows and presented in Figure
3.8:
1. Build PLS model using all the variables. Apply cross validation to deter
mine the optimal number of PC?s.
2. Calculate VIP score for each variable using Equation (3.11).
3. Select variables with VIP scores greater than the cutoff value.
25
4. Calculate and the proper cutoff value using Equations (3.13) and (3.12).
Repeat Step 3 with the new cutoff value found. (Note: This step is only as
sessable for simulated case study.)
5. Rebuild PLS model with only the retained variables.
6. Evaluate the model performance using different indexes.
B u i l d
P L S
M o d e l
C a l c u l a t e V I P
s c o r e s u s i n g P L S
m o d e l p a r a m e t e r s
I f V I P
j
> c u t o f f
R e b u i l d P L S
m o d e l o n t h e
r e t a i n e d
v a r i a b l e s
K e e p t h e v a r i a b l e
R e m o v e t h e v a r i a b l e
Y E S
N O
E n d
Figure 3.8 Procedure of PLSVIP
3.7 Partial Least Squares with Regression Coefficients
Partial least squares with regression coefficients is a variable selection method
that is very similar to PLSVIP. It is also known as PLSBETA. The only difference is
PLSBETA utilizes the regression coefficients estimated by PLS regression instead of
VIP scores. The significant variables are selected according to the magnitude of the abso
lute values of the regression coefficients. The procedure of PLSBETA is illustrated in
Figure 3.9.
26
B u i l d
P L S
M o d e l
O b t a i n t h e
a b s o l u t e
r e g r e s s i o n
c o e f f i c i e n t
I f B E T A
j
> c u t o f f
R e b u i l d P L S
m o d e l o n t h e
r e t a i n e d
v a r i a b l e s
K e e p t h e v a r i a b l e
R e m o v e t h e v a r i a b l e
Y E S
N O
E n d
Figure 3.9 Procedure of PLSBETA
27
Chapter 4. Variable Selection Method with Its Application to Simulated
and Industrial Dataset
4.1 Introduction
The characteristics of these seven variable selection methods will be illustrated
using a simulated case study and an industrial case study. The results presented here are
based on the rules of thumb of each method to choose the model parameters, for the pur
pose of just comparing the base line of the methods studied. Further tuning of the pa
rameters can be done to optimize the performance of each model.
The simulation case is generated to mimic the typical characteristics of industrial
data by considering four factors: proportion of relevant predictors, magnitude of correla
tion between predictors, structure of regression coefficients, and magnitude of signal to
noise ratio. A detailed description of data generation will be provided. The industrial case
study is focused on the process data of polyester resin production plant. A brief specifica
tion of the plant will be included, followed by discussion of characteristics of batch pro
cess and their necessary preprocessing steps. The results and comparison of variable se
lections on both simulated and industrial case studies will be investigated. Two aspects
are studied: correctly identify all the variables and prediction performance. The former
one is only applicable in simulated case study, where the ground truth of the data is
known. It is evaluated by the geometric mean of sensitivity and specificity discussed in
Chapter 3. In this aspect, we also look at the consistency of the models produced by each
variable selection method. In other words, the robustness of each method to data selection
is explored. For both case studies, the data are permuted 100 times to generate different
28
combinations of training and validation sets. Frequency plots of selection of each variable
are generated to assess the consistency of the models. The second aspect is also one of
the most important factors in soft sensor development, since the optimal goal is to im
prove the prediction performance of soft sensor schemes. Two performance metrics are
considered to evaluate the prediction performance: root mean square error (RMSE) and
mean absolute percentage error (MAPE).
?? ( ? )
(4.1)
? ?

(4.2)
4.2 Simulated Case Study
Four factors are considered in data generation to mimic the characteristic of in
dustrial data. They are proportion of the number of relevant predictors, the magnitude of
correlations between predictors, the structure of regression coefficients, and the magni
tude of signal to noise ratio. The dataset is generated following a linear model as in(4.3),
?
(4.3)
where is normal distributed random error with zero mean and specified standard devia
tion (described below). The data matrix of 500 sample points is generated considering
the four factors.
? For convenience, the number of relevant predictors is set to be 10. The total num
ber of predictors, , in data matrix can be varied, which would yield different
29
proportion of relevant predictors. The total number of predictors, , in data matrix
are varied in three levels, 20, 40 and 100.
(4.4)
? Data matrix is generated from multivariate normal distribution with zero mean
vector and variancecovariance matrix of . The elements of matrix are func
tion of the magnitude of correlations between predictors, . The magnitude of
correlations between predictors, , is also varied in three levels, 0.5, 0.7 and 0.9.
  ( ) (4.5)
Two types of equal and unequal coefficients are compared. Each type has two
levels according to their locations of relevant predictors: in the middle of the range and at
the extremes. All the irrelevant predictors have zero coefficients in both types. For the
case with 10 relevant predictors, the regression coefficients are generated as follows:
? Equal coefficients in the middle of range
( ) (4.6)
? Equal coefficients at the extreme
( ) (4.7)
? Unequal coefficients in the middle of range
(  ( )) ( ) (4.8)
? Unequal coefficients at the extreme
30
( ( ) ( ))
( )
(4.9)
? The magnitude of signal to noise ratio is introduced by manipulating the standard
deviation of error terms in , where is the reciprocal of signal to noise ratio. The
magnitude of reciprocal of signal to noise ratio, , is varied in three levels as well,
0.33, 0.74 and 1.22.
? ( ) (4.10)
4.2.1 Results
All seven variable selection methods are implemented with the simulated case
study. The four parameters are varied one at a time while holding the others constant. The
sensitivity results of these fours parameters considered are summarized in Table 4.1 to
Table 4.4. The individual result is compared with model before and after variable selec
tion. PLSBETA performs the best in the sensitivity of proportion of relevant predictors.
It outperforms other variables selection methods in all three levels of total number of pre
dictors. The improvement in MAPE of the validation set runs from 1% to 9%. PLSVIP
performs the best when the correlation is at its highest level. However, the improvement
is not significant. PLSBETA yields best performance for correlation at the lower two
levels, by 23% in MAPE. UVEPLS gives best performance in the case with unequal
regression coefficients with improvement of 3% in MAPE. Surprisingly, CARSPLS
shows performance improvement of 7% in terms of MAPE when the signal to noise ratio
is at its lowest.
31
To visualize the result, they are also demonstrated in Figure 4.1 to Figure 4.20.
The performances are evaluated using the geometric mean of sensitivity and specificity
( ), root mean square error (RMSE), and mean absolute percentage error (MAPE). The
values plotted in Figure 4.5 to Figure 4.20 are the improvement compared to the full
models. The ones with values higher than zero indicate improvement of reduced models
compared to the full model; and the ones with values lower than zero imply performance
deterioration of the reduced models.
From Figure 4.1 to Figure 4.4, one can see that all the variables selection methods
yields relatively high values except PLSSA and CARSPLS. The sensitivities of pre
diction performance of each variable selection method to different data generation pa
rameters are illustrated in Figure 4.5 to Figure 4.20. The results are presented in percent
age improved in the average performance metrics. The percentage improvement is calcu
lated by comparing the reduced models to their corresponding full model of each case.
The results of calibration models shown in Figure 4.5 and Figure 4.12 indicate no im
provement from the models produced by the variable selection methods compared with
the full models. Especially for PLSSA, the performance deteriorates by 75% in RMSE
and 77% in MAPE. From the prediction performance of validation models shown in Fig
ure 4.13 to Figure 4.20, the prediction performance of the reduced models are improved
compared with the full models, with exception of PLSSA. Performance of PLSBETA
worsens significantly when the regression coefficients are unequal. PLSBETA only se
lects the variables with larger regression coefficients. In other words, even if the variable
is relevant to the primary variable, it is not selected by PLSBETA since its coefficient is
relatively small.
32
Table 4.1 Comparison of Sensitivity of Different Variable Selection Methods to Propor
tion of Relevant Predictors
Training Validation
Model No. Sel G RMSE MAPE RMSE MAPE
Full
20 20  1.5357 3.0418 1.5724 3.1222
40 40  1.4802 3.0229 1.6177 3.3173
100 100  1.3914 2.8289 1.7140 3.4910
SR
20 10+/1 0.9767 1.5463 3.0641 1.5613 3.1014
40 11+/1 0.9784 1.5131 3.0884 1.5828 3.2434
100 14+/2 0.9765 1.4920 3.0349 1.6052 3.2716
GA
20 12+/1 0.8721 1.5858 3.1446 1.6110 3.2032
40 16+/2 0.8746 1.5510 3.1687 1.6417 3.3622
100 26+/5 0.9008 1.4840 3.0181 1.6446 3.3517
UVE
20 12+/1 0.9989 1.5465 3.0639 1.5612 3.1021
40 12+/1 0.9856 1.5199 3.1008 1.5758 3.2303
100 12+/1 0.9947 1.5234 3.0991 1.5723 3.2036
SA
20 14+/2 0.4580 2.1686 4.3078 2.2198 4.4048
40 23+/3 0.4720 2.5455 5.2190 2.6748 5.4793
100 55+/6 0.4991 2.4337 4.9977 2.7491 5.6427
CARS
20 18+/4 0.1722 1.5381 3.0471 1.5700 3.1177
40 20+/13 0.7057 1.5065 3.0738 1.5900 3.2584
100 17+/17 0.9384 1.5024 3.0560 1.5933 3.2448
VIP
20 10+/0 0.9918 1.5807 3.1335 1.5930 3.1667
40 11+/1 0.9874 1.5230 3.1075 1.5721 3.2219
100 13+/1 0.9849 1.5247 3.1023 1.5710 3.2007
BETA
20 10+/0 1 1.5504 3.0732 1.5571 3.0936
40 10+/0 1 1.5241 3.1097 1.5712 3.2204
100 10+/0 1 1.5287 3.1097 1.5659 3.1893
33
Table 4.2 Comparison of Sensitivity of Different Variable Selection Methods to Magni
tude of Correlation between Predictors
Training Validation
Model No. Sel G RMSE MAPE RMSE MAPE
Full
0.5 40  1.4802 3.0229 1.6177 3.3173
0.7 40  1.7618 3.0547 1.8912 3.2944
0.9 40  2.1383 3.1542 2.2216 3.2872
SR
0.5 11+/1 0.9784 1.5131 3.0884 1.5828 3.2434
0.7 11+/1 0.9836 1.7854 3.0947 1.8581 3.2348
0.9 10+/1 0.9526 2.1409 3.1561 2.2252 3.2908
GA
0.5 16+/2 0.8746 1.5510 3.1687 1.6417 3.3622
0.7 15+/3 0.8873 1.7979 3.1205 1.8937 3.2956
0.9 15+/3 0.8344 2.1432 3.1594 2.2484 3.3246
UVE
0.5 12+/1 0.9856 1.5199 3.1008 1.5758 3.2303
0.7 14+/2 0.9636 1.7882 3.0994 1.8568 3.2331
0.9 28+/3 0.8452 2.1383 3.1511 2.2220 3.2879
SA
0.5 23+/3 0.4720 2.5455 5.2190 2.6748 5.4793
0.7 28+/3 0.4566 2.2565 3.9257 2.3950 4.1815
0.9 36+/3 0.2399 2.1798 3.2195 2.2625 3.3511
CARS
0.5 20+/13 0.7057 1.5065 3.0738 1.5900 3.2584
0.7 20+/13 0.6731 1.7800 3.0856 1.8697 3.2553
0.9 18+/12 0.7439 2.1441 3.1603 2.2220 3.2850
VIP
0.5 11+/1 0.9874 1.5230 3.1075 1.5721 3.2219
0.7 13+/1 0.9536 1.7913 3.1040 1.8532 3.2277
0.9 16+/1 0.9000 2.1412 3.1543 2.2087 3.2693
BETA
0.5 10+/0 1 1.5241 3.1097 1.5712 3.2204
0.7 10+/0 0.9995 1.7959 3.1132 1.8492 3.2198
0.9 10+/0 0.9851 2.1552 3.1772 2.2162 3.2759
34
Table 4.3 Comparison of Sensitivity of Different Variable Selection Methods to Regres
sion Coefficient Structure
Training Validation
Model No. Sel G RMSE MAPE RMSE MAPE
Full
EM 40  1.6078 2.9768 1.7471 3.2427
EE 40  1.4802 3.0229 1.6177 3.3173
UM 40  21.80 3.0353 23.54 3.2896
UE 40  18.50 3.0014 20.19 3.2909
SR
EM 11+/1 0.9779 1.6442 3.0432 1.7093 3.1696
EE 11+/1 0.9784 1.5131 3.0884 1.5828 3.2434
UM 9+/1 0.8830 22.34 3.1090 23.11 3.2238
UE 9+/1 0.8861 18.93 3.0704 19.84 3.2307
GA
EM 16+/3 0.8865 1.6558 3.0641 1.7377 3.2224
EE 16+/2 0.8746 1.5510 3.1687 1.6417 3.3622
UM 14+/3 0.8065 22.52 3.1327 23.54 3.2850
UE 14+/3 0.8091 19.20 3.1129 20.28 3.3039
UVE
EM 12+/1 0.9928 1.6502 3.0546 1.7021 3.1576
EE 12+/1 0.9856 1.5199 3.1008 1.5758 3.2303
UM 9+/2 0.9281 22.46 3.1262 22.95 3.2012
UE 9+/1 0.8944 19.06 3.0897 19.70 3.2098
SA
EM 24+/3 0.4866 2.5117 4.6580 2.6385 4.8940
EE 23+/3 0.4720 2.5455 5.2190 2.6748 5.4793
UM 22+/3 0.4950 36.44 5.1019 38.03 5.3456
UE 22+/3 0.4797 35.04 5.7135 36.74 6.0067
CARS
EM 19+/12 0.7196 1.6373 3.0301 1.7154 3.1844
EE 20+/13 0.7057 1.5065 3.0738 1.5900 3.2584
UM 16+/11 0.6912 22.40 3.1169 23.23 3.2568
UE 17+/12 0.6584 19.00 3.0799 20.03 3.2644
VIP
EM 10+/1 0.9948 1.6557 3.0650 1.6963 3.1466
EE 11+/1 0.9874 1.5230 3.1075 1.5721 3.2219
UM 7+/1 0.8618 22.74 3.1654 23.06 3.2189
UE 8+/1 0.9095 19.13 3.1034 19.71 3.2112
BETA
EM 10+/0 1 1.6563 3.0658 1.6957 3.1458
EE 10+/0 1 1.5241 3.1097 1.5712 3.2204
UM 5+/1 0.6977 25.09 3.4904 25.67 3.5808
UE 5+/1 0.6932 22.23 3.6071 22.85 3.7205
35
Table 4.4 Comparison of Sensitivity of Different Variable Selection Methods to Magni
tude of Signal to Noise Ratio
Training Validation
Model No. Sel G RMSE MAPE RMSE MAPE
Full
0.33 40  1.6078 2.9768 1.7471 3.2427
0.74 40  3.6020 5.6126 3.9212 6.1351
1.22 40  5.9373 7.3677 6.4675 8.0715
SR
0.33 11+/1 0.9779 1.6442 3.0432 1.7093 3.1696
0.74 11+/1 0.9757 3.6872 5.7435 3.8349 5.9951
1.22 9+/2 0.8775 6.1020 7.5759 6.3888 7.9648
GA
0.33 16+/3 0.8865 1.6558 3.0641 1.7377 3.2224
0.74 16+/2 0.8781 3.6853 5.7405 3.8758 6.0561
1.22 15+/3 0.8435 6.0636 7.5197 6.3973 7.9827
UVE
0.33 12+/1 0.9928 1.6502 3.0546 1.7021 3.1576
0.74 11+/1 0.9928 3.6988 5.7612 3.8195 5.9728
1.22 11+/2 0.9928 6.1032 7.5697 6.3022 7.8635
SA
0.33 24+/3 0.4866 2.5117 4.6580 2.6385 4.8940
0.74 22+/3 0.4837 3.6020 5.6126 3.9212 6.1351
1.22 21+/3 0.4939 5.9373 7.3677 6.4675 8.0715
CARS
0.33 19+/12 0.7196 1.6373 3.0301 1.7154 3.1844
0.74 24+/13 0.5429 3.6630 5.7062 3.9015 6.1003
1.22 20+/13 0.5948 6.0540 7.5116 6.4418 7.5116
VIP
0.33 10+/1 0.9948 1.6557 3.0650 1.6963 3.1466
0.74 10+/1 0.9936 3.7122 5.7838 3.8046 5.9502
1.22 10+/1 0.9902 6.1185 7.5934 6.2758 7.8297
BETA
0.33 10+/0 1 1.6563 3.0658 1.6957 3.1458
0.74 9+/1 0.9596 3.7381 5.8259 3.8548 6.0291
1.22 8+/1 0.8462 6.1565 7.6452 6.4054 7.9873
36
Figure 4.1 Sensitivity of Proportion of Relevant Predictors in Terms of Average G
Figure 4.2 Sensitivity of Magnitude of Correlation between Predictors in Terms of
Average G
37
Figure 4.3 Sensitivity of Regression Coefficient Structure in Terms of Average G
Figure 4.4 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average G
38
Figure 4.5 Sensitivity of Proportion of Relevant Predictors in Terms of Average RMSE in
Training Set
Figure 4.6 Sensitivity of Magnitude of Correlation between Predictors in Terms of Aver
age RMSE in Training Set
39
Figure 4.7 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE in
Training Set
Figure 4.8 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average RMSE
In Training Set
40
Figure 4.9 Sensitivity of Proportion of Relevant Predictors in Terms of Average MAPE
in Training Set
Figure 4.10 Sensitivity of Magnitude of Correlation between Predictors in Terms of Av
erage MAPE in Training Set
41
Figure 4.11 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE
in Training Set
Figure 4.12 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
MAPE in Training Set
42
Figure 4.13 Sensitivity of Proportion of Relevant Predictors in Terms of Average RMSE
in Validation Set
Figure 4.14 Sensitivity of Magnitude of Correlation between Predictors in Terms of Av
erage RMSE in Validation Set
43
Figure 4.15 Sensitivity of Regression Coefficient Structure in Terms of Average RMSE
in Validation Set
Figure 4.16 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
RMSE in Validation Set
44
Figure 4.17 Sensitivity of Proportion of Relevant Predictors in Terms of Average MAPE
in Validation Set
Figure 4.18 Sensitivity of Magnitude of Correlation between Predictors in Terms of Av
erage MAPE in Validation Set
45
Figure 4.19 Sensitivity of Regression Coefficient Structure in Terms of Average MAPE
in Validation Set
Figure 4.20 Sensitivity of Magnitude of Signal to Noise Ratio in Terms of Average
MAPE in Validation Set
46
Figure 4.21 Frequency of Variables Selected by SR
?
?
?
47
Figure 4.22 Frequency of Variables Selected by SR
?
? 7
? 9
48
Figure 4.23 Frequency of Variables Selected by SR
??
??
??
??
49
Figure 4.24 Frequency of Variables Selected by SR
?
? 7
?
50
Figure 4.25 Frequency of Variables Selected by GAPLS
?
?
?
51
Figure 4.26 Frequency of Variables Selected by GAPLS
?
? 7
? 9
52
Figure 4.27 Frequency of Variables Selected by GAPLS
??
??
??
??
53
Figure 4.28 Frequency of Variables Selected by GAPLS
?
? 7
?
54
Figure 4.29 Frequency of Variables Selected by UVEPLS
?
?
?
55
Figure 4.30 Frequency of Variables Selected by UVEPLS
?
? 7
? 9
56
Figure 4.31 Frequency of Variables Selected by UVEPLS
??
??
??
??
57
Figure 4.32 Frequency of Variables Selected by UVEPLS
?
? 7
?
58
Figure 4.33 Frequency of Variables Selected by PLSSA
?
?
?
59
Figure 4.34 Frequency of Variables Selected by PLSSA
?
? 7
? 9
60
Figure 4.35 Frequency of Variables Selected by PLSSA
??
??
??
??
61
Figure 4.36 Frequency of Variables Selected by PLSSA
?
? 7
?
62
Figure 4.37 Frequency of Variables Selected by CARSPLS
?
?
?
63
Figure 4.38 Frequency of Variables Selected by CARSPLS
?
? 7
? 9
64
Figure 4.39 Frequency of Variables Selected by CARSPLS
??
??
??
??
65
Figure 4.40 Frequency of Variables Selected by CARSPLS
?
? 7
?
66
Figure 4.41 Frequency of Variables Selected by PLSVIP
?
?
?
67
Figure 4.42 Frequency of Variables Selected by PLSVIP
?
? 7
? 9
68
Figure 4.43 Frequency of Variables Selected by PLSVIP
??
??
??
??
69
Figure 4.44 Frequency of Variables Selected by PLSVIP
?
? 7
?
70
Figure 4.45 Frequency of Variables Selected by PLSBETA
?
?
?
71
Figure 4.46 Frequency of Variables Selected by PLSBETA
?
? 7
? 9
72
Figure 4.47 Frequency of Variables Selected by PLSBETA
??
??
??
??
73
Figure 4.48 Frequency of Variables Selected by PLSBETA
?
? 7
?
74
Another aspect of investigation is to examine the model?s consistency, i.e., if a
variable is selected once, will this variable be selected in next run? In order to check
models? consistency, frequency of variable selection plots are generated and illustrated in
Figure 4.21 to Figure 4.48. As shown in Figure 4.21 to Figure 4.24, SR is able to correct
ly identify most of the relevant predictors with 100% frequency. However, it also selects
the irrelevant predictors at times. When the correlation between predictors increases, and
signal to noise ratio decreases; SR no longer selects relevant predictors with 100% fre
quency. None of the relevant predictors are selected by GAPLS with 100% frequency as
shown in Figure 4.25 to Figure 4.28. Its performance worsens when the correlation be
tween predictors and reciprocal of signal to noise ratio increase. Based on Figure 4.29 to
Figure 4.32, UVEPLS performs fairly well in the cases with unequal regression coeffi
cients and low signal to noise ratio. However, UVEPLS select some irrelevant predictors
nonrandomly when the correlation between predictors is at its higher level. From Figure
4.33 to Figure 4.36, one can see that PLSSA selects both relevant and irrelevant predic
tors with same frequency in all cases. From Figure 4.37, CARSPLS selects almost all the
variables when the proportion of relevant predictors is high. Also, CARSPLS selects ir
relevant predictor with around 3040% frequency in all other cases, which can be seen in
Figure 4.37 to Figure 4.40. PLSVIP and PLSBETA are the ones with the most ?clean?
frequency plots. In many cases, most of the irrelevant predictors are selected with 0%
frequency. Only a few irrelevant predictors are selected with very low frequency in the
case with low signal to noise ratio.
75
4.2.2 Conclusion and Discussion
Based on the results shown in simulated case study, PLSVIP and PLSBETA
yield the best results among all seven variable selection methods studied. However, the
performance of PLSBETA does decay significantly when the regression coefficients are
not equal and cover a wide range. PLSBETA only selects the ones with large regression
coefficients and discards the ones with smaller coefficients even though they are related
to the primary variables. The prediction performance also deteriorates accordingly. In the
opposite, UVEPLS performs the best in the case with unequal regression coefficients.
Performance of SR is in the middle range among all seven variable selection methods.
The consistency of selection is quite good for SR. However, irrelevant predictors are se
lected by SR at lower frequency. Performance of GAPLS is one of the ones in the mid
dle range as well. Similar behavior is observed in GAPLS as in SR. However, in terms
of computational effort, SR requires less computation time. CARSPLS is sensitive to
proportion of relevant predictors. It selects all the irrelevant predictors with higher than
80% when the proportion of relevant predictors is high. CARSPLS also selects irrelevant
predictors with around 3040% frequency in all other cases. Among all the variable selec
tion methods considered in this work, PLSSA yields the worst performance in most
training and validation set. And the computation time of PLSSA is quite intensively
compare to other methods.
These conclusions are made purely based on the results obtained from this simu
lated case study.
76
4.3 Industrial Case Study
This industrial dataset was obtained from the request to Dr. Barolo. This process
is the production of polyester resin used in the manufacturing of coatings via batch poly
condensation between a diol and a longchain dicarboxylic acid [49]. The main part of
this plant is a 12 m3 stirred tank reactor, which is used for the production of different res
ins. Water is also formed in the polycondensation reaction as a byproduct. A packed
distillation column, along with an external watercooled condenser and a scrubber, are
installed to remove the water. In addition, a vacuum pump is equipped to maintain the
vacuum in the reactor.
There are several online measurement sensors supplied in the plant. Thirtyfour
variables are routinely measured online and recorded by a process computer every 30 se
conds. The number of samples is in the range between 4500 and 7500 from batch to
batch. These are process measurements (temperature, pressures and valve openings, etc.)
and controller settings (which are adjusted manually by the operators). A list of these
thirtyfour variables is shown in Table 4.5. Product quality measurements, acidity number
and viscosity, are not measured online and are not available for the entire duration of the
batch. The product samples are taken manually by the operators. The sampling is uneven
ly and infrequently. There are only 15 to 25 measurements available per batch. 33 batches
are made available in 16month period of time. The autoscaled process data of one of the
batches is shown in Figure 4.49, and the quality variables are plotted in Figure 4.50.
More details about this process can be found in [49], [50].
77
Table 4.5 List of Process Variables Included in Polyester Resin Dataset
Online Monitored Variable Column MAPLS
Date and time of the day 1
Mixing rate (%) 2 X
Mixing rate 3 X
Mixing rate SP 4
Vacuum line temperature (?C) 5 X
Inlet dowtherm temperature (?C) 6 X
Outlet dowtherm temperature (?C) 7 X
Reactor temperature (sensor 1) (?C) 8 X
(dummy) 9
Column head temperature (?C) 10 X
Valve V25 temperature (?C) 11
Scrubber top temperature (?C) 12 X
Inlet water temperature (?C) 13 X
Column bottom temperature (?C) 14 X
Scrubber bottom temperature (?C) 15 X
Reactor temperature (sensor 2) (?C) 16 X
Condenser inlet temperature (?C) 17 X
Valve V14 temperature (?C) 18 X
Valve V15 temperature (?C) 19 X
Reactor differential pressure 20 X
(dummy) 21
Column top temperature PV (?C) 22 X
Column top temperature SP (?C) 23
V42 way1 valve opening (%) 24 X
Inlet dowtherm temperature PV (?C) 25 X
Inlet dowtherm temperature SP (?C) 26
V42 way2 valve opening (%) 27 X
Reactor temperature PV(?C) 28 X
Reactor temperature SP (?C) 29
(dummy) 30
Valve V25 temperature PV (?C) 31
Valve V25 temperature SP (?C) 32
Valve V42 valve opening (%) 33 X
Reactor vacuum PV (mbar) 34 X
Reactor vacuum SP (mbar) 35
78
Figure 4.49 Visualization of Autoscaled Process Data from A Reference Batch in Polyes
ter Production
Figure 4.50 Product Quality Variables from A Reference Batch in Polyester Production.
(a) is the acidity number in gNaOH/gresin; (b) is the viscosity in poise.
(a)
(b)
79
4.3.1 Data Preprocessing
For a batch process, the data are stored in a threedimension array, , as
shown in Figure 4.51. Each row corresponds to one of the batches, while each column
contains one of the variables; is the total number of samples taken in batch. This
is one of the typical characteristics of batch process, where batch duration is not fixed.
Thus, a preprocessing step is required to synchronize the batchtobatch durations. Three
preprocessing methods are considered in this work:
1. Only retain the process samples when the quality variables are available.
All the process samples without their corresponding quality variables are
eliminated.
2. Instead of eliminate all those samples points, the average of them are tak
en and utilized as soft sensor inputs.
3. Similar to the previous one, but integral over time is taken as the soft sen
sor inputs.
The approach taken to unfold the threeway array is to preserve the direction of
the variables [51]. The resulting matrix has dimension of by , where ? . A
previous approach proposed by Nomikos and MacGregor [52], [53] is to unfold the three
way matrix so that the batch direction is preserved. This results in matrix with dimension
of by (? ). Since variable selection is our purpose, the approach that pre
serves the direction of variables is adopted.
To provide fair comparison, the batches are permuted 100 times before unfolding
to generate different combinations of training and validation set. In every permutation,
the first 27 batches are used for training, and the remaining is used for validation. The
80
visualization of the autoscaled process data using the first preprocessing method and the
quality data from one of the permutation runs are shown in Figure 4.52 and Figure 4.53,
respectively.
Figure 4.51 Illustration of Unfolding ThreeDimension Array to Preserve the Direction of
Variables
Figure 4.52 Dynamic Parallel Coordinate Plot of Autoscaled Unfolded Process
Data of A Permutation Run of Polyester Resin Dataset
81
Figure 4.53 Product Quality Variables of A Permutation Run of Polyester Produc
tion. (a) is the acidity number in gNaOH/gresin; (b) is the viscosity in poise.
4.3.2 Results
Variable selection methods are applied individually to each quality variables. On
ly the training set is utilized for variable selection. Models developed by these variable
selection schemes are then validated using the validation set. The results of three different
preprocessing methods are presented in Table 4.6 and Table 4.8.
Based on the results obtained from the first preprocessing method, all the variable
selection methods is able to identify a subset of variables that would improve the model
prediction performance, with the exception of PLSSA. For acidity number model, the
one with the best prediction performance is produced by PLSVIP. Over 100 permutation
runs, model with average number of 13 variables are created by PLSVIP. The prediction
performances on the external validation set are improved by 23.2% and 27.1% in RMSE
and MAPE, respectively. For the calibration models, the best results are actually given by
(a)
(b)
82
SR with an average size of 12 variables. Nevertheless, the results on the validation set are
not as great as PLSVIP. This indicates that the model produced by SR may tend to over
fit the training data. For viscosity model, the best performance model is again produced
by PLSVIP with model size of 14 variables in average. The prediction performances on
the external validation set are improved by 28.1% and 23.3% in RMSE and MAPE, re
spectively. In addition, this is the only model that gives such superior performance; all
the other methods only improve the prediction performance up to 6%.
Results of acidity number and viscosity models from the second preprocessing
method show that the most superior prediction performance is still given by PLSVIP.
The model size is increased from 13 to 16 variables for acidity number model and 14 to
16 for viscosity model. The prediction performances of acidity number model are boosted
by 26% and 13% in RMSE and MAPE, respectively. For viscosity model, the prediction
performances are advanced by 36.5 % and 29.2% in RMSE and MAPE, respectively. SR
also produces best calibration models for both acidity number and viscosity in this pre
processing method.
Once again, best prediction performance of acidity number model is provided by
PLSVIP in the third preprocessing method, by 10% in RMSE and 8% in MAPE. The
average model size is only 9 variables. The results of the viscosity model are different
from the previous two methods. The highest prediction performance is actually provided
by CARSPLS with improvement of 13% in both RMSE and MAPE. The model is rela
tively small with 9 variables. The standard deviation of model size is almost half of the
average value, which means CARSPLS is sensitive to data selection. Furthermore, the
prediction errors of the last method are almost doubled compared to the previous ones.
83
Table 4.6 Comparison of Different Variable Selection for Preprocessing Method 1
Model No. of Variables Training Validation RMSE MAPE RMSE MAPE
Acidity
Number
PLS 34 1.7031 24.4044 2.2175 32.9926
Stepwise 12+/2 1.6166 22.1661 1.8587 26.4972
GAPLS 13+/2 1.6327 23.0392 1.8183 26.6951
UVEPLS 18+/2 1.6795 23.5050 2.0449 29.4195
PLSSA 27+/2 1.7752 25.1340 2.2777 33.8493
CARSPLS 8+/2 1.7402 23.9638 1.8849 27.2733
PLSVIP 13+/1 1.6653 22.5347 1.7021 24.0513
PLSBETA 16+/1 1.6737 22.7862 1.9947 28.0227
Viscosity
PLS 34 0.6885 11.7796 1.0662 17.5609
Stepwise 11+/2 0.6869 12.0458 1.0068 17.0644
GAPLS 11+/3 0.6936 11.6718 0.9951 16.5112
UVEPLS 18+/3 0.7475 13.2438 1.0368 17.4080
PLSSA 29+/2 0.7084 12.1746 1.0951 17.9966
CARSPLS 9+/4 0.7270 12.1278 1.0234 16.6214
PLSVIP 14+/1 0.7344 13.0173 0.7661 13.4600
PLSBETA 16+/2 0.6884 11.8257 1.0341 17.1061
Table 4.7 Comparison of Different Variable Selection for Preprocessing Method 2
Model No. of Variables Training Validation RMSE MAPE RMSE MAPE
Acidity
Number
PLS 34 1.4498 22.8617 2.0488 29.1595
Stepwise 11+/2 1.4200 22.5903 1.6558 25.9971
GAPLS 11+/2 1.4292 23.1971 1.5874 26.4053
UVEPLS 20+/1 1.4604 23.4277 1.8858 29.4339
PLSSA 28+/2 1.5008 23.7984 2.0197 29.8083
CARSPLS 9+/3 1.4511 23.7552 1.5500 25.6226
PLSVIP 16+/0 1.4605 23.8432 1.5245 25.3108
PLSBETA 17+/1 1.4406 22.6302 1.7180 26.9167
Viscosity
PLS 34 0.7079 12.6513 1.1777 19.3988
Stepwise 11+/2 0.6902 12.0718 1.0320 17.5915
GAPLS 10+/2 0.6965 12.2571 1.0391 17.5482
UVEPLS 21+/3 0.7196 13.0370 1.1227 18.7715
PLSSA 28+/2 0.7539 13.7273 1.2775 21.1849
CARSPLS 9+/5 0.7413 13.4055 1.0976 18.7016
PLSVIP 16+/1 0.7228 13.2627 0.7476 13.7396
PLSBETA 15+/2 0.6950 12.4778 1.0176 17.6031
84
Table 4.8 Comparison of Different Variable Selection for Preprocessing Method 3
Model No. of Variables Training Validation RMSE MAPE RMSE MAPE
Acidity
Number
PLS 34 3.0584 53.1167 3.7931 57.7093
Stepwise 14+/2 3.0806 53.1855 4.0081 58.4953
GAPLS 11+/2 2.9637 49.3168 4.0584 56.3860
UVEPLS 28+/1 3.1057 52.4875 4.5632 62.0594
PLSSA 32+/1 3.0722 53.4689 3.7646 57.7600
CARSPLS 12+/7 3.0216 49.0159 3.7139 53.9567
PLSVIP 9+/1 3.0576 50.6028 3.4143 52.9578
PLSBETA 14+/1 2.9966 51.0542 3.8230 56.2184
Viscosity
PLS 34 1.7437 32.7257 2.2371 43.3408
Stepwise 10+/2 1.6001 31.0598 2.4125 45.0485
GAPLS 11+/2 1.5842 30.7862 2.3106 43.2504
UVEPLS 27+/2 1.6804 31.7349 2.3896 44.1047
PLSSA 29+/5 1.7580 32.9440 2.1602 41.7313
CARSPLS 9+/5 1.5539 30.3764 1.9422 37.7724
PLSVIP 14+/1 1.7392 32.5318 1.9951 39.9705
PLSBETA 17+/1 1.6217 31.0373 2.1733 41.9445
Based on the results summarized in the above Tables and Figure 4.54 to Figure
4.57, among three preprocessing methods, the third method yields the largest prediction
error. Also, the improvement by variable selection for the third method is least signifi
cant. The results from the first two methods are comparable. In acidity number model, the
second preprocessing method yields the best performance, while the first preprocessing
method performs the best in viscosity model. The sizes of models produced by the second
preprocessing method are slightly larger the first method. However, more significant im
provement in prediction performance is observed in the second preprocessing method.
The computation time is also equivalent. The results are also illustrated in Figure 4.58 to
Figure 4.65. The performance indicators shown are compared with the full model from
each preprocessing method, with positive values implying improvement in prediction
performance and negative values implying deteriorations in prediction performance.
85
Figure 4.54 Comparison of Acidity Number Full Models from Each Preprocessing Meth
od in Terms of RMSE
Figure 4.55 Comparison of Acidity Number Full Models from Each Preprocessing Meth
ods in Terms of MAPE
86
Figure 4.56 Comparison of Viscosity Full Models from Each Preprocessing Methods in
Terms on RMSE
Figure 4.57 Comparison of Viscosity Full Models from Each Preprocessing Methods in
Terms of MAPE
87
Figure 4.58 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of RMSE in Training Set
Figure 4.59 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of MAPE in Training Set
88
Figure 4.60 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of RMSE in Validation Set
Figure 4.61 Comparison of Different Preprocessing Method of Acidity Number Model in
Terms of MAPE in Validation Set
89
Figure 4.62 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of RMSE in Training Set
Figure 4.63 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of MAPE in Training Set
90
Figure 4.64 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of RMSE in Validation Set
Figure 4.65 Comparison of Different Preprocessing Method of Viscosity Model in Terms
of MAPE in Validation Set
91
Figure 4.66 Selection Frequency of SR Acidity Number Model
92
Figure 4.67 Selection Frequency of SR in Viscosity Model
93
Figure 4.68 Selection Frequency of GA in Acidity Number Model
94
Figure 4.69 Selection Frequency of GA in Viscosity Model
95
Figure 4.70 Selection Frequency of UVE in Acidity Number Model
96
Figure 4.71 Selection Frequency of UVE in Viscosity Model
97
Figure 4.72 Selection Frequency of PLSSA in Acidity Number Model
98
Figure 4.73 Selection Frequency of PLSSA in Viscosity Model
99
Figure 4.74 Selection Frequency of CARSPLS in Acidity Number Model
100
Figure 4.75 Selection Frequency of CARSPLS in Viscosity Model
101
Figure 4.76 Selection Frequency of PLSVIP in Acidity Number Model
102
Figure 4.77 Selection Frequency of PLSVIP in Viscosity Model
103
Figure 4.78 Selection Frequency of PLSBETA Acidity Number Model
104
Figure 4.79 Selection Frequency of PLSBETA in Viscosity Model
105
Consistency of variable selection is also studied for the industrial case. The fre
quency plots are presented in Figure 4.66 through Figure 4.79. Frequency of selection is
shown in percentage. As results shown, PLSVIP is the most consistent one among all
seven variable selection methods. Variables are either selected with extreme high fre
quency or not selected at all. Only a few variables are selected in the lower or middle
range. The consistency of PLSBETA is better in the acidity number model than that of
the viscosity model. Compare to PLSVIP, the model size of PLSBETA are generally
larger than PLSVIP. CARSPLS produces the smallest models. The consistency of
CARSPLS is unacceptable. Only a few variables are selected with high frequency, and
many variables are selected with frequency in the lower range. This agrees with results
found in the simulated case study that CARSPLS is sensitive to data selection. The per
formance of SR and GAPLS are the ones in the middle range. UVEPLS produces mod
els with second largest size. It frequency of selection is relatively consistent. PLSSA
generates the largest models. Also, the prediction performances are actually worsened
after variable selection by PLSSA. This is also the only method that yields worse per
formance than the original model.
4.3.3 Conclusion and Discussion
According to the analysis of results obtained from three preprocessing methods,
the first two preprocessing methods should be adopted. The performances of the first pre
processing methods are very competitive, while the third method is not quite comparable.
Especially in prediction performance, the prediction errors are doubled compared to those
of the first two preprocessing methods. The improvement after variable selection is not as
significant as the other ones.
106
Based on results obtained from industrial case study, PLSVIP yields the most su
perior performance in terms of prediction and selection consistency. In the first two pre
processing methods, PLSVIP outperforms the other variable selection methods for both
acidity number model and viscosity model. Even though CARSPLS gives best model in
the third preprocessing method, due to its inconsistency, CARSPLS should be applied
with care.
107
Chapter 5. Conclusions and Future Works
5.1 Conclusions
The goal of this project is to implement variable selection algorithms in data
driven soft sensors to improve their predictive power. Seven variable selection methods
are investigated: stepwise regression (SR), genetic algorithm (GA) with PLS, uninforma
tive variable elimination (UVE) with PLS, PLS with sensitivity analysis (SA), competi
tive adaptive reweighted sampling (CARS) with PLS, PLS with variable importance in
projection (VIP) and regression coefficients (BETA). The characteristics of these meth
ods are explored by using a simulated case study and an industrial case study.
Based on the analysis results, PLSVIP gives the most superior performance in
both simulated and industrial case. PLSVIP is very straightforward, which selects the
relevant predictors based on its importance in the PLS projection. It is shown that the
submodels produced by PLSVIP outperform other methods significantly, especially in
the industrial case study. Furthermore, the models produced by PLSVIP are very con
sistent from one sampling run to the other, which shows its robustness to data selection.
On the other hand, CARSPLS is quite sensitive to data selection. The standard devia
tions of the models produced by CARSPLS are much larger than the other ones. The
next in line would be PLSBETA and SR. PLSBETA performs very well when the con
tributions of each relevant predictor are in the same range. However, this is not always
the case the industrial processes. PLSBETA tends to only select the variables with dom
inating contribution, which may over simplify the model and cause overfit. SR also has
108
issues with overfitting. From the results shown in the industrial case study, SR always
gives the greatest calibration models, but prediction performance on the external valida
tion set is not quite ideal. GAPLS yields similar results to SR. However, the computation
time of GAPLS is much longer than that of SR. GAPLS also required more tedious pre
liminary setting of GA parameters. UVEPLS and PLSSA generates models with rela
tively large size. Nonetheless, UVEPLS is able to identify a subset of variables that
would improve the prediction performance, whereas the submodels produced by PLSSA
worsen the prediction performance. The strength and limitations of each method are
summarized in Table 5.1.
Table 5.1 Limitations and Strengths of Each Variable Selection Method
Models Pros Cons
SR
? Produce high performance train
ing model
? Relatively consistent selection
? Developed model tend to overfit
the model
GAPLS
? Performance can be improved by
tuning the parameters
? Require a lot of user input to op
timize performance
? Selects irrelevant predictors
? Computation load
UVEPLS
? Relatively consistent selection
? Improvement observed in predic
tion performance
? Large model size
PLSSA ? Selection consistency ? Low prediction performance ? Heavy computation load
CARSPLS ? Use as preliminary variable reduction in wavelength selection ? Selection inconsistency
PLSVIP
? Selection consistency
? High prediction performance
? Least computation load
? May select some irrelevant pre
dictors around the relevant ones
PLSBETA
? Selection consistency
? High prediction performance
? Low computation load
? User input requires for the cutoff
value of BETA
109
5.2 Future Works
Variable reduction can be carried out prior to variable selection based on two
rules: elimination of variables with zerovariance and elimination of highly correlated
variables. The variable selection methods can also be improved by considering the appli
cation of modeling power approach. Modeling power approach balances the predictive
and descriptive abilities of model.
Application of wavelength selection is also of interest. The next step of research is
to implement these seven variable selection methods on a benchmark dataset, NIR spec
tral of diesel fuel.
The optimal goal of our study is to implement variable selection method in the
framework of Statistics Pattern Analysis (SPA). Due to the characteristics of SPA, it is
very likely that the number of regressors would be greater than the number of samples.
Variable selection method could be implemented to eliminate the uninformative variables
prior to SPA. In addition, variable selection can also be employed to select useful statis
tics in statistics pattern generation.
110
Bibliography
[1] C. M. Andersen and R. Bro, ?Variable selection in regression ? a tutorial,? Jour
nal of Chemometrics, vol. 24, no. 11?12, pp. 728737, 2010.
[2] J. Reunanen, ?Overfitting in making comparisons between variable selection
methods,? The Journal of Machine Learning Research, vol. 3, pp. 13711382,
2003.
[3] M.D. Ma et al., ?Development of adaptive soft sensor based on statistical identifi
cation of key variables,? Control Engineering Practice, vol. 17, no. 9, pp. 1026
1034, Sep. 2009.
[4] I.G. Chong and C.H. Jun, ?Performance of some variable selection methods
when multicollinearity is present,? Chemometrics and Intelligent Laboratory Sys
tems, vol. 78, no. 1?2, pp. 103112, Jul. 2005.
[5] V. Centner, D. L. Massart, O. E. de Noord, S. de Jong, B. M. Vandeginste, and C.
Sterna, ?Elimination of uninformative variables for multivariate calibration.,? Ana
lytical chemistry, vol. 68, no. 21, pp. 38513858, Nov. 1996.
[6] F. A. Arciniegas, M. Embrechts, and I. E. A. Rueda, ?Variable Selection with Par
tial Least Squares Sensitivity Analysis: An Application to Currency Crises? Real
Effects,? SSRN eLibrary, Jun. 2006.
[7] L. H. Chiang and R. J. Pell, ?Genetic algorithms combined with discriminant anal
ysis for key variable identification,? Analytical Sciences, vol. 14, pp. 143155,
2004.
[8] M. Forina, S. Lanteri, M. Casale, and M. C. Cerrato Oliveros, ?Stepwise orthogo
nalization of predictors in classification and regression techniques: An ?old? tech
nique revisited,? Chemometrics and Intelligent Laboratory Systems, vol. 87, no. 2,
pp. 252261, Jun. 2007.
[9] J.P. Gauchi and P. Chagnon, ?Comparison of selection methods of explanatory
variables in PLS regression with application to manufacturing process data,?
Chemometrics and Intelligent Laboratory Systems, vol. 58, no. 2, pp. 171193,
Oct. 2001.
111
[10] D. Broadhurst, R. Goodacre, A. Jones, J. J. Rowland, and D. B. Kell, ?Genetic al
gorithms as a method for variable selection in multiple linear regression and partial
least squares regression, with applications to pyrolysis mass spectrometry,? Ana
lytica Chimica Acta, vol. 348, no. 1?3, pp. 7186, Aug. 1997.
[11] M. J. Arcosa, M. C. Ortizav, B. Villahoz, and L. A. Sarabiab, ?Geneticalgorithm
based wavelength selection in multicomponent spectrometric determinations by
PLS?: application on indomethacin and acemethacin mixture,? 1997.
[12] H. Kaneko and K. Funatsu, ?A New Process Variable and Dynamics Selection
Method Based on a Genetic AlgorithmBased Wavelength Selection Method,? vol.
58, no. 6, 2012.
[13] G. Jones, P. Willett, and R. Glen, ?Genetic algorithms for chemical structure han
dling and molecular recognition,? in In Genetic Algorithms in Molecular Model
ing, 1996, pp. 211242.
[14] W. Cai, Y. Li, and X. Shao, ?A variable selection method based on uninformative
variable elimination for multivariate calibration of nearinfrared spectra,? Chemo
metrics and Intelligent Laboratory Systems, vol. 90, no. 2, pp. 188194, Feb. 2008.
[15] X. Shao, F. Wang, D. Chen, and Q. Su, ?A method for nearinfrared spectral cali
bration of complex plant samples with wavelet transform and elimination of unin
formative variables,? Analytical and Bioanalytical Chemistry, vol. 378, no. 5, pp.
13821387, 2004.
[16] J. Koshoubu, T. Iwata, and S. Minami, ?Application of the Modified UVEPLS
Method for a MidInfrared Absorption Spectral Data Set of WaterEthanol Mix
tures,? Applied Spectroscopy, vol. 54, no. 1, pp. 148152, Jan. 2000.
[17] J. Koshoubu, T. Iwata, and S. Minami, ?Elimination of the uninformative calibra
tion sample subset in the modified UVE(Uninformative Variable Elimination)PLS
(Partial Least Squares) method.,? Analytical sciences?: the international journal of
the Japan Society for Analytical Chemistry, vol. 17, no. 2, pp. 31922, Feb. 2001.
[18] S. Ye, D. Wang, and S. Min, ?Successive projections algorithm combined with
uninformative variable elimination for spectral variable selection,? Chemometrics
and Intelligent Laboratory Systems, vol. 91, no. 2, pp. 194199, Apr. 2008.
[19] E. Zamprogna, M. Barolo, and D. E. Seborg, ?Optimal selection of soft sensor in
puts for batch distillation columns using principal component analysis,? Journal of
Process Control, vol. 15, no. 1, pp. 3952, Feb. 2005.
[20] Q. Li and C. Shao, ?Soft sensing modelling based on optimal selection of second
ary variables and its application,? International Journal of Automation and Com
puting, vol. 6, no. 4, pp. 379384, Oct. 2009.
112
[21] H. Li, Y. Liang, Q. Xu, and D. Cao, ?Key wavelengths screening using competi
tive adaptive reweighted sampling method for multivariate calibration.,? Analytica
chimica acta, vol. 648, no. 1, pp. 7784, Aug. 2009.
[22] H.D. Li, Y.Z. Liang, Q.S. Xu, and D.S. Cao, ?Model population analysis for
variable selection,? Journal of Chemometrics, vol. 24, no. 7?8, pp. 418423, Jul.
2010.
[23] H.dong Li, Y.zeng Liang, and Q.song Xu, ?Model Population Analysis for Sta
tistical Model Comparison,? no. 1, pp. 321.
[24] T. Mehmood, H. Martens, S. S?b?, J. Warringer, and L. Snipen, ?A Partial Least
Squares based algorithm for parsimonious variable selection.,? Algorithms for mo
lecular biology?: AMB, vol. 6, no. 1, p. 27, Jan. 2011.
[25] F. Lindgren, B. Hansen, and W. Karcher, ?MODEL VALIDATION BY PERMU
TATION TESTS?:,? Journal of Chemometrics, vol. 10, pp. 521532, 1996.
[26] R. Gosselin, D. Rodrigue, and C. Duchesne, ?A BootstrapVIP approach for se
lecting wavelength intervals in spectral imaging applications,? Chemometrics and
Intelligent Laboratory Systems, vol. 100, no. 1, pp. 1221, Jan. 2010.
[27] C. M. Andersen and R. Bro, ?Variable selection in regression?a tutorial,? Journal
of Chemometrics, vol. 24, no. 11?12, pp. 728737, Nov. 2010.
[28] P. P. Roy and K. Roy, ?On Some Aspects of Variable Selection for Partial Least
Squares Regression Models,? QSAR & Combinatorial Science, vol. 27, no. 3, pp.
302?313, 2008.
[29] D. Wang and R. Srinivasan, ?DataDriven Soft Sensor Approach for Quality Pre
diction in a Refining Process,? IEEE Transactions on Industrial Informatics, vol.
6, no. 1, pp. 1117, Feb. 2010.
[30] P. Kadlec, B. Gabrys, and S. Strandt, ?Datadriven Soft Sensors in the process in
dustry,? Computers & Chemical Engineering, vol. 33, no. 4, pp. 795814, Apr.
2009.
[31] P. Kadlec, R. Grbi?, and B. Gabrys, ?Review of adaptation mechanisms for data
driven soft sensors,? Computers & Chemical Engineering, vol. 35, no. 1, pp. 124,
Jan. 2011.
[32] P. Kadlec and B. Gabrys, ?Adaptive Local Learning Soft Sensor for Inferential
Control Support,? 2008, pp. 243248.
[33] I. T. Jolliffee, Principal Component Analysis. Springer, 2002.
113
[34] S. Wold, M. Sjostrom, and L. Eriksson, ?PLSregression?: a basic tool of chemo
metrics,? Chemometrics and Intelligent Laboratory Systems, vol. 58, pp. 109130,
2001.
[35] T. Hastie, R. Tibshirani, and J. Friedman, ?The Elements of Statistical Learning?:
Data Mining , Inference and Prediction Probability Theory?: The Logic of Science
The Fundamentals of Risk Measurement Mathematicians , pure and applied , think
there is something weirdly different about,? vol. 27, no. 2, pp. 8385, 2005.
[36] C. Lin and C. Lee, Neural fuzzy systems: A neurofuzzy synergism to intelligent
systems. Upper Saddle River: PrenticeHall Inc., 1996.
[37] V. N. Vapnik, Statistical learning theory. Wiley New York:, 1998.
[38] P. Geladi and B. R. Kowalski, ?Partial leastsquares regression: a tutorial,? Analyt
ica Chimica Acta, vol. 185, pp. 117, 1986.
[39] S. Wold, H. Martens, and H. Russwurm Jr, Food Research and Data Analysis.
London: Applied Science Publishers, 1983.
[40] S. Wold and B. Kowalski, Chemometrics: Mathematics and Statistics in Chemis
try. Dordrecht: Reidel, 1984.
[41] R. Leardi and A. Lupi??ez Gonz?lez, ?Genetic algorithms applied to feature selec
tion in PLS regression: how and when to use them,? Chemometrics and Intelligent
Laboratory Systems, vol. 41, no. 2, pp. 195207, Jul. 1998.
[42] D. Broadhursta, J. J. Rowlandb, and D. B. Kelp, ?Genetic algorithms as a method
for variable selection in multiple linear regression and partial least squares regres
sion , with applications to pyrolysis mass spectrometry,? Analytica Chimica Acta,
vol. 348, pp. 7186, 1997.
[43] L. Davis, Genetic algorithms and simulated annealing. 1987.
[44] D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning.
AddisonWesley, 1988.
[45] D. E. Goldberg and J. H. Holland, ?Genetic Algorithms and Machine Learning,? in
Machine Learning, vol. 3, Kluwer Academic Publishers, 1988, pp. 9599.
[46] R. Leardi, R. Boggia, and M. Terrile, ?Genetic algorithms as a strategy for feature
selection,? Journal of Chemometrics, vol. 6, no. 5, pp. 267281, Sep. 1992.
[47] R. H. Kewley, M. J. Embrechts, and C. Breneman, ?Data strip mining for the vir
tual design of pharmaceuticals with neural networks.,? IEEE transactions on neu
114
ral networks / a publication of the IEEE Neural Networks Council, vol. 11, no. 3,
pp. 66879, Jan. 2000.
[48] S. Wold, E. Johansson, and M. Cocchi, ?PLS_Partial Least Squares Projections to
Latent Structures.pdf,? in 3D QSAR in Drug Design Theory Methods and Applica
tion, ESCOM, 1993, pp. 523550.
[49] P. Facco, F. Doplicher, F. Bezzo, and M. Barolo, ?Moving average PLS soft sen
sor for online product quality estimation in an industrial batch polymerization pro
cess,? Journal of Process Control, vol. 19, no. 3, pp. 520529, Mar. 2009.
[50] P. Facco, F. Bezzo, and M. Barolo, ?NearestNeighbor Method for the Automatic
Maintenance of Multivariate Statistical Soft Sensors in Batch Processing,? Indus
trial & Engineering Chemistry Research, vol. 49, no. 5, pp. 23362347, Mar. 2010.
[51] S. Wold, N. Kettaneh, H. Frid?n, and A. Holmberg, ?Modelling and diagnostics of
batch processes and analogous kinetic experiments,? Chemometrics and Intelligent
Laboratory Systems, vol. 44, no. 1?2, pp. 331340, Dec. 1998.
[52] P. Nomikos and J. F. MacGregor, ?Multivariate SPC Charts for Monitoring Batch
Processes,? Technometrics, vol. 37, no. 1, pp. 4159, Feb. 1995.
[53] P. Nomikos and J. F. MacGregor, ?Multiway partial least squares in monitoring
batch processes,? Chemometrics and Intelligent Laboratory Systems, vol. 30, no. 1,
pp. 97108, Nov. 1995.