Robust Partial Least Squares For Regression and Classification
Except where reference is made to the work of others, the work described in this
dissertation is my own or was done in collaboration with my advisory committee. This
dissertation does not include proprietary or classifled information.
Asuman Seda Turkmen
Certiflcate of Approval:
Asheber Abebe
Associate Professor
Mathematics and Statistics
Nedret Billor, Chair
Associate Professor
Mathematics and Statistics
Mark Carpenter
Associate Professor
Mathematics and Statistics
Chris Rodger
Professor
Mathematics and Statistics
George T. Flowers
Interim Dean
Graduate School
Robust Partial Least Squares For Regression and Classification
Asuman Seda Turkmen
A Dissertation
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulflllment of the
Requirements for the
Degree of
Doctor of Philosophy
Auburn, Alabama
August 9, 2008
Robust Partial Least Squares For Regression and Classification
Asuman Seda Turkmen
Permission is granted to Auburn University to make copies of this dissertation at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Vita
Asuman Seda T?urkmen, daughter of ?Umit Peyda Top?cuba?s? and Vicdan Top?cuba?s?,
was born July 24, 1977, in Adana, Turkey. She graduated with honors from C?ukurova
University in 1999 with a Bachelor of Science degree in Mathematics. After graduation, she
continued her graduate studies in the same department from which she received her Master
of Science degree in Statistics in 2001. She came to Auburn University in August 2003 and
she entered the doctoral program in the Department of Mathematics and Statistics where
she completed her Ph.D. in Statistics in 2008.
iv
Dissertation Abstract
Robust Partial Least Squares For Regression and Classification
Asuman Seda Turkmen
Doctor of Philosophy, August 9, 2008
(M.A., C?ukurova University, 2002)
(B.S., C?ukurova University, 1999)
133 Typed Pages
Directed by Nedret Billor
Partial Least Squares (PLS) is a class of methods for modeling relations between sets
of observed variables by means of latent variables where the explanatory variables are
highly collinear and where they outnumber the observations. In general, PLS methods aim
to derive orthogonal components using the cross-covariance matrix between the response
variable(s) and the explanatory variables, a quantity that is known to be afiected by unusual
observations (outliers) in the data set. In this study, robustifled versions of PLS methods,
for regression and classiflcation, are introduced.
For regression with quantitative response, a robust PLS regression method (RoPLS),
basedonweightscalculated byBACONorPCOUTalgorithm, isproposed. Arobustcriteria
is suggested to determine the optimal number of PLS components which is an important
issue in building a PLS regression model. In addition, diagnostic plots are constructed to
visualize and classify outliers. Robustness of the proposed method, RoPLS, is studied in
detail. In uence function for the RoPLS estimator is derived for low dimensional data and
empirical robustness properties are provided for high dimensional data.
v
PLS was originally designed for regression problems with quantitative response, how-
ever, it is also used as a classiflcation technique where the response variable is qualitative.
Although several robust PLS methods have been proposed for regression problems, to our
knowledge, there has been no study on the robustness of the PLS classiflcation methods. In
this study, the efiect of outliers on existing PLS classiflcation methods is investigated and
a new robust PLS algorithm (RoCPLS) for classiflcation is introduced.
The performances of the proposed methods, RoPLS and RoCPLS, are being assessed
by employing several benchmark data sets and extensive simulation experiments.
vi
Acknowledgments
I would like to express my sincere appreciation to my advisor and mentor, Dr. Nedret
Billor, who has always provided her enthusiastic assistance, persistent encouragement and
help throughout my graduate education. Words cannot express how thankful I am for her
support in every aspect of my academic development. Without her and her continuous
guidance, I never would have seen the end of this study.
Special thanks go to Dr. Asheber Abebe for his valuable suggestions and opinions. I
am also grateful to Dr. Mark Carpenter and Dr. Chris Rodger for kindly consenting to
serve on the dissertation committee. Much-deserved gratitude is expressed to my parents
and family for their endless love and support throughout my life. I am grateful for the
loving kindness of my husband, ?Ozg?ur T?urkmen, for his patience, love, and support; he has
been my true joy in this journey together. Finally, my sincere appreciation is also extended
to all of the people, who have always been there for me whenever I needed them.
I dedicate this dissertation to the memory of my grandparents: G?uzide & Mustafa
Top?cuba?s?, Rabia & Lokman Karag?ol who truly believed in me.
vii
Style manual or journal used Journal of Approximation Theory (together with the style
known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars.
Computer software used The document preparation package TEX (speciflcally LATEX)
together with the departmental style-flle aums.sty.
viii
Table of Contents
List of Figures xi
1 Introduction 1
2 An Overview on Partial Least Squares 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Partial Least Squares Regression (PLSR) . . . . . . . . . . . . . . . . . . . 6
2.2.1 NIPALS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 SIMPLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Determining the Optimal Number of Components in PLSR . . . . . . . . . 13
2.4 PLS Regression Among Other Biased Methods . . . . . . . . . . . . . . . . 16
2.5 Statistical Properties of the PLSR Estimator . . . . . . . . . . . . . . . . . 21
2.5.1 Shrinkage Properties of the PLSR Estimator . . . . . . . . . . . . . 22
2.5.2 The Asymptotic Variance of the PLSR Estimator . . . . . . . . . . . 28
2.5.3 Consistency of the PLSR Estimator . . . . . . . . . . . . . . . . . . 30
3 RoPLS: Robust Partial Least Squares Regression 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Outlier Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 BACON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 PCOUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Description of the Proposed Algorithm: RoPLS . . . . . . . . . . . . . . . . 39
3.3.1 RoPLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Selecting Number of Components . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Diagnostic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Robustness Properties of RoPLS Estimator 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 In uence Function for the SIMPLS Based Regression Estimator . . . . . . . 63
4.3 Robustness of the RoPLS Estimator of fl . . . . . . . . . . . . . . . . . . . 70
4.3.1 In uence functions for Low Dimension . . . . . . . . . . . . . . . . . 72
4.3.2 Empirical In uence Function for High Dimension . . . . . . . . . . . 85
4.3.3 Finite-Sample Breakdown Properties of RoPLS1 Estimator . . . . . 87
ix
5 RoCPLS: Robust Partial Classification 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Classiflcation Methods Based on Dimension Reduction . . . . . . . . . . . . 94
5.3 Description of the Proposed Algorithm: RoCPLS . . . . . . . . . . . . . . . 98
5.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Low Dimension: Wine Recognition Data . . . . . . . . . . . . . . . . 101
5.4.2 High Dimension: Colon Data . . . . . . . . . . . . . . . . . . . . . . 104
5.4.3 High Dimension: Leukemia Data . . . . . . . . . . . . . . . . . . . . 105
6 Conclusions and Future Work 111
Bibliography 114
Appendix 121
x
List of Figures
2.1 Scatter plots of y versus PC1, PC2, PC3, PC4. . . . . . . . . . . . . . . . . 20
2.2 Scatter plots of ^y versus y using PCR with k=2 (left) and PLSR with k=2
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Types of outliers when 3 dimensional data setX is projected on 2 dimensional
PLS-subspace (left) and corresponding Orthogonal-Score distance plot (right). 47
3.2 Diagnostic plots for original Tobacco data. . . . . . . . . . . . . . . . . . . . 51
3.3 Diagnostic plots for contaminated Tobacco data. . . . . . . . . . . . . . . . 52
3.4 Boxplots of norms (left) and angles (right) between slope estimates before
and after 12%, 20% and 52% contamination. . . . . . . . . . . . . . . . . . 53
3.5 The scatter plot of flrst two columns of Fish data. . . . . . . . . . . . . . . 54
3.6 RRMSE (left) and RR2 (right) index plot for Fish data. . . . . . . . . . . 55
3.7 Diagnostic plots for Fish data (RoPLS1). . . . . . . . . . . . . . . . . . . . 56
3.8 Diagnostic plots for Fish data (SIMPLS). . . . . . . . . . . . . . . . . . . . 56
3.9 (a) Original Biscuit-dough data (b)Preprocessed Biscuit-dough data. . . . . 57
3.10 RRMSE (left) and RR2 (right) index plot for Biscuit-dough data. . . . . . 58
3.11 Diagnostic plots for Biscuit-dough data. . . . . . . . . . . . . . . . . . . . . 59
4.1 (a) The norm of IF of the r1 (b) The norm of IF of the ^fl1 (c) The IF of the
^fl11 (d) The IF of the ^fl12. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 (a) Graph of w? for robust distances (b) Graph of w? for scaled errors, ". . 73
4.3 (a) Graph of w(0)F ( ?) versus (u1,v) (b) Graph of w(0)F ( ?) . . . . . . . . . . 78
4.4 Norms of the empirical in uence functions for the (a)SIMPLS, n=20, p=200,
k=3 (b) RoPLS1 n=20, p=200, k=3 (c) SIMPLS, n=25, p=125, k=2 (d)
RoPLS1, n=25, p=125, k=2. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xi
4.5 The flnite-sample breakdown values of the SIMPLS, RoPLS1 and PRM es-
timators for (a)fn,pg=f30,6g (b)fn,pg=f20,200g. . . . . . . . . . . . . . . . 87
5.1 Misclassiflcation error rates for PLS and PCA for k=1. . . . . . . . . . . . . 98
5.2 Cross validation error rates obtained for Wine data. . . . . . . . . . . . . . 102
5.3 Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right)
for Wine data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Scree plots for original and deleted Wine data. . . . . . . . . . . . . . . . . 103
5.5 Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right)
for Colon data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 3D scatter plot of the flrst three components for Colon data obtained from
RoCPLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.7 Boxplots of the error rates for SIMPLS, RoCPLS, PCA and DG based clas-
siflcation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8 Scatter plot of the flrst two components for Leukemia data. . . . . . . . . . 107
5.9 The scatter plot of flrst two components (left) and orthogonal-score plot
(right) for Leukemia data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.10 Boxplots of the error rates for (a) no outliers (b) 1 outlier (c) 2 outliers (d)3
outliers in each class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.11 The angle between flrst PLS weight vectors for clean and contaminated data. 110
xii
Chapter 1
Introduction
Most traditional statistical techniques are especially designed for low dimensional data
sets where the number of observations (n) is greater than the number of variables (p).
Application of the statistical methods for problems such as, survival time or tumor class
prediction of a patient, based on a high dimensional data (n << p), is a di?cult and
challenging task. On the other hand, nowadays data sets in many scientiflc flelds are
high dimensional because of the fact that advances in technology have made simultaneous
monitoring of thousands of features (variables) possible. Therefore, analyzing such data
sets has been a focus for many researchers in a wide range of scientiflc flelds due to the
requirement of resolutions for various statistical problems that have emerged.
Recently, partial least squares (PLS) has become an important statistical tool for mod-
eling relations between sets of observed variables by means of latent variables especially for
statistical problems dealing with high dimensional data sets. PLS is a member of nonlinear
iterative least squares (NILES) procedures developed by Wold ([87], [88]). The use of PLS
methods for regression problems began in the early 80?s. The main idea in PLS regres-
sion (PLSR) is to summarize high dimensional and/or collinear explanatory variables into
a smaller set of uncorrelated, so called latent variables, which have the "best" predictive
power.
Although PLSR was initially developed for social and economic science problems hav-
ing scarce information, it has received a great amount of attention in the chemometrics
literature. The main application of PLSR in chemometrics is the prediction of constituent
1
concentrations of a sample based on its spectrum obtained by spectroscopic techniques, such
as near infrared (NIR) spectroscopy, energy-dispersive X-ray uorescence spectroscopy, and
ultraviolet(UV) spectroscopy. Spectral data contain a large amount of information since a
spectrum typically ranges over a large number of wavelengths (variables) with limited num-
ber of concentrations (observations). The spectroscopic techniques, in combination with
PLSR analysis, have proved to be a powerful analytical tool for analyzing on-line industrial
processes. Its speed, relative good performance, and ability to handle data sets with more
variables than observations resulted in a lot of applications of PLSR in many other scientiflc
areas such as bioinformatics, food research, medicine, and pharmacology. For instance, in
the area of drug design, a large amount of chemicals need to be evaluated for their toxicity
and efiectiveness before they are used by pharmacists. Quantitative Structure-Activity Re-
lationships (QSAR) analysis employs theoretical molecular descriptors for reliable estimates
of the toxic and therapeutic potential of chemicals. QSAR data sets contain some degree
of multicollinearity which can be handled by using PLSR. Beside this, because of the large
time scale of the data collection process, QSAR data often contain outliers.
Despite of the fact that PLSR handles the multicollinearity problem, it fails to deal
with data containing outliers since it is based on maximizing the sample covariance matrix
between the response(s) and a set of explanatory variables, a statistic that is known to be
sensitive to outliers. Existence of multicollinearity and outliers is no exception in real data
sets, and it leads to a requirement of robust PLSR methods ( [13], [35], [38], [48], [76], [83])
in chemometrics as well as other application areas.
The problem of classifying entities into one of several groups has been another impor-
tant goal in many scientiflc investigations. It is important that this activity is done in a
2
manner that minimizes the misclassiflcation error rate. High dimensionality and collinearity
make the application of most statistical classiflcation methods di?cult, or even impossible,
for some cases. The procedure for classiflcation of high dimensional data often consists of
two steps: the flrst step is to construct a few components from a large number of explana-
tory variables by using dimension reduction techniques, and the second step is to employ
classical classiflcation methods on the constructed components. Although PLS was origi-
nally designed for regression problems, it has started to be used frequently as a dimension
reduction tool for classiflcation problems and recent studies have showed that classiflcation
via PLS performs quite well ( [4], [8], [60], [65], [66]) especially for microarray data analy-
sis. An important application of microarray technology is tumor diagnosis. A reliable and
precise classiflcation of tumors is potentially life-saving and hence is essential to physicians.
In the presence of outliers, e.g. tissue-speciflc genes whose expression proflle is considerably
difierent (could be "erroneous" or "genuine") in particular tissue(s) than in others, dimen-
sion reduction via PLS would yield unreliable results since PLS is known to be sensitive
to outliers. Although several robust PLS methods have been proposed when the response
variable is quantitative, to our knowledge, there has been no study on the robustness of
PLS when the response variable is qualitative.
The main contribution of this work is the construction of robust algorithms for partial
least squares methods. The chapters are organized as follows: in Chapter 2, a detailed
literature review on PLSR, that includes the existing PLS algorithms; asymptotic variance,
consistency, geometric and, peculiar shrinkage properties of PLSR estimator; and the rela-
tionship between PLSR and other biased estimation methods such as principal component
regression, ridge regression and continuum regression, is given. A robust PLSR method
3
(RoPLS) is introduced in Chapter 3 and its robustness properties are explored in Chap-
ter 4. The efiect of outliers on existing PLS classiflcation methods is investigated and a new
robust PLS algorithm for classiflcation (RoCPLS) is proposed in Chapter 5. Finally, con-
clusions and proposed future work are given in Chapter 6. A recapitulation of the notations
that are used throughout the work can be found in Appendix.
4
Chapter 2
An Overview on Partial Least Squares
2.1 Introduction
The standard multiple regression model deflned by the equation
y = Xfl +" (2.1)
where X is a n ? p matrix of explanatory variables (predictors), y is a n ? 1 vector of
response variable, fl is a p ? 1 vector of unknown parameters, and " is a n ? 1 vector of
error terms whose rows are identically and independently distributed.
In multivariate models, the response vector y in the model (2.1) is replaced by the n?g
response matrix, Y, where g ? 2. Throughout this study, although PLS algorithms are given
for multivariate models in general, only multiple univariate regression model is considered
and emphasized due to its simplicity. Furthermore, multiple univariate models always give
better results than multivariate models in terms of the variance explained ([9], [31], [34], [59])
as long as response variables are unrelated.
The outline of this chapter is as follows. In Section 2.2 most commonly used PLS
algorithms, NIPALS and SIMPLS, are described. Popular approaches for determining the
optimal number of components are discussed in Section 2.3. The relationship between PLSR
and other biased regression techniques is summarized in Section 2.4, followed by statistical
properties of PLSR estimator given in Section 2.5.
5
2.2 Partial Least Squares Regression (PLSR)
The ordinary least squares (OLS) estimator of fl, bflOLS, in the model given by (2.1) is
the solution of the following optimization problem:
bflOLS = argmax
b
corrfXb;yg2: (2.2)
In many applications of multiple regression (e.g., spectral data analysis in chemomet-
rics), multicollinearity is inevitable as a result of large number of variables collected by
modern technologies of computers, networks, and sensors. Despite having desirable prop-
erties, the OLS estimator can have an extremely large variance and results in imprecise
prediction when the data are multicollinear. Moreover, solution of (2.2) is not unique when
n ? p.
One solution to deal with multicollinearity and/or dimensionality problem is regressing
the response variable y on a subset of the k orthogonal (latent) vectors stored in a score
matrix of size n?k by which important features of X have been retained. Score matrix is
formed by taking linear combinations of columns of X. PLS regression (PLSR) constructs
the columns of score matrix, T = [t1;t2;:::;tk], by solving the following optimization
problem for h = 1;2;:::;k (k ? p):
rh = argmax
krk=1
cov(Xr;y)2 = argmax
krk=1
(r0X0yy0Xr) (2.3)
subject to t0htj = r0hX0Xrj = 0 for 1 ? j < h.
So, PLSR balances the maximal correlation criteria for OLS given in (2.2) with the require-
ment of explaining as much as variability in both X and y?space.
6
Several iterative procedures have been proposed to solve nonlinear optimization prob-
lem in (2.3) such as PLS Mode A, PLS-SB, NIPALS and SIMPLS algorithms that difier
by the de ation theme required for the orthogonality of derived components. PLS Mode A
algorithm ([88]) aims to model existing relationships between variables rather than to model
for prediction. PLS-SB computes all eigenvectors at once, and the score vectors obtained by
this method are not necessarily orthogonal. The most commonly used methods, NIPALS
and SIMPLS, consist of two steps may be called calibration (deriving components) and
prediction. These algorithms, for both univariate and multivariate responses, are explained
in the following subsections. The extension of two-block PLS model, where X and y (or Y
for multivariate model) are block variables, to multi-block PLS model is also given in the
literature ([84], [88]) and is not discussed in this study.
2.2.1 NIPALS Algorithm
The NIPALS algorithm ( [87]) was developed as an alternative to principal component
algorithms. NIPALS employs sequential simple linear regressions instead of singular value
decomposition to calculate principal components. PLS algorithm can be considered as
carrying out two simultaneous NIPALS principal component analyses, one for X and one
for Y, while interchanging the results from X for analysis of Y and vice versa ([50], [34]) to
solve the following maximization problem
maxkrk=ksk=1 cov(Xr;Ys)2
under the orthogonality constraint of derived components, where s = 1 and Y = y for
univariate model. Since both X and Y are used in the computation of the components,
PLS is presented as a member of the bilinear class of methods and the bilinear model can
7
be written as:
X = TP0 +E; (2.4)
Y = UQ0 +F: (2.5)
The equations given in (2.4) and (2.5) are called outer relations where T and U are score
matrices derived from X and Y, respectively; P (x-loadings) represents the regression coef-
flcients of X on T; Q (y-loadings) represents the regression coe?cients of Y on U; E and F
are the matrices of errors. It is assumed that the score matrix T is a good predictor for Y
and a linear, inner relationship between the score matrices T and U exists, i.e. U = TB+H
where B is a k ?k diagonal matrix and H is a matrix of errors. The mixed relation then
becomes:
Y = UQ0 +F = (TB +H)Q0 +F = TA0 +F? (2.6)
where A0 = BQ0 is a matrix of regression coe?cients and F? = HQ0+F is matrix of errors.
For the univariate case (g = 1), the matrix B in the model (2.6) becomes identity matrix,
so the equation (2.6) represents both outer and mixed relationships which can be rewritten
as
y = Ta+f? (2.7)
where a is a vector of regression coe?cients and f? is a vector of errors. The NIPALS
algorithm for univariate response variable (y) based on mixed relationship in (2.7) is called
PLS1, whereas NIPALS algorithm for multivariate response variable (Y) is called PLS2.
The calibration step of PLS2 algorithm for k component is described as follows:
8
Algorithm 2.1 (PLS2)
Step 1: Let E0 and F0 be the copies of X and Y, respectively.
Step 2: For h = 1;2;:::;k do steps 2:1?2:4:
Step 2.1: Let uh be a column of Fh?1, e.g., the one having maximum variance.
Step 2.2: Repeat steps 2:2:1?2:2:4 until the convergence of wh (or th).
Step 2.2.1: Perform the regression Eh?1 = uhw0h + "1, yielding the least squares
solution
wh = E
0h?1uh
u0huh (2.8)
and normalize wh := wh=jjwhjj.
Step 2.2.2: Perform the regression of E0h?1 = wht0h + "2 yielding the least squares
solution
th = Eh?1wh.
Step 2.2.3: Perform the regression of Fh?1 = thq0h+"3 with the least squares solution
qh = F
0h?1th
t0hth (2.9)
and normalize qh := qh=jjqhjj.
Step 2.2.4: Perform the regression of F0h?1 = qhu0h+"4 yielding the following solution
uh = Fh?1qh.
Step 2.3: Perform the regression of Eh?1 and Fh?1 on th, separately to compute residuals
Eh = Eh?1 ?thp0h
Fh = Fh?1 ?bhthq0h
9
where p0h is the regression coe?cient vector obtained by regressing Eh?1 on th (outer rela-
tion) and bh is the regression coe?cient obtained by regressing uh on th (inner relation),
i.e.,
ph = E0h?1th=t0hth
bh = u0hth=t0hth
Step 2.4: Store vectors wh, th, and ph into matrices Wh = [w1;w2;:::;wh], Th =
[t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively. Set h =: h+1
For univariate case (PLS1), y and f0 are used instead of Y and F0 in the flrst step
of Algorithm 2.1. Steps 2.1 and 2.2 are replaced by 2.2.1 (with uh = fh?1) and 2.2.2,
respectively while the steps 2.3 and 2.4 remain the same where bh = 1. In other words,
convergence of wh is obtained in the flrst iteration.
Score matrix, Tk, can be written in terms of linear combinations of the columns of
E0 = X, that is Tk = XRk, where the hth column of Rk is called hth PLS-weight vector.
The matrix, Rk is related to Pk and Wk stored in NIPALS algorithm, via the formula
Rk = Wk(P0kWk)?1 (2.10)
which follows from the fact that Rk and Wk share the same column space and that P0kRk
is equal to the identity matrix ( [41], [45], [55]).
In the prediction step, ordinary least squares estimate for a in the univariate mixed
model given in (2.7) is obtained by regressing the response vector y onto these k components
which yields
10
ba(k) = (T0kTk)?1T0ky = (R0kX0XRk)?1R0kX0y.
Therefore, the PLS estimate of fl in the model given in (2.1) is
bfl(k)PLS = Rkba(k).
Alternative formulations of the PLS1 are suggested by Helland ([41]) and Garthwaite
([33]). An extensive simulation study by Breiman and Friedman ([9]) on the comparison
of univariate and multivariate regression methods including PLS1, PLS2, OLS, and other
biased regression methods demonstrated that performing separate PLS1 regressions on each
individual response would be a better strategy than employing PLS2 (see also [31], [34], and
[59]).
The major drawback of NIPALS algorithm (PLS1 and PLS2) is that the columns of
score matrix, T, are obtained as linear combinations of de ated data matrix X. Since
one looses sight of what is in the depleted data, the interpretation of the components gets
complicated. SIMPLS algorithm, given in the next subsection, resolves this drawback by
using a difierent de ating scheme.
2.2.2 SIMPLS Algorithm
SIMPLS algorithm ([15]) is an alternative to NIPALS algorithm that aims to derive
PLS components directly in terms of the original data which results in faster computation
with less memory requirements and interpretation easiness. SIMPLS de ates the cross-
covariance matrix, Sxy / X0Y, whereas NIPALS de ates the original data matrix X to
obtain orthogonal components.
11
SIMPLS algorithm can be summarized as follows:
Algorithm 2.2 (SIMPLS)
Step 1: Compute cross-product matrix: S0xy = X0Y (X and Y are centered),
Step 2: Repeat steps 2:1?2:6 for h = 1;2;:::;k:
Step 2.1: Compute flrst left singular vector of Sh?1xy as hth PLS weight vector rh,
Step 2.2: Compute hth score, th = Xrh, and normalize th =: th=kthk,
Step 2.3: Update hth PLS weight, rh =: rh=
q
r0hX0Xrh,
Step 2.4: Compute hth x-loading by regressing X on th: ph = X0th ,
Step 2.5: Store vectors rh, th, and ph into matrices Rh = [r1;r2;:::;rh],
Th = [t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively.
Step 2.6: h =: h+1 and Sh?1xy = (Ip ?Vh?1V0h?1)X0y where columns of Vh?1 form
an orthonormal basis for Ph?1.
The orthogonality constraint of components is fulfllled when the PLS weight vector rh
is orthogonal to all previous x-loadings Ph?1 = [p1;p2;:::;ph?1]. As a result of this, the
hth pair of SIMPLS weight vector rh for h = 2;:::;k is obtained as the flrst left singular
vector of Sxyh?1 which is projection of Sxyh?2 on a subspace orthogonal to Ph?1. Therefore,
if the columns of Vh?1 = [v1;v2;:::;vh?1] form an orthonormal basis of Ph?1 obtained by
GramSchmidt method, then
Sxyh?1 = (Ip ?vh?1v0h?1)Sxyh?2 = (Ip ?Vh?1V0h?1)X0Y: (2.11)
12
Once k components are derived, PLS estimate is obtained by using score matrix as explana-
tory variables as in the prediction step described in NIPALS algorithm.
After h components are derived, the data matrix is reduced implicitly to X(Ip?VhV0h)
with SIMPLS algorithm which can be seen from (2.11). In PLS1 algorithm, the hth derived
component, th, is equal to Eh?1wh, where wh is the normalized form of:
E0h?1fh?1 = X0(In ?Th?1(T0h?1Th?1)?1T0h?1)2y = X0(In ?Th?1(T0h?1Th?1)?1T0h?1)y:
(2.12)
Therefore, data matrix is reduced explicitly to (In?Th(T0hTh)?1T0h)X with PLS1. Although
residual matrices difier, application of both algorithms on data sets demonstrated that
SIMPLS and NIPALS algorithms are equivalent for univariate case which is also stated in
the next proposition.
Proposition 2.1 (De Jong, [15]) SIMPLS is equivalent to PLS1.
Proposition 2.1 can be proven by induction. However, when applied to multivariate set of
response variables (g > 1), the SIMPLS results are difierent from the results of PLS2 ([15]).
In this study, SIMPLS algorithm is employed because of its speed and e?ciency.
2.3 Determining the Optimal Number of Components in PLSR
The decision on the optimal number of components, k, is a very important issue in
building the PLSR model. Although, it is possible to calculate as many components as
the rank of the X, it does not make sense in practice. Because data are never noise-free
and some of the smaller components will only describe noise. Due to uncertain statistical
behavior of PLSR, explained in Section 2.5, it is di?cult to perform inferential tasks such
13
as assessing the number of components. Consequently, developing as well as comparing
PLS component selection rules have been and apparently continue to be subjects of active
research in chemometrics. Cross validation, adjusted Wold?s criterion and randomization
test are leading methods that are proposed to seek out the optimum dimensionality of PLS
models.
Among the many approaches proposed in the past, the cross-validation (CV) scheme
stands out in particular. In M-fold cross-validation, the original sample is partitioned into
M sub-samples. Of the M subsamples, a single sub-sample is retained as the validation set
for testing the model, and the remaining M ? 1 sub-samples are used as learning set for
estimating the model. The cross-validation process is then repeated M times (number of
folds), with each of M sub-samples used exactly once as the validation set. The M results
from the folds then can be combined to produce a single estimate for the optimal number of
components. Particularly, the n-fold cross validation (M = n), where only one observation
is deleted and the process is repeated as many times as samples, is called leave-one-out cross
validation. The resulting residual sum of squares, PRESS, is a measure of the predictive
power of the components in the model. The PRESS value for h component univariate PLSR
using leave-one-out cross validation is:
PRESS(h) =
nX
i=1
?
yi ? by?i(h)
?2
(2.13)
where the predicted values by?i(h) are based on the parameter estimates that are obtained
from the data set which does not include observation i using a PLSR model with h compo-
nents. The optimal number of components is the one that yields the minimum PRESS or
14
root mean square error, RMSE,
k = argmin
h
fPRESS(h)g = argmin
h
fRMSE(h)g (2.14)
where
RMSE(h) =
vu
ut1
n
nX
i=1
?
yi ? by?i(h)
?2
=
r1
nPRESS(h): (2.15)
A simple and classical method is the Wold?s R criterion ([89]) which compares two
successive values of PRESS via their proportion, that is
R = PRESS
(h+1)
PRESS(h) (2.16)
where PRESS(h) is given in equation (2.13). When R is greater than 1, it is considered
that the optimal number of components is h. Instead of comparing this ratio to unity, it
was proposed by ([89]) to flx it at 0:90 or 0:95 which is named Adjusted Wold?s Criteria.
The randomization test ([85]) is a recent method that assesses the statistical signiflcance
of each individual component that enters the model. Theoretical approaches to achieve this
goal (using a t- or F-test) have been put forth, but they are all based on some assumptions.
Randomization test is a data-driven approach and therefore ideally suited for avoiding
assumptions.
Denham ([19]) evaluated performances of several mean squared error (MSE) estimation
approaches in terms of their accuracy and usefulness in determining the optimal number
of components to include PLSR model. It is concluded that all methods perform very
compatible for data sets with few variables, while the cross-validation method results in
better MSE estimates for the data sets with large number of variables. One area where the
15
method of cross-validation works poorly is design of experiments, where the randomization
test should have merit ([85]).
2.4 PLS Regression Among Other Biased Methods
In the multiple linear regression, the OLS estimator of the regression coe?cient vector
has minimum variance in the class of unbiased estimators. Existence of multicollinearity
problem results in large variancesof the coe?cientestimators. Therefore, severalbiased esti-
mation methods have been proposed as alternatives to OLS estimator when multicollinearity
is present. The main goal of biased methods is to decrease the mean squared error of predic-
tion by introducing a reasonable amount of bias into the model. This is done by shrinking
the solution coe?cient vector away from the OLS solution toward directions in which the
projected data have larger spread. PLSR regression is a biased regression method and it
is related to other biased methods such as principal component regression (PCR, [62]) and
ridge regression (RR, [44]).
Most of these methods can be unifled under a generalized approach called continuum
regression. The continuum regression (CR) is a technique that can generate a range of
models including OLS, PLSR and PCR. CR weight vectors rh for h = 1;2;:::;k are deflned
as proposed by Stone and Brooks ([80]), according to the criterion:
rh = argmax
krk=1
cov(Xr;y)2[Var(Xr)] ?1???1 = argmax
krk=1
(r0X0y)2(r0X0Xr) ?1???1 (2.17)
under the constraint that cov(Xrh;Xrj) = 0 for h > j. The parameter ? (0 ? ? ? 1) adjusts
the amount of information of the regressors to be considered for predicting the response
variable.
16
The single vector (h = 1) that maximizes the squared sample correlation between the
response and the corresponding linear combination of the predictor variables is the OLS
solution which is obtained by taking ? = 0 in equation (2.17). Similarly, for h = k, ?=0.5
and ?=1 yield PLSR and PCR based solutions on k components, respectively.
Alternatively, the OLS estimator can be obtained as the solution of the normal equa-
tions
X0Xb = X0y: (2.18)
Throughthissection, itisassumedthatn > pandXhasfull-column rank, i.e. rank(X) = p,
however, with some minor modiflcations, the results given in this section can be established
for other cases, as well. Then, the OLS estimator of fl is
bflOLS = (X0X)?1X0y: (2.19)
If rank(X) < p, then (X0X)?1 in (2.19) is replaced by (X0X)+ which yields unique mini-
mum length least squares (MLLS) solution.
The idea of PCR is to replace the original regressors by h ? p principal components
(PCs), stored in the score matrix, Zh = XVh, where the flrst h eigenvectors of X0X form
Vh. These eigenvectors are the solutions of (2.17) for ?=1. Therefore, the PCR estimator
of fl in (2.1) based on h components is:
^fl(h)PCR = (Z0hZh)?1Z0hy = VhV0h^flOLS: (2.20)
It can be seen from (2.20) that the PCR estimator based on h components is the
orthogonal projection of OLS estimator onto the space spanned by the flrst h eigenvectors
17
of X0X, since VhV0h = Vh(V0hVh)?1Vh. On the other hand, the PLSR estimator of fl for h
components is given by
^fl(h)PLS = RhP0h^flOLS = Wh(P0hWh)?1P0h^flOLS: (2.21)
The matrix Wh(P0hWh)?1P0h is an idempotent matrix, hence it is a projection matrix. How-
ever, it is not a symmetric matrix, so it is called oblique projector. Therefore, the PLSR
estimator for h components is the oblique projection of ^flOLS onto Wh along to the space
P?h ([69]). In PCR, as well as in PLSR, the degree of bias is controlled by the dimension of
the space , h, on which orthogonal projection of ^flOLS is taken. The smaller the value of h,
the larger the bias.
In PCR, the columns of score matrix are derived without the reference to the response
variables so that the derived components are optimal in the sense of maximizing the amount
of explained variation in X. On the other hand, in PLSR, a set of linear combinations for
X and another set of linear combination for y are derived and they are optimally related in
yet another sense. This is an advantage of PLSR over PCR especially in the cases, where
components obtained by maximizing variation in X, may have no relevance for modeling y
which is demonstrated in the next example.
18
Example 2.1
Hald?s data set ([23]), consists of four standardized regressors and one response variable, is
used to demonstrate this drawback of PCR. The original response variable is replaced by
y = 2Xv4 + ", where " ? N(0;1) and v4 is the eigenvector corresponding to the smallest
eigenvalue of X0X, as suggested by Hadi et al. ([39]). It can be seen from Figure 2.1 that
scatter plots of y versus each of the flrst three PCs display completely random pattern,
while the relationship between y and the last PC is perfectly linear. PCR and PLSR based
on k = 2 components resulted in mean squared errors 5:3228 and 0:2277, respectively which
also can be concluded from Figure 2.2. In general, if the true regression coe?cient is in the
direction of ith eigenvector of X0X, then ith component alone will contribute everything to
flt, while the remaining PCs contribute nothing ([39]). In such cases, PLSR is expected to
perform better than PCR since optimal directions are determined by considering y.
Another advantage of PLSR over PCR is that the vector of fltted values from PLSR is
closer to fltted values from OLS and hence to y than its PCR counterpart. The PLS model
always gives a closer flt, in terms of coe?cient determination, R2, than a PCR model with
the same number of components ([16], [31], [70]).
Ridge regression (RR) is another well-known biased regression method. The method
replaces the covariance matrix, X0X, by a better conditioned matrix, X0X+?Ip for a value
of ? called ridge constant that lies between 0 and 1. The aim is stabilizing the inverse of
the possibly ill-conditioned covariance matrix by adding a multiple of Ip. As in OLS, the
solution is deflned by a single vector
^flRR = argmax
kwk=1
corr(Xw;y)2 Var(Xw)Var(Xw)+? = (X0X +?Ip)?1X0y: (2.22)
19
?3 ?2 ?1 0 1 2?0.2
?0.15
?0.1
?0.05
0
0.05
0.1
0.15
PC1
y
?2 ?1 0 1 2 3?0.2
?0.15
?0.1
?0.05
0
0.05
0.1
0.15
PC2
y
?1 ?0.5 0 0.5 1?0.2
?0.15
?0.1
?0.05
0
0.05
0.1
0.15
PC3
y
?0.1 ?0.05 0 0.05 0.1?0.2
?0.15
?0.1
?0.05
0
0.05
0.1
0.15
PC4
y
Figure 2.1: Scatter plots of y versus PC1, PC2, PC3, PC4.
?0.2 ?0.1 0 0.1 0.2?2.5
?2
?1.5
?1
?0.5
0
0.5
1
1.5
2x 10?4
y
PCR predicted values for k=2
Correlation=0.0017
?0.2 ?0.1 0 0.1 0.2?0.2
?0.15
?0.1
?0.05
0
0.05
0.1
0.15
y
PLSR predicted values for k=2
Correlation=0.9867
Figure 2.2: Scatter plots of ^y versus y using PCR with k=2 (left) and PLSR with k=2
(right).
20
Setting the ? = 0 yields the unbiased OLS solution, while the larger values of ? introduces
bias into the model. The relationship between the flrst factor of CR and RR is described
in [81] and it is concluded that there is one-to-one correspondence between the ? (CR
parameter) and ? ([81]).
RR difiers in two respects from the PLSR. First of all, it does not derive orthogonal
components, it applies explicit shrinkage to the coe?cient vector. Secondly, RR is usu-
ally applied to univariate regression models, although the generalized RR for multivariate
response model is proposed. One of the disadvantages of RR is its heavier computation
especially for high dimensional problems.
The comparison of OLS, PLSR, PCR, and RR is given by Frank and Friedman ([31])
and Alm?ay ([1]). Simulation studies indicated that none of the methods is uniformly better
than the others. Frank and Friedman concluded that RR, PLSR and PCR provided sub-
stantial improvement over OLS when multicollinearity exists. In all settings, PLSR usually
performed very compatible with RR and it was closely followed by PCR. The main result
of Alm?ay?s study was that PLSR performs better when the irrelevant eigenvalues are large,
whereas PCR performs better when the irrelevant eigenvalues are small.
2.5 Statistical Properties of the PLSR Estimator
Since the PLSR estimator of fl is a non-linear function of y, it is very di?cult to derive
the exact distribution of the estimator which leads di?culties in terms of inference based
tasks. Although PLSR is a very popular tool for chemometricians, it is used to be overlooked
by statisticians. However, more recently statisticians have attempted to shed some light on
the method and its properties. In this section, shrinkage structure of the PLSR estimator
21
([17], [31], [53], [70]), its asymptotic variance ([18], [71], [75]) and consistency ([42],[64])
properties are conveyed.
2.5.1 Shrinkage Properties of the PLSR Estimator
Characterization of the PLSR estimator in terms of shrinkage properties is useful for
providing a link between PLSR and other shrinkage estimators.
In general, if the singular value decomposition of X is given by X = U?12V0, where
the columns of U, uj, and the columns of V, vj, are left and right singular vectors of X,
respectively and ? is diagonal matrix with ordered eigenvalues of X0X, ?1 ? ?2 ? ::: ? ?p,
on the diagonal; then the shrinkage estimator of fl is
bfl =
pX
j=1
f(?j) u
0jy
p?
j
vj =
pX
j=1
f(?j)fijvj (2.23)
where fij = u
0
jyp
?j, f(?j) are shrinkage factors and u
0jy are called Fourier coe?cients ([53]).
Since the OLS estimator in (2.19) can be rewritten as Ppj=1 fijvj, all shrinkage factors,
f(?j), are set to be 1 for the OLS estimator. Setting f(?j) to values other than 1 introduces
bias into the estimation process which is beneflcial since the increase in bias results in a
decrease in mean-squared error. Shrinkage factors less than 1 lead a reduction in the
variance of the estimator, whereas factors greater than 1 result in simultaneous increase in
both variance and bias.
The shrinkage factors for the RR estimator are
f(?j) = ?j?
j +?
22
for j = 1;2;:::;p while factors of the PCR estimator for h ? p components are:
f(?j) =
8>
><
>>:
0 j > h
1 j ? h
It can be easily seen that the shrinkage factors for PCR and RR are between 0 and 1.
Therefore, they are regarded as shrinkage methods since they shrink bfl by shrinking some
of the fij.
Frank and Friedman ([31]) were the flrst statisticians who stated the shrinkage property
of the PLSR estimator using simulation studies, but they did not provide theoretical proof.
The shrinkage properties of the PLSR estimator can be investigated theoretically through
its relationship with Krylov space and conjugate gradient method (CG) ([70]). Therefore,
initially, a brief information on these concepts would be appropriate. The space spanned
by the columns of z;Az;A2z;:::;Am?1z is called m dimensional Krylov space of a square
matrix A and a vector z and denoted by Km(A;z). An alternative form of PLSR estimator
with h components can be given in terms of h dimensional Krylov space of Kh(X0X;X0y).
Wh, obtained from NIPALS algorithm, is an orthonormal basis for the spaceKm(X0X;X0y)
([41]) and it is central to connection between PLSR and conjugate gradient (CG) method.
CG method aims to solve a system of linear equations Cb = c arising out of the
minimization of the quadratic function ?(b) = 12b0Cb ? c0b for a positive semi-deflnite
matrix C. The solution by CG algorithm for h iterations can be obtained from the following
algorithm:
Algorithm 2.3 (CG)
23
Step 1: Let b0 = 0 and d0 = e0 = c ? Cb0 = c. Repeat the Step 2? Step 5 for
j = 0;:::;h?1:
Step 2: Calculate zj
zj = d
0jej
d0jCdj: (2.24)
Step 3: Calculate bj+1
bj+1 = bj +zjdj: (2.25)
Step 4: Calculate residual: ej+1
ej+1 = c?Cbj+1 = c?C(bj +zjdj) = ej ?zjCdj: (2.26)
Step 5: Calculate dj+1
dj+1 = ej+1 ?
?e0
j+1Cdj
d0jCdj
!
dj; (2.27)
and set j := j +1.
When C = X0X and c = X0y, the normal equations given in (2.18) is obtained and
for any arbitrary initial solution, bp converges (in exact arithmetic) to (X0X)+X0y which
is equal to bflOLS if X is full rank matrix ([70]).
There are important properties of the vectors generated during the CG algorithm ([36]).
For instance, the space spanned by vectors fd0;d1;:::;dh?1g is Kh(C;c) and the residual
vectors fe0;e1;:::;eh?1g span the same Krylov space. Beside this, since the residuals are
orthogonal, these vectors are actually the wj?s from NIPALS algorithm when C = X0X and
24
c = X0y for j = 1;2;:::;h except for a scale factor. Therefore, the CG estimator, bj, for
j = 1;2;:::;h, is identical to the PLSR estimator for j components, bfl(j)PLS, and the PLSR
estimator for h = p components yields the OLS estimator. Another important property of
CG estimator, bj, is that it minimizes (b ? bj)0C(b ? bj) over Kj(C;c) j = 1;2;:::;h.
Therefore, the PLSR estimator based on h components is given by
bfl(h)PLS = argmin
b2Kh(X0X;X0y)
(y ?Xb)0(y ?Xb): (2.28)
Furthermore, the relationship between CG and PLSR can be used to prove the following
proposition.
Proposition 2.2 (De Jong, [17]) The norm of bfl(h)PLS is strictly non-decreasing function of
the number of components, h, i.e. PLSR estimator of fl shrinks.
This result was proven algebraically by De Jong ([17]). Another proof by Phatak and De
Hoog ([70]) given below is more compact which uses relationship between CG method and
PLSR estimator.
Proof:
It is necessary to prove that the norms of CG estimators for normal equations, bh = bfl(h)PLS,
increase as h increases, that is kb1k < kb2k < ::: < kbpk = kbflOLSk. In general, (2.25) gives
kbj+1k2 = kbjk2 +zj2kdjk2 +2zjd0jbj:
Hence, to prove thatkbj+1k2 > kbjk2, it is su?cient to show that zjd0jbj > 0 since zj2kdjk2 ?
0. By multiplying both sides of (2.27) by e0j+1, the following equality is obtained:
e0j+1dj+1 = e0j+1ej+1 ?
?e0
j+1X0Xdj
d0jX0Xdj
!
e0j+1dj: (2.29)
25
Since residuals are orthogonal and spanfe0;e1;:::;ej+1g=spanfd0;d1;:::;dj+1g, ej+1 is
orthogonal to dj. Therefore e0j+1dj = 0 in (2.29) and e0j+1dj+1 = e0j+1ej+1 for j =
1;2;:::;p?1 from which it follows that
zj = d
0jej
d0jX0Xdj =
e0jej
d0jX0Xdj > 0: (2.30)
From (2.25), bj = Pj?1i=0 zidi so d0jbj = Pj?1i=0 zid0jdi. Therefore, it is su?cient to prove
that d0jdi > 0 for i 6= j to show that d0jbj > 0. The statement d0jdi > 0 can be proven by
expressing dj vectors in terms of the residuals. If lj = ?e
0
j+1X
0Xdj
d0jX0Xdj , then (2.27) gives
dj = ej +
j?1X
i=0
0
@
j?1Y
k=i
lk
1
Aei: (2.31)
So, d0jdi > 0 only if lj > 0. Again, utilizing the (2.26), we get
ej+1 = ej ?zjX0Xdj =) X0Xdj = z?1j (ej ?ej+1).
Therefore, lj can be rewritten as
lj = ?e
0j+1X0Xdj
d0jX0Xdj = ?
e0j+1(ej ?ej+1)
zjd0jX0Xdj =
kej+1k2
zjd0jX0Xdj > 0 (2.32)
which completes the proof 2.
Expressing shrinkage factors of the PLSR estimator, in terms of eigenvalues of the
matrices X0X and W0hX0XWh, is also possible and this expression, given in the next propo-
sition ([53]), provides valuable information on how the shrinkage behavior of the PLSR
estimator is.
26
Proposition 2.3 (Lindj?erde and Christophersen, [53]) The shrinkage factors, from (2.23),
for the PLSR estimator for h components are given by
f(?j)(h) = 1?
hY
i=1
?
1? ?j
(h)i
!
(2.33)
for j = 1;2;:::;p, where ?1 ? ?2 ? ::: ? ?p and (h)1 ? (h)2 ? ::: ? (h)h are the eigenvalues
of X0X and W0hX0XWh, respectively.
Shrinkage factors of the PLSR estimator are determined using the relationship between
PLSR and Lanchoz method ([41], [55]). Lanchoz method transforms an original symmetric
matrix, C, into a symmetric tridiagonal matrix. Therefore, after applying Lanchoz method
on C = X0X and obtaining a tridiagonal matrix, eigenvalues and corresponding eigenvectors
of C = X0X can be easily calculated due to the nature of being a tridiagonal matrix. Given
some starting vector c = X0y, the procedure constructs tridiagonal matrices W0hX0XWh
for h = 1;2;:::;p, where the columns of Wh form an orthonormal basis for Kh(C;c) =
Kh(X0X;X0y). If (h)i and?(h)i , i = 1;2:::;hare the eigenvalues and unit-norm eigenvectors
respectively of W0hX0XWh, then ( (h)i ;Wh?(h)i ) are called Ritz pairs.
Although it is known that the PLSR estimator shrinks relative to the OLS estimator
(Proposition 2.2), it may expand some of the fij, that is f(?j) > 1. It is determined
that eigenvalues of X0X and Fourier coe?cients play an important role on the shrinkage
properties of the PLSR estimator ([53]). Especially the cases where the corresponding
singular value is large and the corresponding Fourier coe?cient is small are the ones that
anti-shrinkage property is observed. On the other hand, since the contribution of such terms
to the PLSR estimator will be small, anti-shrinkage may not be seriously efiective.
27
2.5.2 The Asymptotic Variance of the PLSR Estimator
Unlike the OLS estimator, the PLSR estimator of fl in (2.1) is a non-linear function
of the response variable. Thus, the covariance matrix of the PLSR estimator cannot be
determined easily. This leads to di?culties in inference based tasks such as choosing opti-
mal number of components, testing signiflcance of coe?cients and constructing confldence
intervals for the regression coe?cients.
Methods based on re-sampling techniques, such as bootstrapping and cross-validation,
utilize the original data to gain information about the variability of the estimator. However,
these methods are computationally intensive. Furthermore, their applications to typical
chemometrics problems are not practical since there are often very few observations.
Another approach is linearizing the non-linear estimator to estimate its variance. Den-
ham ([18]) provided a locally linear approximation to the covariance matrix based on the
flrst derivative of the PLSR vector. More compact expression for the asymptotic covariance
matrix is given by Phatak et al. ([71]). In their approach, the approximate covariance
matrix for the bfl(h)PLS is calculated using the delta method stated below.
Theorem 2.1 Let fYng be a sequence of random vectors. Assume that pn[Yn ? ?] con-
verges in distribution to N(0;?) and g(:) is a vector function whose derivatives exist in
a neighborhood of z = ?, then pn[g(Yn) ? g(?)] converges in distribution to N(0;J?J0),
where J = @g(z)@z0 jz=?.
Therefore, assuming that var(") = 2In, the approximate covariance matrix of bfl(h)PLS can
be written as
var(bfl(h)PLS) = JoJ0o 2 (2.34)
28
where Jo is the Jacobian matrix that consists of derivatives of each element of bfl(h)PLS with
respect to elements of y evaluated at given data (Xo;yo). Jacobian matrix, Jo, is derived
by using the matrix difierential calculus and given as
Jo = @
bfl(h)PLS
@y0 jy=yo= f[y
0XWhGh ?L]+[WhGh ?y0XL]g(M?10 ?Ip)Uh0 +HhX0 (2.35)
where Gh = (W0hX0XWh)?1, Hh = WhGhW0h, L = (Ip ?HhX0X), M is a matrix such that
Kh = WhM with
Kh = [X0y;(X0X)X0y;:::;(X0X)h?1X0y].
and Uh = [X;X(X0X);:::;X(X0X)h?1] is n?hp matrix ([71]). A reasonable estimate of
2 in (2.34) can be calculated by dividing residual sum of squares to an appropriate degrees
of freedom. Although n ? p has been suggested as degrees of freedom, due to non-linear
form of PLSR estimator it is suggested the use of the trace of (In?XJo)0(In?XJo) ([18]).
The results of simulation studies indicated that the covariance matrix estimate based
on Jacobian matrix, Jo, provides a useful approximation to the true covariance matrix ([71]).
However, since Jo is evaluated using the data from a single experiment, how well JoJ0ob 2
approximates the actual covariance matrix is directly related to the optimality of number
of components, h.
The covariance matrix for the bfl(h)PLS can also be calculated from the in uence function
([75]). This approach has advantages over the methods based on linearization approaches
such as independence of model assumption and computational easiness. The relationship
between in uence function of an estimator and its variance is given in Chapter 4.
29
2.5.3 Consistency of the PLSR Estimator
A consistent estimator is an estimator that converges in probability to the parameter
being estimated as the sample size grows without bound. In this section, it is shown that
^fl(h)PLS is consistent estimator of fl in the model (2.1), when speciflc assumptions are held.
Linear model given in (2.1) can be rewritten as
yi = x0ifl +"i (2.36)
where xi is the ith row of X for i = 1;2;:::;n. Assuming that xi?s are independently
identically distributed random variables with positive deflnite covariance matrix ?xx and
are independent of "i , then
?xy = cov(xi;yi) = cov(xi;x0ifl) = ?xxfl: (2.37)
Therefore,
fl = ??1xx?xy = Pph=1 vhv0h?h ?xy
where ?h and vh are the hth eigenvalue and the corresponding eigenvector of ?xx for h =
1;2:::;p. In general, eigenvectors with v0h?xy 6= 0, one for each ?h, are called the relevant
eigenvectors of ?xx for prediction of y ([42], [59]), i.e., eigenvectors of ?xx with non-zero
components along ?xy.
Helland ([41]) provided a general form of the ^fl(h)PLS as
^fl(h)PLS = ^Dh( ^D0hSxx ^Dh)?1 ^D0hSxy (2.38)
30
where Sxx is the sample covariance estimator of ?xx, Sxy is the sample estimator of ?xy,
and ^Dh is p?h matrix with columns that span Kh(Sxx;Sxy). Helland ([42]) also proved
that if m is the number of relevant eigenvectors of ?xx for prediction of y, then m is the
least integer h such that fl = ??1xx?xy belongs to Kh(?xx;?xy). These facts are used to
prove the consistency of the PLSR estimator given in the next lemma.
Lemma 2.1 Assume that Sxx and Sxy converge to ?xx and ?xy in probability, respectively.
If h = m is the number of the relevant eigenvectors of ?xx for prediction of y, then ^fl(h)PLS
is consistent estimator of fl.
Proof: Since Sxx and Sxy converge to ?xx and ?xy in probability, respectively, h component
PLSR estimator in (2.38), ^fl(h)PLS, converges in probability to
Dh(D0h?xxDh)?1D0h?xy = Dh(D0h?xxDh)?1D0h?xxfl
where columns of Dh span Kh(?xx;?xy). If h = m, then, as proved by Helland ([42]),
fl 2 Kh(?xx;?xy). Consequently, the space spanned by D?h = ?1=2xx Dh contains ?1=2xx fl,
which implies
D?h(D?0h D?h)?1D?0h ?1=2xx fl = ?1=2xx fl.
Multiplying each side by ??1=2xx from left, we obtain
Dh(D?0h D?h)?1D0h?xxfl = Dh(D0h?xxDh)?1D0h?xxfl = fl.
So, ^fl(h)PLS converges in probability to fl 2.
31
Chapter 3
RoPLS: Robust Partial Least Squares Regression
3.1 Introduction
AlthoughPLSregressionhandlesthemulticollinearityproblem, itfailstodealwithdata
containing outliers since PLS algorithms are based on the empirical cross-covariance matrix
between the response variable(s) and the regressors. Existence of multicollinearity and
outliers in the data sets leads to a requirement of robust PLS methods in many application
areas. Consequently, several robust PLS regression methods have been proposed in the
literature. In general, these methods can be classifled in two groups: those which use
iteratively reweighting technique and those which use the idea of robustication of covariance
matrix.
The methods utilizing iteratively reweighting idea, assign a weight between 0 and 1 to
the each data point in a way that outliers, points which are su?ciently far away from the
bulk of the data, gain less weights than inliers, points which are the bulk of the data. The
flrst robust PLS method in this group, proposed by Wakeling and Macfle ([83]), is based
on the idea of replacing a set of ordinary regression steps in NIPALS algorithm by robust
counterparts. The main drawback of the method is the large amount of the computation
time. Therefore, Griep et al. ([38]) suggested a semi-robust method by replacing only the
flrst ordinary regression step by a robust regression method for the sake of computation.
However, the method looked at outliers onto planes [y;Xj] where j = 1;2;:::;p, while
ignoring the multivariate nature of the data. Cummins and Andrews ([13]) gave a slightly
difierent version of iteratively reweighting method by calculating weights after performing
32
an ordinal PLS algorithm and using these weights for the next PLS algorithm. Since the two
aforementioned methods employ a robust method within the PLS algorithm they are called
"internal iteratively reweighting" methods whereas the method by Cummins and Andrews
is called "external iteratively reweighting" method ([35]). Although the latter method has
advantages over the internal reweighting methods, it sufiers from the lack of resistance to
outliers in X space. Only recently, another external iterative method, called "Partial Robust
M Regression" (PRM), proposed by Serneels et al. ([76]) which is robust to outliers in X
space.
The second group of robust methods, introduced by Gil and Romera ([35]) estimate
the regression coe?cients with the help of a robust covariance matrix instead of applying
robust regression method in PLS algorithms. Gil and Romera estimated covariance matrix
using Stahel-Donoho estimator ([22], [78]). However, this method can not be applied to
high dimensional data sets since it uses a resampling scheme by drawing subsets of size
p + 2. Hubert and Branden ([48]) proposed another two step algorithm (RSIMPLS), that
can be used for both low and high dimensional data, by estimating covariance matrix using
MCD (minimum covariance determinant, [74]) and the idea of projection pursuit. Once the
score matrix is obtained, robust regression is used to estimate a in (2.7), so the fl in the
equation (2.1).
In this chapter, robustifled versions of the SIMPLS algorithm, RoPLS1 and RoPLS2,
are introduced. The proposed methods are external iteratively reweighting algorithms based
on the idea of reweighted least squares method given by Billor et al. ([6]). RoPLS1 uses
weights calculated from BACON (blocked adaptive computationally-e?cient outlier nomi-
nators) algorithm ([5]), whereas RoPLS2 uses weights calculated from PCOUT algorithm
33
([26]). RoPLS2 is the flrst method that incorporates PCOUT algorithm as an integral com-
ponent in a robust estimation procedure rather than just as an outlier detection method.
Both RoPLS1 and RoPLS2 have computational advantages over recent robust methods
PRM and RSIMPLS and they are resistant to masking problem.
The rest of the chapter is organized as follows: Outlier identiflcation algorithms, BA-
CON and PCOUT, are reviewed in Section 3.2. The detailed algorithm for the proposed
algorithm, RoPLS, including a robust method to determine the number of components and
two graphical methods to diagnose outliers, are given in Section 3.3. Real and simulated
data sets are utilized to demonstrate the performance of the proposed method in Section 3.4.
3.2 Outlier Detection Methods
The outlier challenge is one of the earliest of statistical interests, and since data sets
often contain outliers, it continues to be one of the most important. The outlier detection
problem has important applications in the flelds of fraud detection, astronomy, bioinfor-
matics (e.g., microarray experiments), and many other countless areas. For instance, the
great interest of astronomers is to discover unusual, rare or unknown types of astronomical
objects or phenomena. The outlier detection approaches in multiple terabyte, and even
larger, multi petabyte data sets, correctly meet the need of astronomers.
Outlierdetectionalgorithmsfallintotwobroadclasses: thosewhichemploythedistance-
based methods, such as MULTOUT ([90]), MCD ([74]), BACON ([5]); and those which rely
upon projection pursuit ideas such as Kurtosis1 ([68]) and Stahel-Donoho estimator ([56]).
Distance-based methods classify outliers as points which are su?ciently far away from the
bulk of the data; whereas projection-pursuit methods use lower dimensional projection of
34
data that enables the user to detect outliers. Primary goals in all of these algorithms are
explicit outlier detection and/or robust estimation. Most of the distance-based methods
are especially designed for low dimensional data sets, that is, the number of observations is
greater than the number of variables, and identiflcation of outliers in higher dimensions be-
comes more complicated as the dimension increases. On the other hand, nowadays data sets
in many scientiflc flelds are high dimensional. Although projection pursuit based methods
can be applied to such situations, their computational di?culties make them impracti-
cal to use. Recent outlier detection method, PCOUT ([26]), which detects outliers very
e?ciently in high dimensional data sets, combines the advantages of distance-based and
projection pursuit methods. In the following subsections, detailed information on BACON
and PCOUT methods is provided, since these algorithms are used to build new robust PLS
algorithms in Section 3.3.
3.2.1 BACON
BACON ([5]) algorithm is a distance-based method that starts with an outlier-free
subset of data (initial basic subset), from which robust distances can be calculated. The
initial basic subset can be found algorithmically in one of two ways: Mahalanobis distances
based on classical mean and covariance estimates (Version1) or Euclidian distances from
medians (Version2). The advantages of Version1 are its a?ne equivariance and its low com-
putational cost. Then, based on the mean and covariance matrix of the basic subsetof size
m, discrepancies are computed and all observations with discrepancy less than correction
factor, Cnpr?p;fi=n, form the new basic subset, where ?p;fi=n is the 1 ?fi percentile of the
35
chi-square distribution with p degrees of freedom and,
Cnpr = Cnp +Chr = max[0;(h?r)=(h+r)]+1+ p+1n?p + 1n?h?p
where h = [(n+p+1)=2)] and r is the size of current basic subset. This iterative method is
repeated until the size of basic subset, r, no longer changes. The observations excluded by
the flnal basic subset are nominated as outliers. BACON estimators of the location vector
(e?) and the covariance matrix (e?), based on this flnal subset, are employed to calculate
robust Mahalanobis distance vector dB with ith row for i = 1;2;:::;n given as
diB = d(xi;e?; e?) =
q
(xi ? e?)0e??1(xi ? e?) (3.1)
where xi 2 Rp is the ith row of data matrix, X, for i = 1;2;:::;n. Simulation studies
demonstrate that BACON is a robust technique with 40% breakdown point. The method
is fast, easy to implement and thus practical for data sets of even million of cases.
Although BACON algorithm is originally designed for low dimensional data, an exten-
sion of BACON algorithm to high dimensional data is also possible. A simple solution to
this problem is to run the BACON algorithm on the reduced data set, eX, of size n?p?,
where eX is the score matrix based on the spectral decomposition of covariance matrix es-
timate and p? denotes the number of such score vectors in the reduced data set. Robust
distance vector, dB, is calculated based on eX.
36
3.2.2 PCOUT
PCOUT is a recent outlier identiflcation algorithm that is particularly efiective in high
dimensions([26]). PCOUT is based on the idea of outlier detection on principal component
space which does not require matrix inversion. The method starts by robustly sphering the
original data by columnwise median of absolute deviances for j = 1;2;:::;p:
x?ij = xij ?mediani(xij)(1:4826)mad
i(xij)
(3.2)
where
madi(xij) = mediani(jxij ?mediani(xij)j): (3.3)
Then, PCA is applied to sphered data matrix, X? = [x?ij], and p? < p semi-robust compo-
nents, Z = [z1;z2;:::;zp?], that contribute to at least 99% of the total variance are retained
in the analysis. The semi-robust component matrix, Z, is sphered similar to (3.2) as
z?ij = zij ?mediani(zij)(1:4826)mad
i(zij)
: (3.4)
The rest of the algorithm uses the sphered component matrix, Z? = [z?ij], to detect location
and scatter outliers. The location outlier detection is initiated by calculating the absolute
value of a robust kurtosis measure for each component:
kj = j1n
nX
i=1
z?ij4 ?3j (3.5)
where j = 1;2;:::;p?. Kurtosis coe?cient measures "heaviness" of distribution tails. Small
value of kurtosis is an indicator of large amount of asymmetric contamination, whereas
37
large value of kurtosis is an indicator of either symmetric contamination or small amount
of asymmetric contamination. Since the presence of location outliers is likely to cause the
kurtosis to become difierent from zero, the relative weight vj = kj=Pp?j=1 kj for each of the
sphered component, z?j, is used to reveal outliers. Robust distances, for detecting location
outliers, are calculated based on the weighted component matrix:
RDLi =
vu
uu
t
p?X
j=1
v2jz?ij2 (3.6)
and then the following equation:
dLi = RDLi
q
(?2p?;0:5)
mediani(RDLi ); (3.7)
suggested by Maronna and Zamar ([57]), is used to transform distances.
To detect scatter outliers, sphered component matrix, Z?, is directly used to calculate
robust distances, RDSi , by plugging vj = 1 for all components in (3.6). Similarly, dSi is
calculated using the transformation given in equation (3.7) by replacing RDLi by RDSi .
After calculating robust distances, the translated biweight function ([72]),
w(d;c;M) =
8>
>>>
>><
>>>>
>>:
0 d?c
1?
?d?M
c?M
?2?2
M can be written as a functional, i.e., T(Fn) = b n. In this
section, notation A> is used for transpose of matrix A, while the notation (0) is used as
60
derivative operator. Here Fn is the empirical distribution function given by:
Fx;n(t) = 1n
nX
i=1
I(?1;t](xi)
where IA(x) = 1 for x 2 A and 0 otherwise. Thus, if the purpose is to obtain an estimator
that is relatively unafiected by small shifts in the cumulative distribution function, this
can be achieved by choosing an estimator represented by a continuous functional and an
estimator with this property is said to have "qualitative robustness".
The second tool for robustness arises when the two other restrictions, existence and
boundedness of the functional derivative, are imposed so that small changes in the cumu-
lative distribution do not result in large changes in the value of the functional, T. In order
to theoretically assess the in uence that an observation z? has on the value of a statistical
functional, the derivative of the functional called in uence function is used ([40]).
Deflnition 4.1 The in uence function (IF) of a functional T at F 2F is given by
IF(z?;T;F) = dd"T(F")
"=0
= lim
"#0
T(F")?T(F)
" = lim"#0
T((1?")F +"?z?)?T(F)
" (4.1)
in those z? where this limit exists. Here ?z? is point mass distribution at z? and # denotes
the limit from the right.
In uence function given in (4.1) corresponds to the directional (Gateaux) derivative of T
at F in the direction of H = ?z? ?F and can also be written as:
IF(z?;T;F) = T0(H) = dd"T(F +"H) j"=0 : (4.2)
61
The functional, so the estimator, is said to have "inflnitesimal robustness" if IF(z?;T;F)
is a bounded function of z?. The in uence function of T can also be used to determine an
explicit formula for the asymptotic variance of T ([47]) since
Var(T;F) ?
R IF(x;T;F)>IF(x;T;F)dF(x)
n (4.3)
which can be estimated by
dVar(T;F) =
Pn
i=1 IF(xi;T;Fx;n)>IF(xi;T;Fx;n)
n2 : (4.4)
The third tool, breakdown point, addresses the notion of quantitative robustness and
it can loosely be deflned as the smallest fraction of samples (with respect to n) that can
render the estimator useless. It describes how greatly a small change in the underlying
distribution F changes the distribution of an estimator. The higher the breakdown point of
an estimator, the more robust it is. In this chapter, flnite-sample version of the breakdown
point given in Deflnition 4.2 is preferred because of the simplicity.
Deflnition 4.2 The flnite-sample breakdown point, "?(x;T), of an estimator T at a sample
x=[x1;x2;:::;xn]> is given by
"?(x;T) = 1n minm fm : supex k T(x)?T(ex) k= 1g (4.5)
where ex is obtained by replacing m (1 ? m ? n) observations of x by arbitrary observations.
The outline of this chapter is as follows. In Section 4.2 in uence functions for classical PLSR
estimator of fl is given. Robustness properties of RoPLS estimator of fl: in uence function
62
for low dimension, empirical in uence curve for high dimensional case and flnite-sample
breakdown properties, are discussed in Section 4.3.
4.2 In uence Function for the SIMPLS Based Regression Estimator
In this section, in uence function for the SIMPLS estimator of fl is given. Assume
that x 2Rp and y 2R are centered random variables. Let F be a cumulative distribution
function for a random vector = (x>;y)> and d = p+1. Then the covariance matrix of
is given by
? =
0
BB
@
?xx ?xy
?yx yy
1
CC
A:
A functional, S, can be deflned on a suitable class of probability distributions, F,
which maps an arbitrary distribution G 2 F into a positive deflnite symmetric matrix,
S(G) = EG( >), under the assumption that S(F) = EF( >) = ?, that is; S is Fisher-
consistent for ? at F. Using functional S, other functionals can also be deflned as:
Sxx(G) = EG(xx>) = [Ip : 0]S(G)[Ip : 0]>
Sxy(G) = EG(xy) = [Ip : 0]S(G)ed
where 0 is p?1 vector of zeros and ed is the dth standard basis vector of Rd. Fisher consis-
tency of S at F implies Fisher-consistency of Sxy at F for ?xy, that is Sxy(F) = EF(xy) =
?xy. Similarly, Syx(G) = EG(yx>) = Sxy(G)> and Syx is also Fisher-consistent for ?yx at
F. Since the existence of second moments guarantees the existence of the functional S, it
is assumed that F consists of probability functions, G, at where positive deflnite matrix
EG( >) exists which requires the existence of second moments.
63
Throughout this section, functional for an estimator is denoted by the parameter being
estimated unless stated otherwise. ?h = ?2h and ?h stand for eigenvalues and eigenvectors
of ?h?1yx ?h?1xy , respectively, satisfying ?h?1xy ?h = ?hrh for h = 1;2;:::;k. Since S(F) = ?,
?h(F) = ?h, ?h(F) = ?h and rh(F) = rh.
The functional for the SIMPLS estimator of fl for h component is given by
^flh(G) = Rh(G)R>h (G)Sxy (4.6)
where Rh(G) = [r1(G);r2(G);:::;rh(G)] and G 2 F. So, for the sake of clarity, in uence
function of ^flh is studied in three steps. In the flrst step, in uence functions of S0yxS0xy
and the flrst PLS-weight vector, r1, are derived. In the second step, similar approach is
followed to flnd the in uence function of hth PLS-weight vector, rh. Finally, in the third
step, in uence functions for scaled PLS-weight vector, ~rh and PLS slope estimator, ^flh, are
given.
Step 1: In uence functions for SyxSxy and r1
Since S0xy?1 = ?1r1, to determine the in uence function of r1, it is necessary to deter-
mine the in uence functions of ?1 and ?1 that require in uence function of S0yxS0xy. The
following lemmas are used to obtain in uence function for S0yxS0xy = SyxSxy.
Lemma 4.1 The in uence function of S at F 2F is
IF( ?;S;F) = ? ?> ??: (4.7)
64
Corollary 4.1 Let ? = (u>;v)> be an arbitrary d-dimensional point. Then,
i. IF( ?;Sxy;F) = uv ??xy = IF( ?;Syx;F)>,
ii. IF( ?;Sxx;F) = uu> ??xx,
iii. IF( ?;SyxSxy;F) = 2[vu>?xy ??yx?xy].
Proof:
i. IF( ?;Sxy;F) = [Ip : 0]IF( ?;S;F)ed = uv ??xy,
IF( ?;Syx;F) = e>d S(F)[Ip : 0]> = vu> ??yx = IF( ?;Sxy;F)>,
ii. IF( ?;Sxx;F) = [Ip : 0]IF( ?;S;F)[Ip : 0]> = uu> ??xx,
iii. IF( ?;SyxSxy;F) = IF( ?;Syx;F)?xy +?yxIF( ?;Sxy;F),
= fvu> ??yxg?xy +?yxfuv ??xyg = 2[vu>?xy ??yx?xy] 2.
For the in uence functions of the eigenvalue and the eigenvector of SyxSxy (?1 and ?1),
the following lemmas are used.
Lemma 4.2 (Sibson, [77]) Let A and B be two symmetric matrices, ?i be the ith simple
eigenvalue of A and vi be the associated eigenvector of unit length. Let A be perturbed to
A(") = A+"B +o("2)
and assume that the corresponding perturbations of ?i and vi are
?i(") = ?i +"e?i +o("2)
vi(") = vi +"evi +o("2).
Then, e?i = v>i Bvi and evi = ?(A??iI)+Bvi = ?[Pj6=i( vjv
>
j
?j??i)]Bvi.
The following lemma mimics Lemma 4.2 and provides the in uence functions of eigen-
values and corresponding eigenvectors of a positive deflnite symmetric (PDS) matrix, ?.
65
Lemma 4.3 (Croux and Haesbroeck, [12]) Let C : F ! PDS(m) be a statistical func-
tional, F a m-dimensional distribution and ? F. Suppose that IF( ?;C;F) exists and
C(F)=?. Denote the ith simple eigenvalue and eigenvector of ? by ?i and ?i, respectively.
Then in uence functions of functionals ?i and ?i at F 2F are given by
IF( ?;?i;F) = ?>i IF( ?;C;F)?i (4.8)
IF( ?;?i;F) =
2
4X
j6=i
( ?j?
>j
?i ??j)
3
5IF( ?;C;F)?i (4.9)
Proof: The perturbation of ?, ?(") = C(F"), can be approximated as:
C(F") ? C(F)+R IF( ?;C;F)d(F" ?F)+o("2) = ?+"IF( ?;C;F)+o("2)
([47] and [54]). Similarly ?i and ?i can be written as
?i(F") = ?i +"e?i +o("2) = ?i +"IF( ?;?i;F)+o("2)
?i(F") = ?i +"e?i +o("2) = ?i +"IF( ?;?i;F)+o("2),
where e?i and e?i are obtained from Lemma 4.2 as,
e?i = IF( ?;?i;F) = ?>i IF( ?;C;F)?i
e?i = IF( ?;?i;F) = ?
?P
j6=i(
?j?>j
?j??i)
?
IF( ?;C;F)?i =
?P
j6=i(
?j?>j
?i??j)
?
IF( ?;C;F)?i
2:
Lemma 4.3 demonstrates that once IF( ?;SyxSxy;F) is known, in uence functions for all
PLS-weight vectors, and as a result of this, in uence function of the PLS slope estimator
can be determined. Using Lemma 4.3, it can be easily seen that in uence function for ?1,
66
the eigenvector of SyxSxy, is 0. Since ?1 = 1, in uence function for ?1 can be given as
IF( ?;?1;F) = IF( ?;SyxSxy;F): (4.10)
Using IF( ?;SyxSxy;F) given in Corollary 4.1 and using ?xy = p?1r1, (4.10) can be
rewritten as
IF( ?;?1;F) = 2[vu>?xy ??yx?xy] = 2p?1[vu>r1 ?p?1]: (4.11)
Furthermore, ?1 = p?1 so (4.11) can be written in terms of ?1 as
IF( ?;?1;F) = IF( ?;?21;F) = [ dd?
1
?21]IF( ?;?1;F) = 2?1[vu>r1 ??1] (4.12)
which implies
IF( ?;?1;F) = vu>r1 ??1 = vr>1 u??1: (4.13)
The functional for the flrst PLS weight vector, r1, satisfles Sxy(F)?1(F) = ?1(F)r1(F),
because it is the left singular vector of Sxy. Hence, in uence function for r1, given next,
follows immediately from IF( ?;Sxy;F) = IF( ?;?1;F)r1 +?1IF( ?;r1;F).
Corollary 4.2 The in uence function for the flrst weight vector, r1 at F 2F is
IF( ?;r1;F) = v?
1
[Ip ?r1r1>]u: (4.14)
It can be seen from (4.14) that IF( ?;r1;F) is unbounded since it is a function of arbitrary
values of u and v.
67
Step 2: In uence functions for Sh?1yx Sh?1xy and rh
In general, for h > 1 components, the hth PLS weight vector is obtained as the left
singular vector of ?xyh?1. Thus, the functional for the hth PLS weight vector, rh, satisfles
Sh?1xy (F)?h(F) = ?h(F)rh(F) where Sh?1xy = [Ip ?Vh?1V>h?1]Sxy with S0xy = Sxy. Here, the
columns of Vh?1 = [v1;v2;:::;vh?1] form an orthonormal basis for the space spanned by
[p1;p2;:::;ph?1] where pi / ?xxri for i = 1;2:::;h?1.
In uence function for Sh?1yx Sh?1xy , which is same as the in uence function for ?h, is:
IF( ?;Sh?1yx Sh?1xy ;F) = IF( ?;?h;F) = IF( ?;Syx[Ip ?currency1h?1]Sxy;F) (4.15)
where currency1h?1 = Vh?1V>h?1 = Ph?1i=1 viv>i . Using multiplication rule, (4.15) can be rewritten
as
IF( ?;?h;F) = 2IF( ?;Syx;F)[Ip ?currency1h?1]?xy ??yxIF( ?;currency1h?1;F)?xy: (4.16)
Since 2?hIF( ?;?h;F) = IF( ?;?h;F), after plugging IF( ?;Syx;F) in (4.16), in uence
function for ?h can be derived as
IF( ?;?h;F) = vr>h u??h ? 12?h?yxIF( ?;currency1h?1;F)?xy.
Thenextcorollaryentailsthein uencefunctionforrh thatrequirestheuseofIF( ?;Sh?1xy ;F)
and IF( ?;?h;F).
Corollary 4.3 The in uence function for the hth (h > 1) weight vector, rh, at F 2F is
IF( ?;rh;F) = 1?
h
?
[Ip ?rhrh>]vu?[Ip ? rh2?
h
?yx]IF( ?;currency1h?1;F)?xy ?currency1h?1uv
:
(4.17)
68
Here, IF( ?;currency1h?1;F) can be calculated recursively.
This function is obtained directly by using the equality
IF( ?;Sh?1xy ;F) = IF( ?;?h;F)rh +IF( ?;rh;F)?h.
Therefore, in uence function for rh given in (4.17) depends on explicitly to arbitrary point
(u>;v)> as well as implicitly to the in uence functions for all previous PLS-weight vectors.
So, IF( ?;rh;F) in (4.17) is unbounded.
Step 3: In uence functions for ~rh and ^flh
The PLS weight vectors, rh, for h ? 1 should be scaled by functional
q
r>h Sxxrh to
be able make the hth component unit norm. The scaled version of the PLS-weight vector,
denoted by ~rh, has the in uence function in the form of
IF( ?; ~rh;F) = [?(h)Ip ?rhrh
>?xx]IF( ?;rh;F)
?(h)3=2 ?
rhrh>IF( ?;Sxx;F)rh
2?(h)3=2 (4.18)
where ?(h) = rh>?xxrh for h ? 1.
Corollary 4.4 In uence function for the PLS slope estimator, represented by functional
^flh = fRhfRh>Sxy with fRh = [~r1; ~r2;:::; ~rh] for h components at F 2F is
IF( ?; ^flh;F) = IF( ?; fRh;F)fRh>?xy + fRhIF( ?; fRh;F)>?xy + fRhfRh>IF( ?;Sxy;F)
(4.19)
where ith column of IF( ?; fRh;F) is IF( ?; ~ri;F) for h ? 1.
69
Similarly, ~rh in (4.18) is unbounded because it depends on the in uence function for rh
which is unbounded. Finally, ^flh has unbounded in uence function. The following example
([82]) demonstrates the unboundedness of the in uence functions of SIMPLS estimators.
Example 4.1
Assume that ? N3(?;?) where ? = (0 0 0)> and
? =
0
BB
BB
BB
@
5 0:5 3
0:5 2 0:3?3
3 0:3?3 2
1
CC
CC
CC
A
:
The norms of the theoretical in uence functions for r1 given in (4.14) and ^fl1 = [^fl11; ^fl12]>
given in (4.19) are calculated for ? = (u>;v)> where u> = (i;0) and v = j with i and
j take values between ?10 to 10. The unbounded shapes of the in uence functions, can
be seen in Figure (4.1). Hence, it can be concluded that the SIMPLS estimator of slope is
non-robust towards outlying observations.
4.3 Robustness of the RoPLS Estimator of fl
SIMPLS regression is scale and orthogonal equivariant method ([43]). The flrst version
of BACON algorithm is scale and a?ne equivariant, therefore orthogonal equivariant ([5]).
Equivariance properties of BACON and SIMPLS guarantee the scale and orthogonal equiv-
ariance of the entire RoPLS1 , BACON based RoPLS, estimator of fl. Therefore, RoPLS1
estimator, efl, that is computed from a transformed response vector ey = fiy (fi 2 R?f0g)
and data matrix eX = XA (A: p?p orthogonal matrix), is given by
efl = fiA>bfl (4.20)
70
Figure 4.1: (a) The norm of IF of the r1 (b) The norm of IF of the ^fl1 (c) The IF of the ^fl11
(d) The IF of the ^fl12.
71
where bfl is RoPLS1 estimator that is computed from original data matrix, X, and response
vector, y. However, (4.20) does not hold for RoPLS2, PCOUT based RoPLS, estimator
since the coordinatewise median is not orthogonal equivariant. Because of its equivariance
advantage over RoPLS2, only robustness properties of RoPLS1 estimator are explored in
this section.
4.3.1 In uence functions for Low Dimension
RoPLS1 algorithm, introduced in Chapter 3, is an iterative algorithm that starts with
an initial estimator. This section is presented into two parts. In the flrst part, in uence
functions for the initial estimators of PLS-weight vector and slope are derived for h ? 1
components. Then, in the second part, after deriving in uence functions for the estimators
of PLS-weight vector and slope obtained in ith iteration, it is demonstrated that in uence
functions in the following iterations are directly related to in uence function of the initial
estimators where 0 ? i ? a and a is the number at which iteration converges.
In general, functional for the h component RoPLS1 estimator of fl obtained in the
ith iteration is denoted by ^fl(i)h , while the estimators of rh, ?h and ?h are represented by
functionals r(i)h , ?(i)h , and ?(i)h , respectively.
Part I: In uence Function for the Initial Estimator, ^fl(0)h
In uence function of the initial estimator ^fl(0)h is derived in three steps similar as in
Section 4.2.
72
Figure 4.2: (a) Graph of w? for robust distances (b) Graph of w? for scaled errors, ".
Step 1: In uence functions for SwyxSwxy and r(0)1
A functional, Sw(0), can be deflned as Sw(0)(G) = EG(w(0)G ( ) >) for a random vector
=(x>;y)> and G 2 F, where F is the same class of d-variate probability distributions
deflned in Section 4.2. The weight functional, w(0)G ( ), is
w(0)G ( ) = w?
?
d B(G)
?
= w?
q
>SB(G)?1
?
(4.21)
where SB is the functional representative of the BACON estimator of covariance matrix of
(?) with range of positive deflnite matrices and w?(:) is the function given in (3.11).
The weight function, w(0)G , is decreasing function of the distance which can be seen from
Figure 4.2 (a) with md(G) = max
n
1;median(d B(G))
o
. Furthermore, 0 ? w(0)G ( ) ? 1 for
any G 2 F. Existence of S(G) = EG( >) implies existence of Sw(0)(G), since w(0)G ( ) is
bounded for G 2F. Therefore, functionals Swxy(0)(G) = EG
?
w(0)G ( )xy
?
and Swxx(0)(G) =
73
EG
?
w(0)G ( )xx>
?
exist, and for h > 0, Sh?1wxy(0)(G) = ?(0)h (G)r(0)h (G). Moreover, Sw(0)(G) is
a non-negative matrix because of the fact that 0 ? w(0)G ( ).
Throughout this section, the following are assumed to be held for i ? 0:
1. Sw(i)(F) = ?,
2. SB(F) = ? and IF( ?;SB?1;F) exists with a boundary,
3. PFf : q> 6= 0 and w(i)F ( ) > 0g > 0,
4. PFf : p >??1 = md(F)g = 0.
Assumption (1) implies Swxy(i)(F) = ?xy, Swxx(i)(F) = ?xx, r(i)h (F) = rh, ?(i)h (F) =
?h, ?(i)h (F) = ?h. When assumption (2) holds, d B0(H) exists with boundary. If for
every q 2 Rd, the probability of f : q> 6= 0 and w(i)G ( ) > 0g is positive under G,
then q>Sw(i)(G)q > 0 implying that Sw(i)(G) is positive deflnite. Therefore, assumption
(3) is needed for positive deflniteness of Sw(i)(G). The last assumption is required for
difierentiability of w(0)G over the support of the random vector .
Similar to Section 4.2, the in uence functions of SwyxSwxy = S0wyx(0)S0wxy(0) and the
in uence function for ?1(0) are used to obtain in uence function of r(0)1 . The notation M0(H)
is used to denote IF( ?;M;F) where H = ? ? ?F.
Lemma 4.4 In uence function of SwyxSwxy at F 2F is
IF( ?;SwyxSwxy;F) = 2w(0)F ( ?)u>v?xy ?2?yx?xy +2C>?xy (4.22)
where C = EF
? @w?
@d B(F)d
B0(H)xy?.
Proof:
The in uence function for Swxy is:
74
Swxy0(H) = @@tEF+tH
?
w(0)F+tH( )xy
?
jt=0= EF
?
w(0)0(H)xy
?
+
n
w(0)F ( ?)uv ??xy
o
where
w(0)0(H) = @@tw(0)F+tH( ) jt=0= @@tw?
?
d B(F +tH)
?
jt=0= @w?@d
B(F+tH)
@d B(F+tH)
@t jt=0.
Therefore,
Swxy0(H) =
n
w(0)F ( ?)uv ??xy
o
+EF
? @w?
@d B(F)d
B0(H)xy?
with the derivative
@w?(d)
@d B(F) =
8
>><
>>:
0; d < md(F)
?(d?2); d > md(F)g:
The derivative exists everywhere except d = md(F) and it is bounded. Assumption (4)
guarantees the existence of the derivative over the support of , whereas the assumption
(2) guarantees the boundedness of d B0(H). Therefore EF
? @w?
@d B(F)d
B0(H)xy? exists. Let
C = EF
? @w?
@d B(F)d
B0(H)xy? then;
SwyxSwxy0(H) = Swyx0(H)Swxy(F)+Swyx(F)Swxy0(H)
which is equal to
n
w(0)F ( ?)u>v ??yx
o
?xy +C>?xy +?yxfwF( ?)uv ??xyg+?yxC
and this can be simplifled as
SwyxSwxy0(H) = 2w(0)F ( ?)u>v?xy ?2?yx?xy +2C>?xy 2.
By Lemma 4.3, the in uence function of ?1(0) = ?(0)1 2 is the same as the in uence function
of SwyxSwxy and ?xy = ?1r1, thus the in uence function of ?(0)1 is given as
?(0)1 0(H) = 2?1?(0)1 0(H) = 2?1
n
w(0)F ( ?)u>vr1 ??1 +C>r1
o
75
and this yields the following
?(0)1 0(H) = w(0)F ( ?)u>vr1 ??1 +C>r1 = w(0)F ( ?)vr1>u??1 +r1>C.
The following lemma gives the in uence function for r(0)1 .
Lemma 4.5 The in uence function of r(0)1 at F 2F is
IF( ?;r(0)1 ;F) = 1?
1
[1?r1r1>][w(0)F ( ?)vu+C] (4.23)
Proof:
Using the equality that Swxy(F +tH) = ?(0)1 (F+tH)r(0)1 (F +tH) and taking the derivatives
with respect to t for both sides, the following is obtained
S0wxy(H) = ?(0)1 0(H)r1 +?1r(0)1 0(H): (4.24)
After plugging S0wxy(H) and ?(0)1 0(H) in (4.24), r(0)1 0(H) can be written as
r(0)1 0(H) = 1?
1
n
w(0)F ( ?)uv ??xy +C ?
n
w(0)F ( ?)vr1>u??1 +r1>C
o
r1
o
and after simpliflcations, the in uence function for r(0)1 is obtained as
r(0)1 0(H) = 1?
1
[Ip ?r1r1>][w(0)F ( ?)vu+C]
2.
The in uence function of r(0)1 , given in (4.23), consists of two parts. The flrst part,
1
?1[Ip?r1r1
>]w(0)
F (
?)vu, is directly related to the in uence function of classical PLS weight
76
vector given in (4.14). The main difierence between (4.14) and (4.23) is the presence of
additional weight vector, w(0)F ( ?), in (4.23) which allows the flrst term of (4.23) to be
bounded. The second term, 1?1[Ip?r1r1>]C, is obviously independent of ?. Since C exists,
the second term is a flnite valued vector for given F. The following example demonstrates
how wF( ?) gets smaller for extreme observations, so that it makes the flrst part of the
r(0)1 0(H) bounded.
Example 4.2
Consider the same setting given in Example 4.1. ? N3(?;?) where ? = (0 0 0)>. In this
example, it is going to be shown that w(0)F ( ?) is small for extreme ? = (u>;v)>. Since,
>??1 is distributed with ?2 distribution with degrees of freedom rank(???1) = 3, the
p >??1 has chi-distribution with degrees of freedom 3 and median is approximately
1:15, i.e., md = median(p 0??1 ) ? 1:15. The weight function is calculated for ? where
u> = (i;0) and v = j with i and j take values between ?10 to 10. Figure 4.3 shows the
behavior of weights for difierent ? values. Obviously, w(0)F ( ?) takes values close to zero
for extreme values of ?.
Step 2: In uence functions for Sh?1wyx(0)Sh?1wxy(0) and r(0)h
The functional for the hth PLS weight vector, r(0)h , satisfles Sh?1wxy(0)(F) = ?(0)h (F)r(0)h (F)
where Sh?1wxy(0) = [Ip ?V h?1w(0)V h?1w(0)>]Swxy. Here, the columns of V h?1w(0) form an orthonormal
basis for the space spanned by fp(0)1 ;p(0)2 ;:::;p(0)h?1g. Therefore, the in uence function for
Sh?1wxy(0) is:
Sh?1wxy(0)0(H) = ?currency1h?1w(0)0(H)?xy +[Ip ?currency1h?1]Swxy0(H) (4.25)
77
Figure 4.3: (a) Graph of w(0)F ( ?) versus (u1,v) (b) Graph of w(0)F ( ?)
versus robust distances.
with currency1h?1w(0) = V h?1w(0)V h?1w(0)> and currency1w(0)h?1(F) = currency1h?1. Similarly, using multiplication rule,
the in uence function for Sh?1wyx(0)S(h?1)wxy(0) = Swyx[Ip ?currency1h?1w(0)]Swxy is obtained as
2?h
n
w(0)F ( ?)vrh>u??h +C>rh ? 12?h?yxcurrency1h?1w(0)0(H)?xy
o
which yields the in uence function of ?(0)h as
?(0)0h (H) =
?
w(0)F ( ?)vrh>u??h +C>rh ? 12?
h
?yxcurrency1h?1w(0)0(H)?xy
: (4.26)
So, the general form for the in uence functions of r(0)h is given in the next lemma.
Lemma 4.6 The in uence function of r(0)h at F 2F is
r(0)h 0(H) = 1?h
n
[Ip ?rhrh>][vw(0)F ( ?)u+C]?[Ip ? rh2?h?yx]currency1h?1w(0)0(H)?xy
o
? 1?h
n
currency1h?1[w(0)F ( ?)uv +C]
o
where currency1h?1w(0)0(H) is calculated recursively.
78
Proof:
Using Sh?1wxy(0) = ?(0)h r(0)h , taking derivatives of both sides and plugging (4.25) and (4.26), we
obtain;
r(0)h 0(H) = 1?h
n
[Ip ?rhrh>][vw(0)F ( ?)u+C]?[Ip ? rh2?h?yx]currency1w(h?1)0(H)?xy
o
? 1?h
n
currency1h?1[w(0)F ( ?)uv +C]
o
.
2.
Step 3: In uence functions for ~r(0)h and ^fl(0)h
~rh(0) is the scaled version of rh(0) that is
~rh(0) = rh
(0)
q
rh(0)>Swxxrh(0)
and using the quotient rule, the in uence function of ~rh(0) is
[?(h)Ip ?rhrh>?xx]r(0)h 0(H)
?(h)3=2 ?
rhrh>Swxx0(H)rh
2?(h)3=2
where ?(h) = rh>?xxrh and Swxx0(H) = [w(0)F ( ?)uu> ??xx +Cx] with
Cx = EF
? @w?
@d B(F)d
B0(H)xx>?:
Cx exists under the same conditions that C exists. Similar to r(0)1 , the in uence function of
r(0)h is comparable to the in uence function of the classical PLS weight function, rh, in (4.17).
It consists of two parts: weighted version of equation (4.17) and the part depending on C
and Cx which are assumed to be flnite valued vector and matrix, respectively. r(0)h 0(H)
is bounded, so is in uence function of ~rh(0), as long as r(0)j 0(H) for j = 1;2;:::;h ? 1
79
and Swxx0(H) are bounded. These conditions also imply the boundedness of the in uence
function for ^fl(0)h given in the next lemma.
Lemma 4.7 In uence function for the initial RoPLS1 slope estimator for h (h ? 1) compo-
nent, represented by functional ^fl(0)h = fRh(0)fRh(0)>Swxy with fRh(0) = [~r1(0); ~r2(0);:::; ~rh(0)],
at F 2F is
^fl(0)0h (H) = fRh(0)0(H)fRh>?xy + fRhfRh(0)0(H)>?xy + fRhfRh>[w(0)F ( ?)uv??xy +C] (4.27)
where jth column of fRh(0)
0
(H) is ~rj(0)0(H) for 1 ? j ? h.
Part II: In uence Function for ^fl(i)h
Assume that the slope estimator for h component model at iteration i? 1, ^fl(i?1)h , is
given where i ? 1. In this part, in uence function for ^fl(i)h , is derived as in Part I.
Step 1: In uence functions for Swyx(i)Swxy(i) and r(i)1
A functional, Sw(i), can be deflned as Sw(i)(G) = EG(w(i)G ( ) >) for a random vector
=(x>;y)> and G 2 F, where F is the same class of d-variate probability distributions
deflned in Section 4.2. The weight functional, w(i)G ( ), for i ? 1 is
w(i)G ( ) =
?
1?dB(G)
?
w?
?
?h(i)(G)
mad(?h(i)(G))
!
=
?
1?dB(G)
?
w?
?
"h(i)(G)
?
(4.28)
where ?h(i)(G) = y ? X^fl(i?1)h (G) and dB(G) is the normalized version of dxB(G) =
q
x>([Ip : 0]SB(G)[Ip : 0]>)?1 x that lies between 0 and 1.
w?("h(i)(G)) is decreasing function of the j"h(i)(G)j and lie between 0 and 1 which
can be seen from Figure 4.2(b) where m"h(i)(G) = max
n
1;median(j"h(i)(G)j)
o
, under the
80
assumption that the marginal distribution of ?h(i)(G) = y?X^fl(i?1)h (G) is symmetric about
0. Similarly, 0 ? 1?dB(G) ? 1 decreases when distances in X space increase. Therefore, the
weight function, w(i)G , decreases with extreme residuals and/or large distances. Furthermore,
0 ? w(0)G ( ) ? 1 for any G 2 F. Existence and non-negative deflniteness of Sw(i) can be
shown as in Part I. Other assumption made in this section, additional to the (1), (2), (3)
and (4) given in Part I, is that
5. PFf : j"(i)h j = m(i)"h(F)g = 0.
This assumption is required for difierentiability of w?("h(i)(F)) over the support of the
random vector .
Similar to previous subsection, the in uence function for S0wyx(i)S0wxy(i) = Swyx(i)Swxy(i)
is needed to determine the in uence function of the r(i)1 for i ? 1.
Lemma 4.8 In uence function of Swyx(i)Swxy(i) at F 2F is
IF( ?;Swyx(i)Swxy(i);F) = 2w(i)F ( ?)u>v?xy?2?yx?xy+2?yxAid+2?yxAis+2?yxAifl ^fl(i?1)0h (H)
(4.29)
with Aid = EF
?
?dB0(H)w?("h(i))xy
?
, Ais = EF
?(1?dB) @w?@"
h(i)
?(i)h s0(H)
s2 xy
?
, and
Aifl = EF
?
?(1?dB) @w?@"
h(i)
y
sxx
>
?
where s = mad(?(i)h (F)).
Proof:
The in uence function for Swxy(i) is:
S0wxy(i)(H) = EF
?
w(i)0(H)xy
?
+EH
?
w(i)F ( )xy
?
= EF
?
w(i)0(H)xy
?
+
n
w(i)F ( ?)uv ??xy
o
with
w(i)0(H) = @@tw(i)F+tH( ) jt=0= @@t
n
(1?dB(F +tH))w?
?
"h(i)(F +tH)
?o
jt=0
81
= ?dB0(H)w?("h(i))+(1?dB) @w?@"
h(i)
"h(i)0(H)
where "h(i)(F) = ?
(i)
h (F)s(F) = "(i)h with scaling factor s(F) = mad(?(i)h (F)) = s, ?(i)h (F) =
y ?x>fl(i?1)h (F) = ?(i?1)h , dB(F) = dB, and "h(i)0(H) is
"h(i)0(H) = ?
(i)0
h (H)s??
(i)
h s
0(H)
s2 =
?x>^fl(i?1)0h s??(i)h s0(H)
s2 :
Therefore S0wxy(i)(H) can be written as
S0wxy(i)(H) = Aid +Ais +Aifl ^fl(i?1)0h +
n
w(i)F ( ?)uv ??xy
o
with Aid = EF
?
?dB0(H)w?("h(i))xy
?
, Ais = EF
?(1?dB) @w?@"
h(i)
?(i)h s0(H)
s2 xy
?
, and
Aifl = EF
?
?(1?dB) @w?@"
h(i)
y
sxx
>
?
.
Aid exists since dB0(H)w?("h(i)) is bounded. From Figure 4.2(b), it can be seen that
@w?
@"h(i) exists everywhere except ?m
(i)"
h(F). Hence, A
ifl exists because of the assumption (5).
Ais existsifAifl existssinces0(H)isknowntobeboundedwhichisin uencefunctionofrobust
scatter measure median absolute deviation. Using multiplication rule and the fact that
S0wyx(i)(H) = S0wxy(i)(H)>; the following in uence function is obtained for Swyx(i)Swxy(i):
2w(i)F ( ?)u>v?xy ?2?yx?xy +2?yxAid +2?yxAis +2?yxAifl ^fl(i?1)0(H)h
2.
Since the in uence function of Swyx(i)Swxy(i) is equal to 2?(i)01 (H),
?(i)01 (H) = w(i)F ( ?)vr>1 u??1 +r>1 Aid +r>1 Ais +r>1 Aifl ^fl(i?1)0(H)h .
The next lemma gives the in uence function for r(i)1 .
82
Lemma 4.9 In uence function of r(i)1 at F is
IF( ?;r(i)1 ;F) = 1?
1
h
Ip ?r1r>1
ih
w(i)F ( ?)uv +Aid +Ais +Aifl ^fl(i?1)0(H)h
i
(4.30)
The proof can be given in a similar way to that of the proof of Lemma 4.5.
Step 2: In uence function for r(i)h
The next lemma gives the in uence function for r(i)h and the proof is similar to the
proof of Lemma 4.6.
Lemma 4.10 The in uence function of r(i)h at F 2F is
r(i)h 0(H) = 1?h
n
[Ip ?rhrh>][vw(i)F ( ?)u+Aid +Ais +Aifl ^fl(i?1)0h (H)]
o
? 1?h
n
[Ip ? rh2?h?yx]currency1h?1w(i)0(H)?xy
o
? 1?h
n
currency1h?1[w(i)F ( ?)uv +Aid +Ais +Aifl ^fl(i?1)0h (H)]
o
where currency1h?1w(i)0(H) is calculated recursively.
Step 3: In uence functions for ~r(i)h and ^fl(i)h
~rh(i) is the scaled version of rh(i) that is
~rh(i) = rh
(i)
q
rh(i)>Swxx(i)rh(i)
and using the quotient rule, the in uence function of ~rh(i) is
[?(h)Ip ?rhrh>?xx]r(i)h 0(H)
?(h)3=2 ?
rhrh>Swxx(i)0(H)rh
2?(h)3=2
83
where ?(h) = rh>?xxrh and Swxx0(H) = [w(i)F ( ?)uu>??xx +Aixd +Aixs +Aixfl ^fl(i?1)0h (H)]
with Aixd = EF
?
?dB0(H)w?("h(i))xx>
?
, Aixs = EF
?(1?dB) @w?@"
h(i)
?(i)h s0(H)
s2 xx
>
?
, and
Aixfl = EF
?
?(1?dB) @w?@"
h(i)
yx>^fl(i0)h (H)
s xx
>
!
.
Aixd, Aixs, and Aixfl exist under the same conditions that Aid, Ais, and Aifl exist. Finally,
the in uence function for ^fl(i)h is given in the next lemma.
Lemma 4.11 In uence function for the RoPLS1 slope estimator at the ith iteration for
h (h ? 1) component, represented by functional ^fl(i)h = fRh(i)fRh(i)>Swxy(i) with fRh(i) =
[~r1(i); ~r2(i);:::; ~rh(i)], at F 2F is
^fl(i)0h (H) = fRh(i)0(H)fRh>?xy+fRhfRh(i)0(H)>?xy+fRhfRh>[w(i)F ( ?)uv??xy+Aid+Ais+Aifl ^fl(i?1)0h (H)]
(4.31)
where jth column of fRh(i)
0
(H) is ~rj(i)0(H) for 1 ? j ? h.
The boundedness of ^fl(i)0h (H) for i ? 1 can be proven by induction. For i = 1, since
w(1)F gets smaller for the observations lying far from the data (large dB) and/or the ones
which are not fltted well by the model (large residuals), in uence function given (4.30) is
bounded as long as fl(0)h 0(H) is bounded. Therefore, r(1)01 (H) has bounded in uence function.
Similarly, Swxx(1) has a bounded in uence function. These facts imply the boundedness of
^fl(1)0h (H).
If we assume that ^fl(i)0h (H) is bounded, using the similar argument it can be proven that
^fl(i+1)0h (H) is bounded. So, it can be concluded that under conditions described in Part I &
II, in uence function for the h component RoPLS1 slope estimator, determined implicitly,
exists and it has inflnitesimal robustness.
84
4.3.2 Empirical In uence Function for High Dimension
In uence function obtained in Section 4.3 is shown to be bounded, which demonstrates
the robustness of RoPLS1 for low dimensional data. For high dimensional case with n
observations, the following empirical in uence function, deflned as
^fl(e?)? ^fl(?)
1=n ; (4.32)
is used where e? is the contaminated data set obtained from varying one observation of ?,
^fl(e?) and ^fl(?) are the estimated slope vectors for the contaminated and the clean data,
respectively. Data are generated by the same simulation setting described in Chapter 3
where the error terms are generated from standard normal distributions, that is, " ? N(0;1).
The contamination added to a randomly chosen observation by varying the explanatory
variable and response variable between ?50 to 50. For each contamination level, the norm
of the empirical in uence function given in (4.32) is calculated for fn,p,kg=f20,200,3g and
f25,125,2g . Then three dimensional plots in Figure 4.4 are constructed. Figure 4.4 (a) and
(c) clearly illustrate non-robustness of the SIMPLS estimator. However, empirical in uence
curves for RoPLS1 estimator are clearly bounded which can be seen in Figure 4.4 (b) and
(d). The maximal norms of the empirical in uence functions for the SIMPLS, RoPLS1,
PRM, and RSIMPLS estimators of fl are summarized in Table 4.1. RoPLS1 yielded the
smallest upper bound for the norm of the empirical in uence function.
fn,p,kgn Method SIMPLS RoPLS1 PRM RSIMPLS
f20,200,3g 218.13 0.16 1.48 0.62
f25,125,2g 88.74 0.17 0.93 1.76
Table 4.1: The maximal norms for SIMPLS, RoPLS1, PRM, and RSIMPLS estimators.
85
Figure 4.4: Norms of the empirical in uence functions for the (a)SIMPLS, n=20, p=200,
k=3 (b) RoPLS1 n=20, p=200, k=3 (c) SIMPLS, n=25, p=125, k=2 (d) RoPLS1, n=25,
p=125, k=2.
86
Figure 4.5: The flnite-sample breakdown values of the SIMPLS, RoPLS1 and PRM estima-
tors for (a)fn,pg=f30,6g (b)fn,pg=f20,200g.
4.3.3 Finite-Sample Breakdown Properties of RoPLS1 Estimator
The resistance of a robust statistical method to groups of outliers is another important
issue which is measured by breakdown point. In this section, flnite-sample breakdown value,
given in (4.5), is investigated for RoPLS1 estimator. After X and y are generated as in
Section 4.3.2, various amounts of contamination are added to generated data by replacing
flrst i observations (i = 1;2;:::;n=2) of the response variable with 50. For each amount
of contamination, the norm of the difierence between slope estimates for the contaminated
and the clean data is calculated and a plot of norm versus contamination percentage is
constructed. The sample size and the number of variables are taken as fn,pg=f30,6g and
f20,200g for k = 2 component PLS model. It is clear from Figure 4.5 that the SIMPLS
estimator of fl is not robust, whereas RoPLS1 estimator copes with up to 43% of irregular
87
observations for low dimensional case and about 40% for high dimensional case. PRM yields
comparable results with RoPLS1.
88
Chapter 5
RoCPLS: Robust Partial Classification
5.1 Introduction
The problem of classifying entities into one of several groups has been one of the main
goals of many scientiflc investigations. For instance, predicting whether someone will have a
heart attack on the basis of demographic and clinical measurements or identifying a tumor
as one of the many difierent possibilities on the basis of DNA expression values are po-
tentially life-saving and hence are indispensable to physicians. Numerous other interesting
applications of classiflcation can be found in a broad range of scientiflc areas such as chem-
istry, economics, marketing research, bio-informatics, image analysis, pattern recognition
and data mining.
Classiflcation is a multivariate method of distinguishing among classes of objects by
developing a decision rule to assign a new object with unknown class membership to the
most likely group. In this study, only two-class problems are considered. (x0i;yi) denotes the
observed data set, with xi = [xi1;xi2;:::;xip]0 2 Rp consisting of p characteristics that are
sampled from two populations and yi is the class membership for the observation i where
i = 1;2;:::;n. Ig = fi;yi = gg denotes the set of indices for the ng observations in the
class g where g = 1;2 and n = n1 +n2.
It is important that classiflcation is done in a manner that the proportion of misclas-
sifled observations (misclassiflcation error rate) is minimum. In general, performances of
classiflcation methods can be evaluated based on their misclassiflcation error rates which
can be obtained by using difierent approaches. Optimum error rate (OER) and actual error
89
rate (AER) are two quantities that can be used for determining misclassiflcation probabili-
ties. However, they cannot, in general, be calculated, because they depend on the unknown
density functions of the populations. Another method, which is employed in this study
since it does not depend on the form of populations, is to split the data set randomly into
two non-overlapping sets, called learning, (XL;yL), and test, (XT;yT) sets. The learning
set, XL, allows to construct a decision rule, ?, that associates a new vector x 2 Rp to one
of the two classes, that is
?(x;XL;yL) : Rp !f1;2g
where yL is the vector containing the class labels of the observations in the learning data
set, XL. Based on the determined classiflcation rule, the fraction of the misclassifled ob-
servations in the test set, XT, is computed. By repeating this process N times, estimated
misclassiflcation error rate is obtained as
dMER = 1
NnT
NX
r=1
nTX
j=1
If?1;1g(yj ? ^yj) (5.1)
where nT is the number of observations in the test set, yj is the known class label in XL,
^yj is the estimated class label for the jth observation in XT, and If?1;1g(b) is the indicator
function which takes the value of 1 if b = ?1;1 and 0 otherwise. Apparent error rate
(APER) is another measure of the performance that can be used for any classiflcation
procedure. APER is the fraction of observations in the XL that are misclassifled by the
classiflcation rule, ?(:;XL;yL). Although, it is easy to calculate, it underestimates the error
rate. Cross-validation is also a popular approach that consists to split data set into m non-
overlapping subsets where m? 1 subsets form learning set to construct decision rule and
90
remaining subset is used as test set. If m = n, the procedure is called leave-one-out cross
validation. Throughout this study, leave-one-out classiflcation is employed to estimate the
value of meta parameter, k, i.e. optimal number of components.
Since the introduction of the Fisher?s discriminant (FD) analysis in 1936 ([28]), several
classiflcation rules have been proposed and studied in the literature. FD analysis is based
on the idea of flnding the directions in multivariate space that yield the best discrimination
between the groups of samples. This idea can be written as the optimization problem given
by
argmax
a2Rp
a0Ba
a0Wa =
a0
nP2
g=1 ng(xg ?x)(xg ?x)0
o
a
a0
nP2
g=1
P
i2Ig(xi ?xg)(xi ?xg)0
o
a
(5.2)
where B is the sample between-group matrix, W is the sample within-group matrix, xg
sample mean vector for gth class with g = 1;2 and x is the overall sample mean vector. In
general, if fi1 is the largest eigenvalue and e1 is the corresponding eigenvector of W?1B,
then a = e1 = S?1p (x1 ? x2) is the solution of the optimization problem in (5.2) where
Sp = W=(n?2) is the sample pooled covariance matrix. Therefore FD rule can be given as
?FD(x;X;y) =
8
>><
>>:
1; e01x ? 0:5(x1 ?x2)0S?1p (x1 +x2)
2; otherwise:
FD rule is developed under the assumption that the two populations have a common co-
variance matrix and it does not explicitly assume any form for the underlying distributions.
Bayes classiflcation is another approach that needs prior probabilities, ?g, and prob-
abilistic structure estimates for each class. The Bayesian discriminant rule assigns an ob-
servation x 2 Rp to the population for which posterior probability, P(yjx), is maximal.
Under the assumption that each class comes from multivariate normal distribution with
91
equal covariance matrix, the allocation rule is
?LD(x;X;y) = argmax
g
LDg(x;X;y); (5.3)
where
LDg(x;X;y) = x0gS?1p x?0:5x0gS?1p xg +ln(?g); (5.4)
and this is called linear discriminant (LD) rule. Provided that the two classes come from
the two normal distributions with the same covariance matrix and equal prior probabilities,
?FD is equivalent to linear discriminant rule, ?LD. When the assumption of equality of
covariance matrices is not satisfled, an individual covariance matrix for each group can be
used in (5.4) and this leads to the so-called quadratic discriminant (QD) analysis as the
discriminating boundaries are quadratic curves.
Over the last decade, many sophisticated classiflcation methods, like support vector
machine, neural networks, classiflcation and regression trees (CART), have been proposed.
In spite of these reflned methods, ?LD, that yields optimal discrimination between two
classes, is still often used and very popular because of the simplicity, unnecessity of strict
assumptions, interpretation easiness and its good performance in many applications. Of
course from the point of view of optimality, LD analysis should be used for classiflcation
when it can be used. However, it becomes a serious challenge to use LD analysis in the set-
tings where the data matrix X is multicollinear or p >> n. Because, the sample covariance
matrix estimate is near singular if high collinearity exists and high dimensionality makes
direct matrix operation di?cult. Many solutions have been proposed to deal with these
problems such as variable selection, penalized estimation, and dimension reduction.
92
Variable selection is a very popular method due to its simplicity and interpretabil-
ity. The most commonly used variable selection methods are based on a score (such as
t-statistic, Wilcoxon?s rank-sum statistic, false discovery rate) which measures discriminat-
ing power of each variable individually and the variables with the best scores are selected
(see [20], [24], [25]). These methods are called univariate ranking methods. The major
drawback is the selection of variables according to an individual relevance score that ig-
nores the correlations and interactions among variables. Therefore, more complex criteria
than the individual scores have been proposed (optimal subset selection methods), which
are generally computationally expensive and sufier from over fltting problem ([7], [11], [52]).
Penalization (regularization) methods can be also employed to stabilize the pertinent
covariance matrices so that the classical discrimination paradigms might be implemented
(see [32]). These methods reduce the variance associated with the sample based estimate
at the expense of potentially increased bias.
Dimension reduction (feature extraction) is another alternative to deal with dimen-
sionality problem. It allows the visualization of data in a low dimension, takes into account
the correlation structure of the data and the most importantly, utilizes the information
on all variables. This topic, particularly PLS as a dimension reduction tool, is examined
in Section 5.2. Although, PLS solves dimensionality problem by constructing orthogonal
components described in Section 5.2, it fails to deal with data containing outliers. There-
fore, in Section 5.3, a new robust method, RoCPLS, is proposed. To our knowledge, there
has been no study on the robustness of PLS based classiflcation methods. Performances of
the existing PLS based classiflcation methods and RoCPLS are compared using benchmark
data sets in Section 5.4.
93
5.2 Classiflcation Methods Based on Dimension Reduction
A typical DNA microarray data set in tumor tissue classiflcation studies consists of
expression measurements on thousands of genes over a relatively small number of tissue
samples. Similarly, in food research, classical classiflcation methodologies can not be used
for the prediction of presence/absence of a preservative in a particular food product based
on spectral data in which number of variables is very large and the correlation among them
is substantial.
One approach to classiflcation problems, dealing with high dimensional and/or collinear
data sets, is to employ a dimension reduction method and then perform a standard clas-
siflcation method in the reduced space. In this section, we study dimension reduction for
classiflcation based on PLS and PCA followed by LD implemented in the reduced subspace.
Another classiflcation method such as logistic regression can also be employed instead of
LD, however logistic regression does not perform well when the classes are completely or
quasi-completely separated which is quite common conflguration in microarray data.
Although PLS was originally designed for problems with quantitative response, it has
started to be used frequently as a dimension reduction tool for classiflcation problems where
response variable is qualitative. There are mainly two approaches when PLS is employed
as a dimension reduction method for classiflcation purpose.
One approach is to utilize NIPALS algorithm to determine components. However, since
NIPALS algorithm consists of regression steps (see Chapter 2), it seems to be unappealing
to use NIPALS algorithm designed to handle continuous response models that do not sufier
from heteroscedasticity. So, Marx ([60]) proposed an extension of NIPALS algorithm to han-
dle qualitative response models. He basically incorporated the original NIPALS algorithm
94
into the framework of generalized linear models by employing iteratively reweighted least
squares (IRLS). The main drawback of the method is the convergence problem. Therefore,
Ding and Gentleman ([21]) modifled the Marx?s method by applying Firth?s procedure ([27])
to resolve complete or quasi-complete separation problem resulted in convergence problem.
Recent method by Fort and Lacroix, RPLS, ([30]), combines the NIPALS algorithm and
Ridge penalized logistic regression. They also provided an extensive simulation study to
compare existing NIPALS based classiflcation methods and concluded that misclassiflcation
error rates for IRPLS and Ding and Gentleman?s method are lower and less variable.
The other most commonly used approach is to determine PLS components for classifl-
cation problem is applying original SIMPLS algorithm, described in Chapter 2. Barker and
Rayens ([4]), Nguyen and Rocke ([65]) and Boulesteix ([8]) proposed the use of SIMPLS for
dimension reduction based on SIMPLS as a preliminary step to classiflcation problems. In
this chapter, SIMPLS based classiflcation is considered because not only SIMPLS has com-
putational advantages over NIPALS algorithm (see Chapter 2), but also optimal directions
obtained by SIMPLS are related to the Fisher?s optimal directions, so there is a relationship
between classiflcation based on SIMPLS and Fisher?s discrimination. The following lemma
gives this relationship.
Lemma 5.1 (Boluesteix, [8]) If the common covariance matrix, ?, is assumed to be of the
form ? = 2Ip for a non-zero constant , then a = e1 and the flrst PLS-weight vector, r1
are collinear.
Proof:
Let X and y be centered. ^r1 is the direction that maximizes the square of the covariance
between projected explanatory variable and response variable, i.e.
95
^r1 = argmaxa cov(Xa;Y) = X0ypy0XX0y.
The centered y is given by
yi =
8>
><
>>:
?n2=n; i 2 I1
n1=n; i 2 I2:
Therefore, the jth row of p?1 vector r1 is
?n2
n
X
i2I1
xij + n1n X
i2I2
xij = n1n2n (x2j ?x1j)
and
^r1 = (x2 ?x1)k x
1 ?x2 k
:
Therefore r1 is proportional to the normalized form of ?1 ? ?2 which is the dominant
eigenvector of between-groups matrix, B. Since e1 = ??1(?1 ??2) and ? = 2Ip, r1 and
e1 are collinear 2.
Lemma 5.1 implies that SIMPLS depends on the between-groups matrix. It is also
obvious that the within-group matrix information is not utilized to construct SIMPLS
components, that is, since only B not W is involved, classiflcation based on SIMPLS only
depends on between-groups matrix. So, LD outperforms in the situations that it can be
implemented. However, in the existence of multicollinearity, optimality advantage of LD
over SIMPLS based classiflcation would reverse direction.
PCA reduces the dimension of the data set by retaining as much as possible the vari-
ation present in the data. So, PCA is only capable of identifying total variability, i.e.,
B + W, and not capable of distinguishing between-groups and within-groups variability
which is the main goal of Fisher?s discrimination ([4], [8]). Especially, if within-groups
96
variability, W, dominates the total variability, PCA will no longer perform well as a classifl-
cation tool. Since, SIMPLS depends on the between-groups matrix, when the discrimination
is the major goal after dimension reduction, SIMPLS is to be preferred to PCA. The fol-
lowing example also indicates that PLS outperforms the PCA as within-groups variability
increases.
Example 5.1
This example is the modifled version of the example given by Barker and Rayens, [4]. 50
observations from the two multivariate normal distributions with the means ?1 = (?2;0;0)0
and ?2 = (2;0;0)0 and common covariance matrix
? =
0
BB
BB
BB
@
1 0 0
0 1 r
0 r 2
1
CC
CC
CC
A
are generated N=100 times, where 2 = 1;2;:::;6 is the variance of the third variable and
r = 0:9 is the correlation between second and third variable. Misclassiflcation error rates
are calculated using leave-one-out cross validation and these rates are averaged over 100
randomly generated data. It can be seen from Figure 5.1 that as 2, variance of the third
variable, varies from 1 to 6, the misclassiflcation rate based on PCA based classiflcation
increases since the PCA loses sight of the discrimination information when within-group
matrix dominates the total variability.
PCA and SIMPLS are both linear dimension reduction methods, but SIMPLS uses
class information, y, to construct components (supervised), while PCA does not use the
class information (unsupervised). There are several other dimension reduction methods
that can be applied in the context of classiflcation. For instance, sliced inverse regression
97
1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
Sigma
Misclassification error rate
PLSPCA
Figure 5.1: Misclassiflcation error rates for PLS and PCA for k=1.
(SIR) is one of the su?cient dimension reduction methods ([51]) which represent a family
of dimension reduction procedures. A simulation study by Dai et al. ([14]) demonstrates
that SIMPLS and SIR are both efiective in dimension reduction for classiflcation and also
more efiective than PCA which is not surprising since both SIMPLS and SIR are supervised
methods. Considering both accuracy and computational e?ciency, it is concluded in this
study that SIMPLS provides the best performance among PCA and SIR.
5.3 Description of the Proposed Algorithm: RoCPLS
In Chapter 3, it has been shown that RoPLS is successful in regression framework
where data contain outliers. Partial least squares is also frequently used as a classiflcation
method as described in Section 5.2. In the presence of outliers, dimension reduction via
PLS would yield unreliable results since PLS is known to be sensitive to outliers. Although
several robust PLS methods have been proposed when the response variable is quantitative,
to our knowledge, there has been no study on the robustness of PLS when the response
98
variable is qualitative. In this section, the sensitivity of PLS based classiflcation methods
to outliers is demonstrated and an extension of the robust method introduced in Chapter 3
is given.
The proposed algorithm, RoCPLS, is the robustifled version of SIMPLS based clas-
siflcation. The main difierences between RoPLS and RoCPLS are that weights for the
response variable, y, are immaterial and weights for data matrix, X, are computed for each
class separately. The detailed algorithm is given below:
Algorithm: RoCPLS
Input: n?p data matrix, X, n?1 vector of response variable, y, a new observation, x 2Rp
Output: Score matrix, T, and p ? k PLS weight matrix, R and class label for the new
observation, ^y
Step 1: Let Xg = fxij;i 2 Ig;j = 1;2;:::;pg for g = 1;2. Apply PCOUT algorithm, de-
scribed in Section 3.2.2, to X1 and X2 to obtain weight vectors w1 and w2, respectively. Take
p? = ng ?1 for high dimensional data and p? = rank(Xg) for low dimensional data where
g = 1;2. Within each group, any observation with flnal weight less than 0:25 is assigned
as an outlier. So, let X?1 and X?2 be the clean matrices of observations with corresponding
weights greater than 0:25, and merged matrix
X? = fx?ijg =
0
BB
@
X?1
X?2
1
CC
A,
with the vector of class labels y? and nc1 and nc2 are the number of observations in X?1 and
X?2, respectively with nc = nc1 +nc2.
Step 2: Let X0 = fx0ijg = X?; X01j = fx0ij;1 ? i ? nc1g and X02j = fx0ij;nc1 + 1 ? i ? ncg
for 1 ? j ? p. Repeat steps 2:1?2:6 for h = 1;2;:::;k:
Step 2.1: Compute PLS weight vector rh with jth row equal to:
99
rh(j) = m
1j ?m2j
k m1j ?m2j k
where
mgj =
P
i2Ijg x
h?1
ij
njg
with
Ij1 = fi;1 ? i ? nc1 and q25(Xh?11j )?1:5IQR(X01j) ? xh?1ij ? q75(X01j)+1:5IQR(X01j)g
Ij2 = fi;nc1 +1 ? i ? nc and q25(Xh?12j )?1:5IQR(X02j) ? xh?1ij ? q75(X02j)+1:5IQR(X02j)g
where q25 and q75 are the 25th and 75th sample quantiles; IQR is the interquartile range
and njg is the size of Ijg for g = 1;2 and j = 1;2;:::;p.
Step 2.2: Compute hth score, th = X?rh, and normalize th =: th=kthk,
Step 2.3: Update hth PLS weight, rh =: rh=
q
r0hX?0X?rh,
Step 2.4: Compute hth x-loading by regressing X? on th: ph = X?0th ,
Step 2.5: Store vectors rh, th, and ph into matrices Rh = [r1;r2;:::;rh],
Th = [t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively.
Step 2.6: h =: h + 1 and Xh?1 = X?(Ip ? Vh?1V0h?1) = fxh?1ij g where columns of
Vh?1 form an orthonormal basis for Ph?1.
Step 3: ^y = ?LD(x0Rh;Th;y?)
The constructed component matrix, Th, is not only utilized to determine classiflcation
rule, but also used to plot the flrst two or three components which helps to display relation-
ships, possible groupings and potential outliers in the data. After projecting the original
data matrix, X, on the robustly calculated directions, R, orthogonal-score distance plot,
100
described in Section 3.3, can be constructed to distinguish regular, good PLS-leverage, or-
thogonal and bad PLS-leverage points. It can be also used to visualize outlying observations
within each class.
Numerical examples with diagnostic plots in Section 5.4 demonstrated the robustness
and e?ciency of the proposed method.
5.4 Numerical Examples
In this section, two benchmark data sets are utilized to compare performances of the
existing dimension reduction based classiflcation methods and proposed method, RoCPLS.
5.4.1 Low Dimension: Wine Recognition Data
The wine recognition data ([29]) are the results of a chemical analysis of wines grown in
the same region in Italy but derived from three difierent cultivars. The analysis determined
the quantities of 13 constituents (i.e., the level of alcohol, the level of magnesium, the color
intensity ) found in each of the types of wines. In this study, only two cultivars with sample
sizes 59 and 71 are considered.
To determine the optimal number of components, SIMPLS based cross-validation er-
ror rate is calculated for h = 1;2;:::;10 components and the scree plot in Figure 5.2 is
obtained. Based on the flgure, the optimal number of components is determined as 7,
i.e. k = 7 which yields the lowest error rate. The orthogonal-score plots for SIMPLS and
RoCPLS based on k = 7 component model can be seen in Figure 5.3. It is obvious that
none of the PLS-bad leverage points can be identifled when SIMPLS is employed, while
101
1 2 3 4 5 6 7 8 9 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Number of components
Misclassification error rate
Figure 5.2: Cross validation error rates obtained for Wine data.
observations 74;79;96;111 and 122 are determined as PLS-bad leverage points when RoC-
PLS is employed. We deleted all PLS-bad leverage points as well as two good leverage
points 70 and 97 that are identifled by RoCPLS; and SIMPLS based cross-validation error
rate is calculated for the new data which yields Figure 5.4. Excluding these observations
clearly yielded smaller error rates, and it also indicates that k = 4 is the optimal number of
components. This example demonstrates how outlying observations can afiect the misclas-
siflcation error rate based on SIMPLS as well as the decision of the k. The orthogonal-score
plots based on k = 4 for SIMPLS and RoCPLS yielded very similar patterns observed in
Figure 5.3, therefore they are not repeated here.
In order to compare the classiflcation accuracies, 100 random partitions, into learn-
ing set containing 70% of the data and a test set containing remaining observations, are
generated. We keep observations 70;74;79;96;97;111 and 122 in the learning set for each
partition. Then, we calculated the classiflcation rule based on learning set and evaluated
estimated class membership of the observations in the test set after projecting them onto
102
0 2 4 6 80
0.5
1
1.5
2
2.5
124
70
111 74 96
122
113 99 14
119
62 84
Orthogonal?Score Distance Plot
Score distance
Orthogonal distance
0 2 4 6 8 100
0.5
1
1.5
2
2.5
3
97
79111
74
122
70
96110119
62
84
Robust Orthogonal?Score Distance Plot
Score distance
Orthogonal distance
Figure 5.3: Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right)
for Wine data.
1 2 3 4 5 6 7 8 9 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of components
Misclassification error rate
Deleted Data
Original Data
Figure 5.4: Scree plots for original and deleted Wine data.
103
the directions calculated from learning set and using the rule based on learning set. For
h = 1;2;3;4, the error rates, given in Table 5.1., are obtained. Both SIMPLS and RoCPLS
give lower error rates than PCA does. Beside this, RoCPLS yields the smallest error rates
for each h which indicates the robustness of the method.
h SIMPLS RoCPLS PCA
1 0.0703 0.0677 0.0703
2 0.0736 0.0462 0.0744
3 0.0374 0.0292 0.0659
4 0.0187 0.0100 0.0362
Table 5.1: The mean misclassiflcation error rates for Wine data based on SIMPLS, RoCPLS
and PCA classiflcation.
5.4.2 High Dimension: Colon Data
Colon data set ([2]) contains the expression levels of p = 2000 genes for n = 62 patients
from two classes. 22 patients are healthy patients and 40 have colon cancer. After the
data preprocessed described in Dudoit ([24]), only 1224 variables remain. Cross validation
error rates indicated that k = 4 components result in the minimum error rate. Therefore,
orthogonal-score plots for k = 4 components are constructed using SIMPLS and RoCPLS
which are given in Figure 5.5. None of the plots indicates the existence of extreme outliers.
The scatter plot of the flrst three components in Figure 5.6 does not display any outlying
observations. It can also be seen from Figure 5.6 that classes are not completely separated.
This also explains why high misclassiflcation error rates are obtained in Table 5.2 (k = 4)
where DG stands for Ding and Gentleman?s method ([21]). As in previous example, 100
randomly splitted data sets are employed to calculate the error rates. The optimal value
of k is estimated in each iteration based on the learning set. Boxplots given in Figure 5.7
104
0 2 4 6 8 100
2
4
6
8
10
12
9
11
18
3
5557 16
464744
Robust Orthogonal?Score Distance Plot
Score distance
Orthogonal distance
0 2 4 6 8 100
2
4
6
8
10
12
51
20
3
11
32
571631
55 937
Orthogonal?Score Distance Plot
Score distance
Orthogonal distance
Figure 5.5: Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right)
for Colon data.
summarize the error rates calculated from each method. Clearly, all methods give better
results than PCA. The numerical results and graphics show the comparable performances
of SIMPLS and RoCPLS, which demonstrates the e?ciency of the proposed method at
uncontaminated data sets.
SIMPLS RoCPLS PCA DG
0.1326 0.1363 0.2289 0.1389
Table 5.2: The mean misclassiflcation error rates for SIMPLS, RoCPLS, PCA and DG
based classiflcation.
5.4.3 High Dimension: Leukemia Data
This data set was introduced in Golub et al. ([37]) and it contains the expression levels
of 7129 genes for 47 ALL-leukemia patients and 25 AML-leukemia patients. After data
preprocessing, only 500 variables remain. Leave-one-out cross validation on the whole data
set indicated k = 2 components should be retained in the model. For k = 2 components,
105
?0.4?0.2
0 0.2
0.4 0.6
?0.4?0.2
00.2
0.40.6
?0.4
?0.2
0
0.2
0.4
0.6
PLS1 PLS2
PLS3
Figure 5.6: 3D scatter plot of the flrst three components for Colon data obtained from
RoCPLS.
SIMPLS RoCPLS PCA DG
0
0.1
0.2
0.3
0.4
0.5
Misclassification error rates
Figure 5.7: Boxplots of the error rates for SIMPLS, RoCPLS, PCA and DG based classifl-
cation.
106
?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 0.2 0.25?0.4
?0.3
?0.2
?0.1
0
0.1
0.2
0.3
PLS1
PLS2
Figure 5.8: Scatter plot of the flrst two components for Leukemia data.
scatter plot of the flrst two components is constructed (Figure 5.8), where the separation
between two groups can be clearly seen. The orthogonal-score plots obtained from SIMPLS
and RoCPLS do not demonstrate any extreme observations in the data as in colon data set.
Leukemia data set is randomly divided into a learning set of size 50 and a test set of
size 22, N = 250 times. For this simulation study, we introduced 0, 1, 2, and 3 outly-
ing observations to the each class corresponding to 0%, 4%, 7% and 10% contamination,
respectively. For class g, contaminated observation is generated from multivariate normal
distribution with mean (10)1500 +xg and covariance matrix I500 with class label y = g, for
g = 1;2. For example, when 2 outliers are introduced for each class, the flrst two component
plot and orthogonal-score plot, obtained from RoCPLS in Figure 5.9, indicate that these
observations are PLS-bad leverage points.
107
?1 ?0.5 0 0.5 1?0.8
?0.6
?0.4
?0.2
0
0.2
0.4
0.6 Robust Score Plot
PLS1
PLS2
0 2 4 6 8 100
50
100
150
200
250
626
51 5253 54
4018
Robust Orthogonal?Score Distance Plot
Score distance
Orthogonal distance
Figure 5.9: The scatter plot of flrst two components (left) and orthogonal-score plot (right)
for Leukemia data.
As in colon data example, leave-one out method applied to clean learning set of size
50 to determine optimal number of components within each division. 60% of the time k is
determined as 2, while 40% of the time k = 1. For the optimal value, k, the error rates
based on the methods SIMPLS, RoCPLS, PCA and DG are calculated for the test set. The
boxplots in Figure 5.10 summarize the simulation results. For the clean data (no outliers),
all methods yield very comparable results. Once again, it can be seen that RoCPLS is
an efiective method for uncontaminated data. As the number of outlying observations
increases, the error rates for the SIMPLS, PCA and DG increases as well. However, adding
outlying observations do not afiect the error rates based on RoCPLS. The main reason
behind this is that the optimal directions obtained by RoCPLS are robust to outliers. In
order to show that, the angle between the flrst PLS components, r1, obtained from the clean
and the contaminated data is calculated for SIMPLS and RoCPLS. Boxplots, in Figure 5.11,
are constructed for each contamination level based on N = 250 divisions. The angle for
SIMPLS tends to increase as contamination level increases. However, RoCPLS yield smaller
108
SIMPLSRoCPLSPCA DG
0
0.05
0.1
Misclassification error rates
(a)
SIMPLSRoCPLSPCA DG
0
0.2
0.4
0.6
Misclassification error rates
(b)
SIMPLSRoCPLSPCA DG
0
0.1
0.2
0.3
0.4
0.5
Misclassification error rate
(c)
SIMPLSRoCPLSPCA DG
0
0.2
0.4
0.6
Misclassification error rate
(d)
Figure 5.10: Boxplots of the error rates for (a) no outliers (b) 1 outlier (c) 2 outliers (d)3
outliers in each class.
angles than that of SIMPLS, and as contamination level increases, the results remain almost
the same.
109
SIMPLS:4%RoCPLS:4%SIMPLS:7%RoCPLS:7%SIMPLS:10%RoCPLS:10%
10
20
30
40
50
60
70
Angle (Degree)
Figure 5.11: The angle between flrst PLS weight vectors for clean and contaminated data.
110
Chapter 6
Conclusions and Future Work
In this dissertation, difierent aspects of partial least squares methods have been stud-
ied. In this chapter, flnal conclusions on all results obtained throughout dissertation are
summarized. We also discuss some possibilities for future research.
In Chapter 2, the main concepts of PLS are introduced and a detailed overview of its
applications to difierent data analysis problems is given. Two important algorithms, SIM-
PLS and NIPALS (PLS1 and PLS2) are described. It is stated that the optimal number of
components is an important issue in PLSR model building and several approaches in the
literature proposed for determining optimal number of components, k, are reviewed. The
connections among the biased estimation methods PLSR, PCR, and RR are examined in
detail. Statistical properties of PLSR such as shrinkage, asymptotic variance and consis-
tency are discussed. As a conclusion, computational and implementation simplicity of PLS
is a strong aspect of the approach which favours PLS to be used as a flrst step to understand
the existing relations and to analyse real world data.
In Chapter 3, a new iterative robust external reweighting algorithm for the regression
coe?cient vector, which gives low weights to points with high leverage and/or large residuals
is proposed. This algorithm is carried out in two main parts: 1. obtain initial weights as
robust distances from recent outlier detection methods, BACON (RoPLS1) or PCOUT
(RoPLS2), to downweight outlying points in predictor X-space and/or response y-space to
get an initial PLS estimate for the regression coe?cient vector, 2. perform reweighted PLS
regression iteratively by using the initial PLS estimate of the regression coe?cient vector
111
obtained in the flrst part. Both RoPLS1 and RoPLS2 can be applied to low and high
dimensional explanatory variables. Simulations have shown that they are resistant towards
many types of contamination, whereas their performance is also good at uncontaminated
data sets. RoPLS1 is scale and orthogonal equivariant, therefore it can be preferred over
RoPLS2 which is not orthogonal equivariant.
In Chapter 4, it is shown that SIMPLS algorithm is highly non-robust towards outlying
observations. It is illustrated that a single sample can change the direction of the SIMPLS
weight vectors and the regression estimates arbitrarily. This also appears in their unbounded
in uence functions. Robustness properties of RoPLS estimator of fl, including in uence
function for low dimension, empirical in uence curve for high dimensional case and flnite-
sample breakdown properties, are provided. It is shown that the in uence function of all
pairs of PLS weight vectors and of the regression estimator are bounded which makes the
method resistant towards point contamination. For high-dimensional data, it is illustrated
on simulated data sets that the empirical in uence function remains bounded and that it
can resist large fractions of contamination. The resistance of a robust statistical method to
groups of outliers is another important issue which is measured by breakdown point, hence
the flnite sample breakdown point is determined for RoPLS1 which is approximately 40%
for both low and high dimensional settings.
In Chapter 5, the efiect of outliers on existing PLS classication methods is investigated
and a new robust PLS algorithm for classiflcation (RoCPLS) is proposed. It is shown that
the proposed method is very efiective for uncontaminated data and it yields better results
when data contain outliers.
112
In this dissertation, we have shown promising results for RoPLS and RoCPLS as a data
mining tool and high-dimensional classifler, respectively. There is, of course, more research
to be done. We would like to extend RoPLS to multivariate case and RoCPLS to multi-class
case. Also, we would like to employ the in uence function of the RoPLS1 estimator for the
robust estimation of its variance. The relationship between the PLS components and the
variable selection is going to be explored to build a robust variable selection method in the
future.
113
Bibliography
[1] Alm?ay, T., "A Simulation Study on Comparison of Prediction Methods When Only
a Few Components are Relevant", Computational Statistics and Data Analysis, 21,
87-107, 1996.
[2] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine,
A., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and
Normal Colon Tissues Probed by Oligonucleotide Arrays", Proceedings of the National
Academy of Sciences, 96, 6745-6750, 1999.
[3] Anderson, R.L. and Bancroft, T.A., Statistical Theory in Research. McGraw-Hill, New
York, 1952.
[4] Barker, M., and Rayens, W., "Partial Least Squares for Discrimination", Journal of
Chemometrics, 17, 166-173, 2003.
[5] Billor, N., Hadi, A., and Velleman, P., "BACON: Blocked Adaptive Computationally-
E?cient Outlier Nominators", Computational Statistics and Data Analysis, 34, 279-
298, 2000.
[6] Billor, N., Chatterjee, S., and Hadi, A. S., "Iteratively Re-weighted Least Squares
Method for Outlier Detection in Linear Regression", American Journal for Mathemat-
ical and Management Science, 26, 3, 229-252, 2006.
[7] Bo, T.H., andJonassen, I., "NewFeatureSubsetSelectionProceduresforClassiflcation
of Expression Proflles", Genome Biology , 3, R17, 2002.
[8] Boulesteix, A.L., "PLS Dimension Reduction for Classiflcation with Microarray Data",
Statistical Applications in Genetics and Molecular Biology, 3, Article: 33, 2004.
[9] Breiman, L. and Friedman, J., "Predicting Multivariate Responses in Multiple Linear
Regression", Journal of the Royal Statistical Society, B, 59, 3-54, 1997.
[10] Chatterjee, S. and M?achler, M., "Robust Regression: A Weighted Least Squares Ap-
proach", Communications in Statistics: Theory and Methods, 26, 1381-1394, 1997.
[11] Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones, D., and Szabo, A., "Multi-
variate Approach for Selecting Sets of Difierentially Expressed Genes", Mathematical
Biosciences, 176, 59-72, 2002.
114
[12] Croux, C. and Haesbroeck, G., "Principal Components Analysis Based on Robust Esti-
mators of the Covariance or Correlation Matrix: In uence Functions and E?ciencies",
Biometrika, 87, 603-618, 2000.
[13] Cummins, D.J. and Andrews, C.W., "Iteratively Reweighted Partial Least Squares:
A Performance Analysis by Monte Carlo Simulation", Journal of Chemometrics, 9,
489-507, 1995.
[14] Dai, J. J., Lieu, L., and Rocke, D., "Dimension Reduction for Clasiflcation with Gene
Expression Microarray Data", Statistical Applications in Genetics and Molecular Bi-
ology, 5, Article: 6, 2006.
[15] De Jong, S., "SIMPLS: An Alternative Approach to Partial Least Squares Regression",
Chemometrics and Intelligent Laboratory Systems, 18, 251-263, 1993.
[16] De Jong, S., "PLS Fits Closer than PCR", Journal of Chemometrics, 7, 551-557, 1993.
[17] De Jong, S., "PLS Shrinks", Journal of Chemometrics, 9, 323-326, 1995.
[18] Denham, M.C., "Prediction Intervals in Partial Least Squares", Journal of Chemomet-
rics, 11, 39-42, 1997.
[19] Denham, M. C., "Choosing the Number of Factors in Partial Least Squares Regres-
sion: Estimating and Minimizing the Mean Squared Error of Prediction", Journal of
Chemometrics, 14, 351-361, 2000.
[20] Dettling, M., B?uhlmann, P., "Boosting for Tumor Classiflcation with Gene Expression
Data", Bioinformatics, 19, 1061-1069, 2003.
[21] Ding, B. and Gentleman, R., "Classiflcation Using Generalized Partial Least Squares",
Bioconductor Project Working Papers, Technical Report 5, 2004.
[22] Donoho D.L., "Breakdown Properties of Multivariate Location Estimators", Ph.D.
Qualifying paper, Harvard University, 1982.
[23] Draper, N. and Smith, H., Regression Analysis by Example, Wiley, New York, 1981.
[24] Dudoit, S., Fridlyand, J., and Speed, T. P., "Comparison of Discrimination Methods
for the Classiflcation of Tumors Using Gene Expression Data", Journal of the American
Statistical Association, 97, 77-87, 2002.
[25] Dudoit, S., Shafier, J. P., and Boldrick, J. C., "Multiple Hypothesis Testing in Mi-
croarray Experiments", Statistical Science, 18, 71-103, 2003.
[26] Filzmoser, P., Maronna, R. and Werner, M., "Outlier Identiflcation in High Dimen-
sions", Computational Statistics and Data Analysis, 52, 1694-1711, 2008.
115
[27] Firth, D., "Bias Reduction of Maximum Likelihood Estimates", Biometrika, 80, 27-38,
1993.
[28] Fisher, R. A., "The Use of Multiple Measurements in Taxonomic Problems", Annals
of Eugenics, 7, 179-188, 1936.
[29] Forina, M., Armanino, C., Castino, M. and Ubigli, M., "Parvus - An Extendible Pack-
age for Data Exploration, Classiflcation and Correlation", Vitis, 25, 189-201, 1986.
[30] Fort, G. and Lambert-Lacroix, S., "Classiflcation Using Partial Least Squares with
Penalized Logistic Regression", Bioinformatics, 21(7), 1104-1111, 2005.
[31] Frank, I.E. and Friedman, J.H., "A Statistical View of Some Chemometrics Regression
Tools", Technometrics, 35, 109-147, 1993.
[32] Friedman, J., "Regularized Discriminant Analysis", Journal of American Statistical
Association, 84, 165-175, 1989.
[33] Garthwaite P. H., "An interpretation of Partial Least Squares", Journal of the Amer-
ican Statistical Association, 89, 122-127, 1994.
[34] Geladi, P. and Kowalski, B. R., "Partial Least Squares Regression: A Tutorial", Ana-
lytica Chimica Acta, 185, 1-17, 1986.
[35] Gil, J.A. and Romera, R., "On Robust Partial Least Squares Methods". Journal of
Chemometrics, 12, 365-378, 1998.
[36] Golub, G.H. and Van Loan C.F., Matrix Computations , Johns Hopkins University
Press, Baltimore, 1996.
[37] Golub, T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J., Caligiuri, M. A., Bloomfleld, C. D., and Lander,
E. S., "Molecular Classiflcation of Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring", Science, 286, 531-537, 1999.
[38] Griep, M.I., Wakeling, I.N., Vankeerberghen, P., and Massart, D.L., "Comparison of
Semirobust and Robust Partial Least Squares Procedures", Chemometrics and Intelli-
gent Laboratory Systems, 29, 37{40, 1995.
[39] Hadi, A. S. and Ling, R. F., "Some Cautionary Notes on the Use of Principal Compo-
nent Regression", The American Statistician, 52, 15-19, 1998.
[40] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A., Robust Statistics:
The Approach Based on In uence Functions, John Wiley and Sons, New York, 1986.
[41] Helland, I., "On the Structure of Partial Least Squares Regression", Communications
in Statistics: Simulation and Computation, 17, 581-607, 1988.
116
[42] Helland, I., "Partial Least Squares Regression and Statistical Models", Scandivian
Journal of Statistics, 17, 97-114, 1990.
[43] Helland, I., "Rotational Symmetry, Model Reduction and Optimality of Prediction
from the PLS Population Model", Chemometrics and Intelligent Laboratory Systems,
68, 53-60, 2003.
[44] Hoerl, A.E. and Kennard, R. W., "Ridge Regression: Biased Estimation for Nonorthog-
onal Problems", Technometrics, 8, 27-51, 1970.
[45] H?oskuldsson, A., "PLS Regression Methods", Journal of Chemometrics, 2, 211-228,
1988.
[46] H?oskuldsson, A., "Dimension of Linear Models", Chemometrics and Intelligent Labo-
ratory Systems, 32, 37-55, 1996.
[47] Huber, P.J., Robust Statistics, John Wiley and Sons, New York, 1981.
[48] Hubert, M. and Vanden Branden, K., "Robust Methods For Partial Least Squares
Regression", Journal of Chemometrics, 17, 537{549, 2003.
[49] Hubert, M., Rousseeuw, and P. J., Vanden Branden, K., "ROBPCA: A New Approach
to Robust Principal Component Analysis", Technometrics, 47(1), 64-79, 2005.
[50] Kav?sek, B. Partial Least Squares Regression and Its Robustiflcation, Diploma Thesis,
Technische Universitt Wien, 2002.
[51] Li, K.C., "Sliced Inverse Regression for Dimension Reduction", Journal of American
Statistical Association, 86, 316-342, 1991.
[52] Li, L., Weinberg, C. R., Darden, T. A., and Pedersen, L. G., "Gene Selection for
Sample Classiflcation Based on Gene Expression: Study of Sensitivity to Choice of
Parameters of the GA/KNN Method", Bioinformatics, 17, 1131-1142, 2001.
[53] Lindj?erde, O.C. and Christophersen, N., "Shrinkage Structure of Partial Least
Squares Regression", Scandinavian Journal of Statistics, 27, 459-473, 2000.
[54] Mallows, C. P., "On Some Topics in Robustness", Technical Memorandum , Murray
Hill, New Jersey, 1975.
[55] Manne, R., "Analysis of Two Partial-Least-Squares Algorithms for Multivariate Cali-
bration", Chemometrics and Intelligent Laboratory Systems, 2, 187-197, 1987.
[56] Maronna, R. and Yohai, V., "The Behavior of the Stahel-Donoho Robust Multivariate
Estimator", Journal of the American Statistical Association, 90, 330-341, 1995.
[57] Maronna, R. and Zamar, V., "Robust Estimates of Location and Dispersion for High
Dimensional Data Sets", Technometrics, 44(4), 307-317, 2002.
117
[58] Maronna,R., Martin, D. and Yohai,V., Robust Statistics Theory and Methods, John
Wiley and Sons, New York, 2006.
[59] Martens, H. and Naes, T., Multivariate Calibration, Wiley, Chichester, 1989.
[60] Marx, B.D., "Iteratively Reweighted Partial Least Squares Estimation for Generalized
Linear Regression", Technometrics, 38(4), 374-381, 1996.
[61] Marx, B.D. and Eilers, P.H., "Generalized Linear Regression on Sampled Signals and
Curves: A P-spline Approach". Technometrics, 41, 1-13, 1999.
[62] Massy, W.F., "Principal Component Regression in Explanatory Statistical Research",
Journal of the American Statistical Association, 60, 234-246, 1965.
[63] Naes, T., "Multivariate Calibration When the Error Covariance Matrix is Structured",
Technometrics, 27, 301-311, 1985.
[64] Naik, P. and Tsai, C., "Partial Least Squares Estimator for Single-Index Models",
Journal of the Royal Statistical Society: B , 62, 763-771, 2000.
[65] Nguyen, D.V. and Rocke, D.M., "Tumor Classiflcation by Partial Least Squares Using
Microarray Gene Expression Data", Bioinformatics, 18, 39-50, 2002(a).
[66] Nguyen, D.V. and Rocke, D.M., "Multi-Class Cancer Classiflcation Via Partial Least
Squares Using Gene Expression Profles".Bioinformatics, 18, 1216-1226, 2002(b).
[67] Osborne, B.G., Fearn, T. , Miller, A.R. and Douglas, S., "Application of Near In-
frared Re ectance Spectroscopy to the Compositional Analysis of Biscuits and Biscuit
Dough", Journal of Scientiflc Food Agriculture, 35, 99-105, 1984.
[68] Pe~na, D. and Prieto, F.,"Multivariate Outlier Detection and Robust Covariance Matrix
Estimation", Technometrics , 43, 286-310, 2001.
[69] Phatak, A. and De Jong, S., "The Geometry of Partial Least Squares", Journal of
Chemometrics, 11, 311-338, 1997.
[70] Phatak, A. and De Hoog, F., "Exploiting the Connection Between PLS, Lanczos Meth-
ods and Conjugate Gradients: Alternative Proofs of Some Properties of PLS", Journal
of Chemometrics, 16, 361-367, 2002.
[71] Phatak, A., Reilly, P. M., and Pendilis, A., "The Asymptotic Variance of the Univariate
PLS Estimator", Linear Algebra and Its Applications, 354, 245-253, 2002.
[72] Rocke, D., "Robustness Properties of S-estimators of Multivariate Location and Shape
in High Dimension", Annals of Statistics, 24, 1327-1345, 1996.
[73] Rousseeuw, P. J. and Leroy, A. M., Robust Regression and Outlier Detection, John
Wiley and Sons, New York, 1987.
118
[74] Rousseeuw P. J. and Van Driessen K., "A Fast Algorithm for the Minimum Covariance
Determinant Estimator", Technometrics, 41, 212-223, 1999.
[75] Serneels, S., Croux, C., and Van Espen, "In uence Properties of Partial Least Squares
Regression", Chemometrics and Intelligent Laboratory Systems, 71, 13-20, 2004.
[76] Serneels, S., Croux, C., Filzmoser, P., and Van Espen, P.J., "Partial Robust M-
Regression", Chemometrics and Intelligent Laboratory Systems, 79, 55-64, 2005.
[77] Sibson, R., "Studies in the Robustness of Multidimensional Scaling: Perturbation Anal-
ysis of Classical Scaling", Journal of the Royal Statistical Society: B 41, 217-229, 1979.
[78] Stahel W.A., Robust Estimation: Inflnitesimal Optimality and Covariance Matrix Es-
timators, Ph.D. thesis, ETH, Z?urich, 1981.
[79] Staudte, R. G. and Sheather, S. J., "Robust Estimation and Testing", John Wiley and
Sons, New York, 1990.
[80] Stone, M. and Brooks, R.J., "Continuum Regression: Cross-validated Sequentially
Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and
Principal Components Regression", Journal of the Royal Statistical Society: B, 52,
237-269, 1990.
[81] Sundberg, R., "Continuum Regression and Ridge Regression", Journal of the Royal
Statistical Society: B,55, 653-659, 1993.
[82] Vanden Branden, K. and Hubert, M., "Robustness Properties of a Robust PLS Re-
gression Method, Analytica Chimica Acta, 515, 229-241, 2004.
[83] Wakeling, I. N. and Macfle, H.J.H., "A Robust PLS Procedure", Journal of Chemo-
metrics, 6,189{198, 1992.
[84] Wangen, L.E. and Kowalsky,B.R., "A Multiblock Partial Least Squares Algorithm for
Investigating Complex Chemical Systems", Journal of Chemometrics, 3, 3-20, 1989.
[85] Wiklund, S.,Nilsson, D., Eriksson, L., Sj?ostr?om, M., Wold, S. and Faber, K., " A
Randomization Test for PLS Component Selection", Journal of Chemometrics, 21,
427-439, 2007.
[86] Wilcox, R. R., Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition,
Elsevier, 2005.
[87] Wold, H., "Estimation of Principal Components and Related Models by Iterative Least
Squares", Multivariate Analysis, Academic Press, New York, 391-420, 1966.
[88] Wold H., "Path Models with Latent Variables: The NIPALS Approach", Quantitative
Sociology: International perspectives on mathematical and statistical model building,
Academic Press, 307-357, 1975.
119
[89] Wold S., "Cross-validatory Estimation of the Number of Components in Factor and
Principal Components Models", Technometrics, 20, 397-405, 1978.
[90] Woodrufi, D. and Rocke, D., "Computable Robust Estimation of Multivariate Location
and Shape in High Dimension Using Compound Estimators", Journal of the American
Statistical Association, 89, 888-896, 1994.
120
Appendix
X : n?p matrix of explanatory variables
Y : n?q matrix of response variables
y : n?1 vector of response variable
ai : jth column of a matrix A (column vector)
ai : ith row of a matrix A (column vector)
A0 : Transpose of matrix A
A? : Orthogonal complement of A
A?1 : Inverse of matrix A
A+ : Moore-Penrose inverse of matrix A
Im : m?m identity matrix
1m : m?1 vector of ones
kak : Euclidian norm of a vector a
/ : Proportional to
121