Robust Partial Least Squares For Regression and Classification Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classifled information. Asuman Seda Turkmen Certiflcate of Approval: Asheber Abebe Associate Professor Mathematics and Statistics Nedret Billor, Chair Associate Professor Mathematics and Statistics Mark Carpenter Associate Professor Mathematics and Statistics Chris Rodger Professor Mathematics and Statistics George T. Flowers Interim Dean Graduate School Robust Partial Least Squares For Regression and Classification Asuman Seda Turkmen A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Fulflllment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 9, 2008 Robust Partial Least Squares For Regression and Classification Asuman Seda Turkmen Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Asuman Seda T?urkmen, daughter of ?Umit Peyda Top?cuba?s? and Vicdan Top?cuba?s?, was born July 24, 1977, in Adana, Turkey. She graduated with honors from C?ukurova University in 1999 with a Bachelor of Science degree in Mathematics. After graduation, she continued her graduate studies in the same department from which she received her Master of Science degree in Statistics in 2001. She came to Auburn University in August 2003 and she entered the doctoral program in the Department of Mathematics and Statistics where she completed her Ph.D. in Statistics in 2008. iv Dissertation Abstract Robust Partial Least Squares For Regression and Classification Asuman Seda Turkmen Doctor of Philosophy, August 9, 2008 (M.A., C?ukurova University, 2002) (B.S., C?ukurova University, 1999) 133 Typed Pages Directed by Nedret Billor Partial Least Squares (PLS) is a class of methods for modeling relations between sets of observed variables by means of latent variables where the explanatory variables are highly collinear and where they outnumber the observations. In general, PLS methods aim to derive orthogonal components using the cross-covariance matrix between the response variable(s) and the explanatory variables, a quantity that is known to be afiected by unusual observations (outliers) in the data set. In this study, robustifled versions of PLS methods, for regression and classiflcation, are introduced. For regression with quantitative response, a robust PLS regression method (RoPLS), basedonweightscalculated byBACONorPCOUTalgorithm, isproposed. Arobustcriteria is suggested to determine the optimal number of PLS components which is an important issue in building a PLS regression model. In addition, diagnostic plots are constructed to visualize and classify outliers. Robustness of the proposed method, RoPLS, is studied in detail. In uence function for the RoPLS estimator is derived for low dimensional data and empirical robustness properties are provided for high dimensional data. v PLS was originally designed for regression problems with quantitative response, how- ever, it is also used as a classiflcation technique where the response variable is qualitative. Although several robust PLS methods have been proposed for regression problems, to our knowledge, there has been no study on the robustness of the PLS classiflcation methods. In this study, the efiect of outliers on existing PLS classiflcation methods is investigated and a new robust PLS algorithm (RoCPLS) for classiflcation is introduced. The performances of the proposed methods, RoPLS and RoCPLS, are being assessed by employing several benchmark data sets and extensive simulation experiments. vi Acknowledgments I would like to express my sincere appreciation to my advisor and mentor, Dr. Nedret Billor, who has always provided her enthusiastic assistance, persistent encouragement and help throughout my graduate education. Words cannot express how thankful I am for her support in every aspect of my academic development. Without her and her continuous guidance, I never would have seen the end of this study. Special thanks go to Dr. Asheber Abebe for his valuable suggestions and opinions. I am also grateful to Dr. Mark Carpenter and Dr. Chris Rodger for kindly consenting to serve on the dissertation committee. Much-deserved gratitude is expressed to my parents and family for their endless love and support throughout my life. I am grateful for the loving kindness of my husband, ?Ozg?ur T?urkmen, for his patience, love, and support; he has been my true joy in this journey together. Finally, my sincere appreciation is also extended to all of the people, who have always been there for me whenever I needed them. I dedicate this dissertation to the memory of my grandparents: G?uzide & Mustafa Top?cuba?s?, Rabia & Lokman Karag?ol who truly believed in me. vii Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speciflcally LATEX) together with the departmental style-flle aums.sty. viii Table of Contents List of Figures xi 1 Introduction 1 2 An Overview on Partial Least Squares 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Partial Least Squares Regression (PLSR) . . . . . . . . . . . . . . . . . . . 6 2.2.1 NIPALS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 SIMPLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Determining the Optimal Number of Components in PLSR . . . . . . . . . 13 2.4 PLS Regression Among Other Biased Methods . . . . . . . . . . . . . . . . 16 2.5 Statistical Properties of the PLSR Estimator . . . . . . . . . . . . . . . . . 21 2.5.1 Shrinkage Properties of the PLSR Estimator . . . . . . . . . . . . . 22 2.5.2 The Asymptotic Variance of the PLSR Estimator . . . . . . . . . . . 28 2.5.3 Consistency of the PLSR Estimator . . . . . . . . . . . . . . . . . . 30 3 RoPLS: Robust Partial Least Squares Regression 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Outlier Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 BACON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 PCOUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Description of the Proposed Algorithm: RoPLS . . . . . . . . . . . . . . . . 39 3.3.1 RoPLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Selecting Number of Components . . . . . . . . . . . . . . . . . . . . 43 3.3.3 Diagnostic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 Robustness Properties of RoPLS Estimator 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 In uence Function for the SIMPLS Based Regression Estimator . . . . . . . 63 4.3 Robustness of the RoPLS Estimator of fl . . . . . . . . . . . . . . . . . . . 70 4.3.1 In uence functions for Low Dimension . . . . . . . . . . . . . . . . . 72 4.3.2 Empirical In uence Function for High Dimension . . . . . . . . . . . 85 4.3.3 Finite-Sample Breakdown Properties of RoPLS1 Estimator . . . . . 87 ix 5 RoCPLS: Robust Partial Classification 89 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Classiflcation Methods Based on Dimension Reduction . . . . . . . . . . . . 94 5.3 Description of the Proposed Algorithm: RoCPLS . . . . . . . . . . . . . . . 98 5.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.1 Low Dimension: Wine Recognition Data . . . . . . . . . . . . . . . . 101 5.4.2 High Dimension: Colon Data . . . . . . . . . . . . . . . . . . . . . . 104 5.4.3 High Dimension: Leukemia Data . . . . . . . . . . . . . . . . . . . . 105 6 Conclusions and Future Work 111 Bibliography 114 Appendix 121 x List of Figures 2.1 Scatter plots of y versus PC1, PC2, PC3, PC4. . . . . . . . . . . . . . . . . 20 2.2 Scatter plots of ^y versus y using PCR with k=2 (left) and PLSR with k=2 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Types of outliers when 3 dimensional data setX is projected on 2 dimensional PLS-subspace (left) and corresponding Orthogonal-Score distance plot (right). 47 3.2 Diagnostic plots for original Tobacco data. . . . . . . . . . . . . . . . . . . . 51 3.3 Diagnostic plots for contaminated Tobacco data. . . . . . . . . . . . . . . . 52 3.4 Boxplots of norms (left) and angles (right) between slope estimates before and after 12%, 20% and 52% contamination. . . . . . . . . . . . . . . . . . 53 3.5 The scatter plot of flrst two columns of Fish data. . . . . . . . . . . . . . . 54 3.6 RRMSE (left) and RR2 (right) index plot for Fish data. . . . . . . . . . . 55 3.7 Diagnostic plots for Fish data (RoPLS1). . . . . . . . . . . . . . . . . . . . 56 3.8 Diagnostic plots for Fish data (SIMPLS). . . . . . . . . . . . . . . . . . . . 56 3.9 (a) Original Biscuit-dough data (b)Preprocessed Biscuit-dough data. . . . . 57 3.10 RRMSE (left) and RR2 (right) index plot for Biscuit-dough data. . . . . . 58 3.11 Diagnostic plots for Biscuit-dough data. . . . . . . . . . . . . . . . . . . . . 59 4.1 (a) The norm of IF of the r1 (b) The norm of IF of the ^fl1 (c) The IF of the ^fl11 (d) The IF of the ^fl12. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 (a) Graph of w? for robust distances (b) Graph of w? for scaled errors, ". . 73 4.3 (a) Graph of w(0)F ( ?) versus (u1,v) (b) Graph of w(0)F ( ?) . . . . . . . . . . 78 4.4 Norms of the empirical in uence functions for the (a)SIMPLS, n=20, p=200, k=3 (b) RoPLS1 n=20, p=200, k=3 (c) SIMPLS, n=25, p=125, k=2 (d) RoPLS1, n=25, p=125, k=2. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xi 4.5 The flnite-sample breakdown values of the SIMPLS, RoPLS1 and PRM es- timators for (a)fn,pg=f30,6g (b)fn,pg=f20,200g. . . . . . . . . . . . . . . . 87 5.1 Misclassiflcation error rates for PLS and PCA for k=1. . . . . . . . . . . . . 98 5.2 Cross validation error rates obtained for Wine data. . . . . . . . . . . . . . 102 5.3 Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right) for Wine data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4 Scree plots for original and deleted Wine data. . . . . . . . . . . . . . . . . 103 5.5 Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right) for Colon data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 3D scatter plot of the flrst three components for Colon data obtained from RoCPLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7 Boxplots of the error rates for SIMPLS, RoCPLS, PCA and DG based clas- siflcation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.8 Scatter plot of the flrst two components for Leukemia data. . . . . . . . . . 107 5.9 The scatter plot of flrst two components (left) and orthogonal-score plot (right) for Leukemia data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.10 Boxplots of the error rates for (a) no outliers (b) 1 outlier (c) 2 outliers (d)3 outliers in each class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.11 The angle between flrst PLS weight vectors for clean and contaminated data. 110 xii Chapter 1 Introduction Most traditional statistical techniques are especially designed for low dimensional data sets where the number of observations (n) is greater than the number of variables (p). Application of the statistical methods for problems such as, survival time or tumor class prediction of a patient, based on a high dimensional data (n << p), is a di?cult and challenging task. On the other hand, nowadays data sets in many scientiflc flelds are high dimensional because of the fact that advances in technology have made simultaneous monitoring of thousands of features (variables) possible. Therefore, analyzing such data sets has been a focus for many researchers in a wide range of scientiflc flelds due to the requirement of resolutions for various statistical problems that have emerged. Recently, partial least squares (PLS) has become an important statistical tool for mod- eling relations between sets of observed variables by means of latent variables especially for statistical problems dealing with high dimensional data sets. PLS is a member of nonlinear iterative least squares (NILES) procedures developed by Wold ([87], [88]). The use of PLS methods for regression problems began in the early 80?s. The main idea in PLS regres- sion (PLSR) is to summarize high dimensional and/or collinear explanatory variables into a smaller set of uncorrelated, so called latent variables, which have the "best" predictive power. Although PLSR was initially developed for social and economic science problems hav- ing scarce information, it has received a great amount of attention in the chemometrics literature. The main application of PLSR in chemometrics is the prediction of constituent 1 concentrations of a sample based on its spectrum obtained by spectroscopic techniques, such as near infrared (NIR) spectroscopy, energy-dispersive X-ray uorescence spectroscopy, and ultraviolet(UV) spectroscopy. Spectral data contain a large amount of information since a spectrum typically ranges over a large number of wavelengths (variables) with limited num- ber of concentrations (observations). The spectroscopic techniques, in combination with PLSR analysis, have proved to be a powerful analytical tool for analyzing on-line industrial processes. Its speed, relative good performance, and ability to handle data sets with more variables than observations resulted in a lot of applications of PLSR in many other scientiflc areas such as bioinformatics, food research, medicine, and pharmacology. For instance, in the area of drug design, a large amount of chemicals need to be evaluated for their toxicity and efiectiveness before they are used by pharmacists. Quantitative Structure-Activity Re- lationships (QSAR) analysis employs theoretical molecular descriptors for reliable estimates of the toxic and therapeutic potential of chemicals. QSAR data sets contain some degree of multicollinearity which can be handled by using PLSR. Beside this, because of the large time scale of the data collection process, QSAR data often contain outliers. Despite of the fact that PLSR handles the multicollinearity problem, it fails to deal with data containing outliers since it is based on maximizing the sample covariance matrix between the response(s) and a set of explanatory variables, a statistic that is known to be sensitive to outliers. Existence of multicollinearity and outliers is no exception in real data sets, and it leads to a requirement of robust PLSR methods ( [13], [35], [38], [48], [76], [83]) in chemometrics as well as other application areas. The problem of classifying entities into one of several groups has been another impor- tant goal in many scientiflc investigations. It is important that this activity is done in a 2 manner that minimizes the misclassiflcation error rate. High dimensionality and collinearity make the application of most statistical classiflcation methods di?cult, or even impossible, for some cases. The procedure for classiflcation of high dimensional data often consists of two steps: the flrst step is to construct a few components from a large number of explana- tory variables by using dimension reduction techniques, and the second step is to employ classical classiflcation methods on the constructed components. Although PLS was origi- nally designed for regression problems, it has started to be used frequently as a dimension reduction tool for classiflcation problems and recent studies have showed that classiflcation via PLS performs quite well ( [4], [8], [60], [65], [66]) especially for microarray data analy- sis. An important application of microarray technology is tumor diagnosis. A reliable and precise classiflcation of tumors is potentially life-saving and hence is essential to physicians. In the presence of outliers, e.g. tissue-speciflc genes whose expression proflle is considerably difierent (could be "erroneous" or "genuine") in particular tissue(s) than in others, dimen- sion reduction via PLS would yield unreliable results since PLS is known to be sensitive to outliers. Although several robust PLS methods have been proposed when the response variable is quantitative, to our knowledge, there has been no study on the robustness of PLS when the response variable is qualitative. The main contribution of this work is the construction of robust algorithms for partial least squares methods. The chapters are organized as follows: in Chapter 2, a detailed literature review on PLSR, that includes the existing PLS algorithms; asymptotic variance, consistency, geometric and, peculiar shrinkage properties of PLSR estimator; and the rela- tionship between PLSR and other biased estimation methods such as principal component regression, ridge regression and continuum regression, is given. A robust PLSR method 3 (RoPLS) is introduced in Chapter 3 and its robustness properties are explored in Chap- ter 4. The efiect of outliers on existing PLS classiflcation methods is investigated and a new robust PLS algorithm for classiflcation (RoCPLS) is proposed in Chapter 5. Finally, con- clusions and proposed future work are given in Chapter 6. A recapitulation of the notations that are used throughout the work can be found in Appendix. 4 Chapter 2 An Overview on Partial Least Squares 2.1 Introduction The standard multiple regression model deflned by the equation y = Xfl +" (2.1) where X is a n ? p matrix of explanatory variables (predictors), y is a n ? 1 vector of response variable, fl is a p ? 1 vector of unknown parameters, and " is a n ? 1 vector of error terms whose rows are identically and independently distributed. In multivariate models, the response vector y in the model (2.1) is replaced by the n?g response matrix, Y, where g ? 2. Throughout this study, although PLS algorithms are given for multivariate models in general, only multiple univariate regression model is considered and emphasized due to its simplicity. Furthermore, multiple univariate models always give better results than multivariate models in terms of the variance explained ([9], [31], [34], [59]) as long as response variables are unrelated. The outline of this chapter is as follows. In Section 2.2 most commonly used PLS algorithms, NIPALS and SIMPLS, are described. Popular approaches for determining the optimal number of components are discussed in Section 2.3. The relationship between PLSR and other biased regression techniques is summarized in Section 2.4, followed by statistical properties of PLSR estimator given in Section 2.5. 5 2.2 Partial Least Squares Regression (PLSR) The ordinary least squares (OLS) estimator of fl, bflOLS, in the model given by (2.1) is the solution of the following optimization problem: bflOLS = argmax b corrfXb;yg2: (2.2) In many applications of multiple regression (e.g., spectral data analysis in chemomet- rics), multicollinearity is inevitable as a result of large number of variables collected by modern technologies of computers, networks, and sensors. Despite having desirable prop- erties, the OLS estimator can have an extremely large variance and results in imprecise prediction when the data are multicollinear. Moreover, solution of (2.2) is not unique when n ? p. One solution to deal with multicollinearity and/or dimensionality problem is regressing the response variable y on a subset of the k orthogonal (latent) vectors stored in a score matrix of size n?k by which important features of X have been retained. Score matrix is formed by taking linear combinations of columns of X. PLS regression (PLSR) constructs the columns of score matrix, T = [t1;t2;:::;tk], by solving the following optimization problem for h = 1;2;:::;k (k ? p): rh = argmax krk=1 cov(Xr;y)2 = argmax krk=1 (r0X0yy0Xr) (2.3) subject to t0htj = r0hX0Xrj = 0 for 1 ? j < h. So, PLSR balances the maximal correlation criteria for OLS given in (2.2) with the require- ment of explaining as much as variability in both X and y?space. 6 Several iterative procedures have been proposed to solve nonlinear optimization prob- lem in (2.3) such as PLS Mode A, PLS-SB, NIPALS and SIMPLS algorithms that difier by the de ation theme required for the orthogonality of derived components. PLS Mode A algorithm ([88]) aims to model existing relationships between variables rather than to model for prediction. PLS-SB computes all eigenvectors at once, and the score vectors obtained by this method are not necessarily orthogonal. The most commonly used methods, NIPALS and SIMPLS, consist of two steps may be called calibration (deriving components) and prediction. These algorithms, for both univariate and multivariate responses, are explained in the following subsections. The extension of two-block PLS model, where X and y (or Y for multivariate model) are block variables, to multi-block PLS model is also given in the literature ([84], [88]) and is not discussed in this study. 2.2.1 NIPALS Algorithm The NIPALS algorithm ( [87]) was developed as an alternative to principal component algorithms. NIPALS employs sequential simple linear regressions instead of singular value decomposition to calculate principal components. PLS algorithm can be considered as carrying out two simultaneous NIPALS principal component analyses, one for X and one for Y, while interchanging the results from X for analysis of Y and vice versa ([50], [34]) to solve the following maximization problem maxkrk=ksk=1 cov(Xr;Ys)2 under the orthogonality constraint of derived components, where s = 1 and Y = y for univariate model. Since both X and Y are used in the computation of the components, PLS is presented as a member of the bilinear class of methods and the bilinear model can 7 be written as: X = TP0 +E; (2.4) Y = UQ0 +F: (2.5) The equations given in (2.4) and (2.5) are called outer relations where T and U are score matrices derived from X and Y, respectively; P (x-loadings) represents the regression coef- flcients of X on T; Q (y-loadings) represents the regression coe?cients of Y on U; E and F are the matrices of errors. It is assumed that the score matrix T is a good predictor for Y and a linear, inner relationship between the score matrices T and U exists, i.e. U = TB+H where B is a k ?k diagonal matrix and H is a matrix of errors. The mixed relation then becomes: Y = UQ0 +F = (TB +H)Q0 +F = TA0 +F? (2.6) where A0 = BQ0 is a matrix of regression coe?cients and F? = HQ0+F is matrix of errors. For the univariate case (g = 1), the matrix B in the model (2.6) becomes identity matrix, so the equation (2.6) represents both outer and mixed relationships which can be rewritten as y = Ta+f? (2.7) where a is a vector of regression coe?cients and f? is a vector of errors. The NIPALS algorithm for univariate response variable (y) based on mixed relationship in (2.7) is called PLS1, whereas NIPALS algorithm for multivariate response variable (Y) is called PLS2. The calibration step of PLS2 algorithm for k component is described as follows: 8 Algorithm 2.1 (PLS2) Step 1: Let E0 and F0 be the copies of X and Y, respectively. Step 2: For h = 1;2;:::;k do steps 2:1?2:4: Step 2.1: Let uh be a column of Fh?1, e.g., the one having maximum variance. Step 2.2: Repeat steps 2:2:1?2:2:4 until the convergence of wh (or th). Step 2.2.1: Perform the regression Eh?1 = uhw0h + "1, yielding the least squares solution wh = E 0h?1uh u0huh (2.8) and normalize wh := wh=jjwhjj. Step 2.2.2: Perform the regression of E0h?1 = wht0h + "2 yielding the least squares solution th = Eh?1wh. Step 2.2.3: Perform the regression of Fh?1 = thq0h+"3 with the least squares solution qh = F 0h?1th t0hth (2.9) and normalize qh := qh=jjqhjj. Step 2.2.4: Perform the regression of F0h?1 = qhu0h+"4 yielding the following solution uh = Fh?1qh. Step 2.3: Perform the regression of Eh?1 and Fh?1 on th, separately to compute residuals Eh = Eh?1 ?thp0h Fh = Fh?1 ?bhthq0h 9 where p0h is the regression coe?cient vector obtained by regressing Eh?1 on th (outer rela- tion) and bh is the regression coe?cient obtained by regressing uh on th (inner relation), i.e., ph = E0h?1th=t0hth bh = u0hth=t0hth Step 2.4: Store vectors wh, th, and ph into matrices Wh = [w1;w2;:::;wh], Th = [t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively. Set h =: h+1 For univariate case (PLS1), y and f0 are used instead of Y and F0 in the flrst step of Algorithm 2.1. Steps 2.1 and 2.2 are replaced by 2.2.1 (with uh = fh?1) and 2.2.2, respectively while the steps 2.3 and 2.4 remain the same where bh = 1. In other words, convergence of wh is obtained in the flrst iteration. Score matrix, Tk, can be written in terms of linear combinations of the columns of E0 = X, that is Tk = XRk, where the hth column of Rk is called hth PLS-weight vector. The matrix, Rk is related to Pk and Wk stored in NIPALS algorithm, via the formula Rk = Wk(P0kWk)?1 (2.10) which follows from the fact that Rk and Wk share the same column space and that P0kRk is equal to the identity matrix ( [41], [45], [55]). In the prediction step, ordinary least squares estimate for a in the univariate mixed model given in (2.7) is obtained by regressing the response vector y onto these k components which yields 10 ba(k) = (T0kTk)?1T0ky = (R0kX0XRk)?1R0kX0y. Therefore, the PLS estimate of fl in the model given in (2.1) is bfl(k)PLS = Rkba(k). Alternative formulations of the PLS1 are suggested by Helland ([41]) and Garthwaite ([33]). An extensive simulation study by Breiman and Friedman ([9]) on the comparison of univariate and multivariate regression methods including PLS1, PLS2, OLS, and other biased regression methods demonstrated that performing separate PLS1 regressions on each individual response would be a better strategy than employing PLS2 (see also [31], [34], and [59]). The major drawback of NIPALS algorithm (PLS1 and PLS2) is that the columns of score matrix, T, are obtained as linear combinations of de ated data matrix X. Since one looses sight of what is in the depleted data, the interpretation of the components gets complicated. SIMPLS algorithm, given in the next subsection, resolves this drawback by using a difierent de ating scheme. 2.2.2 SIMPLS Algorithm SIMPLS algorithm ([15]) is an alternative to NIPALS algorithm that aims to derive PLS components directly in terms of the original data which results in faster computation with less memory requirements and interpretation easiness. SIMPLS de ates the cross- covariance matrix, Sxy / X0Y, whereas NIPALS de ates the original data matrix X to obtain orthogonal components. 11 SIMPLS algorithm can be summarized as follows: Algorithm 2.2 (SIMPLS) Step 1: Compute cross-product matrix: S0xy = X0Y (X and Y are centered), Step 2: Repeat steps 2:1?2:6 for h = 1;2;:::;k: Step 2.1: Compute flrst left singular vector of Sh?1xy as hth PLS weight vector rh, Step 2.2: Compute hth score, th = Xrh, and normalize th =: th=kthk, Step 2.3: Update hth PLS weight, rh =: rh= q r0hX0Xrh, Step 2.4: Compute hth x-loading by regressing X on th: ph = X0th , Step 2.5: Store vectors rh, th, and ph into matrices Rh = [r1;r2;:::;rh], Th = [t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively. Step 2.6: h =: h+1 and Sh?1xy = (Ip ?Vh?1V0h?1)X0y where columns of Vh?1 form an orthonormal basis for Ph?1. The orthogonality constraint of components is fulfllled when the PLS weight vector rh is orthogonal to all previous x-loadings Ph?1 = [p1;p2;:::;ph?1]. As a result of this, the hth pair of SIMPLS weight vector rh for h = 2;:::;k is obtained as the flrst left singular vector of Sxyh?1 which is projection of Sxyh?2 on a subspace orthogonal to Ph?1. Therefore, if the columns of Vh?1 = [v1;v2;:::;vh?1] form an orthonormal basis of Ph?1 obtained by GramSchmidt method, then Sxyh?1 = (Ip ?vh?1v0h?1)Sxyh?2 = (Ip ?Vh?1V0h?1)X0Y: (2.11) 12 Once k components are derived, PLS estimate is obtained by using score matrix as explana- tory variables as in the prediction step described in NIPALS algorithm. After h components are derived, the data matrix is reduced implicitly to X(Ip?VhV0h) with SIMPLS algorithm which can be seen from (2.11). In PLS1 algorithm, the hth derived component, th, is equal to Eh?1wh, where wh is the normalized form of: E0h?1fh?1 = X0(In ?Th?1(T0h?1Th?1)?1T0h?1)2y = X0(In ?Th?1(T0h?1Th?1)?1T0h?1)y: (2.12) Therefore, data matrix is reduced explicitly to (In?Th(T0hTh)?1T0h)X with PLS1. Although residual matrices difier, application of both algorithms on data sets demonstrated that SIMPLS and NIPALS algorithms are equivalent for univariate case which is also stated in the next proposition. Proposition 2.1 (De Jong, [15]) SIMPLS is equivalent to PLS1. Proposition 2.1 can be proven by induction. However, when applied to multivariate set of response variables (g > 1), the SIMPLS results are difierent from the results of PLS2 ([15]). In this study, SIMPLS algorithm is employed because of its speed and e?ciency. 2.3 Determining the Optimal Number of Components in PLSR The decision on the optimal number of components, k, is a very important issue in building the PLSR model. Although, it is possible to calculate as many components as the rank of the X, it does not make sense in practice. Because data are never noise-free and some of the smaller components will only describe noise. Due to uncertain statistical behavior of PLSR, explained in Section 2.5, it is di?cult to perform inferential tasks such 13 as assessing the number of components. Consequently, developing as well as comparing PLS component selection rules have been and apparently continue to be subjects of active research in chemometrics. Cross validation, adjusted Wold?s criterion and randomization test are leading methods that are proposed to seek out the optimum dimensionality of PLS models. Among the many approaches proposed in the past, the cross-validation (CV) scheme stands out in particular. In M-fold cross-validation, the original sample is partitioned into M sub-samples. Of the M subsamples, a single sub-sample is retained as the validation set for testing the model, and the remaining M ? 1 sub-samples are used as learning set for estimating the model. The cross-validation process is then repeated M times (number of folds), with each of M sub-samples used exactly once as the validation set. The M results from the folds then can be combined to produce a single estimate for the optimal number of components. Particularly, the n-fold cross validation (M = n), where only one observation is deleted and the process is repeated as many times as samples, is called leave-one-out cross validation. The resulting residual sum of squares, PRESS, is a measure of the predictive power of the components in the model. The PRESS value for h component univariate PLSR using leave-one-out cross validation is: PRESS(h) = nX i=1 ? yi ? by?i(h) ?2 (2.13) where the predicted values by?i(h) are based on the parameter estimates that are obtained from the data set which does not include observation i using a PLSR model with h compo- nents. The optimal number of components is the one that yields the minimum PRESS or 14 root mean square error, RMSE, k = argmin h fPRESS(h)g = argmin h fRMSE(h)g (2.14) where RMSE(h) = vu ut1 n nX i=1 ? yi ? by?i(h) ?2 = r1 nPRESS(h): (2.15) A simple and classical method is the Wold?s R criterion ([89]) which compares two successive values of PRESS via their proportion, that is R = PRESS (h+1) PRESS(h) (2.16) where PRESS(h) is given in equation (2.13). When R is greater than 1, it is considered that the optimal number of components is h. Instead of comparing this ratio to unity, it was proposed by ([89]) to flx it at 0:90 or 0:95 which is named Adjusted Wold?s Criteria. The randomization test ([85]) is a recent method that assesses the statistical signiflcance of each individual component that enters the model. Theoretical approaches to achieve this goal (using a t- or F-test) have been put forth, but they are all based on some assumptions. Randomization test is a data-driven approach and therefore ideally suited for avoiding assumptions. Denham ([19]) evaluated performances of several mean squared error (MSE) estimation approaches in terms of their accuracy and usefulness in determining the optimal number of components to include PLSR model. It is concluded that all methods perform very compatible for data sets with few variables, while the cross-validation method results in better MSE estimates for the data sets with large number of variables. One area where the 15 method of cross-validation works poorly is design of experiments, where the randomization test should have merit ([85]). 2.4 PLS Regression Among Other Biased Methods In the multiple linear regression, the OLS estimator of the regression coe?cient vector has minimum variance in the class of unbiased estimators. Existence of multicollinearity problem results in large variancesof the coe?cientestimators. Therefore, severalbiased esti- mation methods have been proposed as alternatives to OLS estimator when multicollinearity is present. The main goal of biased methods is to decrease the mean squared error of predic- tion by introducing a reasonable amount of bias into the model. This is done by shrinking the solution coe?cient vector away from the OLS solution toward directions in which the projected data have larger spread. PLSR regression is a biased regression method and it is related to other biased methods such as principal component regression (PCR, [62]) and ridge regression (RR, [44]). Most of these methods can be unifled under a generalized approach called continuum regression. The continuum regression (CR) is a technique that can generate a range of models including OLS, PLSR and PCR. CR weight vectors rh for h = 1;2;:::;k are deflned as proposed by Stone and Brooks ([80]), according to the criterion: rh = argmax krk=1 cov(Xr;y)2[Var(Xr)] ?1???1 = argmax krk=1 (r0X0y)2(r0X0Xr) ?1???1 (2.17) under the constraint that cov(Xrh;Xrj) = 0 for h > j. The parameter ? (0 ? ? ? 1) adjusts the amount of information of the regressors to be considered for predicting the response variable. 16 The single vector (h = 1) that maximizes the squared sample correlation between the response and the corresponding linear combination of the predictor variables is the OLS solution which is obtained by taking ? = 0 in equation (2.17). Similarly, for h = k, ?=0.5 and ?=1 yield PLSR and PCR based solutions on k components, respectively. Alternatively, the OLS estimator can be obtained as the solution of the normal equa- tions X0Xb = X0y: (2.18) Throughthissection, itisassumedthatn > pandXhasfull-column rank, i.e. rank(X) = p, however, with some minor modiflcations, the results given in this section can be established for other cases, as well. Then, the OLS estimator of fl is bflOLS = (X0X)?1X0y: (2.19) If rank(X) < p, then (X0X)?1 in (2.19) is replaced by (X0X)+ which yields unique mini- mum length least squares (MLLS) solution. The idea of PCR is to replace the original regressors by h ? p principal components (PCs), stored in the score matrix, Zh = XVh, where the flrst h eigenvectors of X0X form Vh. These eigenvectors are the solutions of (2.17) for ?=1. Therefore, the PCR estimator of fl in (2.1) based on h components is: ^fl(h)PCR = (Z0hZh)?1Z0hy = VhV0h^flOLS: (2.20) It can be seen from (2.20) that the PCR estimator based on h components is the orthogonal projection of OLS estimator onto the space spanned by the flrst h eigenvectors 17 of X0X, since VhV0h = Vh(V0hVh)?1Vh. On the other hand, the PLSR estimator of fl for h components is given by ^fl(h)PLS = RhP0h^flOLS = Wh(P0hWh)?1P0h^flOLS: (2.21) The matrix Wh(P0hWh)?1P0h is an idempotent matrix, hence it is a projection matrix. How- ever, it is not a symmetric matrix, so it is called oblique projector. Therefore, the PLSR estimator for h components is the oblique projection of ^flOLS onto Wh along to the space P?h ([69]). In PCR, as well as in PLSR, the degree of bias is controlled by the dimension of the space , h, on which orthogonal projection of ^flOLS is taken. The smaller the value of h, the larger the bias. In PCR, the columns of score matrix are derived without the reference to the response variables so that the derived components are optimal in the sense of maximizing the amount of explained variation in X. On the other hand, in PLSR, a set of linear combinations for X and another set of linear combination for y are derived and they are optimally related in yet another sense. This is an advantage of PLSR over PCR especially in the cases, where components obtained by maximizing variation in X, may have no relevance for modeling y which is demonstrated in the next example. 18 Example 2.1 Hald?s data set ([23]), consists of four standardized regressors and one response variable, is used to demonstrate this drawback of PCR. The original response variable is replaced by y = 2Xv4 + ", where " ? N(0;1) and v4 is the eigenvector corresponding to the smallest eigenvalue of X0X, as suggested by Hadi et al. ([39]). It can be seen from Figure 2.1 that scatter plots of y versus each of the flrst three PCs display completely random pattern, while the relationship between y and the last PC is perfectly linear. PCR and PLSR based on k = 2 components resulted in mean squared errors 5:3228 and 0:2277, respectively which also can be concluded from Figure 2.2. In general, if the true regression coe?cient is in the direction of ith eigenvector of X0X, then ith component alone will contribute everything to flt, while the remaining PCs contribute nothing ([39]). In such cases, PLSR is expected to perform better than PCR since optimal directions are determined by considering y. Another advantage of PLSR over PCR is that the vector of fltted values from PLSR is closer to fltted values from OLS and hence to y than its PCR counterpart. The PLS model always gives a closer flt, in terms of coe?cient determination, R2, than a PCR model with the same number of components ([16], [31], [70]). Ridge regression (RR) is another well-known biased regression method. The method replaces the covariance matrix, X0X, by a better conditioned matrix, X0X+?Ip for a value of ? called ridge constant that lies between 0 and 1. The aim is stabilizing the inverse of the possibly ill-conditioned covariance matrix by adding a multiple of Ip. As in OLS, the solution is deflned by a single vector ^flRR = argmax kwk=1 corr(Xw;y)2 Var(Xw)Var(Xw)+? = (X0X +?Ip)?1X0y: (2.22) 19 ?3 ?2 ?1 0 1 2?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 PC1 y ?2 ?1 0 1 2 3?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 PC2 y ?1 ?0.5 0 0.5 1?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 PC3 y ?0.1 ?0.05 0 0.05 0.1?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 PC4 y Figure 2.1: Scatter plots of y versus PC1, PC2, PC3, PC4. ?0.2 ?0.1 0 0.1 0.2?2.5 ?2 ?1.5 ?1 ?0.5 0 0.5 1 1.5 2x 10?4 y PCR predicted values for k=2 Correlation=0.0017 ?0.2 ?0.1 0 0.1 0.2?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 y PLSR predicted values for k=2 Correlation=0.9867 Figure 2.2: Scatter plots of ^y versus y using PCR with k=2 (left) and PLSR with k=2 (right). 20 Setting the ? = 0 yields the unbiased OLS solution, while the larger values of ? introduces bias into the model. The relationship between the flrst factor of CR and RR is described in [81] and it is concluded that there is one-to-one correspondence between the ? (CR parameter) and ? ([81]). RR difiers in two respects from the PLSR. First of all, it does not derive orthogonal components, it applies explicit shrinkage to the coe?cient vector. Secondly, RR is usu- ally applied to univariate regression models, although the generalized RR for multivariate response model is proposed. One of the disadvantages of RR is its heavier computation especially for high dimensional problems. The comparison of OLS, PLSR, PCR, and RR is given by Frank and Friedman ([31]) and Alm?ay ([1]). Simulation studies indicated that none of the methods is uniformly better than the others. Frank and Friedman concluded that RR, PLSR and PCR provided sub- stantial improvement over OLS when multicollinearity exists. In all settings, PLSR usually performed very compatible with RR and it was closely followed by PCR. The main result of Alm?ay?s study was that PLSR performs better when the irrelevant eigenvalues are large, whereas PCR performs better when the irrelevant eigenvalues are small. 2.5 Statistical Properties of the PLSR Estimator Since the PLSR estimator of fl is a non-linear function of y, it is very di?cult to derive the exact distribution of the estimator which leads di?culties in terms of inference based tasks. Although PLSR is a very popular tool for chemometricians, it is used to be overlooked by statisticians. However, more recently statisticians have attempted to shed some light on the method and its properties. In this section, shrinkage structure of the PLSR estimator 21 ([17], [31], [53], [70]), its asymptotic variance ([18], [71], [75]) and consistency ([42],[64]) properties are conveyed. 2.5.1 Shrinkage Properties of the PLSR Estimator Characterization of the PLSR estimator in terms of shrinkage properties is useful for providing a link between PLSR and other shrinkage estimators. In general, if the singular value decomposition of X is given by X = U?12V0, where the columns of U, uj, and the columns of V, vj, are left and right singular vectors of X, respectively and ? is diagonal matrix with ordered eigenvalues of X0X, ?1 ? ?2 ? ::: ? ?p, on the diagonal; then the shrinkage estimator of fl is bfl = pX j=1 f(?j) u 0jy p? j vj = pX j=1 f(?j)fijvj (2.23) where fij = u 0 jyp ?j, f(?j) are shrinkage factors and u 0jy are called Fourier coe?cients ([53]). Since the OLS estimator in (2.19) can be rewritten as Ppj=1 fijvj, all shrinkage factors, f(?j), are set to be 1 for the OLS estimator. Setting f(?j) to values other than 1 introduces bias into the estimation process which is beneflcial since the increase in bias results in a decrease in mean-squared error. Shrinkage factors less than 1 lead a reduction in the variance of the estimator, whereas factors greater than 1 result in simultaneous increase in both variance and bias. The shrinkage factors for the RR estimator are f(?j) = ?j? j +? 22 for j = 1;2;:::;p while factors of the PCR estimator for h ? p components are: f(?j) = 8> >< >>: 0 j > h 1 j ? h It can be easily seen that the shrinkage factors for PCR and RR are between 0 and 1. Therefore, they are regarded as shrinkage methods since they shrink bfl by shrinking some of the fij. Frank and Friedman ([31]) were the flrst statisticians who stated the shrinkage property of the PLSR estimator using simulation studies, but they did not provide theoretical proof. The shrinkage properties of the PLSR estimator can be investigated theoretically through its relationship with Krylov space and conjugate gradient method (CG) ([70]). Therefore, initially, a brief information on these concepts would be appropriate. The space spanned by the columns of z;Az;A2z;:::;Am?1z is called m dimensional Krylov space of a square matrix A and a vector z and denoted by Km(A;z). An alternative form of PLSR estimator with h components can be given in terms of h dimensional Krylov space of Kh(X0X;X0y). Wh, obtained from NIPALS algorithm, is an orthonormal basis for the spaceKm(X0X;X0y) ([41]) and it is central to connection between PLSR and conjugate gradient (CG) method. CG method aims to solve a system of linear equations Cb = c arising out of the minimization of the quadratic function ?(b) = 12b0Cb ? c0b for a positive semi-deflnite matrix C. The solution by CG algorithm for h iterations can be obtained from the following algorithm: Algorithm 2.3 (CG) 23 Step 1: Let b0 = 0 and d0 = e0 = c ? Cb0 = c. Repeat the Step 2? Step 5 for j = 0;:::;h?1: Step 2: Calculate zj zj = d 0jej d0jCdj: (2.24) Step 3: Calculate bj+1 bj+1 = bj +zjdj: (2.25) Step 4: Calculate residual: ej+1 ej+1 = c?Cbj+1 = c?C(bj +zjdj) = ej ?zjCdj: (2.26) Step 5: Calculate dj+1 dj+1 = ej+1 ? ?e0 j+1Cdj d0jCdj ! dj; (2.27) and set j := j +1. When C = X0X and c = X0y, the normal equations given in (2.18) is obtained and for any arbitrary initial solution, bp converges (in exact arithmetic) to (X0X)+X0y which is equal to bflOLS if X is full rank matrix ([70]). There are important properties of the vectors generated during the CG algorithm ([36]). For instance, the space spanned by vectors fd0;d1;:::;dh?1g is Kh(C;c) and the residual vectors fe0;e1;:::;eh?1g span the same Krylov space. Beside this, since the residuals are orthogonal, these vectors are actually the wj?s from NIPALS algorithm when C = X0X and 24 c = X0y for j = 1;2;:::;h except for a scale factor. Therefore, the CG estimator, bj, for j = 1;2;:::;h, is identical to the PLSR estimator for j components, bfl(j)PLS, and the PLSR estimator for h = p components yields the OLS estimator. Another important property of CG estimator, bj, is that it minimizes (b ? bj)0C(b ? bj) over Kj(C;c) j = 1;2;:::;h. Therefore, the PLSR estimator based on h components is given by bfl(h)PLS = argmin b2Kh(X0X;X0y) (y ?Xb)0(y ?Xb): (2.28) Furthermore, the relationship between CG and PLSR can be used to prove the following proposition. Proposition 2.2 (De Jong, [17]) The norm of bfl(h)PLS is strictly non-decreasing function of the number of components, h, i.e. PLSR estimator of fl shrinks. This result was proven algebraically by De Jong ([17]). Another proof by Phatak and De Hoog ([70]) given below is more compact which uses relationship between CG method and PLSR estimator. Proof: It is necessary to prove that the norms of CG estimators for normal equations, bh = bfl(h)PLS, increase as h increases, that is kb1k < kb2k < ::: < kbpk = kbflOLSk. In general, (2.25) gives kbj+1k2 = kbjk2 +zj2kdjk2 +2zjd0jbj: Hence, to prove thatkbj+1k2 > kbjk2, it is su?cient to show that zjd0jbj > 0 since zj2kdjk2 ? 0. By multiplying both sides of (2.27) by e0j+1, the following equality is obtained: e0j+1dj+1 = e0j+1ej+1 ? ?e0 j+1X0Xdj d0jX0Xdj ! e0j+1dj: (2.29) 25 Since residuals are orthogonal and spanfe0;e1;:::;ej+1g=spanfd0;d1;:::;dj+1g, ej+1 is orthogonal to dj. Therefore e0j+1dj = 0 in (2.29) and e0j+1dj+1 = e0j+1ej+1 for j = 1;2;:::;p?1 from which it follows that zj = d 0jej d0jX0Xdj = e0jej d0jX0Xdj > 0: (2.30) From (2.25), bj = Pj?1i=0 zidi so d0jbj = Pj?1i=0 zid0jdi. Therefore, it is su?cient to prove that d0jdi > 0 for i 6= j to show that d0jbj > 0. The statement d0jdi > 0 can be proven by expressing dj vectors in terms of the residuals. If lj = ?e 0 j+1X 0Xdj d0jX0Xdj , then (2.27) gives dj = ej + j?1X i=0 0 @ j?1Y k=i lk 1 Aei: (2.31) So, d0jdi > 0 only if lj > 0. Again, utilizing the (2.26), we get ej+1 = ej ?zjX0Xdj =) X0Xdj = z?1j (ej ?ej+1). Therefore, lj can be rewritten as lj = ?e 0j+1X0Xdj d0jX0Xdj = ? e0j+1(ej ?ej+1) zjd0jX0Xdj = kej+1k2 zjd0jX0Xdj > 0 (2.32) which completes the proof 2. Expressing shrinkage factors of the PLSR estimator, in terms of eigenvalues of the matrices X0X and W0hX0XWh, is also possible and this expression, given in the next propo- sition ([53]), provides valuable information on how the shrinkage behavior of the PLSR estimator is. 26 Proposition 2.3 (Lindj?erde and Christophersen, [53]) The shrinkage factors, from (2.23), for the PLSR estimator for h components are given by f(?j)(h) = 1? hY i=1 ? 1? ?j (h)i ! (2.33) for j = 1;2;:::;p, where ?1 ? ?2 ? ::: ? ?p and (h)1 ? (h)2 ? ::: ? (h)h are the eigenvalues of X0X and W0hX0XWh, respectively. Shrinkage factors of the PLSR estimator are determined using the relationship between PLSR and Lanchoz method ([41], [55]). Lanchoz method transforms an original symmetric matrix, C, into a symmetric tridiagonal matrix. Therefore, after applying Lanchoz method on C = X0X and obtaining a tridiagonal matrix, eigenvalues and corresponding eigenvectors of C = X0X can be easily calculated due to the nature of being a tridiagonal matrix. Given some starting vector c = X0y, the procedure constructs tridiagonal matrices W0hX0XWh for h = 1;2;:::;p, where the columns of Wh form an orthonormal basis for Kh(C;c) = Kh(X0X;X0y). If (h)i and?(h)i , i = 1;2:::;hare the eigenvalues and unit-norm eigenvectors respectively of W0hX0XWh, then ( (h)i ;Wh?(h)i ) are called Ritz pairs. Although it is known that the PLSR estimator shrinks relative to the OLS estimator (Proposition 2.2), it may expand some of the fij, that is f(?j) > 1. It is determined that eigenvalues of X0X and Fourier coe?cients play an important role on the shrinkage properties of the PLSR estimator ([53]). Especially the cases where the corresponding singular value is large and the corresponding Fourier coe?cient is small are the ones that anti-shrinkage property is observed. On the other hand, since the contribution of such terms to the PLSR estimator will be small, anti-shrinkage may not be seriously efiective. 27 2.5.2 The Asymptotic Variance of the PLSR Estimator Unlike the OLS estimator, the PLSR estimator of fl in (2.1) is a non-linear function of the response variable. Thus, the covariance matrix of the PLSR estimator cannot be determined easily. This leads to di?culties in inference based tasks such as choosing opti- mal number of components, testing signiflcance of coe?cients and constructing confldence intervals for the regression coe?cients. Methods based on re-sampling techniques, such as bootstrapping and cross-validation, utilize the original data to gain information about the variability of the estimator. However, these methods are computationally intensive. Furthermore, their applications to typical chemometrics problems are not practical since there are often very few observations. Another approach is linearizing the non-linear estimator to estimate its variance. Den- ham ([18]) provided a locally linear approximation to the covariance matrix based on the flrst derivative of the PLSR vector. More compact expression for the asymptotic covariance matrix is given by Phatak et al. ([71]). In their approach, the approximate covariance matrix for the bfl(h)PLS is calculated using the delta method stated below. Theorem 2.1 Let fYng be a sequence of random vectors. Assume that pn[Yn ? ?] con- verges in distribution to N(0;?) and g(:) is a vector function whose derivatives exist in a neighborhood of z = ?, then pn[g(Yn) ? g(?)] converges in distribution to N(0;J?J0), where J = @g(z)@z0 jz=?. Therefore, assuming that var(") = 2In, the approximate covariance matrix of bfl(h)PLS can be written as var(bfl(h)PLS) = JoJ0o 2 (2.34) 28 where Jo is the Jacobian matrix that consists of derivatives of each element of bfl(h)PLS with respect to elements of y evaluated at given data (Xo;yo). Jacobian matrix, Jo, is derived by using the matrix difierential calculus and given as Jo = @ bfl(h)PLS @y0 jy=yo= f[y 0XWhGh ?L]+[WhGh ?y0XL]g(M?10 ?Ip)Uh0 +HhX0 (2.35) where Gh = (W0hX0XWh)?1, Hh = WhGhW0h, L = (Ip ?HhX0X), M is a matrix such that Kh = WhM with Kh = [X0y;(X0X)X0y;:::;(X0X)h?1X0y]. and Uh = [X;X(X0X);:::;X(X0X)h?1] is n?hp matrix ([71]). A reasonable estimate of 2 in (2.34) can be calculated by dividing residual sum of squares to an appropriate degrees of freedom. Although n ? p has been suggested as degrees of freedom, due to non-linear form of PLSR estimator it is suggested the use of the trace of (In?XJo)0(In?XJo) ([18]). The results of simulation studies indicated that the covariance matrix estimate based on Jacobian matrix, Jo, provides a useful approximation to the true covariance matrix ([71]). However, since Jo is evaluated using the data from a single experiment, how well JoJ0ob 2 approximates the actual covariance matrix is directly related to the optimality of number of components, h. The covariance matrix for the bfl(h)PLS can also be calculated from the in uence function ([75]). This approach has advantages over the methods based on linearization approaches such as independence of model assumption and computational easiness. The relationship between in uence function of an estimator and its variance is given in Chapter 4. 29 2.5.3 Consistency of the PLSR Estimator A consistent estimator is an estimator that converges in probability to the parameter being estimated as the sample size grows without bound. In this section, it is shown that ^fl(h)PLS is consistent estimator of fl in the model (2.1), when speciflc assumptions are held. Linear model given in (2.1) can be rewritten as yi = x0ifl +"i (2.36) where xi is the ith row of X for i = 1;2;:::;n. Assuming that xi?s are independently identically distributed random variables with positive deflnite covariance matrix ?xx and are independent of "i , then ?xy = cov(xi;yi) = cov(xi;x0ifl) = ?xxfl: (2.37) Therefore, fl = ??1xx?xy = Pph=1 vhv0h?h ?xy where ?h and vh are the hth eigenvalue and the corresponding eigenvector of ?xx for h = 1;2:::;p. In general, eigenvectors with v0h?xy 6= 0, one for each ?h, are called the relevant eigenvectors of ?xx for prediction of y ([42], [59]), i.e., eigenvectors of ?xx with non-zero components along ?xy. Helland ([41]) provided a general form of the ^fl(h)PLS as ^fl(h)PLS = ^Dh( ^D0hSxx ^Dh)?1 ^D0hSxy (2.38) 30 where Sxx is the sample covariance estimator of ?xx, Sxy is the sample estimator of ?xy, and ^Dh is p?h matrix with columns that span Kh(Sxx;Sxy). Helland ([42]) also proved that if m is the number of relevant eigenvectors of ?xx for prediction of y, then m is the least integer h such that fl = ??1xx?xy belongs to Kh(?xx;?xy). These facts are used to prove the consistency of the PLSR estimator given in the next lemma. Lemma 2.1 Assume that Sxx and Sxy converge to ?xx and ?xy in probability, respectively. If h = m is the number of the relevant eigenvectors of ?xx for prediction of y, then ^fl(h)PLS is consistent estimator of fl. Proof: Since Sxx and Sxy converge to ?xx and ?xy in probability, respectively, h component PLSR estimator in (2.38), ^fl(h)PLS, converges in probability to Dh(D0h?xxDh)?1D0h?xy = Dh(D0h?xxDh)?1D0h?xxfl where columns of Dh span Kh(?xx;?xy). If h = m, then, as proved by Helland ([42]), fl 2 Kh(?xx;?xy). Consequently, the space spanned by D?h = ?1=2xx Dh contains ?1=2xx fl, which implies D?h(D?0h D?h)?1D?0h ?1=2xx fl = ?1=2xx fl. Multiplying each side by ??1=2xx from left, we obtain Dh(D?0h D?h)?1D0h?xxfl = Dh(D0h?xxDh)?1D0h?xxfl = fl. So, ^fl(h)PLS converges in probability to fl 2. 31 Chapter 3 RoPLS: Robust Partial Least Squares Regression 3.1 Introduction AlthoughPLSregressionhandlesthemulticollinearityproblem, itfailstodealwithdata containing outliers since PLS algorithms are based on the empirical cross-covariance matrix between the response variable(s) and the regressors. Existence of multicollinearity and outliers in the data sets leads to a requirement of robust PLS methods in many application areas. Consequently, several robust PLS regression methods have been proposed in the literature. In general, these methods can be classifled in two groups: those which use iteratively reweighting technique and those which use the idea of robustication of covariance matrix. The methods utilizing iteratively reweighting idea, assign a weight between 0 and 1 to the each data point in a way that outliers, points which are su?ciently far away from the bulk of the data, gain less weights than inliers, points which are the bulk of the data. The flrst robust PLS method in this group, proposed by Wakeling and Macfle ([83]), is based on the idea of replacing a set of ordinary regression steps in NIPALS algorithm by robust counterparts. The main drawback of the method is the large amount of the computation time. Therefore, Griep et al. ([38]) suggested a semi-robust method by replacing only the flrst ordinary regression step by a robust regression method for the sake of computation. However, the method looked at outliers onto planes [y;Xj] where j = 1;2;:::;p, while ignoring the multivariate nature of the data. Cummins and Andrews ([13]) gave a slightly difierent version of iteratively reweighting method by calculating weights after performing 32 an ordinal PLS algorithm and using these weights for the next PLS algorithm. Since the two aforementioned methods employ a robust method within the PLS algorithm they are called "internal iteratively reweighting" methods whereas the method by Cummins and Andrews is called "external iteratively reweighting" method ([35]). Although the latter method has advantages over the internal reweighting methods, it sufiers from the lack of resistance to outliers in X space. Only recently, another external iterative method, called "Partial Robust M Regression" (PRM), proposed by Serneels et al. ([76]) which is robust to outliers in X space. The second group of robust methods, introduced by Gil and Romera ([35]) estimate the regression coe?cients with the help of a robust covariance matrix instead of applying robust regression method in PLS algorithms. Gil and Romera estimated covariance matrix using Stahel-Donoho estimator ([22], [78]). However, this method can not be applied to high dimensional data sets since it uses a resampling scheme by drawing subsets of size p + 2. Hubert and Branden ([48]) proposed another two step algorithm (RSIMPLS), that can be used for both low and high dimensional data, by estimating covariance matrix using MCD (minimum covariance determinant, [74]) and the idea of projection pursuit. Once the score matrix is obtained, robust regression is used to estimate a in (2.7), so the fl in the equation (2.1). In this chapter, robustifled versions of the SIMPLS algorithm, RoPLS1 and RoPLS2, are introduced. The proposed methods are external iteratively reweighting algorithms based on the idea of reweighted least squares method given by Billor et al. ([6]). RoPLS1 uses weights calculated from BACON (blocked adaptive computationally-e?cient outlier nomi- nators) algorithm ([5]), whereas RoPLS2 uses weights calculated from PCOUT algorithm 33 ([26]). RoPLS2 is the flrst method that incorporates PCOUT algorithm as an integral com- ponent in a robust estimation procedure rather than just as an outlier detection method. Both RoPLS1 and RoPLS2 have computational advantages over recent robust methods PRM and RSIMPLS and they are resistant to masking problem. The rest of the chapter is organized as follows: Outlier identiflcation algorithms, BA- CON and PCOUT, are reviewed in Section 3.2. The detailed algorithm for the proposed algorithm, RoPLS, including a robust method to determine the number of components and two graphical methods to diagnose outliers, are given in Section 3.3. Real and simulated data sets are utilized to demonstrate the performance of the proposed method in Section 3.4. 3.2 Outlier Detection Methods The outlier challenge is one of the earliest of statistical interests, and since data sets often contain outliers, it continues to be one of the most important. The outlier detection problem has important applications in the flelds of fraud detection, astronomy, bioinfor- matics (e.g., microarray experiments), and many other countless areas. For instance, the great interest of astronomers is to discover unusual, rare or unknown types of astronomical objects or phenomena. The outlier detection approaches in multiple terabyte, and even larger, multi petabyte data sets, correctly meet the need of astronomers. Outlierdetectionalgorithmsfallintotwobroadclasses: thosewhichemploythedistance- based methods, such as MULTOUT ([90]), MCD ([74]), BACON ([5]); and those which rely upon projection pursuit ideas such as Kurtosis1 ([68]) and Stahel-Donoho estimator ([56]). Distance-based methods classify outliers as points which are su?ciently far away from the bulk of the data; whereas projection-pursuit methods use lower dimensional projection of 34 data that enables the user to detect outliers. Primary goals in all of these algorithms are explicit outlier detection and/or robust estimation. Most of the distance-based methods are especially designed for low dimensional data sets, that is, the number of observations is greater than the number of variables, and identiflcation of outliers in higher dimensions be- comes more complicated as the dimension increases. On the other hand, nowadays data sets in many scientiflc flelds are high dimensional. Although projection pursuit based methods can be applied to such situations, their computational di?culties make them impracti- cal to use. Recent outlier detection method, PCOUT ([26]), which detects outliers very e?ciently in high dimensional data sets, combines the advantages of distance-based and projection pursuit methods. In the following subsections, detailed information on BACON and PCOUT methods is provided, since these algorithms are used to build new robust PLS algorithms in Section 3.3. 3.2.1 BACON BACON ([5]) algorithm is a distance-based method that starts with an outlier-free subset of data (initial basic subset), from which robust distances can be calculated. The initial basic subset can be found algorithmically in one of two ways: Mahalanobis distances based on classical mean and covariance estimates (Version1) or Euclidian distances from medians (Version2). The advantages of Version1 are its a?ne equivariance and its low com- putational cost. Then, based on the mean and covariance matrix of the basic subsetof size m, discrepancies are computed and all observations with discrepancy less than correction factor, Cnpr?p;fi=n, form the new basic subset, where ?p;fi=n is the 1 ?fi percentile of the 35 chi-square distribution with p degrees of freedom and, Cnpr = Cnp +Chr = max[0;(h?r)=(h+r)]+1+ p+1n?p + 1n?h?p where h = [(n+p+1)=2)] and r is the size of current basic subset. This iterative method is repeated until the size of basic subset, r, no longer changes. The observations excluded by the flnal basic subset are nominated as outliers. BACON estimators of the location vector (e?) and the covariance matrix (e?), based on this flnal subset, are employed to calculate robust Mahalanobis distance vector dB with ith row for i = 1;2;:::;n given as diB = d(xi;e?; e?) = q (xi ? e?)0e??1(xi ? e?) (3.1) where xi 2 Rp is the ith row of data matrix, X, for i = 1;2;:::;n. Simulation studies demonstrate that BACON is a robust technique with 40% breakdown point. The method is fast, easy to implement and thus practical for data sets of even million of cases. Although BACON algorithm is originally designed for low dimensional data, an exten- sion of BACON algorithm to high dimensional data is also possible. A simple solution to this problem is to run the BACON algorithm on the reduced data set, eX, of size n?p?, where eX is the score matrix based on the spectral decomposition of covariance matrix es- timate and p? denotes the number of such score vectors in the reduced data set. Robust distance vector, dB, is calculated based on eX. 36 3.2.2 PCOUT PCOUT is a recent outlier identiflcation algorithm that is particularly efiective in high dimensions([26]). PCOUT is based on the idea of outlier detection on principal component space which does not require matrix inversion. The method starts by robustly sphering the original data by columnwise median of absolute deviances for j = 1;2;:::;p: x?ij = xij ?mediani(xij)(1:4826)mad i(xij) (3.2) where madi(xij) = mediani(jxij ?mediani(xij)j): (3.3) Then, PCA is applied to sphered data matrix, X? = [x?ij], and p? < p semi-robust compo- nents, Z = [z1;z2;:::;zp?], that contribute to at least 99% of the total variance are retained in the analysis. The semi-robust component matrix, Z, is sphered similar to (3.2) as z?ij = zij ?mediani(zij)(1:4826)mad i(zij) : (3.4) The rest of the algorithm uses the sphered component matrix, Z? = [z?ij], to detect location and scatter outliers. The location outlier detection is initiated by calculating the absolute value of a robust kurtosis measure for each component: kj = j1n nX i=1 z?ij4 ?3j (3.5) where j = 1;2;:::;p?. Kurtosis coe?cient measures "heaviness" of distribution tails. Small value of kurtosis is an indicator of large amount of asymmetric contamination, whereas 37 large value of kurtosis is an indicator of either symmetric contamination or small amount of asymmetric contamination. Since the presence of location outliers is likely to cause the kurtosis to become difierent from zero, the relative weight vj = kj=Pp?j=1 kj for each of the sphered component, z?j, is used to reveal outliers. Robust distances, for detecting location outliers, are calculated based on the weighted component matrix: RDLi = vu uu t p?X j=1 v2jz?ij2 (3.6) and then the following equation: dLi = RDLi q (?2p?;0:5) mediani(RDLi ); (3.7) suggested by Maronna and Zamar ([57]), is used to transform distances. To detect scatter outliers, sphered component matrix, Z?, is directly used to calculate robust distances, RDSi , by plugging vj = 1 for all components in (3.6). Similarly, dSi is calculated using the transformation given in equation (3.7) by replacing RDLi by RDSi . After calculating robust distances, the translated biweight function ([72]), w(d;c;M) = 8> >>> >>< >>>> >>: 0 d?c 1? ?d?M c?M ?2?2 M can be written as a functional, i.e., T(Fn) = b n. In this section, notation A> is used for transpose of matrix A, while the notation (0) is used as 60 derivative operator. Here Fn is the empirical distribution function given by: Fx;n(t) = 1n nX i=1 I(?1;t](xi) where IA(x) = 1 for x 2 A and 0 otherwise. Thus, if the purpose is to obtain an estimator that is relatively unafiected by small shifts in the cumulative distribution function, this can be achieved by choosing an estimator represented by a continuous functional and an estimator with this property is said to have "qualitative robustness". The second tool for robustness arises when the two other restrictions, existence and boundedness of the functional derivative, are imposed so that small changes in the cumu- lative distribution do not result in large changes in the value of the functional, T. In order to theoretically assess the in uence that an observation z? has on the value of a statistical functional, the derivative of the functional called in uence function is used ([40]). Deflnition 4.1 The in uence function (IF) of a functional T at F 2F is given by IF(z?;T;F) = dd"T(F") "=0 = lim "#0 T(F")?T(F) " = lim"#0 T((1?")F +"?z?)?T(F) " (4.1) in those z? where this limit exists. Here ?z? is point mass distribution at z? and # denotes the limit from the right. In uence function given in (4.1) corresponds to the directional (Gateaux) derivative of T at F in the direction of H = ?z? ?F and can also be written as: IF(z?;T;F) = T0(H) = dd"T(F +"H) j"=0 : (4.2) 61 The functional, so the estimator, is said to have "inflnitesimal robustness" if IF(z?;T;F) is a bounded function of z?. The in uence function of T can also be used to determine an explicit formula for the asymptotic variance of T ([47]) since Var(T;F) ? R IF(x;T;F)>IF(x;T;F)dF(x) n (4.3) which can be estimated by dVar(T;F) = Pn i=1 IF(xi;T;Fx;n)>IF(xi;T;Fx;n) n2 : (4.4) The third tool, breakdown point, addresses the notion of quantitative robustness and it can loosely be deflned as the smallest fraction of samples (with respect to n) that can render the estimator useless. It describes how greatly a small change in the underlying distribution F changes the distribution of an estimator. The higher the breakdown point of an estimator, the more robust it is. In this chapter, flnite-sample version of the breakdown point given in Deflnition 4.2 is preferred because of the simplicity. Deflnition 4.2 The flnite-sample breakdown point, "?(x;T), of an estimator T at a sample x=[x1;x2;:::;xn]> is given by "?(x;T) = 1n minm fm : supex k T(x)?T(ex) k= 1g (4.5) where ex is obtained by replacing m (1 ? m ? n) observations of x by arbitrary observations. The outline of this chapter is as follows. In Section 4.2 in uence functions for classical PLSR estimator of fl is given. Robustness properties of RoPLS estimator of fl: in uence function 62 for low dimension, empirical in uence curve for high dimensional case and flnite-sample breakdown properties, are discussed in Section 4.3. 4.2 In uence Function for the SIMPLS Based Regression Estimator In this section, in uence function for the SIMPLS estimator of fl is given. Assume that x 2Rp and y 2R are centered random variables. Let F be a cumulative distribution function for a random vector = (x>;y)> and d = p+1. Then the covariance matrix of is given by ? = 0 BB @ ?xx ?xy ?yx yy 1 CC A: A functional, S, can be deflned on a suitable class of probability distributions, F, which maps an arbitrary distribution G 2 F into a positive deflnite symmetric matrix, S(G) = EG( >), under the assumption that S(F) = EF( >) = ?, that is; S is Fisher- consistent for ? at F. Using functional S, other functionals can also be deflned as: Sxx(G) = EG(xx>) = [Ip : 0]S(G)[Ip : 0]> Sxy(G) = EG(xy) = [Ip : 0]S(G)ed where 0 is p?1 vector of zeros and ed is the dth standard basis vector of Rd. Fisher consis- tency of S at F implies Fisher-consistency of Sxy at F for ?xy, that is Sxy(F) = EF(xy) = ?xy. Similarly, Syx(G) = EG(yx>) = Sxy(G)> and Syx is also Fisher-consistent for ?yx at F. Since the existence of second moments guarantees the existence of the functional S, it is assumed that F consists of probability functions, G, at where positive deflnite matrix EG( >) exists which requires the existence of second moments. 63 Throughout this section, functional for an estimator is denoted by the parameter being estimated unless stated otherwise. ?h = ?2h and ?h stand for eigenvalues and eigenvectors of ?h?1yx ?h?1xy , respectively, satisfying ?h?1xy ?h = ?hrh for h = 1;2;:::;k. Since S(F) = ?, ?h(F) = ?h, ?h(F) = ?h and rh(F) = rh. The functional for the SIMPLS estimator of fl for h component is given by ^flh(G) = Rh(G)R>h (G)Sxy (4.6) where Rh(G) = [r1(G);r2(G);:::;rh(G)] and G 2 F. So, for the sake of clarity, in uence function of ^flh is studied in three steps. In the flrst step, in uence functions of S0yxS0xy and the flrst PLS-weight vector, r1, are derived. In the second step, similar approach is followed to flnd the in uence function of hth PLS-weight vector, rh. Finally, in the third step, in uence functions for scaled PLS-weight vector, ~rh and PLS slope estimator, ^flh, are given. Step 1: In uence functions for SyxSxy and r1 Since S0xy?1 = ?1r1, to determine the in uence function of r1, it is necessary to deter- mine the in uence functions of ?1 and ?1 that require in uence function of S0yxS0xy. The following lemmas are used to obtain in uence function for S0yxS0xy = SyxSxy. Lemma 4.1 The in uence function of S at F 2F is IF( ?;S;F) = ? ?> ??: (4.7) 64 Corollary 4.1 Let ? = (u>;v)> be an arbitrary d-dimensional point. Then, i. IF( ?;Sxy;F) = uv ??xy = IF( ?;Syx;F)>, ii. IF( ?;Sxx;F) = uu> ??xx, iii. IF( ?;SyxSxy;F) = 2[vu>?xy ??yx?xy]. Proof: i. IF( ?;Sxy;F) = [Ip : 0]IF( ?;S;F)ed = uv ??xy, IF( ?;Syx;F) = e>d S(F)[Ip : 0]> = vu> ??yx = IF( ?;Sxy;F)>, ii. IF( ?;Sxx;F) = [Ip : 0]IF( ?;S;F)[Ip : 0]> = uu> ??xx, iii. IF( ?;SyxSxy;F) = IF( ?;Syx;F)?xy +?yxIF( ?;Sxy;F), = fvu> ??yxg?xy +?yxfuv ??xyg = 2[vu>?xy ??yx?xy] 2. For the in uence functions of the eigenvalue and the eigenvector of SyxSxy (?1 and ?1), the following lemmas are used. Lemma 4.2 (Sibson, [77]) Let A and B be two symmetric matrices, ?i be the ith simple eigenvalue of A and vi be the associated eigenvector of unit length. Let A be perturbed to A(") = A+"B +o("2) and assume that the corresponding perturbations of ?i and vi are ?i(") = ?i +"e?i +o("2) vi(") = vi +"evi +o("2). Then, e?i = v>i Bvi and evi = ?(A??iI)+Bvi = ?[Pj6=i( vjv > j ?j??i)]Bvi. The following lemma mimics Lemma 4.2 and provides the in uence functions of eigen- values and corresponding eigenvectors of a positive deflnite symmetric (PDS) matrix, ?. 65 Lemma 4.3 (Croux and Haesbroeck, [12]) Let C : F ! PDS(m) be a statistical func- tional, F a m-dimensional distribution and ? F. Suppose that IF( ?;C;F) exists and C(F)=?. Denote the ith simple eigenvalue and eigenvector of ? by ?i and ?i, respectively. Then in uence functions of functionals ?i and ?i at F 2F are given by IF( ?;?i;F) = ?>i IF( ?;C;F)?i (4.8) IF( ?;?i;F) = 2 4X j6=i ( ?j? >j ?i ??j) 3 5IF( ?;C;F)?i (4.9) Proof: The perturbation of ?, ?(") = C(F"), can be approximated as: C(F") ? C(F)+R IF( ?;C;F)d(F" ?F)+o("2) = ?+"IF( ?;C;F)+o("2) ([47] and [54]). Similarly ?i and ?i can be written as ?i(F") = ?i +"e?i +o("2) = ?i +"IF( ?;?i;F)+o("2) ?i(F") = ?i +"e?i +o("2) = ?i +"IF( ?;?i;F)+o("2), where e?i and e?i are obtained from Lemma 4.2 as, e?i = IF( ?;?i;F) = ?>i IF( ?;C;F)?i e?i = IF( ?;?i;F) = ? ?P j6=i( ?j?>j ?j??i) ? IF( ?;C;F)?i = ?P j6=i( ?j?>j ?i??j) ? IF( ?;C;F)?i 2: Lemma 4.3 demonstrates that once IF( ?;SyxSxy;F) is known, in uence functions for all PLS-weight vectors, and as a result of this, in uence function of the PLS slope estimator can be determined. Using Lemma 4.3, it can be easily seen that in uence function for ?1, 66 the eigenvector of SyxSxy, is 0. Since ?1 = 1, in uence function for ?1 can be given as IF( ?;?1;F) = IF( ?;SyxSxy;F): (4.10) Using IF( ?;SyxSxy;F) given in Corollary 4.1 and using ?xy = p?1r1, (4.10) can be rewritten as IF( ?;?1;F) = 2[vu>?xy ??yx?xy] = 2p?1[vu>r1 ?p?1]: (4.11) Furthermore, ?1 = p?1 so (4.11) can be written in terms of ?1 as IF( ?;?1;F) = IF( ?;?21;F) = [ dd? 1 ?21]IF( ?;?1;F) = 2?1[vu>r1 ??1] (4.12) which implies IF( ?;?1;F) = vu>r1 ??1 = vr>1 u??1: (4.13) The functional for the flrst PLS weight vector, r1, satisfles Sxy(F)?1(F) = ?1(F)r1(F), because it is the left singular vector of Sxy. Hence, in uence function for r1, given next, follows immediately from IF( ?;Sxy;F) = IF( ?;?1;F)r1 +?1IF( ?;r1;F). Corollary 4.2 The in uence function for the flrst weight vector, r1 at F 2F is IF( ?;r1;F) = v? 1 [Ip ?r1r1>]u: (4.14) It can be seen from (4.14) that IF( ?;r1;F) is unbounded since it is a function of arbitrary values of u and v. 67 Step 2: In uence functions for Sh?1yx Sh?1xy and rh In general, for h > 1 components, the hth PLS weight vector is obtained as the left singular vector of ?xyh?1. Thus, the functional for the hth PLS weight vector, rh, satisfles Sh?1xy (F)?h(F) = ?h(F)rh(F) where Sh?1xy = [Ip ?Vh?1V>h?1]Sxy with S0xy = Sxy. Here, the columns of Vh?1 = [v1;v2;:::;vh?1] form an orthonormal basis for the space spanned by [p1;p2;:::;ph?1] where pi / ?xxri for i = 1;2:::;h?1. In uence function for Sh?1yx Sh?1xy , which is same as the in uence function for ?h, is: IF( ?;Sh?1yx Sh?1xy ;F) = IF( ?;?h;F) = IF( ?;Syx[Ip ?currency1h?1]Sxy;F) (4.15) where currency1h?1 = Vh?1V>h?1 = Ph?1i=1 viv>i . Using multiplication rule, (4.15) can be rewritten as IF( ?;?h;F) = 2IF( ?;Syx;F)[Ip ?currency1h?1]?xy ??yxIF( ?;currency1h?1;F)?xy: (4.16) Since 2?hIF( ?;?h;F) = IF( ?;?h;F), after plugging IF( ?;Syx;F) in (4.16), in uence function for ?h can be derived as IF( ?;?h;F) = vr>h u??h ? 12?h?yxIF( ?;currency1h?1;F)?xy. Thenextcorollaryentailsthein uencefunctionforrh thatrequirestheuseofIF( ?;Sh?1xy ;F) and IF( ?;?h;F). Corollary 4.3 The in uence function for the hth (h > 1) weight vector, rh, at F 2F is IF( ?;rh;F) = 1? h ? [Ip ?rhrh>]vu?[Ip ? rh2? h ?yx]IF( ?;currency1h?1;F)?xy ?currency1h?1uv : (4.17) 68 Here, IF( ?;currency1h?1;F) can be calculated recursively. This function is obtained directly by using the equality IF( ?;Sh?1xy ;F) = IF( ?;?h;F)rh +IF( ?;rh;F)?h. Therefore, in uence function for rh given in (4.17) depends on explicitly to arbitrary point (u>;v)> as well as implicitly to the in uence functions for all previous PLS-weight vectors. So, IF( ?;rh;F) in (4.17) is unbounded. Step 3: In uence functions for ~rh and ^flh The PLS weight vectors, rh, for h ? 1 should be scaled by functional q r>h Sxxrh to be able make the hth component unit norm. The scaled version of the PLS-weight vector, denoted by ~rh, has the in uence function in the form of IF( ?; ~rh;F) = [?(h)Ip ?rhrh >?xx]IF( ?;rh;F) ?(h)3=2 ? rhrh>IF( ?;Sxx;F)rh 2?(h)3=2 (4.18) where ?(h) = rh>?xxrh for h ? 1. Corollary 4.4 In uence function for the PLS slope estimator, represented by functional ^flh = fRhfRh>Sxy with fRh = [~r1; ~r2;:::; ~rh] for h components at F 2F is IF( ?; ^flh;F) = IF( ?; fRh;F)fRh>?xy + fRhIF( ?; fRh;F)>?xy + fRhfRh>IF( ?;Sxy;F) (4.19) where ith column of IF( ?; fRh;F) is IF( ?; ~ri;F) for h ? 1. 69 Similarly, ~rh in (4.18) is unbounded because it depends on the in uence function for rh which is unbounded. Finally, ^flh has unbounded in uence function. The following example ([82]) demonstrates the unboundedness of the in uence functions of SIMPLS estimators. Example 4.1 Assume that ? N3(?;?) where ? = (0 0 0)> and ? = 0 BB BB BB @ 5 0:5 3 0:5 2 0:3?3 3 0:3?3 2 1 CC CC CC A : The norms of the theoretical in uence functions for r1 given in (4.14) and ^fl1 = [^fl11; ^fl12]> given in (4.19) are calculated for ? = (u>;v)> where u> = (i;0) and v = j with i and j take values between ?10 to 10. The unbounded shapes of the in uence functions, can be seen in Figure (4.1). Hence, it can be concluded that the SIMPLS estimator of slope is non-robust towards outlying observations. 4.3 Robustness of the RoPLS Estimator of fl SIMPLS regression is scale and orthogonal equivariant method ([43]). The flrst version of BACON algorithm is scale and a?ne equivariant, therefore orthogonal equivariant ([5]). Equivariance properties of BACON and SIMPLS guarantee the scale and orthogonal equiv- ariance of the entire RoPLS1 , BACON based RoPLS, estimator of fl. Therefore, RoPLS1 estimator, efl, that is computed from a transformed response vector ey = fiy (fi 2 R?f0g) and data matrix eX = XA (A: p?p orthogonal matrix), is given by efl = fiA>bfl (4.20) 70 Figure 4.1: (a) The norm of IF of the r1 (b) The norm of IF of the ^fl1 (c) The IF of the ^fl11 (d) The IF of the ^fl12. 71 where bfl is RoPLS1 estimator that is computed from original data matrix, X, and response vector, y. However, (4.20) does not hold for RoPLS2, PCOUT based RoPLS, estimator since the coordinatewise median is not orthogonal equivariant. Because of its equivariance advantage over RoPLS2, only robustness properties of RoPLS1 estimator are explored in this section. 4.3.1 In uence functions for Low Dimension RoPLS1 algorithm, introduced in Chapter 3, is an iterative algorithm that starts with an initial estimator. This section is presented into two parts. In the flrst part, in uence functions for the initial estimators of PLS-weight vector and slope are derived for h ? 1 components. Then, in the second part, after deriving in uence functions for the estimators of PLS-weight vector and slope obtained in ith iteration, it is demonstrated that in uence functions in the following iterations are directly related to in uence function of the initial estimators where 0 ? i ? a and a is the number at which iteration converges. In general, functional for the h component RoPLS1 estimator of fl obtained in the ith iteration is denoted by ^fl(i)h , while the estimators of rh, ?h and ?h are represented by functionals r(i)h , ?(i)h , and ?(i)h , respectively. Part I: In uence Function for the Initial Estimator, ^fl(0)h In uence function of the initial estimator ^fl(0)h is derived in three steps similar as in Section 4.2. 72 Figure 4.2: (a) Graph of w? for robust distances (b) Graph of w? for scaled errors, ". Step 1: In uence functions for SwyxSwxy and r(0)1 A functional, Sw(0), can be deflned as Sw(0)(G) = EG(w(0)G ( ) >) for a random vector =(x>;y)> and G 2 F, where F is the same class of d-variate probability distributions deflned in Section 4.2. The weight functional, w(0)G ( ), is w(0)G ( ) = w? ? d B(G) ? = w? q >SB(G)?1 ? (4.21) where SB is the functional representative of the BACON estimator of covariance matrix of (?) with range of positive deflnite matrices and w?(:) is the function given in (3.11). The weight function, w(0)G , is decreasing function of the distance which can be seen from Figure 4.2 (a) with md(G) = max n 1;median(d B(G)) o . Furthermore, 0 ? w(0)G ( ) ? 1 for any G 2 F. Existence of S(G) = EG( >) implies existence of Sw(0)(G), since w(0)G ( ) is bounded for G 2F. Therefore, functionals Swxy(0)(G) = EG ? w(0)G ( )xy ? and Swxx(0)(G) = 73 EG ? w(0)G ( )xx> ? exist, and for h > 0, Sh?1wxy(0)(G) = ?(0)h (G)r(0)h (G). Moreover, Sw(0)(G) is a non-negative matrix because of the fact that 0 ? w(0)G ( ). Throughout this section, the following are assumed to be held for i ? 0: 1. Sw(i)(F) = ?, 2. SB(F) = ? and IF( ?;SB?1;F) exists with a boundary, 3. PFf : q> 6= 0 and w(i)F ( ) > 0g > 0, 4. PFf : p >??1 = md(F)g = 0. Assumption (1) implies Swxy(i)(F) = ?xy, Swxx(i)(F) = ?xx, r(i)h (F) = rh, ?(i)h (F) = ?h, ?(i)h (F) = ?h. When assumption (2) holds, d B0(H) exists with boundary. If for every q 2 Rd, the probability of f : q> 6= 0 and w(i)G ( ) > 0g is positive under G, then q>Sw(i)(G)q > 0 implying that Sw(i)(G) is positive deflnite. Therefore, assumption (3) is needed for positive deflniteness of Sw(i)(G). The last assumption is required for difierentiability of w(0)G over the support of the random vector . Similar to Section 4.2, the in uence functions of SwyxSwxy = S0wyx(0)S0wxy(0) and the in uence function for ?1(0) are used to obtain in uence function of r(0)1 . The notation M0(H) is used to denote IF( ?;M;F) where H = ? ? ?F. Lemma 4.4 In uence function of SwyxSwxy at F 2F is IF( ?;SwyxSwxy;F) = 2w(0)F ( ?)u>v?xy ?2?yx?xy +2C>?xy (4.22) where C = EF ? @w? @d B(F)d B0(H)xy?. Proof: The in uence function for Swxy is: 74 Swxy0(H) = @@tEF+tH ? w(0)F+tH( )xy ? jt=0= EF ? w(0)0(H)xy ? + n w(0)F ( ?)uv ??xy o where w(0)0(H) = @@tw(0)F+tH( ) jt=0= @@tw? ? d B(F +tH) ? jt=0= @w?@d B(F+tH) @d B(F+tH) @t jt=0. Therefore, Swxy0(H) = n w(0)F ( ?)uv ??xy o +EF ? @w? @d B(F)d B0(H)xy? with the derivative @w?(d) @d B(F) = 8 >>< >>: 0; d < md(F) ?(d?2); d > md(F)g: The derivative exists everywhere except d = md(F) and it is bounded. Assumption (4) guarantees the existence of the derivative over the support of , whereas the assumption (2) guarantees the boundedness of d B0(H). Therefore EF ? @w? @d B(F)d B0(H)xy? exists. Let C = EF ? @w? @d B(F)d B0(H)xy? then; SwyxSwxy0(H) = Swyx0(H)Swxy(F)+Swyx(F)Swxy0(H) which is equal to n w(0)F ( ?)u>v ??yx o ?xy +C>?xy +?yxfwF( ?)uv ??xyg+?yxC and this can be simplifled as SwyxSwxy0(H) = 2w(0)F ( ?)u>v?xy ?2?yx?xy +2C>?xy 2. By Lemma 4.3, the in uence function of ?1(0) = ?(0)1 2 is the same as the in uence function of SwyxSwxy and ?xy = ?1r1, thus the in uence function of ?(0)1 is given as ?(0)1 0(H) = 2?1?(0)1 0(H) = 2?1 n w(0)F ( ?)u>vr1 ??1 +C>r1 o 75 and this yields the following ?(0)1 0(H) = w(0)F ( ?)u>vr1 ??1 +C>r1 = w(0)F ( ?)vr1>u??1 +r1>C. The following lemma gives the in uence function for r(0)1 . Lemma 4.5 The in uence function of r(0)1 at F 2F is IF( ?;r(0)1 ;F) = 1? 1 [1?r1r1>][w(0)F ( ?)vu+C] (4.23) Proof: Using the equality that Swxy(F +tH) = ?(0)1 (F+tH)r(0)1 (F +tH) and taking the derivatives with respect to t for both sides, the following is obtained S0wxy(H) = ?(0)1 0(H)r1 +?1r(0)1 0(H): (4.24) After plugging S0wxy(H) and ?(0)1 0(H) in (4.24), r(0)1 0(H) can be written as r(0)1 0(H) = 1? 1 n w(0)F ( ?)uv ??xy +C ? n w(0)F ( ?)vr1>u??1 +r1>C o r1 o and after simpliflcations, the in uence function for r(0)1 is obtained as r(0)1 0(H) = 1? 1 [Ip ?r1r1>][w(0)F ( ?)vu+C] 2. The in uence function of r(0)1 , given in (4.23), consists of two parts. The flrst part, 1 ?1[Ip?r1r1 >]w(0) F ( ?)vu, is directly related to the in uence function of classical PLS weight 76 vector given in (4.14). The main difierence between (4.14) and (4.23) is the presence of additional weight vector, w(0)F ( ?), in (4.23) which allows the flrst term of (4.23) to be bounded. The second term, 1?1[Ip?r1r1>]C, is obviously independent of ?. Since C exists, the second term is a flnite valued vector for given F. The following example demonstrates how wF( ?) gets smaller for extreme observations, so that it makes the flrst part of the r(0)1 0(H) bounded. Example 4.2 Consider the same setting given in Example 4.1. ? N3(?;?) where ? = (0 0 0)>. In this example, it is going to be shown that w(0)F ( ?) is small for extreme ? = (u>;v)>. Since, >??1 is distributed with ?2 distribution with degrees of freedom rank(???1) = 3, the p >??1 has chi-distribution with degrees of freedom 3 and median is approximately 1:15, i.e., md = median(p 0??1 ) ? 1:15. The weight function is calculated for ? where u> = (i;0) and v = j with i and j take values between ?10 to 10. Figure 4.3 shows the behavior of weights for difierent ? values. Obviously, w(0)F ( ?) takes values close to zero for extreme values of ?. Step 2: In uence functions for Sh?1wyx(0)Sh?1wxy(0) and r(0)h The functional for the hth PLS weight vector, r(0)h , satisfles Sh?1wxy(0)(F) = ?(0)h (F)r(0)h (F) where Sh?1wxy(0) = [Ip ?V h?1w(0)V h?1w(0)>]Swxy. Here, the columns of V h?1w(0) form an orthonormal basis for the space spanned by fp(0)1 ;p(0)2 ;:::;p(0)h?1g. Therefore, the in uence function for Sh?1wxy(0) is: Sh?1wxy(0)0(H) = ?currency1h?1w(0)0(H)?xy +[Ip ?currency1h?1]Swxy0(H) (4.25) 77 Figure 4.3: (a) Graph of w(0)F ( ?) versus (u1,v) (b) Graph of w(0)F ( ?) versus robust distances. with currency1h?1w(0) = V h?1w(0)V h?1w(0)> and currency1w(0)h?1(F) = currency1h?1. Similarly, using multiplication rule, the in uence function for Sh?1wyx(0)S(h?1)wxy(0) = Swyx[Ip ?currency1h?1w(0)]Swxy is obtained as 2?h n w(0)F ( ?)vrh>u??h +C>rh ? 12?h?yxcurrency1h?1w(0)0(H)?xy o which yields the in uence function of ?(0)h as ?(0)0h (H) = ? w(0)F ( ?)vrh>u??h +C>rh ? 12? h ?yxcurrency1h?1w(0)0(H)?xy : (4.26) So, the general form for the in uence functions of r(0)h is given in the next lemma. Lemma 4.6 The in uence function of r(0)h at F 2F is r(0)h 0(H) = 1?h n [Ip ?rhrh>][vw(0)F ( ?)u+C]?[Ip ? rh2?h?yx]currency1h?1w(0)0(H)?xy o ? 1?h n currency1h?1[w(0)F ( ?)uv +C] o where currency1h?1w(0)0(H) is calculated recursively. 78 Proof: Using Sh?1wxy(0) = ?(0)h r(0)h , taking derivatives of both sides and plugging (4.25) and (4.26), we obtain; r(0)h 0(H) = 1?h n [Ip ?rhrh>][vw(0)F ( ?)u+C]?[Ip ? rh2?h?yx]currency1w(h?1)0(H)?xy o ? 1?h n currency1h?1[w(0)F ( ?)uv +C] o . 2. Step 3: In uence functions for ~r(0)h and ^fl(0)h ~rh(0) is the scaled version of rh(0) that is ~rh(0) = rh (0) q rh(0)>Swxxrh(0) and using the quotient rule, the in uence function of ~rh(0) is [?(h)Ip ?rhrh>?xx]r(0)h 0(H) ?(h)3=2 ? rhrh>Swxx0(H)rh 2?(h)3=2 where ?(h) = rh>?xxrh and Swxx0(H) = [w(0)F ( ?)uu> ??xx +Cx] with Cx = EF ? @w? @d B(F)d B0(H)xx>?: Cx exists under the same conditions that C exists. Similar to r(0)1 , the in uence function of r(0)h is comparable to the in uence function of the classical PLS weight function, rh, in (4.17). It consists of two parts: weighted version of equation (4.17) and the part depending on C and Cx which are assumed to be flnite valued vector and matrix, respectively. r(0)h 0(H) is bounded, so is in uence function of ~rh(0), as long as r(0)j 0(H) for j = 1;2;:::;h ? 1 79 and Swxx0(H) are bounded. These conditions also imply the boundedness of the in uence function for ^fl(0)h given in the next lemma. Lemma 4.7 In uence function for the initial RoPLS1 slope estimator for h (h ? 1) compo- nent, represented by functional ^fl(0)h = fRh(0)fRh(0)>Swxy with fRh(0) = [~r1(0); ~r2(0);:::; ~rh(0)], at F 2F is ^fl(0)0h (H) = fRh(0)0(H)fRh>?xy + fRhfRh(0)0(H)>?xy + fRhfRh>[w(0)F ( ?)uv??xy +C] (4.27) where jth column of fRh(0) 0 (H) is ~rj(0)0(H) for 1 ? j ? h. Part II: In uence Function for ^fl(i)h Assume that the slope estimator for h component model at iteration i? 1, ^fl(i?1)h , is given where i ? 1. In this part, in uence function for ^fl(i)h , is derived as in Part I. Step 1: In uence functions for Swyx(i)Swxy(i) and r(i)1 A functional, Sw(i), can be deflned as Sw(i)(G) = EG(w(i)G ( ) >) for a random vector =(x>;y)> and G 2 F, where F is the same class of d-variate probability distributions deflned in Section 4.2. The weight functional, w(i)G ( ), for i ? 1 is w(i)G ( ) = ? 1?dB(G) ? w? ? ?h(i)(G) mad(?h(i)(G)) ! = ? 1?dB(G) ? w? ? "h(i)(G) ? (4.28) where ?h(i)(G) = y ? X^fl(i?1)h (G) and dB(G) is the normalized version of dxB(G) = q x>([Ip : 0]SB(G)[Ip : 0]>)?1 x that lies between 0 and 1. w?("h(i)(G)) is decreasing function of the j"h(i)(G)j and lie between 0 and 1 which can be seen from Figure 4.2(b) where m"h(i)(G) = max n 1;median(j"h(i)(G)j) o , under the 80 assumption that the marginal distribution of ?h(i)(G) = y?X^fl(i?1)h (G) is symmetric about 0. Similarly, 0 ? 1?dB(G) ? 1 decreases when distances in X space increase. Therefore, the weight function, w(i)G , decreases with extreme residuals and/or large distances. Furthermore, 0 ? w(0)G ( ) ? 1 for any G 2 F. Existence and non-negative deflniteness of Sw(i) can be shown as in Part I. Other assumption made in this section, additional to the (1), (2), (3) and (4) given in Part I, is that 5. PFf : j"(i)h j = m(i)"h(F)g = 0. This assumption is required for difierentiability of w?("h(i)(F)) over the support of the random vector . Similar to previous subsection, the in uence function for S0wyx(i)S0wxy(i) = Swyx(i)Swxy(i) is needed to determine the in uence function of the r(i)1 for i ? 1. Lemma 4.8 In uence function of Swyx(i)Swxy(i) at F 2F is IF( ?;Swyx(i)Swxy(i);F) = 2w(i)F ( ?)u>v?xy?2?yx?xy+2?yxAid+2?yxAis+2?yxAifl ^fl(i?1)0h (H) (4.29) with Aid = EF ? ?dB0(H)w?("h(i))xy ? , Ais = EF ?(1?dB) @w?@" h(i) ?(i)h s0(H) s2 xy ? , and Aifl = EF ? ?(1?dB) @w?@" h(i) y sxx > ? where s = mad(?(i)h (F)). Proof: The in uence function for Swxy(i) is: S0wxy(i)(H) = EF ? w(i)0(H)xy ? +EH ? w(i)F ( )xy ? = EF ? w(i)0(H)xy ? + n w(i)F ( ?)uv ??xy o with w(i)0(H) = @@tw(i)F+tH( ) jt=0= @@t n (1?dB(F +tH))w? ? "h(i)(F +tH) ?o jt=0 81 = ?dB0(H)w?("h(i))+(1?dB) @w?@" h(i) "h(i)0(H) where "h(i)(F) = ? (i) h (F)s(F) = "(i)h with scaling factor s(F) = mad(?(i)h (F)) = s, ?(i)h (F) = y ?x>fl(i?1)h (F) = ?(i?1)h , dB(F) = dB, and "h(i)0(H) is "h(i)0(H) = ? (i)0 h (H)s?? (i) h s 0(H) s2 = ?x>^fl(i?1)0h s??(i)h s0(H) s2 : Therefore S0wxy(i)(H) can be written as S0wxy(i)(H) = Aid +Ais +Aifl ^fl(i?1)0h + n w(i)F ( ?)uv ??xy o with Aid = EF ? ?dB0(H)w?("h(i))xy ? , Ais = EF ?(1?dB) @w?@" h(i) ?(i)h s0(H) s2 xy ? , and Aifl = EF ? ?(1?dB) @w?@" h(i) y sxx > ? . Aid exists since dB0(H)w?("h(i)) is bounded. From Figure 4.2(b), it can be seen that @w? @"h(i) exists everywhere except ?m (i)" h(F). Hence, A ifl exists because of the assumption (5). Ais existsifAifl existssinces0(H)isknowntobeboundedwhichisin uencefunctionofrobust scatter measure median absolute deviation. Using multiplication rule and the fact that S0wyx(i)(H) = S0wxy(i)(H)>; the following in uence function is obtained for Swyx(i)Swxy(i): 2w(i)F ( ?)u>v?xy ?2?yx?xy +2?yxAid +2?yxAis +2?yxAifl ^fl(i?1)0(H)h 2. Since the in uence function of Swyx(i)Swxy(i) is equal to 2?(i)01 (H), ?(i)01 (H) = w(i)F ( ?)vr>1 u??1 +r>1 Aid +r>1 Ais +r>1 Aifl ^fl(i?1)0(H)h . The next lemma gives the in uence function for r(i)1 . 82 Lemma 4.9 In uence function of r(i)1 at F is IF( ?;r(i)1 ;F) = 1? 1 h Ip ?r1r>1 ih w(i)F ( ?)uv +Aid +Ais +Aifl ^fl(i?1)0(H)h i (4.30) The proof can be given in a similar way to that of the proof of Lemma 4.5. Step 2: In uence function for r(i)h The next lemma gives the in uence function for r(i)h and the proof is similar to the proof of Lemma 4.6. Lemma 4.10 The in uence function of r(i)h at F 2F is r(i)h 0(H) = 1?h n [Ip ?rhrh>][vw(i)F ( ?)u+Aid +Ais +Aifl ^fl(i?1)0h (H)] o ? 1?h n [Ip ? rh2?h?yx]currency1h?1w(i)0(H)?xy o ? 1?h n currency1h?1[w(i)F ( ?)uv +Aid +Ais +Aifl ^fl(i?1)0h (H)] o where currency1h?1w(i)0(H) is calculated recursively. Step 3: In uence functions for ~r(i)h and ^fl(i)h ~rh(i) is the scaled version of rh(i) that is ~rh(i) = rh (i) q rh(i)>Swxx(i)rh(i) and using the quotient rule, the in uence function of ~rh(i) is [?(h)Ip ?rhrh>?xx]r(i)h 0(H) ?(h)3=2 ? rhrh>Swxx(i)0(H)rh 2?(h)3=2 83 where ?(h) = rh>?xxrh and Swxx0(H) = [w(i)F ( ?)uu>??xx +Aixd +Aixs +Aixfl ^fl(i?1)0h (H)] with Aixd = EF ? ?dB0(H)w?("h(i))xx> ? , Aixs = EF ?(1?dB) @w?@" h(i) ?(i)h s0(H) s2 xx > ? , and Aixfl = EF ? ?(1?dB) @w?@" h(i) yx>^fl(i0)h (H) s xx > ! . Aixd, Aixs, and Aixfl exist under the same conditions that Aid, Ais, and Aifl exist. Finally, the in uence function for ^fl(i)h is given in the next lemma. Lemma 4.11 In uence function for the RoPLS1 slope estimator at the ith iteration for h (h ? 1) component, represented by functional ^fl(i)h = fRh(i)fRh(i)>Swxy(i) with fRh(i) = [~r1(i); ~r2(i);:::; ~rh(i)], at F 2F is ^fl(i)0h (H) = fRh(i)0(H)fRh>?xy+fRhfRh(i)0(H)>?xy+fRhfRh>[w(i)F ( ?)uv??xy+Aid+Ais+Aifl ^fl(i?1)0h (H)] (4.31) where jth column of fRh(i) 0 (H) is ~rj(i)0(H) for 1 ? j ? h. The boundedness of ^fl(i)0h (H) for i ? 1 can be proven by induction. For i = 1, since w(1)F gets smaller for the observations lying far from the data (large dB) and/or the ones which are not fltted well by the model (large residuals), in uence function given (4.30) is bounded as long as fl(0)h 0(H) is bounded. Therefore, r(1)01 (H) has bounded in uence function. Similarly, Swxx(1) has a bounded in uence function. These facts imply the boundedness of ^fl(1)0h (H). If we assume that ^fl(i)0h (H) is bounded, using the similar argument it can be proven that ^fl(i+1)0h (H) is bounded. So, it can be concluded that under conditions described in Part I & II, in uence function for the h component RoPLS1 slope estimator, determined implicitly, exists and it has inflnitesimal robustness. 84 4.3.2 Empirical In uence Function for High Dimension In uence function obtained in Section 4.3 is shown to be bounded, which demonstrates the robustness of RoPLS1 for low dimensional data. For high dimensional case with n observations, the following empirical in uence function, deflned as ^fl(e?)? ^fl(?) 1=n ; (4.32) is used where e? is the contaminated data set obtained from varying one observation of ?, ^fl(e?) and ^fl(?) are the estimated slope vectors for the contaminated and the clean data, respectively. Data are generated by the same simulation setting described in Chapter 3 where the error terms are generated from standard normal distributions, that is, " ? N(0;1). The contamination added to a randomly chosen observation by varying the explanatory variable and response variable between ?50 to 50. For each contamination level, the norm of the empirical in uence function given in (4.32) is calculated for fn,p,kg=f20,200,3g and f25,125,2g . Then three dimensional plots in Figure 4.4 are constructed. Figure 4.4 (a) and (c) clearly illustrate non-robustness of the SIMPLS estimator. However, empirical in uence curves for RoPLS1 estimator are clearly bounded which can be seen in Figure 4.4 (b) and (d). The maximal norms of the empirical in uence functions for the SIMPLS, RoPLS1, PRM, and RSIMPLS estimators of fl are summarized in Table 4.1. RoPLS1 yielded the smallest upper bound for the norm of the empirical in uence function. fn,p,kgn Method SIMPLS RoPLS1 PRM RSIMPLS f20,200,3g 218.13 0.16 1.48 0.62 f25,125,2g 88.74 0.17 0.93 1.76 Table 4.1: The maximal norms for SIMPLS, RoPLS1, PRM, and RSIMPLS estimators. 85 Figure 4.4: Norms of the empirical in uence functions for the (a)SIMPLS, n=20, p=200, k=3 (b) RoPLS1 n=20, p=200, k=3 (c) SIMPLS, n=25, p=125, k=2 (d) RoPLS1, n=25, p=125, k=2. 86 Figure 4.5: The flnite-sample breakdown values of the SIMPLS, RoPLS1 and PRM estima- tors for (a)fn,pg=f30,6g (b)fn,pg=f20,200g. 4.3.3 Finite-Sample Breakdown Properties of RoPLS1 Estimator The resistance of a robust statistical method to groups of outliers is another important issue which is measured by breakdown point. In this section, flnite-sample breakdown value, given in (4.5), is investigated for RoPLS1 estimator. After X and y are generated as in Section 4.3.2, various amounts of contamination are added to generated data by replacing flrst i observations (i = 1;2;:::;n=2) of the response variable with 50. For each amount of contamination, the norm of the difierence between slope estimates for the contaminated and the clean data is calculated and a plot of norm versus contamination percentage is constructed. The sample size and the number of variables are taken as fn,pg=f30,6g and f20,200g for k = 2 component PLS model. It is clear from Figure 4.5 that the SIMPLS estimator of fl is not robust, whereas RoPLS1 estimator copes with up to 43% of irregular 87 observations for low dimensional case and about 40% for high dimensional case. PRM yields comparable results with RoPLS1. 88 Chapter 5 RoCPLS: Robust Partial Classification 5.1 Introduction The problem of classifying entities into one of several groups has been one of the main goals of many scientiflc investigations. For instance, predicting whether someone will have a heart attack on the basis of demographic and clinical measurements or identifying a tumor as one of the many difierent possibilities on the basis of DNA expression values are po- tentially life-saving and hence are indispensable to physicians. Numerous other interesting applications of classiflcation can be found in a broad range of scientiflc areas such as chem- istry, economics, marketing research, bio-informatics, image analysis, pattern recognition and data mining. Classiflcation is a multivariate method of distinguishing among classes of objects by developing a decision rule to assign a new object with unknown class membership to the most likely group. In this study, only two-class problems are considered. (x0i;yi) denotes the observed data set, with xi = [xi1;xi2;:::;xip]0 2 Rp consisting of p characteristics that are sampled from two populations and yi is the class membership for the observation i where i = 1;2;:::;n. Ig = fi;yi = gg denotes the set of indices for the ng observations in the class g where g = 1;2 and n = n1 +n2. It is important that classiflcation is done in a manner that the proportion of misclas- sifled observations (misclassiflcation error rate) is minimum. In general, performances of classiflcation methods can be evaluated based on their misclassiflcation error rates which can be obtained by using difierent approaches. Optimum error rate (OER) and actual error 89 rate (AER) are two quantities that can be used for determining misclassiflcation probabili- ties. However, they cannot, in general, be calculated, because they depend on the unknown density functions of the populations. Another method, which is employed in this study since it does not depend on the form of populations, is to split the data set randomly into two non-overlapping sets, called learning, (XL;yL), and test, (XT;yT) sets. The learning set, XL, allows to construct a decision rule, ?, that associates a new vector x 2 Rp to one of the two classes, that is ?(x;XL;yL) : Rp !f1;2g where yL is the vector containing the class labels of the observations in the learning data set, XL. Based on the determined classiflcation rule, the fraction of the misclassifled ob- servations in the test set, XT, is computed. By repeating this process N times, estimated misclassiflcation error rate is obtained as dMER = 1 NnT NX r=1 nTX j=1 If?1;1g(yj ? ^yj) (5.1) where nT is the number of observations in the test set, yj is the known class label in XL, ^yj is the estimated class label for the jth observation in XT, and If?1;1g(b) is the indicator function which takes the value of 1 if b = ?1;1 and 0 otherwise. Apparent error rate (APER) is another measure of the performance that can be used for any classiflcation procedure. APER is the fraction of observations in the XL that are misclassifled by the classiflcation rule, ?(:;XL;yL). Although, it is easy to calculate, it underestimates the error rate. Cross-validation is also a popular approach that consists to split data set into m non- overlapping subsets where m? 1 subsets form learning set to construct decision rule and 90 remaining subset is used as test set. If m = n, the procedure is called leave-one-out cross validation. Throughout this study, leave-one-out classiflcation is employed to estimate the value of meta parameter, k, i.e. optimal number of components. Since the introduction of the Fisher?s discriminant (FD) analysis in 1936 ([28]), several classiflcation rules have been proposed and studied in the literature. FD analysis is based on the idea of flnding the directions in multivariate space that yield the best discrimination between the groups of samples. This idea can be written as the optimization problem given by argmax a2Rp a0Ba a0Wa = a0 nP2 g=1 ng(xg ?x)(xg ?x)0 o a a0 nP2 g=1 P i2Ig(xi ?xg)(xi ?xg)0 o a (5.2) where B is the sample between-group matrix, W is the sample within-group matrix, xg sample mean vector for gth class with g = 1;2 and x is the overall sample mean vector. In general, if fi1 is the largest eigenvalue and e1 is the corresponding eigenvector of W?1B, then a = e1 = S?1p (x1 ? x2) is the solution of the optimization problem in (5.2) where Sp = W=(n?2) is the sample pooled covariance matrix. Therefore FD rule can be given as ?FD(x;X;y) = 8 >>< >>: 1; e01x ? 0:5(x1 ?x2)0S?1p (x1 +x2) 2; otherwise: FD rule is developed under the assumption that the two populations have a common co- variance matrix and it does not explicitly assume any form for the underlying distributions. Bayes classiflcation is another approach that needs prior probabilities, ?g, and prob- abilistic structure estimates for each class. The Bayesian discriminant rule assigns an ob- servation x 2 Rp to the population for which posterior probability, P(yjx), is maximal. Under the assumption that each class comes from multivariate normal distribution with 91 equal covariance matrix, the allocation rule is ?LD(x;X;y) = argmax g LDg(x;X;y); (5.3) where LDg(x;X;y) = x0gS?1p x?0:5x0gS?1p xg +ln(?g); (5.4) and this is called linear discriminant (LD) rule. Provided that the two classes come from the two normal distributions with the same covariance matrix and equal prior probabilities, ?FD is equivalent to linear discriminant rule, ?LD. When the assumption of equality of covariance matrices is not satisfled, an individual covariance matrix for each group can be used in (5.4) and this leads to the so-called quadratic discriminant (QD) analysis as the discriminating boundaries are quadratic curves. Over the last decade, many sophisticated classiflcation methods, like support vector machine, neural networks, classiflcation and regression trees (CART), have been proposed. In spite of these reflned methods, ?LD, that yields optimal discrimination between two classes, is still often used and very popular because of the simplicity, unnecessity of strict assumptions, interpretation easiness and its good performance in many applications. Of course from the point of view of optimality, LD analysis should be used for classiflcation when it can be used. However, it becomes a serious challenge to use LD analysis in the set- tings where the data matrix X is multicollinear or p >> n. Because, the sample covariance matrix estimate is near singular if high collinearity exists and high dimensionality makes direct matrix operation di?cult. Many solutions have been proposed to deal with these problems such as variable selection, penalized estimation, and dimension reduction. 92 Variable selection is a very popular method due to its simplicity and interpretabil- ity. The most commonly used variable selection methods are based on a score (such as t-statistic, Wilcoxon?s rank-sum statistic, false discovery rate) which measures discriminat- ing power of each variable individually and the variables with the best scores are selected (see [20], [24], [25]). These methods are called univariate ranking methods. The major drawback is the selection of variables according to an individual relevance score that ig- nores the correlations and interactions among variables. Therefore, more complex criteria than the individual scores have been proposed (optimal subset selection methods), which are generally computationally expensive and sufier from over fltting problem ([7], [11], [52]). Penalization (regularization) methods can be also employed to stabilize the pertinent covariance matrices so that the classical discrimination paradigms might be implemented (see [32]). These methods reduce the variance associated with the sample based estimate at the expense of potentially increased bias. Dimension reduction (feature extraction) is another alternative to deal with dimen- sionality problem. It allows the visualization of data in a low dimension, takes into account the correlation structure of the data and the most importantly, utilizes the information on all variables. This topic, particularly PLS as a dimension reduction tool, is examined in Section 5.2. Although, PLS solves dimensionality problem by constructing orthogonal components described in Section 5.2, it fails to deal with data containing outliers. There- fore, in Section 5.3, a new robust method, RoCPLS, is proposed. To our knowledge, there has been no study on the robustness of PLS based classiflcation methods. Performances of the existing PLS based classiflcation methods and RoCPLS are compared using benchmark data sets in Section 5.4. 93 5.2 Classiflcation Methods Based on Dimension Reduction A typical DNA microarray data set in tumor tissue classiflcation studies consists of expression measurements on thousands of genes over a relatively small number of tissue samples. Similarly, in food research, classical classiflcation methodologies can not be used for the prediction of presence/absence of a preservative in a particular food product based on spectral data in which number of variables is very large and the correlation among them is substantial. One approach to classiflcation problems, dealing with high dimensional and/or collinear data sets, is to employ a dimension reduction method and then perform a standard clas- siflcation method in the reduced space. In this section, we study dimension reduction for classiflcation based on PLS and PCA followed by LD implemented in the reduced subspace. Another classiflcation method such as logistic regression can also be employed instead of LD, however logistic regression does not perform well when the classes are completely or quasi-completely separated which is quite common conflguration in microarray data. Although PLS was originally designed for problems with quantitative response, it has started to be used frequently as a dimension reduction tool for classiflcation problems where response variable is qualitative. There are mainly two approaches when PLS is employed as a dimension reduction method for classiflcation purpose. One approach is to utilize NIPALS algorithm to determine components. However, since NIPALS algorithm consists of regression steps (see Chapter 2), it seems to be unappealing to use NIPALS algorithm designed to handle continuous response models that do not sufier from heteroscedasticity. So, Marx ([60]) proposed an extension of NIPALS algorithm to han- dle qualitative response models. He basically incorporated the original NIPALS algorithm 94 into the framework of generalized linear models by employing iteratively reweighted least squares (IRLS). The main drawback of the method is the convergence problem. Therefore, Ding and Gentleman ([21]) modifled the Marx?s method by applying Firth?s procedure ([27]) to resolve complete or quasi-complete separation problem resulted in convergence problem. Recent method by Fort and Lacroix, RPLS, ([30]), combines the NIPALS algorithm and Ridge penalized logistic regression. They also provided an extensive simulation study to compare existing NIPALS based classiflcation methods and concluded that misclassiflcation error rates for IRPLS and Ding and Gentleman?s method are lower and less variable. The other most commonly used approach is to determine PLS components for classifl- cation problem is applying original SIMPLS algorithm, described in Chapter 2. Barker and Rayens ([4]), Nguyen and Rocke ([65]) and Boulesteix ([8]) proposed the use of SIMPLS for dimension reduction based on SIMPLS as a preliminary step to classiflcation problems. In this chapter, SIMPLS based classiflcation is considered because not only SIMPLS has com- putational advantages over NIPALS algorithm (see Chapter 2), but also optimal directions obtained by SIMPLS are related to the Fisher?s optimal directions, so there is a relationship between classiflcation based on SIMPLS and Fisher?s discrimination. The following lemma gives this relationship. Lemma 5.1 (Boluesteix, [8]) If the common covariance matrix, ?, is assumed to be of the form ? = 2Ip for a non-zero constant , then a = e1 and the flrst PLS-weight vector, r1 are collinear. Proof: Let X and y be centered. ^r1 is the direction that maximizes the square of the covariance between projected explanatory variable and response variable, i.e. 95 ^r1 = argmaxa cov(Xa;Y) = X0ypy0XX0y. The centered y is given by yi = 8> >< >>: ?n2=n; i 2 I1 n1=n; i 2 I2: Therefore, the jth row of p?1 vector r1 is ?n2 n X i2I1 xij + n1n X i2I2 xij = n1n2n (x2j ?x1j) and ^r1 = (x2 ?x1)k x 1 ?x2 k : Therefore r1 is proportional to the normalized form of ?1 ? ?2 which is the dominant eigenvector of between-groups matrix, B. Since e1 = ??1(?1 ??2) and ? = 2Ip, r1 and e1 are collinear 2. Lemma 5.1 implies that SIMPLS depends on the between-groups matrix. It is also obvious that the within-group matrix information is not utilized to construct SIMPLS components, that is, since only B not W is involved, classiflcation based on SIMPLS only depends on between-groups matrix. So, LD outperforms in the situations that it can be implemented. However, in the existence of multicollinearity, optimality advantage of LD over SIMPLS based classiflcation would reverse direction. PCA reduces the dimension of the data set by retaining as much as possible the vari- ation present in the data. So, PCA is only capable of identifying total variability, i.e., B + W, and not capable of distinguishing between-groups and within-groups variability which is the main goal of Fisher?s discrimination ([4], [8]). Especially, if within-groups 96 variability, W, dominates the total variability, PCA will no longer perform well as a classifl- cation tool. Since, SIMPLS depends on the between-groups matrix, when the discrimination is the major goal after dimension reduction, SIMPLS is to be preferred to PCA. The fol- lowing example also indicates that PLS outperforms the PCA as within-groups variability increases. Example 5.1 This example is the modifled version of the example given by Barker and Rayens, [4]. 50 observations from the two multivariate normal distributions with the means ?1 = (?2;0;0)0 and ?2 = (2;0;0)0 and common covariance matrix ? = 0 BB BB BB @ 1 0 0 0 1 r 0 r 2 1 CC CC CC A are generated N=100 times, where 2 = 1;2;:::;6 is the variance of the third variable and r = 0:9 is the correlation between second and third variable. Misclassiflcation error rates are calculated using leave-one-out cross validation and these rates are averaged over 100 randomly generated data. It can be seen from Figure 5.1 that as 2, variance of the third variable, varies from 1 to 6, the misclassiflcation rate based on PCA based classiflcation increases since the PCA loses sight of the discrimination information when within-group matrix dominates the total variability. PCA and SIMPLS are both linear dimension reduction methods, but SIMPLS uses class information, y, to construct components (supervised), while PCA does not use the class information (unsupervised). There are several other dimension reduction methods that can be applied in the context of classiflcation. For instance, sliced inverse regression 97 1 2 3 4 5 60 0.1 0.2 0.3 0.4 0.5 0.6 Sigma Misclassification error rate PLSPCA Figure 5.1: Misclassiflcation error rates for PLS and PCA for k=1. (SIR) is one of the su?cient dimension reduction methods ([51]) which represent a family of dimension reduction procedures. A simulation study by Dai et al. ([14]) demonstrates that SIMPLS and SIR are both efiective in dimension reduction for classiflcation and also more efiective than PCA which is not surprising since both SIMPLS and SIR are supervised methods. Considering both accuracy and computational e?ciency, it is concluded in this study that SIMPLS provides the best performance among PCA and SIR. 5.3 Description of the Proposed Algorithm: RoCPLS In Chapter 3, it has been shown that RoPLS is successful in regression framework where data contain outliers. Partial least squares is also frequently used as a classiflcation method as described in Section 5.2. In the presence of outliers, dimension reduction via PLS would yield unreliable results since PLS is known to be sensitive to outliers. Although several robust PLS methods have been proposed when the response variable is quantitative, to our knowledge, there has been no study on the robustness of PLS when the response 98 variable is qualitative. In this section, the sensitivity of PLS based classiflcation methods to outliers is demonstrated and an extension of the robust method introduced in Chapter 3 is given. The proposed algorithm, RoCPLS, is the robustifled version of SIMPLS based clas- siflcation. The main difierences between RoPLS and RoCPLS are that weights for the response variable, y, are immaterial and weights for data matrix, X, are computed for each class separately. The detailed algorithm is given below: Algorithm: RoCPLS Input: n?p data matrix, X, n?1 vector of response variable, y, a new observation, x 2Rp Output: Score matrix, T, and p ? k PLS weight matrix, R and class label for the new observation, ^y Step 1: Let Xg = fxij;i 2 Ig;j = 1;2;:::;pg for g = 1;2. Apply PCOUT algorithm, de- scribed in Section 3.2.2, to X1 and X2 to obtain weight vectors w1 and w2, respectively. Take p? = ng ?1 for high dimensional data and p? = rank(Xg) for low dimensional data where g = 1;2. Within each group, any observation with flnal weight less than 0:25 is assigned as an outlier. So, let X?1 and X?2 be the clean matrices of observations with corresponding weights greater than 0:25, and merged matrix X? = fx?ijg = 0 BB @ X?1 X?2 1 CC A, with the vector of class labels y? and nc1 and nc2 are the number of observations in X?1 and X?2, respectively with nc = nc1 +nc2. Step 2: Let X0 = fx0ijg = X?; X01j = fx0ij;1 ? i ? nc1g and X02j = fx0ij;nc1 + 1 ? i ? ncg for 1 ? j ? p. Repeat steps 2:1?2:6 for h = 1;2;:::;k: Step 2.1: Compute PLS weight vector rh with jth row equal to: 99 rh(j) = m 1j ?m2j k m1j ?m2j k where mgj = P i2Ijg x h?1 ij njg with Ij1 = fi;1 ? i ? nc1 and q25(Xh?11j )?1:5IQR(X01j) ? xh?1ij ? q75(X01j)+1:5IQR(X01j)g Ij2 = fi;nc1 +1 ? i ? nc and q25(Xh?12j )?1:5IQR(X02j) ? xh?1ij ? q75(X02j)+1:5IQR(X02j)g where q25 and q75 are the 25th and 75th sample quantiles; IQR is the interquartile range and njg is the size of Ijg for g = 1;2 and j = 1;2;:::;p. Step 2.2: Compute hth score, th = X?rh, and normalize th =: th=kthk, Step 2.3: Update hth PLS weight, rh =: rh= q r0hX?0X?rh, Step 2.4: Compute hth x-loading by regressing X? on th: ph = X?0th , Step 2.5: Store vectors rh, th, and ph into matrices Rh = [r1;r2;:::;rh], Th = [t1;t2;:::;th], and Ph = [p1;p1;:::;ph], respectively. Step 2.6: h =: h + 1 and Xh?1 = X?(Ip ? Vh?1V0h?1) = fxh?1ij g where columns of Vh?1 form an orthonormal basis for Ph?1. Step 3: ^y = ?LD(x0Rh;Th;y?) The constructed component matrix, Th, is not only utilized to determine classiflcation rule, but also used to plot the flrst two or three components which helps to display relation- ships, possible groupings and potential outliers in the data. After projecting the original data matrix, X, on the robustly calculated directions, R, orthogonal-score distance plot, 100 described in Section 3.3, can be constructed to distinguish regular, good PLS-leverage, or- thogonal and bad PLS-leverage points. It can be also used to visualize outlying observations within each class. Numerical examples with diagnostic plots in Section 5.4 demonstrated the robustness and e?ciency of the proposed method. 5.4 Numerical Examples In this section, two benchmark data sets are utilized to compare performances of the existing dimension reduction based classiflcation methods and proposed method, RoCPLS. 5.4.1 Low Dimension: Wine Recognition Data The wine recognition data ([29]) are the results of a chemical analysis of wines grown in the same region in Italy but derived from three difierent cultivars. The analysis determined the quantities of 13 constituents (i.e., the level of alcohol, the level of magnesium, the color intensity ) found in each of the types of wines. In this study, only two cultivars with sample sizes 59 and 71 are considered. To determine the optimal number of components, SIMPLS based cross-validation er- ror rate is calculated for h = 1;2;:::;10 components and the scree plot in Figure 5.2 is obtained. Based on the flgure, the optimal number of components is determined as 7, i.e. k = 7 which yields the lowest error rate. The orthogonal-score plots for SIMPLS and RoCPLS based on k = 7 component model can be seen in Figure 5.3. It is obvious that none of the PLS-bad leverage points can be identifled when SIMPLS is employed, while 101 1 2 3 4 5 6 7 8 9 100 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Number of components Misclassification error rate Figure 5.2: Cross validation error rates obtained for Wine data. observations 74;79;96;111 and 122 are determined as PLS-bad leverage points when RoC- PLS is employed. We deleted all PLS-bad leverage points as well as two good leverage points 70 and 97 that are identifled by RoCPLS; and SIMPLS based cross-validation error rate is calculated for the new data which yields Figure 5.4. Excluding these observations clearly yielded smaller error rates, and it also indicates that k = 4 is the optimal number of components. This example demonstrates how outlying observations can afiect the misclas- siflcation error rate based on SIMPLS as well as the decision of the k. The orthogonal-score plots based on k = 4 for SIMPLS and RoCPLS yielded very similar patterns observed in Figure 5.3, therefore they are not repeated here. In order to compare the classiflcation accuracies, 100 random partitions, into learn- ing set containing 70% of the data and a test set containing remaining observations, are generated. We keep observations 70;74;79;96;97;111 and 122 in the learning set for each partition. Then, we calculated the classiflcation rule based on learning set and evaluated estimated class membership of the observations in the test set after projecting them onto 102 0 2 4 6 80 0.5 1 1.5 2 2.5 124 70 111 74 96 122 113 99 14 119 62 84 Orthogonal?Score Distance Plot Score distance Orthogonal distance 0 2 4 6 8 100 0.5 1 1.5 2 2.5 3 97 79111 74 122 70 96110119 62 84 Robust Orthogonal?Score Distance Plot Score distance Orthogonal distance Figure 5.3: Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right) for Wine data. 1 2 3 4 5 6 7 8 9 100 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Number of components Misclassification error rate Deleted Data Original Data Figure 5.4: Scree plots for original and deleted Wine data. 103 the directions calculated from learning set and using the rule based on learning set. For h = 1;2;3;4, the error rates, given in Table 5.1., are obtained. Both SIMPLS and RoCPLS give lower error rates than PCA does. Beside this, RoCPLS yields the smallest error rates for each h which indicates the robustness of the method. h SIMPLS RoCPLS PCA 1 0.0703 0.0677 0.0703 2 0.0736 0.0462 0.0744 3 0.0374 0.0292 0.0659 4 0.0187 0.0100 0.0362 Table 5.1: The mean misclassiflcation error rates for Wine data based on SIMPLS, RoCPLS and PCA classiflcation. 5.4.2 High Dimension: Colon Data Colon data set ([2]) contains the expression levels of p = 2000 genes for n = 62 patients from two classes. 22 patients are healthy patients and 40 have colon cancer. After the data preprocessed described in Dudoit ([24]), only 1224 variables remain. Cross validation error rates indicated that k = 4 components result in the minimum error rate. Therefore, orthogonal-score plots for k = 4 components are constructed using SIMPLS and RoCPLS which are given in Figure 5.5. None of the plots indicates the existence of extreme outliers. The scatter plot of the flrst three components in Figure 5.6 does not display any outlying observations. It can also be seen from Figure 5.6 that classes are not completely separated. This also explains why high misclassiflcation error rates are obtained in Table 5.2 (k = 4) where DG stands for Ding and Gentleman?s method ([21]). As in previous example, 100 randomly splitted data sets are employed to calculate the error rates. The optimal value of k is estimated in each iteration based on the learning set. Boxplots given in Figure 5.7 104 0 2 4 6 8 100 2 4 6 8 10 12 9 11 18 3 5557 16 464744 Robust Orthogonal?Score Distance Plot Score distance Orthogonal distance 0 2 4 6 8 100 2 4 6 8 10 12 51 20 3 11 32 571631 55 937 Orthogonal?Score Distance Plot Score distance Orthogonal distance Figure 5.5: Orthogonal-score distance plots based on SIMPLS (left) and RoCPLS (right) for Colon data. summarize the error rates calculated from each method. Clearly, all methods give better results than PCA. The numerical results and graphics show the comparable performances of SIMPLS and RoCPLS, which demonstrates the e?ciency of the proposed method at uncontaminated data sets. SIMPLS RoCPLS PCA DG 0.1326 0.1363 0.2289 0.1389 Table 5.2: The mean misclassiflcation error rates for SIMPLS, RoCPLS, PCA and DG based classiflcation. 5.4.3 High Dimension: Leukemia Data This data set was introduced in Golub et al. ([37]) and it contains the expression levels of 7129 genes for 47 ALL-leukemia patients and 25 AML-leukemia patients. After data preprocessing, only 500 variables remain. Leave-one-out cross validation on the whole data set indicated k = 2 components should be retained in the model. For k = 2 components, 105 ?0.4?0.2 0 0.2 0.4 0.6 ?0.4?0.2 00.2 0.40.6 ?0.4 ?0.2 0 0.2 0.4 0.6 PLS1 PLS2 PLS3 Figure 5.6: 3D scatter plot of the flrst three components for Colon data obtained from RoCPLS. SIMPLS RoCPLS PCA DG 0 0.1 0.2 0.3 0.4 0.5 Misclassification error rates Figure 5.7: Boxplots of the error rates for SIMPLS, RoCPLS, PCA and DG based classifl- cation. 106 ?0.2 ?0.15 ?0.1 ?0.05 0 0.05 0.1 0.15 0.2 0.25?0.4 ?0.3 ?0.2 ?0.1 0 0.1 0.2 0.3 PLS1 PLS2 Figure 5.8: Scatter plot of the flrst two components for Leukemia data. scatter plot of the flrst two components is constructed (Figure 5.8), where the separation between two groups can be clearly seen. The orthogonal-score plots obtained from SIMPLS and RoCPLS do not demonstrate any extreme observations in the data as in colon data set. Leukemia data set is randomly divided into a learning set of size 50 and a test set of size 22, N = 250 times. For this simulation study, we introduced 0, 1, 2, and 3 outly- ing observations to the each class corresponding to 0%, 4%, 7% and 10% contamination, respectively. For class g, contaminated observation is generated from multivariate normal distribution with mean (10)1500 +xg and covariance matrix I500 with class label y = g, for g = 1;2. For example, when 2 outliers are introduced for each class, the flrst two component plot and orthogonal-score plot, obtained from RoCPLS in Figure 5.9, indicate that these observations are PLS-bad leverage points. 107 ?1 ?0.5 0 0.5 1?0.8 ?0.6 ?0.4 ?0.2 0 0.2 0.4 0.6 Robust Score Plot PLS1 PLS2 0 2 4 6 8 100 50 100 150 200 250 626 51 5253 54 4018 Robust Orthogonal?Score Distance Plot Score distance Orthogonal distance Figure 5.9: The scatter plot of flrst two components (left) and orthogonal-score plot (right) for Leukemia data. As in colon data example, leave-one out method applied to clean learning set of size 50 to determine optimal number of components within each division. 60% of the time k is determined as 2, while 40% of the time k = 1. For the optimal value, k, the error rates based on the methods SIMPLS, RoCPLS, PCA and DG are calculated for the test set. The boxplots in Figure 5.10 summarize the simulation results. For the clean data (no outliers), all methods yield very comparable results. Once again, it can be seen that RoCPLS is an efiective method for uncontaminated data. As the number of outlying observations increases, the error rates for the SIMPLS, PCA and DG increases as well. However, adding outlying observations do not afiect the error rates based on RoCPLS. The main reason behind this is that the optimal directions obtained by RoCPLS are robust to outliers. In order to show that, the angle between the flrst PLS components, r1, obtained from the clean and the contaminated data is calculated for SIMPLS and RoCPLS. Boxplots, in Figure 5.11, are constructed for each contamination level based on N = 250 divisions. The angle for SIMPLS tends to increase as contamination level increases. However, RoCPLS yield smaller 108 SIMPLSRoCPLSPCA DG 0 0.05 0.1 Misclassification error rates (a) SIMPLSRoCPLSPCA DG 0 0.2 0.4 0.6 Misclassification error rates (b) SIMPLSRoCPLSPCA DG 0 0.1 0.2 0.3 0.4 0.5 Misclassification error rate (c) SIMPLSRoCPLSPCA DG 0 0.2 0.4 0.6 Misclassification error rate (d) Figure 5.10: Boxplots of the error rates for (a) no outliers (b) 1 outlier (c) 2 outliers (d)3 outliers in each class. angles than that of SIMPLS, and as contamination level increases, the results remain almost the same. 109 SIMPLS:4%RoCPLS:4%SIMPLS:7%RoCPLS:7%SIMPLS:10%RoCPLS:10% 10 20 30 40 50 60 70 Angle (Degree) Figure 5.11: The angle between flrst PLS weight vectors for clean and contaminated data. 110 Chapter 6 Conclusions and Future Work In this dissertation, difierent aspects of partial least squares methods have been stud- ied. In this chapter, flnal conclusions on all results obtained throughout dissertation are summarized. We also discuss some possibilities for future research. In Chapter 2, the main concepts of PLS are introduced and a detailed overview of its applications to difierent data analysis problems is given. Two important algorithms, SIM- PLS and NIPALS (PLS1 and PLS2) are described. It is stated that the optimal number of components is an important issue in PLSR model building and several approaches in the literature proposed for determining optimal number of components, k, are reviewed. The connections among the biased estimation methods PLSR, PCR, and RR are examined in detail. Statistical properties of PLSR such as shrinkage, asymptotic variance and consis- tency are discussed. As a conclusion, computational and implementation simplicity of PLS is a strong aspect of the approach which favours PLS to be used as a flrst step to understand the existing relations and to analyse real world data. In Chapter 3, a new iterative robust external reweighting algorithm for the regression coe?cient vector, which gives low weights to points with high leverage and/or large residuals is proposed. This algorithm is carried out in two main parts: 1. obtain initial weights as robust distances from recent outlier detection methods, BACON (RoPLS1) or PCOUT (RoPLS2), to downweight outlying points in predictor X-space and/or response y-space to get an initial PLS estimate for the regression coe?cient vector, 2. perform reweighted PLS regression iteratively by using the initial PLS estimate of the regression coe?cient vector 111 obtained in the flrst part. Both RoPLS1 and RoPLS2 can be applied to low and high dimensional explanatory variables. Simulations have shown that they are resistant towards many types of contamination, whereas their performance is also good at uncontaminated data sets. RoPLS1 is scale and orthogonal equivariant, therefore it can be preferred over RoPLS2 which is not orthogonal equivariant. In Chapter 4, it is shown that SIMPLS algorithm is highly non-robust towards outlying observations. It is illustrated that a single sample can change the direction of the SIMPLS weight vectors and the regression estimates arbitrarily. This also appears in their unbounded in uence functions. Robustness properties of RoPLS estimator of fl, including in uence function for low dimension, empirical in uence curve for high dimensional case and flnite- sample breakdown properties, are provided. It is shown that the in uence function of all pairs of PLS weight vectors and of the regression estimator are bounded which makes the method resistant towards point contamination. For high-dimensional data, it is illustrated on simulated data sets that the empirical in uence function remains bounded and that it can resist large fractions of contamination. The resistance of a robust statistical method to groups of outliers is another important issue which is measured by breakdown point, hence the flnite sample breakdown point is determined for RoPLS1 which is approximately 40% for both low and high dimensional settings. In Chapter 5, the efiect of outliers on existing PLS classication methods is investigated and a new robust PLS algorithm for classiflcation (RoCPLS) is proposed. It is shown that the proposed method is very efiective for uncontaminated data and it yields better results when data contain outliers. 112 In this dissertation, we have shown promising results for RoPLS and RoCPLS as a data mining tool and high-dimensional classifler, respectively. There is, of course, more research to be done. We would like to extend RoPLS to multivariate case and RoCPLS to multi-class case. Also, we would like to employ the in uence function of the RoPLS1 estimator for the robust estimation of its variance. The relationship between the PLS components and the variable selection is going to be explored to build a robust variable selection method in the future. 113 Bibliography [1] Alm?ay, T., "A Simulation Study on Comparison of Prediction Methods When Only a Few Components are Relevant", Computational Statistics and Data Analysis, 21, 87-107, 1996. [2] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays", Proceedings of the National Academy of Sciences, 96, 6745-6750, 1999. [3] Anderson, R.L. and Bancroft, T.A., Statistical Theory in Research. McGraw-Hill, New York, 1952. [4] Barker, M., and Rayens, W., "Partial Least Squares for Discrimination", Journal of Chemometrics, 17, 166-173, 2003. [5] Billor, N., Hadi, A., and Velleman, P., "BACON: Blocked Adaptive Computationally- E?cient Outlier Nominators", Computational Statistics and Data Analysis, 34, 279- 298, 2000. [6] Billor, N., Chatterjee, S., and Hadi, A. S., "Iteratively Re-weighted Least Squares Method for Outlier Detection in Linear Regression", American Journal for Mathemat- ical and Management Science, 26, 3, 229-252, 2006. [7] Bo, T.H., andJonassen, I., "NewFeatureSubsetSelectionProceduresforClassiflcation of Expression Proflles", Genome Biology , 3, R17, 2002. [8] Boulesteix, A.L., "PLS Dimension Reduction for Classiflcation with Microarray Data", Statistical Applications in Genetics and Molecular Biology, 3, Article: 33, 2004. [9] Breiman, L. and Friedman, J., "Predicting Multivariate Responses in Multiple Linear Regression", Journal of the Royal Statistical Society, B, 59, 3-54, 1997. [10] Chatterjee, S. and M?achler, M., "Robust Regression: A Weighted Least Squares Ap- proach", Communications in Statistics: Theory and Methods, 26, 1381-1394, 1997. [11] Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones, D., and Szabo, A., "Multi- variate Approach for Selecting Sets of Difierentially Expressed Genes", Mathematical Biosciences, 176, 59-72, 2002. 114 [12] Croux, C. and Haesbroeck, G., "Principal Components Analysis Based on Robust Esti- mators of the Covariance or Correlation Matrix: In uence Functions and E?ciencies", Biometrika, 87, 603-618, 2000. [13] Cummins, D.J. and Andrews, C.W., "Iteratively Reweighted Partial Least Squares: A Performance Analysis by Monte Carlo Simulation", Journal of Chemometrics, 9, 489-507, 1995. [14] Dai, J. J., Lieu, L., and Rocke, D., "Dimension Reduction for Clasiflcation with Gene Expression Microarray Data", Statistical Applications in Genetics and Molecular Bi- ology, 5, Article: 6, 2006. [15] De Jong, S., "SIMPLS: An Alternative Approach to Partial Least Squares Regression", Chemometrics and Intelligent Laboratory Systems, 18, 251-263, 1993. [16] De Jong, S., "PLS Fits Closer than PCR", Journal of Chemometrics, 7, 551-557, 1993. [17] De Jong, S., "PLS Shrinks", Journal of Chemometrics, 9, 323-326, 1995. [18] Denham, M.C., "Prediction Intervals in Partial Least Squares", Journal of Chemomet- rics, 11, 39-42, 1997. [19] Denham, M. C., "Choosing the Number of Factors in Partial Least Squares Regres- sion: Estimating and Minimizing the Mean Squared Error of Prediction", Journal of Chemometrics, 14, 351-361, 2000. [20] Dettling, M., B?uhlmann, P., "Boosting for Tumor Classiflcation with Gene Expression Data", Bioinformatics, 19, 1061-1069, 2003. [21] Ding, B. and Gentleman, R., "Classiflcation Using Generalized Partial Least Squares", Bioconductor Project Working Papers, Technical Report 5, 2004. [22] Donoho D.L., "Breakdown Properties of Multivariate Location Estimators", Ph.D. Qualifying paper, Harvard University, 1982. [23] Draper, N. and Smith, H., Regression Analysis by Example, Wiley, New York, 1981. [24] Dudoit, S., Fridlyand, J., and Speed, T. P., "Comparison of Discrimination Methods for the Classiflcation of Tumors Using Gene Expression Data", Journal of the American Statistical Association, 97, 77-87, 2002. [25] Dudoit, S., Shafier, J. P., and Boldrick, J. C., "Multiple Hypothesis Testing in Mi- croarray Experiments", Statistical Science, 18, 71-103, 2003. [26] Filzmoser, P., Maronna, R. and Werner, M., "Outlier Identiflcation in High Dimen- sions", Computational Statistics and Data Analysis, 52, 1694-1711, 2008. 115 [27] Firth, D., "Bias Reduction of Maximum Likelihood Estimates", Biometrika, 80, 27-38, 1993. [28] Fisher, R. A., "The Use of Multiple Measurements in Taxonomic Problems", Annals of Eugenics, 7, 179-188, 1936. [29] Forina, M., Armanino, C., Castino, M. and Ubigli, M., "Parvus - An Extendible Pack- age for Data Exploration, Classiflcation and Correlation", Vitis, 25, 189-201, 1986. [30] Fort, G. and Lambert-Lacroix, S., "Classiflcation Using Partial Least Squares with Penalized Logistic Regression", Bioinformatics, 21(7), 1104-1111, 2005. [31] Frank, I.E. and Friedman, J.H., "A Statistical View of Some Chemometrics Regression Tools", Technometrics, 35, 109-147, 1993. [32] Friedman, J., "Regularized Discriminant Analysis", Journal of American Statistical Association, 84, 165-175, 1989. [33] Garthwaite P. H., "An interpretation of Partial Least Squares", Journal of the Amer- ican Statistical Association, 89, 122-127, 1994. [34] Geladi, P. and Kowalski, B. R., "Partial Least Squares Regression: A Tutorial", Ana- lytica Chimica Acta, 185, 1-17, 1986. [35] Gil, J.A. and Romera, R., "On Robust Partial Least Squares Methods". Journal of Chemometrics, 12, 365-378, 1998. [36] Golub, G.H. and Van Loan C.F., Matrix Computations , Johns Hopkins University Press, Baltimore, 1996. [37] Golub, T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J., Caligiuri, M. A., Bloomfleld, C. D., and Lander, E. S., "Molecular Classiflcation of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", Science, 286, 531-537, 1999. [38] Griep, M.I., Wakeling, I.N., Vankeerberghen, P., and Massart, D.L., "Comparison of Semirobust and Robust Partial Least Squares Procedures", Chemometrics and Intelli- gent Laboratory Systems, 29, 37{40, 1995. [39] Hadi, A. S. and Ling, R. F., "Some Cautionary Notes on the Use of Principal Compo- nent Regression", The American Statistician, 52, 15-19, 1998. [40] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A., Robust Statistics: The Approach Based on In uence Functions, John Wiley and Sons, New York, 1986. [41] Helland, I., "On the Structure of Partial Least Squares Regression", Communications in Statistics: Simulation and Computation, 17, 581-607, 1988. 116 [42] Helland, I., "Partial Least Squares Regression and Statistical Models", Scandivian Journal of Statistics, 17, 97-114, 1990. [43] Helland, I., "Rotational Symmetry, Model Reduction and Optimality of Prediction from the PLS Population Model", Chemometrics and Intelligent Laboratory Systems, 68, 53-60, 2003. [44] Hoerl, A.E. and Kennard, R. W., "Ridge Regression: Biased Estimation for Nonorthog- onal Problems", Technometrics, 8, 27-51, 1970. [45] H?oskuldsson, A., "PLS Regression Methods", Journal of Chemometrics, 2, 211-228, 1988. [46] H?oskuldsson, A., "Dimension of Linear Models", Chemometrics and Intelligent Labo- ratory Systems, 32, 37-55, 1996. [47] Huber, P.J., Robust Statistics, John Wiley and Sons, New York, 1981. [48] Hubert, M. and Vanden Branden, K., "Robust Methods For Partial Least Squares Regression", Journal of Chemometrics, 17, 537{549, 2003. [49] Hubert, M., Rousseeuw, and P. J., Vanden Branden, K., "ROBPCA: A New Approach to Robust Principal Component Analysis", Technometrics, 47(1), 64-79, 2005. [50] Kav?sek, B. Partial Least Squares Regression and Its Robustiflcation, Diploma Thesis, Technische Universitt Wien, 2002. [51] Li, K.C., "Sliced Inverse Regression for Dimension Reduction", Journal of American Statistical Association, 86, 316-342, 1991. [52] Li, L., Weinberg, C. R., Darden, T. A., and Pedersen, L. G., "Gene Selection for Sample Classiflcation Based on Gene Expression: Study of Sensitivity to Choice of Parameters of the GA/KNN Method", Bioinformatics, 17, 1131-1142, 2001. [53] Lindj?erde, O.C. and Christophersen, N., "Shrinkage Structure of Partial Least Squares Regression", Scandinavian Journal of Statistics, 27, 459-473, 2000. [54] Mallows, C. P., "On Some Topics in Robustness", Technical Memorandum , Murray Hill, New Jersey, 1975. [55] Manne, R., "Analysis of Two Partial-Least-Squares Algorithms for Multivariate Cali- bration", Chemometrics and Intelligent Laboratory Systems, 2, 187-197, 1987. [56] Maronna, R. and Yohai, V., "The Behavior of the Stahel-Donoho Robust Multivariate Estimator", Journal of the American Statistical Association, 90, 330-341, 1995. [57] Maronna, R. and Zamar, V., "Robust Estimates of Location and Dispersion for High Dimensional Data Sets", Technometrics, 44(4), 307-317, 2002. 117 [58] Maronna,R., Martin, D. and Yohai,V., Robust Statistics Theory and Methods, John Wiley and Sons, New York, 2006. [59] Martens, H. and Naes, T., Multivariate Calibration, Wiley, Chichester, 1989. [60] Marx, B.D., "Iteratively Reweighted Partial Least Squares Estimation for Generalized Linear Regression", Technometrics, 38(4), 374-381, 1996. [61] Marx, B.D. and Eilers, P.H., "Generalized Linear Regression on Sampled Signals and Curves: A P-spline Approach". Technometrics, 41, 1-13, 1999. [62] Massy, W.F., "Principal Component Regression in Explanatory Statistical Research", Journal of the American Statistical Association, 60, 234-246, 1965. [63] Naes, T., "Multivariate Calibration When the Error Covariance Matrix is Structured", Technometrics, 27, 301-311, 1985. [64] Naik, P. and Tsai, C., "Partial Least Squares Estimator for Single-Index Models", Journal of the Royal Statistical Society: B , 62, 763-771, 2000. [65] Nguyen, D.V. and Rocke, D.M., "Tumor Classiflcation by Partial Least Squares Using Microarray Gene Expression Data", Bioinformatics, 18, 39-50, 2002(a). [66] Nguyen, D.V. and Rocke, D.M., "Multi-Class Cancer Classiflcation Via Partial Least Squares Using Gene Expression Profles".Bioinformatics, 18, 1216-1226, 2002(b). [67] Osborne, B.G., Fearn, T. , Miller, A.R. and Douglas, S., "Application of Near In- frared Re ectance Spectroscopy to the Compositional Analysis of Biscuits and Biscuit Dough", Journal of Scientiflc Food Agriculture, 35, 99-105, 1984. [68] Pe~na, D. and Prieto, F.,"Multivariate Outlier Detection and Robust Covariance Matrix Estimation", Technometrics , 43, 286-310, 2001. [69] Phatak, A. and De Jong, S., "The Geometry of Partial Least Squares", Journal of Chemometrics, 11, 311-338, 1997. [70] Phatak, A. and De Hoog, F., "Exploiting the Connection Between PLS, Lanczos Meth- ods and Conjugate Gradients: Alternative Proofs of Some Properties of PLS", Journal of Chemometrics, 16, 361-367, 2002. [71] Phatak, A., Reilly, P. M., and Pendilis, A., "The Asymptotic Variance of the Univariate PLS Estimator", Linear Algebra and Its Applications, 354, 245-253, 2002. [72] Rocke, D., "Robustness Properties of S-estimators of Multivariate Location and Shape in High Dimension", Annals of Statistics, 24, 1327-1345, 1996. [73] Rousseeuw, P. J. and Leroy, A. M., Robust Regression and Outlier Detection, John Wiley and Sons, New York, 1987. 118 [74] Rousseeuw P. J. and Van Driessen K., "A Fast Algorithm for the Minimum Covariance Determinant Estimator", Technometrics, 41, 212-223, 1999. [75] Serneels, S., Croux, C., and Van Espen, "In uence Properties of Partial Least Squares Regression", Chemometrics and Intelligent Laboratory Systems, 71, 13-20, 2004. [76] Serneels, S., Croux, C., Filzmoser, P., and Van Espen, P.J., "Partial Robust M- Regression", Chemometrics and Intelligent Laboratory Systems, 79, 55-64, 2005. [77] Sibson, R., "Studies in the Robustness of Multidimensional Scaling: Perturbation Anal- ysis of Classical Scaling", Journal of the Royal Statistical Society: B 41, 217-229, 1979. [78] Stahel W.A., Robust Estimation: Inflnitesimal Optimality and Covariance Matrix Es- timators, Ph.D. thesis, ETH, Z?urich, 1981. [79] Staudte, R. G. and Sheather, S. J., "Robust Estimation and Testing", John Wiley and Sons, New York, 1990. [80] Stone, M. and Brooks, R.J., "Continuum Regression: Cross-validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression", Journal of the Royal Statistical Society: B, 52, 237-269, 1990. [81] Sundberg, R., "Continuum Regression and Ridge Regression", Journal of the Royal Statistical Society: B,55, 653-659, 1993. [82] Vanden Branden, K. and Hubert, M., "Robustness Properties of a Robust PLS Re- gression Method, Analytica Chimica Acta, 515, 229-241, 2004. [83] Wakeling, I. N. and Macfle, H.J.H., "A Robust PLS Procedure", Journal of Chemo- metrics, 6,189{198, 1992. [84] Wangen, L.E. and Kowalsky,B.R., "A Multiblock Partial Least Squares Algorithm for Investigating Complex Chemical Systems", Journal of Chemometrics, 3, 3-20, 1989. [85] Wiklund, S.,Nilsson, D., Eriksson, L., Sj?ostr?om, M., Wold, S. and Faber, K., " A Randomization Test for PLS Component Selection", Journal of Chemometrics, 21, 427-439, 2007. [86] Wilcox, R. R., Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition, Elsevier, 2005. [87] Wold, H., "Estimation of Principal Components and Related Models by Iterative Least Squares", Multivariate Analysis, Academic Press, New York, 391-420, 1966. [88] Wold H., "Path Models with Latent Variables: The NIPALS Approach", Quantitative Sociology: International perspectives on mathematical and statistical model building, Academic Press, 307-357, 1975. 119 [89] Wold S., "Cross-validatory Estimation of the Number of Components in Factor and Principal Components Models", Technometrics, 20, 397-405, 1978. [90] Woodrufi, D. and Rocke, D., "Computable Robust Estimation of Multivariate Location and Shape in High Dimension Using Compound Estimators", Journal of the American Statistical Association, 89, 888-896, 1994. 120 Appendix X : n?p matrix of explanatory variables Y : n?q matrix of response variables y : n?1 vector of response variable ai : jth column of a matrix A (column vector) ai : ith row of a matrix A (column vector) A0 : Transpose of matrix A A? : Orthogonal complement of A A?1 : Inverse of matrix A A+ : Moore-Penrose inverse of matrix A Im : m?m identity matrix 1m : m?1 vector of ones kak : Euclidian norm of a vector a / : Proportional to 121