Robust Methods for Functional Data Analysis Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory committee. This thesis does not include proprietary or classifled information. Pallavi Sawant Certiflcate of Approval: Asheber Abebe Associate Professor Mathematics and Statistics Nedret Billor, Chair Associate Professor Mathematics and Statistics Hyejin Shin Assistant Professor Mathematics and Statistics George T. Flowers Dean Graduate School Robust Methods for Functional Data Analysis Pallavi Sawant A Thesis Submitted to the Graduate Faculty of Auburn University in Partial Fulflllment of the Requirements for the Degree of Master of Science Auburn, Alabama August 10, 2009 Robust Methods for Functional Data Analysis Pallavi Sawant Permission is granted to Auburn University to make copies of this thesis at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Pallavi Rajesh Sawant was born on the 9th of October, 1972 in Thane, Maha- rashtra, India. She graduated with Bachelor of Science in Statistics from University of Mumbai, Mumbai, Maharashtra, India in April 1993. She continued her Master?s program in University of Mumbai and was awarded Master of Science in Statistics in 1995. Following her graduation she taught computer programming courses at com- puter institutes in Mumbai, India. She joined Auburn University in fall 2006 for the Masters program in the Department of Mathematics and Statistics. iv Thesis Abstract Robust Methods for Functional Data Analysis Pallavi Sawant Master of Science, August 10, 2009 (M.S., Mumbai University, 1995) (B.S., Mumbai University, 1993) 91 Typed Pages Directed by Nedret Billor Functional data consist of observed functions or curves at a flnite subset of an interval. Each functional observation is a realization from a stochastic process. This thesis aims to develop a suitable statistical methodologies for functional data analysis in the presence of outliers. Statistical methodologies assume that functional data are homogeneous but in reality they contain functional outliers. Exploratory methods in functional data anal- ysis are outlier sensitive. In this thesis we explore the efiect of outliers in functional principal component analysis and propose a tool for identifying functional outliers by using robust functional principal components in a functional data. This is done by means of robust multivariate principal component analysis. Diagnostic plots based on functional principal component analysis are also found to be useful for identiflcation and classiflcation of functional outliers. Extensive simulation study is conducted to evaluate the performance of the proposed procedures and also real dataset is employed to illustrate the goodness of the method. v In addition, regression diagnostics for a functional regression model where regres- sors are functional data such as curves and the response is a scalar are discussed. We proposed a robust principal component based method for the estimation of the func- tional parameter in this type of functional regression model. Further we introduce robust diagnostic measures for identifying in uential observations. A real dataset is also used to illustrate the usefulness of the proposed robust measures for detecting in uential observations. vi Acknowledgments I express my sincere gratitude to my advisor, Dr. Nedret Billor, who suggested this thesis and generously gave, continued guidance and support during the course of this work. I highly appreciate her insights and valuable suggestions for this work, without which I could not have completed this thesis. My special thanks to Dr. Hyejin Shin, who served on my committee, for the many helpful discussions and advice. I would also like to acknowledge my committee member Dr. Asheber Abebe for his valuable support. Many thanks to the faculty, stafi and fellow students at the Department of Math- ematics and Statistics for making my study at Auburn such a wonderful experience. I would like to express my special gratitude to my parents for their never-ending love and support throughout my life. I feel lucky to have Rajesh as my life partner, whose companionship, and ever lasting support and commitment during all these years of studying abroad have been a source of strength and encouragement. Finally it?s the grace and blessings of the God that has helped me fulflll my aspirations. vii StylemanualorjournalusedJournalofApproximationTheory(togetherwiththe style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speciflcally LATEX) together with the departmental style-flle aums.sty. viii Table of Contents List of Figures x List of Tables xii 1 Introduction 1 2 Robust Functional Principal Component Analysis and Outlier Detection 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Classical and Robust Principal Component Analysis . . . . . . . . . . 11 2.2.1 Classical Principal Component Analysis . . . . . . . . . . . . 12 2.2.2 Robust Principal Component Analysis . . . . . . . . . . . . . 12 2.3 Classical and Robust Functional Principal Component Analysis and Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Classical Functional Principal Component Analysis . . . . . . 18 2.3.2 Robust Functional Principal Component Analysis and Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Numerical Examples 29 3.1 Dataset: NOx Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Robust Functional Principal Component Regression 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Estimation of Functional Parameter fl . . . . . . . . . . . . . . . . . 55 4.3 Functional Regression Diagnostic measures . . . . . . . . . . . . . . . 59 4.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 Conclusion 76 Bibliography 77 ix List of Figures 1.1 Example of (a)Functional Observation and (b)Functional dataset . . . 2 2.1 (a)Difierent types of outliers when a 3 dimensional dataset is projected on robust 2 dimensional PCA-subspace. (b)Difierent types of outliers in Orthogonal-Score distance plot (Hubert et al. [22]). . . . . . . . . . 28 3.1 Sample curves of NOx data by using (a) B-spline Basis and (b) Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Generalized Cross-Validation by using (a) B-spline Basis and (b) Fourier Basis for NOx data . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Orthogonal-score plot for whole sample by using B-spline basis com- puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 32 3.4 Orthogonal-score plot for whole sample by using Fourier basis com- puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 33 3.5 Working Days of NOx data by using (a) B-spline Basis and (b) Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Orthogonal-score plot for working days by using B-spline basis com- puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 35 3.7 Orthogonal-score plot for working days by using Fourier computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . . . . . . . . 36 3.8 Non-working Days of NOx data by using (a) B-spline Basis and (b) Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.9 Orthogonal-score plot for Non-working Days by using B-spline basis computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . 38 3.10 Orthogonal-score plot for Non-working Days by using Fourier basis computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . 39 x 3.11 Outliers detected by proposed method for (a)Whole sample, (b)Working days, (c)Non-working days . . . . . . . . . . . . . . . . . 40 3.12 Curves generated from model 1 (without contamination), model 2 (asymmetric contamination), model 3 (symmetric contamination), model 4 (partial contamination) and model 5 (peak contamination) with n=50, p=100, M=10 and q=0.1. . . . . . . . . . . . . . . . . . . 51 3.13 Boxplots of norm when there is no contamination (0%) and symmetric contamination (5%, 10% and 15% ) for high dimensional cases for CPCA(C) ROBPCA(R) and BACONPCA(B). . . . . . . . . . . . . . 52 3.14 Boxplots of norm when there is no contamination (0%) and symmet- ric contamination (5%, 10% and 15% ) for low dimensional cases for CPCA(C) ROBPCA(R) and BACONPCA(B). . . . . . . . . . . . . . 53 4.1 Potential-Residual (P-R) plot. . . . . . . . . . . . . . . . . . . . . . . 64 4.2 (a)Sample curves (X-data) of the Canadian data; (b)Boxplot of the log-precipitation (y-data) data. . . . . . . . . . . . . . . . . . . . . . 70 4.3 Diagnostic plot of the Candian dataset based on (a)Classical Sore plot; (b)Robust Sore plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 (a)Estimated beta function by Robust and classical FPCA; (b)First eigenfunction and (c)Second eigenfunction by Robust and Classical FPCR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Fitted values against the residuals (a)Classical; (b)Robust. . . . . . . 72 4.6 (a)Leverages; (b)Cook?s In uence Measures; (c)Hadi?s In uence Mea- sures (Classical(o) and Robust(+)). . . . . . . . . . . . . . . . . . . . 73 4.7 P-R plot (a)Classical(o); (b)Robust(+); (c)Classical(o) and Robust(+). 74 4.8 Sample curves of the Canadian data. . . . . . . . . . . . . . . . . . . 75 xi List of Tables 3.1 Outliers detected by three PCA methods. (.) denotes case number for three datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Outliers detected by Febrero et al. [13] for the NOx data . . . . . . . 42 3.3 Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with symmetric contamina- tion (5%, 10% and 15% ) for low and high dimensional cases. . . . . 47 3.4 Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with asymmetric contami- nation (5%, 10% and 15% ) for low and high dimensional cases. . . . 48 3.5 Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with partial contamination (5%, 10% and 15% ) for low and high dimensional cases. . . . . . . . 49 3.6 Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with peak contamination (5%, 10% and 15% ) for low and high dimensional cases. . . . . . . . 50 4.1 Names of the Candian weather stations . . . . . . . . . . . . . . . . . 68 4.2 Classical and Robust in uence measures for the Candian weather stations 69 xii Chapter 1 Introduction Functional Data Analysis (FDA), comparatively new area in the statistical mod- eling, has become more popular in recent years. FDA is an assemblage of difierent methods in statistical analysis for analyzing curves or functional data. The com- plexity of data generated and large size of databases mandate use of new tools for analysis such as FDA [32, 26]. Functional data analysis helps to extract additional information from densely sampled observations over a time or space. In standard statistical methodology the focus is on the set of data vectors whereas, in FDA focus is on the type of data structure such as curves, shapes, images, or set of functional observations. In FDA, each observed curve is thought of as a single observation rather than a collection of individual observations. A curve can be regarded as an inflnite- dimensional vector, whose dimensions may not be countable (Figure 1.1(a),(b)). In a traditional statistical methodology, the usual data types are univariate and multivariate. A univariate dataset contains numbers as its observations; while a mul- tivariate dataset contains vectors as its observations. A number is one-dimensional while a vector is multi-dimensional. Multivariate Data Analysis (MDA) is an exten- sion of Univariate Data Analysis and FDA is an extension of multivariate analysis, where the random vectors are of inflnite dimensions. In number of situations functional data can be treated as multivariate data. However, treating data directly as multivariate data may pose di?culty, such as when 1 0 2 4 6 8 10 12 ?10 ?5 0 5 10 15 (a) 0 2 4 6 8 10 12 ?30 ?25 ?20 ?15 ?10 ?5 0 5 10 15 20 (b) Figure 1.1: Example of (a)Functional Observation and (b)Functional dataset design points are not equal in subjects. So, direct multivariate treatment may not be possible in this case. This calls for the development of functional data analysis. When each functional observation is sampled at a same set of design points, the functional data we get may look like multivariate data. But functional data is in general difierent from the multivariate data in the following aspects: 1. For a functional observation, the observed data is sampled from an underlying smooth function, whereas in a multivariate dataset for an observed vector there is no such structure. 2. The dimension of a functional observation is so large that it is regarded as a continuous function. This can be seen in Figure 1.1(a). This dimension is often larger than the sample size. 3. The time points can be difierent from one data point to another. All of these difierent aspects necessitate development of functional data analysis [39]. There are three advantages in treating data in functional forms. First, by rep- resenting data in functional form with small number of parameters reduces its size considerably. Second, since FDA deals with continuous functions; information be- tween observed points is not lost. For flnite sets of observations, FDA flrst estimates 2 functions from the observed data, and then discretizes the function at any suitable choice of time points for further analysis. The free choice of analyzed points is attrac- tive when observational points are difierent in each subject. Thirdly, it is very useful to have particularly interesting features in some time interval to have more closely spaced points. In functional data framework, the random variables are deflned on the functional space. To model the population of these random functions we think of a functional data observation as a realization of a stochastic process, X(t), t 2 T, where T is a bounded interval in <. Some mathematical concepts used for FDA are explained here. In FDA, we work with a functional Hilbert space L2 (e.g., inner product space) which is determined by an inner product hx;yi [25]. In a flnite dimension with x = (x1;:::;xn) and y = (y1;:::;yn), the Euclidean inner product is deflned in the following way: hx;yi 0. The functional sample descriptive statistics, where we have an n-dimensional subspace for the sample space, xi(t), i = 1;:::;n are also be deflned as: x(t) = 1n[ nX i=1 xi(t)], t 2 T dVarx(t) = 1 n?1[ nX i=1 [xi(t)?x(t)]2], t 2 T dCovx(s;t) = 1 n?1[ nX i=1 [xi(s)?x(s)][xi(t)?x(t)]], s;t 2 T dCorrx(s;t) = dCovx(s;t)p [dVarx(s)dVarx(t)] , s;t 2 T. Majority of statistical techniques used in traditional and functional data analysis assume that the dataset is free of outliers. However, outliers occur very frequently in real data. Outlier is deflned as a data point appearing to be inconsistent with the rest of the data. Possible sources of outliers are errors in recording and measurement, incorrect distribution assumption, unknown data structure, or novel phenomenon [23]. In any statistical data analysis, investigation of outliers is important. Since 7 traditionalstatisticalmethodsaresensitivetooutliers, presenceofoutliersinadataset make estimators and statistical conclusions unreliable. In addition presence of outliers severely afiects modeling and prediction. High dimensional data occurrence is natural in some practical applications such as studies involving image analysis and microarray datasets in genomic studies. In such applications dimension p is greater than n, sample size. The FDA analysis and high-dimensionality are closely related as the functional data generally come in a discretized manner so that a function xi in the sample is in fact given by (xi(t1);:::;xi(tp)). High dimensionality problem has two distinct features: flrst the dimension p depends on the descretization order. This is not given in advance and can be arbitrarily increased. Second, the data from the discretized functions likely to be highly correlated thus creating di?culty in estimation of the covariance matrices. Functional Principal Component Analysis (FPCA) is a useful statistical tech- nique for understanding the structure of data. They are efiective dimension reduction tools for functional data. FPCA aims to explain the covariance structure of data by means of small number of functional components. These functional components are linear combinations of the original variables. This gives better interpretation of the difierent sources of variation. Thus efiectiveness of FPCA in data reduction is useful in analysis of high dimensional data. In the presence of outliers, dimension reduction via FPCA would yield untrustworthy results since FPCA is known to be sensitive to outliers. The main contribution of this work is the construction of the method to detect outliers in functional data through robust FPCA. We have used library developed by Ramsay and Silverman [31] to construct our code for functional data analysis. This thesis is organized as follows. Chapter 2 reviews functional principal component analysis and explore sensitivity of FPCA to outliers. We also propose a functional 8 outlier detection procedure based on robust multivariate techniques. We have de- scribed accompanying diagnostic plots that can be used to detect and classify possi- ble outliers. Chapter 3 consists of numerical examples, real dataset and simulation study for the proposed procedure. Chapter 4 reviews the functional linear model with scalar response. In this chapter we propose a robust technique of parameter function estimation based on the robust functional principal components method. We also introduce robust diagnostic measures for identifying in uential observations. Finally, in Chapter 5, we give the conclusions of this work. 9 Chapter 2 Robust Functional Principal Component Analysis and Outlier Detection 2.1 Introduction In various areas such as chemometrics, biometrics, engineering, genetics, and e- commerce the data come from the observation of continuous phenomenons of time or space known as functional data. Due to advancement of new techniques it is now possible to record large number of variables simultaneously. The nature of this data in many applications is high dimensional where the number of variables (p) is greater than the number of observations (n) (n ? p). The focus of researchers is on analysis of such data due to the emergence of statistical problems while applying various statistical tools for data analysis. Functional principal component analysis (FPCA) is a useful tool to reduce the dimension of the functional data. In functional data, the flrst step is to represent the data in a lower dimensional space in order to have better interpretation. This is done by performing FPCA to capture the main modes of variability of the data by means of small number of components which are linear combinations of original variables that allow for better interpretation of various sources of variation. Sensitivity of the variance and the covariance matrix to irregular observations make it vulnerable to outliers and may not capture the variation of the regular ob- servations. Therefore data reduction based on FPCA becomes unreliable. This ne- cessiates the need of the robust FPCA method. 10 Lacontore et al.[27], has proposed a robust functional principal component anal- ysis based on spherical (SPHER) and elliptical (ELL) PCA [27]. However, these methods have two drawbacks. First drawback is that SPHER and ELL only estimate the principal components and not their eigenvalues. Second drawback is that SPHER and ELL PCA are in uenced by outliers [22]. Febrero et al. [13, 14] also proposed two methods for outlier detection that are based on the idea of functional depths and distance measures. The main contribution of our work is to construct a robust PCA method to achieve dimension reduction of data and to develop tools for detection of functional outliers. The outline of this chapter is as follows. In Section 2.2, a brief description of classical principal component analysis (CPCA) method and robust PCA methods are given. The main concepts of PCA which are used in FPCA are discussed in Section 2.3. In this section, the outlier detection procedure via robust FPCA is also described. 2.2 Classical and Robust Principal Component Analysis Principal Component Analysis (PCA) is used for understanding the structure of a multivariate dataset. PCA is a useful tool for data reduction, which is achieved by identifying main modes of variability of a given dataset. Unfortunately, if the data contains outliers then data reduction based on classical principal component analysis becomes unreliable. The goal of the robust PCA methods is to obtain principal components that are not afiected by outliers. 11 2.2.1 Classical Principal Component Analysis In multivariate data the central notion is to flnd weight vectors j 2

p) Minimum covariance determinant (MCD) estimator [35, 36] of multivariate location and covariance matrix are popular for this case because of its high resistance to outliers. It also provides fast algorithm (FAST-MCD) [37] for computation. In this algorithm: the initial MCD estimators are deflned, based on h observa- tions (h < n), as the mean ^?0 and the covariance matrix ^?0. The covariance matrix of these h observations has the lowest determinant and h should be at least [(n+p+1)=2]. MCD estimator can resist n?h outliers and with this choice the MCD estimator has a breakdown value of (n?h+1)=n. The value of h is taken approximately between 0:5n and 0:75n. The value h ? 0:5n is taken when there is 50% contamination and if there is 25% contamination then the value of h ? 0:75n. When there are smaller number of outliers the value of h is increased for a more precise estimates. Reweighting is then done to increase the flnite sample e?ciency. Each data point receives a weight 1 if its robust distance RD(xi) = q (xi ? ^?0)0^??10 (xi ? ^?0) is ? q ?2p;0:975 and weight 0 otherwise. For the observations with weight one the reweighted MCD estimator is then deflned as the classical mean ^?M and covariance matrix ^?M. The robust loadings are the flrst k1 eigenvectors of the MCD estimator of ^?M sorted in descending order of the eigenvalues [20, 21, 22]. ROBPCA for high dimensional data (p > n) For data with high-dimension (p > n), the MCD estimator can not be used because the covariance matrix of h < p observations is always singular and can not be min- imized. In this case ROBPCA method suggested by Hubert et al. [20, 21, 22] is used on the Xn;p data. ROBPCA method is a combination of both projection pursuit technique (PP)[19] and MCD covariance estimation in lower dimensions. PP is used 14 flrst to reduce dimension. The MCD method is then applied to this low dimensional subspace to estimate the center and the scatter of the data. Initial data preprocessing is done by applying singular value decomposition of Xn;p. This results in huge dimension reduction as the data are represented using at most n?1 = rank( ~Xn;p) variables without loss of information. By applying PP, the high dimensional data points are projected on many uni- variate directions. Then the robust center ^?r and scale ^ r (based on univariate MCD method) of these projected data points on every direction are computed. For each projected data point, the Stahel-Donoho?s outlyingness measure: Outl(zj) = maxvj z 0 iv ? ^?r j ^ r ; i = 1;:::;n: is used to form h subset, that has smallest outlyingness. Optimal k0 ? p principal components are then selected from the covariance matrix of the flnal h subset. The data are then projected onto this k0 dimensional subspace. Next the reweighted MCD estimator is used to compute the center and the scatter of the data points in this low- dimensional subspace. The dominant k1 eigenvectors of this covariance matrix are the k1 robust principal components, and the MCD location and covariance matrix estimates serve as robust estimates for the location vector ? and covariance matrix . BACONPCA in low dimensions (n > p) The Blocked Adaptive Computationally E?cient Outlier Nominators (BACON) al- gorithm developed by Billor et al. [2] is used for this robust procedure. BACON is a cost efiective, fast computational method with high breakdown point based on measuring robust distances from a basic subset, which is free of outliers. The initial 15 basic subset is derived algorithmically in two ways by Mahalanobis distances or by Euclidean distances. For BACONPCA in low dimensions, initial dimension reduction of the mean centered data matrix Xc is done by singular value decomposition (SVD). Xc = (X ?1n^?0) = UD?0; where ^? is the classical mean vector, D is a p?p diagonal matrix of the eigenvalues of the X0cXc and U0U = Ip = ?0?, ? is the matrix of the eigenvectors corresponding to the eigenvalues of X0cXc. Ip is the p?p identity matrix. The next step is to obtain the score matrix, Z = Xc?. The robust mean, ^?B and the robust variance-covariance matrix, ^?B, are computed from clean observations obtained from BACON algorithm. From the BACON based covariance matrix, ^?B, number of robust PCs are determined as k1. ?1 is the matrix of the eigenvectors corresponding to the nonzero eigenvalues of ^?B. Finally, robust score matrix, Z1 = (Z ?1n^?0B)?1, is obtained [3]. BACONPCA in high dimensions (p > n) In this case BACONPCA method, suggested by Billor et al. [3], is used on the centered Xc data. The mean centered data matrix Xc are preprocessed by singular value decomposition (SVD) based on the eigenvalues and the eigenvectors of XcX0c instead of X0cXc. Since decomposition of XcX0c is much faster than that of X0cXc. Then the score matrix Z = Xc? is obtained, where ? is the matrix of the eigenvectors corresponding to the eigenvalues of XcX0c. Since BACON or MCD methods are useful only when n > p, these methods can- not be used, where n < p, to determine clean observations of Z because of singularity 16 of the covariance matrix. Stahel-Donoho?s outlyingness measure is useful to deter- mine a clean set of h observations of Z (Hubert et al. [22]). The high dimensional data points, zi, are projected onto many univariate directions v. For every direction v, robust center ?r and robust standard deviation, ^ r (based on univariate BACON method) are obtained for the projected observations, z0iv (i = 1;:::;n). Outlyingness measure based on these robust center and scale values can be deflned as: Outl(zj) = maxvj z 0 iv ? ^?r j ^ r ; i = 1;:::;n: This measure will detect the points which are outlying on the projection vector. Therefore, this will result into h clean set of observations (h=0:75n). For h observa- tions the mean, ^?1, and the scatter matrix, ^?1, of the Z matrix are obtained. The spectral decomposition of ^?1, gives: ^?1 = ?1?1?01; where ?1 is the matrix of the eigenvectors corresponding to the eigenvalues of ^?1, ? = diag(?1;:::;?p) is the diagonal matrix of the eigenvalues of ^?1. Then we determine the retaining number of principal components k0 < p by using some techniques, like a scree plot. The data are then projected onto the subspace spanned by the flrst k0 eigenvectors of the covariance matrix ^?1, that is Z2 = (Zn?p ?1n^?01)?p?k0 17 where ?p?k0 is a matrix of the flrst k0 eigenvectors ?1. Next, BACON algorithm is applied to flnd the mean vector, ^?B, and scatter matrix, ^?B, of the matrix, Z2. Based on the robust covariance matrix, ^?B, the robust PCs are obtained as: Z3 = (Z2 ?1n^?0B)??p?k1; where ??p?k1 is the matrix of eigenvectors corresponding to the flrst k1 eigenvalues, that are determined by a selection criteria (e.g., a scree plot), of the robust BACON based covariance matrix ^?B [3]. 2.3 Classical and Robust Functional Principal Component Analysis and Outlier Detection When the dataset is in the form of a curve, the procedure for PCA can be generalized for functional principal component analysis (FPCA) to obtain main modes of variability for the curves. 2.3.1 Classical Functional Principal Component Analysis When the dataset is in the form of a curve, the procedure for classical PCA can be generalized for Functional Principal Component Analysis (FPCA) to obtain main modes of variability for the curves. Instead of variable values xij, used in PCA, functional values xi(t) are used in FPCA, so that the discrete index j in the multivariate context is replaced by continuous index t. Unlike multivariate PCA, components in functional PCs are functions rather than vectors. So summations over j are replaced by integrations over t. 18 Let fx(t);t 2Tg be a stochastic process where T is some index set which is a bounded interval on <. The principal component scores corresponding to weight is generalized to an integral form, Zi = Z j(t)xi(t)dt: (2.2) The weight function j(t) is obtained by solving max h j; mi=I(j=m); j?m N?1X( Z jxi)2 (2.3) or equivalent to solving the functional eigenequation Z ?(s;t) (t)dt = ? (s); 2 L2; (2.4) where ? is the covariance function of the x(t). The sequence of eigenfunctions i, i = 1;2;:::, sorted with respect to the corresponding eigenvalues ?1 ? ?2 ? ::: solves the FPCA problem (2.3). The eigenequation is the same general equation as in PCA, except here is now an eigenfunction rather than an eigenvector. There is a major difierence between the multivariate and functional eigenanalysis. In multivariate case the eigenvalue-eigenfunction pairs are p (number of variables) whereas, in functional case they are inflnite (number of functional values). In practice, the unknown covari- ance function ? needs to be estimated by the sample values xi(t), 1 ? i ? n; where for each i; xi(t) is observed on a discrete set of points t = ft1;:::;tpg for flnite p. FPCA problem can be represented in terms of basis function approach. In which, flrst k bases functions in a basis f`1;:::;`kg are used, where k is large enough, so that these functions will be able to describe most of the features of the data. The bases are selected based on the nature of the data; for example if the data are smooth 19 and periodic then a Fourier basis might be ideal and for data that have a lot of local features then B-splines might work better. Approximate each xi by: ^xi(t) = kX K=1 ciK`K(t): (2.5) We can express all n curves simultaneously by deflning the vector-valued function x to have components x1;x2;:::;xn and the vector valued function ` to have components `1;:::;`k as: x = C`; (2.6) where the coe?cient matrix C is n ? k. In matrix terms the variance-covariance function is: ?(s;t) = n?1`(s)0C0C`(t): (2.7) Deflne W as a symmetric matrix of order k W = Z ``0: (2.8) Suppose that the weight function has the expansion (s) = XbK`K(s) (2.9) and in matrix notation, (s) = `(s)0b. Using equations (2.6-2.9) the left hand side of eigen equation (2.4) becomes Z ?(s;t) (t)dt = Z n?1`(s)0C0C`(t)`(t)0bdt = `(s)0n?1C0CW0b: 20 The eigenequation can be written as: `(s)0n?1C0CWb = ?`(s)0b: (2.10) As this equation holds true for all s, it can be written in matrix form in following manner: n?1C0CWb = ?b: (2.11) Ask k= 1 implies b0Wb = 1 and similarly, two functions 1 and 2 will be orthogonal if and only if the corresponding vectors of coe?cients satisfy b01Wb2 = 0. We deflne u = W1=2b to get the required principal components by solving equivalent symmetric eigenvalue problem n?1W1=2C0CW1=2u = ?u (2.12) and compute b = W?1=2u for each eigenvector. If the basis is orthonormal then W = I. The functional PCA problem reduces to the standard multivariate PCA of the coe?cient array C. In this section, we examined FPCA as a dimension reduction tool. Although, FPCA solves dimensionality problem, it fails to deal with data containing outliers. In next section a new robust method, robust FPCA method, is given to overcome this problem. 2.3.2 Robust Functional Principal Component Analysis and Outlier De- tection In this section, an outlier detection method for functional data via robust FPCA is given. The robust FPCA method is to obtain functional principal components 21 that are less in uenced by outliers. Outlier detection method proposed by Febrero et al.[13] is discussed and then the construction of the robust FPCA method is described. Outlier Detection using Likelihood Ratio Test Outlier detection procedure in functional data using Likelihood Ratio Test (LRT) is developed by Febrero et al.[13], which is based on distance measure. Let the functional sample be x1;:::;xn and the statistic is given as: Ofi(xi) = kxi ? ^?TM;fi^ TSD;fi k; ? = max1?i?nOfi(xi); where k ? k is a norm in the functional space (k ? k1, k ? k2 or k ? k1), ^?TM;fi is the fi-trimmed mean and ^ TSD;fi is the fi-trimmed standard deviation. Hence, Ofi(xi) is the distance between xi and ^?TM;fi relative to ^ TSD;fi. The presence of outliers is determined by comparing the test statistic (?) with some threshold and an iterative procedure. Description of the Proposed Method Consider the functions as processes in continuous time deflned over an interval, say T 2 [tmin; tmax]. The ith replication of functional observation is denoted as xi(t) 2 L2[T]; i = 1;:::;n. In practice, it is impossible to observe the functional values in continuous time. We usually obtain data only on a flnite and discrete grid 22 t = ft1;t2;:::;tpg2 T in the following manner: yi = xi(t)+?i; 1 ? i ? n; where ?i is a random error or noise with zero mean and constant variance function 2i(t). For simplicity, we assume that all processes are observed at the same time points, which are equally spaced on T and is denoted by t = ft1;t2;:::;tpg. The function xi can be represented as a linear combination of the flrst k basis functions `K; K = 1;:::;k, where k is large enough, k < p using basis expansion method given in Chapter 1 as: xi(t) = kX K=1 ciK`K(t); where ` is vector-valued function having components `1;:::;`k. The C is n x k coe?cient matrix of the expansion, where C = [ciK]; 1 ? i ? n; 1 ? K ? k. The simultaneous expansion of all n curves can be expressed in matrix notation as: x = C`; where x is a vector-valued function with xi, 1 ? i ? n, as its components. We assume that basis function is orthonormal. To select optimal number of basis functions, k, GCV developed by Craven and Wahba [9], is used which is described in the following section. Coe?cient Estimation: On partially observed functions the coe?cients ciK are com- puted by using the least squares approach, for i = 1;:::;n and K = 1;:::;k, nX i=1 [yi(t)? kX K=1 ciK`K(t)]2 23 = (y ?C`)0(y ?C`) =k y ?C` k2 : SincewedealwithbasisfunctionthatisorthonormalthefunctionalPCAproblem reduces to the standard multivariate PCA of the coe?cient array C (see Section 2.3.1). Instead of dealing with FPCA we apply classical PCA on C. Applying PCA on C would provide the equivalent information about the structure of the covariance function of functional data x(t). Outliers in C will be equivalent to the outliers in functional data x(t). Therefore, the diagnostic plots developed to detect outliers by using multivariate PCA method can also be used to detect functional outliers. Diagnostic plot, Orthogonal-score plot [22], which is a by-product the robust PCA method is used for identiflcation and classiflcation of outliers. By using PCA method we obtain robust scores Z in the following manner: Z = C ?V; where Z is n ? k1 matrix, C is n ? k matrix of the coe?cients, V is k ? k1 robust eigenvectors and k1 ? k. The selection criteria to choose the components k1 is based on the eigenvalues. The predetermined threshold value is 90%. The optimal number of components k1 is the minimal value for which the cumulative percentage of total variance is greater than or equal to 90%. Robust coe?cients are obtained by transforming the data back to 1 and ? q ?2k;0:975 when k = 1. For the cutofi value of orthogonal distances the Wilson-Hilferty approximation for a ?2 distribution is used, where the Od2=3 ? N(?; 2). Estimates of ^? and ^ 2 are obtained by using univariate MCD. The cutofi value of the vertical line equals (^? + ^ z:975)2 where z:975 is 97.5% quantile of the normal distribution. 27 (a) (b) Figure 2.1: (a)Difierent types of outliers when a 3 dimensional dataset is projected on robust 2 dimensional PCA-subspace. (b)Difierent types of outliers in Orthogonal- Score distance plot (Hubert et al. [22]). 28 Chapter 3 Numerical Examples In this chapter a real data and simulation study are given to demonstrate opti- mality of the proposed method for outlier detection in FDA. 3.1 Dataset: NOx Data The aim of our analysis is to illustrate the performance of the robust FPCA on the NOx data, which was used by Febrero et al. [13, 14]. The NOx emission levels data collected by a control station near a power plant in Barcelona in year 2005 is analyzed by using techniques for functional data. The dataset consists of NOx levels (?g=m3) measured every hour for the period February 23, 2005 to June 26, 2005. Only NOx levels for 115 days are available due to missing observations problem for several consecutive hours of some days. The dataset of NOx emission levels are displayed in Figure 3.1. The whole NOx data is divided into working days (n1 = 76 curves) and non working days (n2 =39 curves). Non working days are weekends and holidays during the given data period. For using the functional data analysis it is essential to flrst convert the discrete data to functional form (i.e., continuous function) by using basis function. We will utilize these three datasets 1) to explore the efiect of difierent basis expansion (Fourier and B-spline) on outlier detection. 2) to compare the proposed method for detection of outliers with the Febrero?s results. 29 0 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox (a) 0 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox (b) Figure 3.1: Sample curves of NOx data by using (a) B-spline Basis and (b) Fourier Basis From the Figure 3.1 we can say that NOx levels increase in the morning and reach peak value around 8:00 am, then decrease until 2:00 pm, and again increase in the evening. The Figures 3.5 and 3.8 exhibit the sample curves for working days and non working days, respectively. In Figures 3.1, 3.5 and 3.8, the group of curves shows presence of a few trajectories that are in some way difierent from the rest. NOx (Whole sample): Initial dimension of the dataset is 115?24. GCV method is used to determine optimal number of bases for B-spline and Fourier bases, and the resulting k based on GCV are 12 for B-spline basis (Figure 3.2(a)) and 10 for Fourier basis (Figure 3.2(b)). Whole sample data curves by using 12 B-spline bases and 11 Fourier bases are shown in Figure 3.1(a) and (b), respectively. Since there is a high correlation among the variables of coe?cient matrix we apply CPCA, ROBPCA and BACONPCA on coef- flcient matrix for dimension reduction and outlier detection. For B-spline and Fourier 30 0 5 10 15 20 250 0.5 1 1.5 2 2.5 3 3.5x 105 12 Number of Basis?>k GCV(k) Generalized Cross?Validation (a) 2 4 6 8 10 12 14 16 180.4 0.6 0.8 1 1.2 1.4 1.6 1.8x 105 10 Number of Basis?>k GCV(k) Generalized Cross?Validation (b) Figure 3.2: Generalized Cross-Validation by using (a) B-spline Basis and (b) Fourier Basis for NOx data bases, six and flve principal components were retained, respectively each for CPCA, ROBPCA and BACONPCA, yielding a classical and robust explanation percentage more than 90%. The resulting diagnostic plots for the three PCA methods by using B-spline and Fourier bases are given in Figures 3.3 and 3.4, respectively. All bad leverage points, detected by these diagnostic plots (orthogonal-score plots) formed by using the three PCA methods based on both bases are listed in Table 3.1. NOx (Working Days): Initial dimension of the dataset is 76 ? 24. Optimal number of bases based on GCV is obtained as k=12 for B-spline, k=10 for Fourier basis. Data curves for working days by using 12 B-spline bases and 11 Fourier bases are given in Figure 3.5(a) and (b), respectively. Due to high correlation among the variables in C, we apply CPCA, ROBPCA and BACONPCA on coe?cient matrix. For B-spline and Fourier bases, seven and flve principal components were retained, respectively each for CPCA, 31 0 1 2 3 4 50 20 40 60 80 100 120 140 Score distance Orthogonal distance Score Plot 15 57 19 59 20 56 71 58 16 22 4106 63 3 93 14 60 (a) 0 1 2 3 4 5 6 7 0 50 100 150 Score distance (6 LV) Orthogonal distance 59 21 20 71 23 19 16 58 15 57 56 22 44 4 63 73106 3 14 60 ROBPCA (b) 0 1 2 3 4 5 6 70 50 100 150 Score distance Orthogonal distance Robust Score Plot 23 59 58 20 57 19 15 16 56 22 63 73106 14 3 60 (c) Figure 3.3: Orthogonal-score plot for whole sample by using B-spline basis computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA 32 0 1 2 3 4 50 20 40 60 80 100 120 Score distance Orthogonal distance Score Plot 73 15 57 58 20 19 56 22 59 16 3109 74 93 9 52 83 71 (a) 0 1 2 3 4 5 6 7 0 20 40 60 80 100 120 Score distance (5 LV) Orthogonal distance 21 58 73 23 15 19 59 20 57 56 16 22 105 3 74 83 52109 71 ROBPCA (b) 0 1 2 3 4 5 6 70 20 40 60 80 100 120 140 Score distance Orthogonal distance Robust Score Plot 58 73 15 19 57 20 59 56 16 22 9 74 83109 52 71 (c) Figure 3.4: Orthogonal-score plot for whole sample by using Fourier basis computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA 33 0 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox (a) 0 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox (b) Figure 3.5: Working Days of NOx data by using (a) B-spline Basis and (b) Fourier Basis ROBPCA and BACONPCA, yielding a classical and robust explanation percentage more than 90%. The resulting diagnostic plots for the three PCA methods by using these bases are displayed in Figures 3.6 and 3.7, respectively. All bad leverage points detected by these diagnostic plots (orthogonal-score plots) formed by using the three PCA methods based on both bases are listed in Table 3.1. NOx (Non-Working Days): Initial dimension of the dataset is 39?24. By GCV method the optimal numbers of bases for B-spline and Fourier bases are k=12 for B-spline bases and k=6 for Fourier basis. Data curves for Non-working days by using 12 B-spline bases and 7 Fourier bases are given in Figure 3.8(a) and (b), respectively. We again apply CPCA, ROBPCA and BACONPCA on coe?cient matrix. For B-spline and Fourier basis, flve and three principal components were retained, respectively each for CPCA, ROBPCA 34 0 1 2 3 4 50 20 40 60 80 100 120 Score distance Orthogonal distance Score Plot 28 11 3 38 13 1416 39 37 12 69 9 5542 7 10 (a) 0 1 2 3 4 5 6 7 8 0 20 40 60 80 100 120 140 Score distance (7 LV) Orthogonal distance 36 38 132815 12 11 14 37 16 42 9 69 10 39 ROBPCA (b) 0 1 2 3 4 5 6 70 20 40 60 80 100 120 140 Score distance Orthogonal distance Robust Score Plot 4728 3 1314 38 11 12 37 16 94269 10 39 (c) Figure 3.6: Orthogonal-score plot for working days by using B-spline basis computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA 35 0 0.5 1 1.5 2 2.5 3 3.5 4 4.50 10 20 30 40 50 60 70 80 90 100 Score distance Orthogonal distance Score Plot 47 11 15 2814 38 37 13 12 16 5648 7361 33 55 7 (a) 0 1 2 3 4 5 6 7 0 20 40 60 80 100 120 140 Score distance (5 LV) Orthogonal distance 284715 38 14 37 13 11 12 16 ROBPCA (b) 0 1 2 3 4 5 6 70 20 40 60 80 100 120 140 Score distance Orthogonal distance Robust Score Plot 15 28 471413 11 38 37 12 16 75 748 7333 (c) Figure 3.7: Orthogonal-score plot for working days by using Fourier computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA 36 0 5 10 15 20 50 100 150 200 250 Time(Hours) Nox (a) 0 5 10 15 200 20 40 60 80 100 120 140 160 180 200 Time(Hours) Nox (b) Figure 3.8: Non-working Days of NOx data by using (a) B-spline Basis and (b) Fourier Basis and BACONPCA, yielding a classical and robust explanation percentage more than 90%. The resulting diagnostic plots for the three PCA methods by using B-spline and Fourier bases which are given in Figures 3.9 and 3.10, respectively, illustrate all types of outliers in this dataset. All bad leverage points detected by these diagnostic plots (orthogonal-score plots) formed by using the three PCA methods based on both bases are listed in Table 3.1. Figure 3.11 shows the outliers identifled by the proposed method for the three datasets. All bad leverage points detected by three PCA methods by using two difierent bases are summarized in Table 3.1. Outliers detected by Febrero et al. [13] for the same dataset are also given in Table 3.2 for comparison purposes. From Table 3.1 we conclude that both ROBPCA and BACONPCA detected similar outliers in whole sample by using Fourier basis. These results match with the results obtained by Febrero et al. [13] (refer Table 3.2). However, we have detected 37 0 0.5 1 1.5 2 2.5 3 3.5 4 4.50 10 20 30 40 50 60 70 80 Score distance Orthogonal distance Score Plot 31 5 22 3618 7 27 20 26 21 2 14 811 1 35 19 6 17 (a) 0 1 2 3 4 5 6 7 8 0 20 40 60 80 100 Score distance (5 LV) Orthogonal distance 319 7 18 27 20 21 5 26 ROBPCA (b) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.50 10 20 30 40 50 60 70 80 Score distance Orthogonal distance Robust Score Plot 5 22 1 1836 27 7 26 20 21 14 811 35 19 6 17 (c) Figure 3.9: Orthogonal-score plot for Non-working Days by using B-spline basis com- puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA 38 0 1 2 3 4 50 20 40 60 80 100 120 Score distance Orthogonal distance Score Plot 73 15 57 58 20 19 56 22 59 16 3109 74 93 9 52 83 71 (a) 0 1 2 3 4 5 6 7 0 20 40 60 80 100 120 Score distance (5 LV) Orthogonal distance 21 58 73 23 15 19 59 20 57 56 16 22 105 3 74 83 52109 71 ROBPCA (b) 0 1 2 3 4 5 6 70 20 40 60 80 100 120 140 Score distance Orthogonal distance Robust Score Plot 58 73 15 19 57 20 59 56 16 22 9 74 83109 52 71 (c) Figure 3.10: Orthogonal-score plot for Non-working Days by using Fourier basis com- puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA 39 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox Outliers in Whole Sample 03/18(16) 03/09/(11) 04/29(37) 05/02(38)03/11/(12) (a) 5 10 15 200 50 100 150 200 250 300 350 Time(Hours) Nox Outliers in Working Days 03/18(16) 03/09(11) 04/29(37) 05/02(38)03/11(12) (b) 5 10 15 200 20 40 60 80 100 120 140 160 180 200 Time(Hours) Nox Non Working Days with outliers 05/01(21) 04/30(20)03/19(7) (c) Figure 3.11: Outliers detected by proposed method for (a)Whole sample, (b)Working days, (c)Non-working days 40 one additional outlier (03/09) by using both ROBPCA and BACONPCA methods. The orthogonal score plot based on CPCA detects only one bad leverage point. For whole samples by using B-spline basis four similar outliers are detected by both ROBPCA and BACONPCA methods. CPCA detected three outliers. But outliers detected by these three PCA methods using B-spline basis do not conform with outliers detected by Febrero et al. [13]. For working days dataset we obtained flve similar outliers for both ROBPCA and BACONPCA by using Fourier basis. Outliers thus detected by us match with outliers detected by Febrero et al. [13] for working days dataset using Fourier basis, except that we have detected two additional outliers (05/02 and 03/09) (refer Table 3.1). CPCA method using Fourier basis detected only one outlier. Under B-spline basis four outliers are detected by ROBPCA method and flve outliers are detected by BACONPCA method for working dataset. Results obtained by both PCA methods do not match with results obtained by Febrero et al. [13]. Here also CPCA detected only one outlier and converted bad leverage points detected by ROBPCA and BACONPCA into good leverage points. In non-working days dataset under Fourier basis ROBPCA detected two outliers and BACONPCA detected three outliers. These results match with the flndings of Febrero et al. [13] except that BACONPCA detected one additional outlier (05/01). Under Fourier basis CPCA failed to detect any outlier. Similarly for non-working dataset using B-spline basis the CPCA method has not detected any outlier. Outliers detected by ROBPCA and BACONPCA using B-spline basis are difierent from those obtained by Febrero et al. [13]. After detecting outliers, we checked for sources for abnormal values of these curves. We expected to provide information about the abnormally large NOx emis- sions on these particular days. We found that Friday, March 11 is the beginning 41 Table 3.1: Outliers detected by three PCA methods. (.) denotes case number for three datasets Fourier Basis B-spline Basis Dataset CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA Whole sample 05/02(59) 03/18(22) 03/18(22) 04/29(56) 04/29(56) 04/29(56) 04/29(56) 04/29(56) 03/11(16) 03/11(16) 03/11(16) 03/11(16) 03/11(16) 05/02(59) 05/02(59) 05/02(59) 05/02(59) 05/02(59) 03/16(20) 03/16(20) 03/09(15) 03/09(15) Working days 05/02(38) 03/18(16) 03/18(16) 05/03(39) 04/29(37) 04/29(37) 04/29(37) 04/29(37) 03/11(12) 03/11(12) 03/11(12) 03/11(12) 05/02(38) 05/02(38) 05/02(38) 05/02(38) 03/09(11) 03/09(11) 03/09(11) 03/09(11) 03/18(16) Non working days | 04/30(20) 04/30(20) | 04/30(20) 05/15(26) 03/19(07) 03/19(07) 03/19(07) 05/01(21) Table 3.2: Outliers detected by Febrero et al. [13] for the NOx data Dataset k?k1 k?k2 k?k1 Whole sample 03/18 03/18 03/18 04/29 04/29 04/29 03/11 03/11 03/11 05/02 05/02 Working days 03/18 03/18 03/18 04/29 04/29 04/29 03/11 03/11 03/11 Non working days 04/30 04/30 04/3003/19 03/19 03/19 of a weekend. The Friday, March 18 and Saturday, March 19 are the beginning of the Eastern vacation in Spain in the year 2005. Also Friday, April 29, Saturday, April 30, Sunday, May 1, and Monday, May 2 correspond to long weekend. There is sudden increase in tra?c on these small vacation periods. So we conclude that abnormal observations on speciflc days can be attributed to increase in tra?c due to small vacation periods. We have also detected outlier on Wednesday, March 9. It is observed that high NOx emissions are recorded on March 9 after 8:00 pm. Since the observation on March 10th is missing and thus not included in analysis, we could not pinpoint the reason behind this abnormal observation on March 9th. 42 3.2 Simulation The simulation study is conducted to compare the performance of ROBPCA and BACONPCA with the classical PCA (CPCA) on coe?cient matrix. The simulation setting given by Fraiman and Muniz [12, 28], with few changes, is used here. For simulation we consider functional data x1;:::;xn obtained as realizations from a stochastic process X(?). This functional data has continuous paths on the observation period [tmin;tmax] = [0;1]. Curves are generated from difierent models. Model 1 was generated without contamination and several other models were generated with difierent types of contaminations. Model 1 (no contamination): Xi(t) = g(t) + ei(t);1 ? i ? n; where model error term ei(t) is a stochastic Gaussian process with zero mean and covariance function #(s;t) = (1=2)(1=2)(0:9)jt?sj and g(t)=4t, with t 2 [0;1]. Model 2 (asymmetric contamination): Yi(t) = Xi(t) + ciM; 1 ? i ? n; where ci is 1 with probability q and 0 with probability 1?q; M is the contamination size constant. Model 3 (symmetric contamination): Yi(t) = Xi(t)+ci iM; 1 ? i ? n; where ci and M are deflned as in model 2 and i is a sequence of random variables independent of ci taking values 1 and -1 with probability 1/2. Model 4 (partially contaminated): Yi(t) = Xi(t) + ci iM; if t ? Ti; 1 ? i ? n; and Yi(t) = Xi(t); if t < Ti; where Ti is a random number generated from a uniform distribution on [0;1]. Model 5 (Peak contamination): Yi(t) = Xi(t)+ci iM; if Ti ? t ? Ti +?; 1 ? i ? n; and Yi(t) = Xi(t); if t =2 [Ti;Ti + ?]; where ? = 2=30 and Ti is a random number from a uniform distribution in [0;1??]. Figure 3.12 exhibits curves simulated from these flve models. 43 For each model, we generated 100 replications, with two settings each for low and high dimensional data. For low dimensional data we consider 1) n = 100; p = 12; k = 8 and 2) n = 50; p = 5; k = 4 settings. For high dimensional data we also consider two settings with 1) n = 50; p = 100; k = 51 and 2) n = 50; p = 500; k = 151. For the model 1 contamination percent is q = 0 and contamination constant is M = 0. For each contaminated model (2, 3, 4 and 5) we considered several levels of contamination: q = 5; 10; 15 percentage and contamination constants M = 10 and 25. Classical PCA and robust methods ROBPCA and BACONPCA are used on the simulated functional data based on the flve models. Two quantitative measures of the goodness of the methods are considered. The flrst one is Mean proportion of variability (MPV) : MPV = 1=N NX m=1 ^?m1 + ^?m2 +:::+ ^?mk ?m1 +?m2 +:::+?mk +:::+?mp where N = 100 denotes the number of iterations and ?mj is an jth eigenvalue at mth replication obtained from the covariance matrix of coe?cient matrix of uncontami- nated model. ^?mj is the estimated value of ?mj at the mth replication. ^?mj is obtained by using classical or robust multivariate techniques on coe?cient matrix of contam- inated or uncontaminated model. 90% of variability is explained by the flrst three components for each setting. For the mean proportion of explained variability the optimal values are 0.9 for low and high dimensional data. The second quantitative measure is the Norm of the difierence between ^?m1 and ?m1 which is given as jj^?m1 ? ?m1 jj, where ?m1 is largest eigenvalue obtained from the covariance matrix of coe?cient matrix of uncontaminated model. ^?m1 is the esti- mated value of ?m1 at mth replication. ^?m1 is largest eigenvalue obtained by using 44 classical or robust multivariate techniques on coe?cient matrix of contaminated or uncontaminated model. The optimal value is zero or near zero. Model 1 is compared with models 2, 3, 4 and 5. The simulation results of mean proportion of variability for four comparisons are given in Tables 3.3 - 3.6. It is clear that CPCA provides the best mean proportion of explained variability when there is no contamination in the data, which is expected. For the uncontaminated data robust methods also yield comparable results. However, when contamination is introduced to the data (models 2-5) the eigenvalues obtained with CPCA are over- estimated. Since estimated percentages of MPV are larger than 100%. In ROBPCA and BACONPCA we obtain MPV of 90% for low dimensional data without and with contamination. For high dimensional data the mean percentage of explained variabil- ity is similarly 90% for without and with contamination. The main reason behind this is the optimal direction obtained by ROBPCA and BACONPCA are robust to outliers. CPCA clearly fails and provides the worst possible result because mean pro- portion of variability is above 100%. It is clear from Tables 3.3 - 3.5 that the MPV for ROBPCA at 15% contamination level is above 100% in most of the cases except for model 5. BACONPCA gives better results than ROBPCA at 15% for both low and high dimensional case. From these results we can deduce that BACONPCA and ROBPCA outperform the CPCA. Simulation results for the norm with N = 100 iterations and difierent contami- nation levels for comparison of model 1 vs models 2 are summarized in Figures 3.13 - 3.14. For this comparison, we used two high and two low dimensional settings with the two values of M (10 and 25). The ideal value of norm must be very small or near zero. We conclude that the norm is near zero when there is no contamination for all methods. This is an indication of ROBPCA and BACONPCA being also ef- fective methods for uncontaminated data. The norm for CPCA tends to increase 45 as contamination level increases. For contaminated data, norms corresponding to ROBPCA and BACONPCA method yield minimum value which is near zero for high and low dimensional settings. The comparisons of model 1 vs model 3-5 for low and high dimensional settings yielded very similar results observed in Figures 3.13-3.14, therefore they are not repeated here. We conducted the simulation given above for M = 5, but the results were not up to the mark. This is because in this case the outliers are not yet very well separated from the regular data group. As soon as the contamination lies somewhat further, robust methods are capable to distinguish the outliers. Therefore the results of the simulation for this case is not reported in this thesis since we aimed at distinguishing outliers from the regular points. 46 Table 3.3: Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with symmetric contamination (5%, 10% and 15% ) for low and high dimensional cases. High dimension:n=50, p=100, k=51 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.943 0.790 0.912 0.944 0.801 0.914 5% 13.213 0.833 0.918 71.297 0.838 0.915 10% 21.955 0.874 0.910 121.020 0.866 0.917 15% 32.354 2.472 0.918 195.299 12.720 0.927 High dimension:n=50, p=500, k=151 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.949 0.793 0.907 0.948 0.785 0.909 5% 11.749 0.821 0.917 73.250 0.804 0.912 10% 21.131 0.842 0.908 126.076 0.840 0.910 15% 31.715 1.512 0.919 201.703 14.303 0.928 Low dimension:n=100, p=12, k=8 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.914 0.833 0.892 0.917 0.830 0.898 5% 10.441 0.856 0.901 55.600 0.852 0.897 10% 18.859 0.860 0.889 114.111 0.870 0.903 15% 30.120 1.707 0.899 169.042 0.885 0.896 Low dimension:n=50, p=5, k=4 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.955 0.801 0.913 0.956 0.812 0.918 5% 9.736 0.837 0.923 56.574 0.830 0.913 10% 18.992 0.866 0.910 107.322 0.0.875 0.922 15% 25.839 1.360 0.921 161.370 3.781 0.920 47 Table 3.4: Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with asymmetric contamination (5%, 10% and 15% ) for low and high dimensional cases. High dimension:n=50, p=100, k=51 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.943 0.815 0.921 0.944 0.801 0.914 5% 10.254 0.837 0.920 70.863 0.839 0.914 10% 20.233 0.862 0.919 115.722 0.869 0.914 15% 26.345 0.895 0.919 165.056 9.011 0.921 High dimension:n=50, p=500, k=151 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.948 0.785 0.909 0.948 0.785 0.909 5% 12.728 0.804 0.909 74.551 0.806 0.905 10% 20.080 0.849 0.910 120.222 0.847 0.911 15% 28.722 2.110 0.916 173.636 9.202 0.917 Low dimension:n=100, p=12, k=8 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.914 0.833 0.892 0.915 0.834 0.896 5% 9.856 0.846 0.888 57.283 0.847 0.897 10% 17.285 0.870 0.896 106.619 0.868 0.897 15% 24.770 0.887 0.914 152.289 3.153 0.901 Low dimension:n=50, p=5, k=4 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.953 0.800 0.899 0.954 0.824 0.923 5% 10.930 0.826 0.908 51.805 0.844 0.922 10% 16.986 0.853 0.907 105.488 0.873 0.920 15% 24.150 1.923 0.920 135.844 0.903 0.921 48 Table 3.5: Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with partial contamination (5%, 10% and 15% ) for low and high dimensional cases. High dimension:n=50, p=100, k=51 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.944 0.793 0.913 0.942 0.781 0.911 5% 5.641 0.831 0.925 34.047 0.819 0.917 10% 10.920 0.865 0.910 60.144 0.843 0.905 15% 15.836 1.280 0.908 99.545 1.487 0.918 High dimension:n=50, p=500, k=151 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.948 0.785 0.909 0.949 0.822 0.911 5% 6.514 0.820 0.909 33.517 0.850 0.929 10% 11.222 0.861 0.920 69.062 0.872 0.922 15% 15.815 1.095 0.919 98.250 1.209 0.927 Low dimension:n=100, p=12, k=8 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.915 0.805 0.882 0.915 0.829 0.892 5% 5.792 0.830 0.885 30.162 0.842 0.895 10% 11.325 0.856 0.885 62.695 0.860 0.894 15% 14.734 0.855 0.873 90.655 1.119 0.897 Low dimension:n=50, p=5, k=4 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.955 0.819 0.910 0.956 0.816 0.917 5% 6.745 0.831 0.903 35.184 0.841 0.912 10% 11.983 0.856 0.918 65.041 0.870 0.929 15% 18.445 1.577 0.931 100.715 4.946 0.920 49 Table 3.6: Simulation results of the Mean Proportion of Explained Variability when there is no contamination (0%) and with peak contamination (5%, 10% and 15% ) for low and high dimensional cases. High dimension:n=50, p=100, k=51 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.945 0.802 0.922 0.944 0.812 0.916 5% 1.505 0.832 0.913 5.285 0.851 0.923 10% 1.792 0.882 0.904 7.427 0.871 0.917 15% 2.070 0.921 0.873 8.989 0.912 0.898 High dimension:n=50, p=500, k=151 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.949 0.822 0.911 0.948 0.785 0.909 5% 1.462 0.855 0.925 5.123 0.824 0.910 10% 1.752 0.878 0.902 7.006 0.860 0.900 15% 1.998 0.933 0.901 8.347 0.896 0.896 Low dimension:n=100, p=12, k=8 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.916 0.841 0.893 0.916 0.823 0.895 5% 1.703 0.852 0.893 6.506 0.834 0.896 10% 2.419 0.850 0.887 10.273 0.851 0.893 15% 2.917 0.862 0.889 15.164 0.871 0.899 Low dimension:n=50, p=5, k=4 Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA 0% 0.954 0.799 0.901 0.954 0.814 0.898 5% 1.591 0.808 0.904 5.652 0.818 0.901 10% 2.456 0.811 0.903 9.499 0.828 0.897 15% 3.369 0.820 0.908 15.931 0.831 0.896 50 0 0.2 0.4 0.6 0.8 1 ?1 0 1 2 3 4 5 (a) Model 1 0 0.2 0.4 0.6 0.8 1 ?1 0 1 2 3 4 5 6 7 8 9 (b) Model 2 0 0.2 0.4 0.6 0.8 1?6 ?4 ?2 0 2 4 6 8 10 (c) Model 3 0 0.2 0.4 0.6 0.8 1 ?2 0 2 4 6 8 10 (d) Model 4 0 0.2 0.4 0.6 0.8 1?4 ?2 0 2 4 6 8 (e) Model 5 Figure 3.12: Curves generated from model 1 (without contamination), model 2 (asym- metric contamination), model 3 (symmetric contamination), model 4 (partial contam- ination) and model 5 (peak contamination) with n=50, p=100, M=10 and q=0.1. 51 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 0.5 1 1.5 2 x 106 norm n=50,p=100,k=51,M=10 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 1 2 3 4 5 6 7 8 x 107 norm n=50,p=100,k=51,M=25 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 0.5 1 1.5 2 x 107 norm n=50,p=500,k=151,M=10 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 1 2 3 4 5 6 7 x 108 norm n=50,p=500,k=151,M=25 Figure 3.13: Boxplots of norm when there is no contamination (0%) and symmet- ric contamination (5%, 10% and 15% ) for high dimensional cases for CPCA(C) ROBPCA(R) and BACONPCA(B). 52 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 1 2 3 4 5 x 104 norm n=100,p=12,k=8,M=10 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 2 4 6 8 10 12 14 16 x 105 norm n=100,p=12,k=8,M=25 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 1 2 3 4 5 6x 10 5 norm n=50,p=5,k=4,M=10 0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B 0 1 2 3 4 5 x 105 norm n=50,p=5,k=4,M=25 Figure 3.14: Boxplots of norm when there is no contamination (0%) and symmet- ric contamination (5%, 10% and 15% ) for low dimensional cases for CPCA(C) ROBPCA(R) and BACONPCA(B). 53 Chapter 4 Robust Functional Principal Component Regression 4.1 Introduction Recently researchers have put more attention to functional linear models in which the regressors and/or the response are of a functional nature and proposed several methods for estimating the functional parameter [10, 11, 15, 16, 30]. Functional regressors are inflnite in nature. Problem with inflnite dimensionality of the regressor is that, it results into inflnitely many sets of solutions or sufiers from multicollinearity. Therefore the flrst step in this regression setup is to reduce dimension by using FPCA and then is to regress the response onto these components obtained from FPCA. The presence of outliers or in uential observations has a serious efiect on the estimation and prediction of the functional linear model. In the presence of outliers, the decomposition of the classical covariance matrix is unreliable. In such situation, both the FPCA stage and the regression stage called Functional Principal Component Regression (FPCR) yield unreliable results. In uential observations in a given dataset can have a strong impact on analysis. If these outlying or in uential observations are removed from the data then this may substantially afiect the statistical inference. The functional versions of the diagnostic measures based on Cook?s distance [8] is introduced by Chiou and M?uller [7] and Shen and Xu [38] for the models where the regressors are real or curves and the responses are functional. Recently, Febrero et al. [16] reviewed estimation based on the classical functional principal components method and then analyzed in uence in the functional linear 54 model with scalar response. They have proposed three measures of in uence by generalizing the measures proposed for the standard regression model by Cook [8] and P~ena [29]. In this chapter we propose a robust functional principal component regression method which consists of two parts. First we apply a robust FPCA method on the regressors, and then regress the response variables on the scores, which are discussed in Section 4.2. In Section 4.3 we propose robustifled in uence regression diagnostic measures to detect which observations have strong in uence. In Section 4.4 the practical use of these measures is illustrated by means of a real dataset. 4.2 Estimation of Functional Parameter fl The functional linear model with a scalar response is a regression model with the regressor which is a random curve and the response which is real random variable deflned on the same probability space. We assume that (X;y) is a pair of random variables where X = (X(t)), X 2 L2(T), t 2 T = [tmin;tmax] ? < and y is a real random variable. For easy computation we assume that both X and y are centered i.e., E[X(t)] = 0, and E[y] = 0. Assuming E(k X k2) < 1, the dependence between the scalar response y and the functional random variable X is written as: y = hX;fli+? = Z T X(t)fl(t)dt+?; (4.1) where h:;:i, denotes the L2(T) inner product, fl is a square integrable function deflned on T and errors, ?, is a real random variable with E[?] = 0, E[X(t)?] = 0 and flnite variance equal to 2. Suppose that a random sample of pairs (Xi;yi); i = 1;:::;n; is observed where Xi and yi (i = 1;:::;n) are realizations of the functional X and y, respectively. 55 The estimate of fl can be obtained by flnding such fl that minimizes residual sum of squares RSS(fl) = nX i=1 (yi ?hXi;fli)2: (4.2) Such fl is a functional parameter, that has high dimensionality problem. Thus minimization of RSS can be accomplished by using PC approach. The sample covari- ance operator of X denoted by ?X allows a spectral decomposition into orthonormal eigenfunctions 1; 2;::: [33]. By Mercer?s Theorem, an orthogonal expansion for ?X in L2 is given by: ?X(s;t) = 1X K=1 ?K K(s) K(t); (4.3) with ordered nonnegative eigenvalues ?1 ? ?2 ? ::: ? ?n ? 0 = ?n+1 ? ::: The sequence of eigenvalue-eigenvector pairs satisfles the eigenequation given by ?X K = ?K K, for K ? 1 and h K; li = 1 if K = l and h K; li = 0 otherwise. By using Karhunen-Lo eve expansion [1] the functional variables Xi and the func- tional parameter fl can be written in terms of the eigenfunctions K in following manner: Xi = 1X K=1 ?iK K; (4.4) fl = 1X K=1 flK K; (4.5) where ?iK = hXi; Ki, such that ?iK = 0, for K > n and flK = hfl; Ki, respectively, for i = 1;:::;n and K = 1;2;:::. Eigenfunctions form an orthonormal basis of the functional space L2(T) [30]. 56 By using the new deflnitions of Xi and fl the residual sum of squares in equation (4.2) can be written as: RSS(fl) = nX i=1 (yi ? nX K=1 ?iKflK)2: (4.6) The dimension of fl in equation (4.6) is reduced from 1 to n. But minimizing this equation will give us a perfect flt of the response variable. To avoid this problem alternate method proposed by Cardot et al. [4] to estimate fl is used. In this method flK = 0, for K ? kn + 1, where 0 < kn < n and ?kn > 0. The coe?cients flK, for K = 1;:::;kn, are obtained by minimizing the residual sum of squares given by: RSS(fl1:kn) = nX i=1 (yi ? knX K=1 ?iKflK)2 =k Y ??(1:kn)fl(1:kn) k2; (4.7) where Y = (y1;:::;yn)0 is the n?1 vector, fl(1:kn) = (fl1;:::;flkn)0 is the kn?1 vector and ?(1:kn) is n?kn score matrix whose Kth column is the vector ?:K = (?1K;:::;?nK)0, the Kth principal component score, which satisfles Var(?:K) = ?K and dcov(?:K;?:m) = 0 for K 6= m. The least-squares estimate of fl(1:kn) is then given by: ^fl(1:k n) = (? 0 1:kn?1:kn) ?1?0 1:knY; where ?01:kn?1:kn is a kn?kn diagonal matrix whose (K;K)th element is ?0:K?:K = n?K and ?0:K?:m = 0, for K 6= m. And ?01:knY is a kn ? 1 vector whose Kth element is n? dcov(?:K;Y), ^fl(1:k n) = ( ?0:1Y n?1 ;:::; ?0:knY n?kn ): (4.8) 57 This expression deflnes the least-squares estimate of the slope fl, denoted by ^fl(kn), as follows: ^fl(k n) = knX K=1 ^flK K = knX K=1 ?0:KY n?K K = knX K=1 dcov(?:K;Y) ?K K: (4.9) ^fl(k n) is an estimator of functional regression parameter obtained by using classical functional principal component regression (CFPCR). Cardot et al.[4] showed that, under several conditions, hX; ^fl(kn)i converges in probability and almost surely to hX;fli. The principal components become more rough with increase in value of K. kn acts as a smoothing parameter which has to be determined. There are many ways to determine the value of kn. The selection criteria to choose kn is based on the eigenvalues. The predetermined threshold value is 90%. The optimal number of components kn is then the minimal value for which the cumulative percentage of total variance is greater than or equal to 90%. If dataset contains outliers then spectral decomposition of covariance in (4.3) is unreliable. Both the expressions in equations (4.4) and (4.5) are not reliable. To obtain the robust estimate of the ^fl(kn) we employ robust method of principal components (e.g., BACONPCA Section 2.2.2) on X. Robust values of scores ?(r)iK i = 1;:::;n and K = 1;::: and robust eigenvector (r)K K = 1;::: are obtained. Using robust values both the expression in equation (4.4) and (4.5) can be written as: X(r)i = 1X K=1 ?(r)iK (r)K ; (4.10) fl(r) = 1X K=1 fl(r)K (r)K : (4.11) 58 Both expansions allow to write the residual sum of squares in equation (4.7) as: RSS(fl(r)1:kn) = nX i=1 (yi ? knX K=1 ?(r)iKfl(r)K )2: (4.12) By minimizing the residual sum of squares the robust estimate of the slope fl(kn), denoted by ^fl(r)(kn) is obtained as: ^fl(r)(k n) = knX K=1 ^flK(r) (r)K = knX K=1 ?0(r):K Y n?(r)K (r) K : (4.13) ^fl(r)(k n) is an estimator of functional regression parameter obtained by using robust functional principal component regression (RFPCR). 4.3 Functional Regression Diagnostic measures A pair of ith observation of the form (Xi;yi) is called in uential whose deletion would lead to a noticeable change in the regression parameter estimates. To detect the presence of in uential observations or outliers, diagnostic measures deflned for ordinary linear regression model can be extended to functional linear model with scalar response [16]. We will start deflning robustifled versions of residuals, fltted values, leverages and then some other diagnostic measures such as Cook?s D [8], Hadi?s potential-residual measure [17] for functional data. Similar to the standard regression model the fltted values and residuals are useful in deflning the in uence measures of single observation for functional linear model with scalar response. 59 Fitted values and Leverages Let the fltted values of the response variable be denoted by ^yi and can be obtained from equations (4.1) and (4.9) in the following manner: ^yi = hXi; ^fl(kn)i = knX K=1 ?iK ^flK = knX K=1 ?iK? 0 :KY n?K ; (4.14) fori = 1;:::;n, whichallowstodeflne then?1vector offltted values ^Y = (^y1;:::; ^yn)0 The matrix form of above equation can be written as follows: ^Y = H(kn)Y, where H(kn) is the n?n hat matrix, given by: H(kn) = ?1:kn(?01:kn?1:kn)?1?01:kn; which can be written as: H(kn) = ?1:kn?01:kn; where ?1:kn is the n?kn matrix whose Kth column is the vector ?:K = ?:Kpn? K ; H(kn) = 1n(?:1? 0 :1 ?1 +:::+ ?:kn?0:kn ?kn ): (4.15) The diagonal elements of H(kn) are leverage values denoted by H(kn);ii, given by: H(kn);ii = 1n(? 2 i1 ?1 +:::+ ?2ikn ?kn ); (4.16) 60 where 0 ? H(kn);ii ? 1 and TraceH(kn) = kn. The leverage can be used as a quick way to measure in uential observation in prediction. Smaller the values of hat diagonal for the pair (Xi;yi), the better is predicted yi. We can say that if the leverage value H(kn);ii of any observation (Xi;yi) exceeds 2?kn=n, then that observation might be an in uential observation. But, an observation with high leverage value may not necessarily be in uential. Observations with high leverage values are the outliers in X ?space but the converse is not necessarily true. Residuals (Ordinary and other) The residuals are deflned as e = Y ? ^Y = (In?H(kn))Y, where In is the n?n identity matrix. Using equation (4.1) the relationship between ? and e can be established in following manner: e = (In ?H(kn))Y = ?(kn+1:n)fl(kn+1:n) +(In ?H(kn))?; (4.17) where matrix ?(kn+1:n) is the n ? (n ? kn) whose columns are the vectors ?:K, for K = kn +1;:::;n and fl(kn+1:n) = (flkn+1;:::;fln). As ? has zero mean and covariance 2In, then the vector of residuals e conditional on X1;:::;Xn has mean ?(kn+1:n)fl(kn+1:n) and covariance 2(In ?H(kn)). If n is large then the term ?(kn+1:n)fl(kn+1:n) can be omitted [5, 18]. Since Trace(In?H(kn)) = n?kn E[e0ejX1;:::;Xn] = n(fl 2 kn+1 ?kn+1 +:::+ fl2n ?n)+(n?kn) 2: (4.18) Therefore, the error variance 2 may be estimated by the functional residual variance estimate, ^s2, given by: ^s2 = e 0e n?kn: (4.19) 61 Internally studentized residual The internally studentized residual is given as: r2i = e 2 i ^s2(1?H(kn);ii): (4.20) Cook?s D Measure The Cook?s distance is the standardized difierence in estimating fl with and without the observation (Xi;yi) CPi = (^y ? ^y(?i;kn)) 0(^y ? ^y(?i;k n)) kn^s2 ; (4.21) where ^y(?i;kn) denotes the prediction of the response vector y excluding the ith observa- tion (Xi;yi) in the estimation. A high value of CPi indicates that the ith observation has in uence on estimated responses in the sense that deleting it from the dataset will alter the prediction value. To compute CPi in equation (4.21) requires n standard linear regressions with an observation deleted. Cook?s distance can be written as a measure which is a function of quantities related to the full dataset CPi = e 2 i kn^s2(1?H(kn);ii) H(kn);ii (1?H(kn);ii) = r 2 i H(kn);ii kn(1?H(kn);ii); (4.22) where r2i is the ith internally studentized residual. Observations with large residual values (r2i) are called outliers. Hadi?s Potential-Residual Measure This measure combines two measures H(kn);ii, which provides information about high- leverage points and ri, which contains information about outliers. High leverage and outlier observations may in uence the regression results and conclusions based on 62 them. High leverage points are likely to have small residuals, so detecting residu- als alone is not su?cient to flnd in uential observations. The Cook?s measure is a multiplicative function of the residual and the leverage value. The drawback of mul- tiplicative function is that when one of the two components is small or near zero, it suppresses the other component. If one of the component is too small then the multiplicative measure is also small. The observations with large leverage value are likely to have small residuals and this will result in small values for multiplicative measures. This may lead to incorrect conclusions that these observations are not in uential. But an additive measure which is a function of the residual and leverage value, by contrast, is large if either or both components are large. Instead of multi- plicative measure we use additive measure of in uence suggested by Hadi [17, 6] and is deflned as: HM2i = kn(1?H (kn);ii) d2i (1?d2i) + H(kn);ii (1?H(kn);ii); (4.23) i = 1;:::;n; where d2i = e2i=e0e is the square of the ith normalized residual and Pd2 i = 1. The flrst component is a function of the i th normalized residual. The second component is the ratio of the Var(^yi) and Var(ei) and this quantity is known as the \potential". HM2i is monotonically increasing in both the leverage value and the squared residual. Therefore, a large value of H2i may be due to a large value of ei, a large value of H(kn);ii or both. The additive measures can identify outliers in the X-space, in the y-space, or in both. The cut-ofi value for HM2i suggested by Hadi [17] is mean(HM2i )+c q Var(HM2i ). Since mean and variance are nonrobust; median and median absolute deviations are used to estimate the cut-ofi value, respectively. The Potential-Residual (P-R) plot can be used to distinguish between regular observations, outliers and leverage points (Figure 4.1). This diagnostic plot is a 63 Figure 4.1: Potential-Residual (P-R) plot. scatter plot of H(kn);ii(1?H (kn);ii) versus kn(1?H (kn);ii) e2i (1?d2i). Since the flrst component is known as potential and the second component is a function of the residual, this plot is referred as the Potential-Residaul (P-R) plot. In this plot the high leverage points are located in the upper left area and observations with high prediction error are located in the lower right area [17]. If dataset contains outliers then all diagnostic measures deflned earlier will be sensitive to outliers. To obtain the robust estimates of these diagnostic measures we apply robust method of principal components (e.g., Section 2.2.2) on X. Robust 64 eigenvector (r)K , K = 1;:::; robust eigenvalue ?(r)K , K = 1;:::; and robust values of scores ?(r)iK, i = 1;:::;n and K = 1;::: are obtained. Using robust values both the expression in equations (4.14) and (4.16) can be redeflned. Robust fltted values are given as: ^yi(r) = hXi; ^fl(r)(kn)i = knX K=1 ?(r)iK ? (r)0 :K Y n?(r)K : (4.24) Robust Leverages are deflned as: H(r)(kn);ii = 1n(? (r)2 i1 ?(r)1 +:::+ ?(r)2ikn ?(r)kn ): (4.25) The robust version of Cook?s measure can also be deflned in the following manner: CP(r)i = e (r)2 i kn^s(r)2(1?H(r)(kn);ii) H(r)(kn);ii (1?H(r)(kn);ii); (4.26) where e(r)i = yi ? ^yi(r), i = 1;:::;n; is the ith residual. The robust version of Hadi?s measure is given as follows: HM2(r)i = kn(1?H(r) (kn);ii) d2(r)i (1?d2(r)i ) + H(r)kn;ii (1?H(r)(kn);ii); (4.27) where e(r)i = yi ? ^yi(r), i = 1;:::;n; is the ith robust residual and d2(r)i = e2(r)i =e(r)0e(r) is the square of the ith normalized robust residual. Section 4.4 compares the classical and robustifled versions of diagnostic measures by utilizing a real dataset. 65 4.4 Numerical Example In this section we demonstrate the practical use of proposed in uence measures. For this we have considered a dataset analyzed by Ramsay and Silverman [32]. The data is from thirty-flve Canadian weather stations which are listed in Table 4.1. In this dataset the regressor set of curves is the mean daily temperatures (in degree Celsius) and the response is the logarithm (base 10) of total annual precipitation (in mm). The temperature is assumed to be periodic due to its cyclical behavior during years. Therefore, we used the Fourier basis functions. The Figure 4.2(a) shows the curves dataset using 65 Fourier functions, while a boxplot of the log-precipitation data is shown in the Figure 4.2(b). The curves dataset is assumed to be observed in the interval [tmin;tmax] = [0;1]. The time unit is one year which is discretized in 365 days. For this analysis both regressor and response variable are centered by substracting their means. To select the cutofi value kn we estimate the coe?cients by using the least squares method and then we compute the total variance. The optimal number of components kn = 2 as the cumulative percentage of total variance for two components is 96%. Orthogonal-scoreplotsfortwocomponentsareconstructedusingCPCAandBACON- PCA which are given in Figure 4.3. The orthogonal-score plots for CPCA indicated no bad leverage points while BACONPCA revealed three bad leverage points (21, 11, and 10). The estimated beta function by classical FPCA and robust FPCA and the two eigenfunctions byCFPCRand RFPCR aredepicted inFigure4.4. The estimated beta function and flrst eigenfunction for both methods show difierences; on the other hand second eigenfunction do not difier too much. The estimated beta function and flrst 66 eigenfunction clearly indicates that the classical methods are not robust to outliers and there is a need for robust method to estimate the parameter function [20]. Table 4.2 contains values for diagnostic measures. Residual plots by classical (Figure 4.5(a)) and robust methods (Figure 4.5(b)) show that observation 12 (Kam- loops) is an outlier as it has high residual value as compared to other stations. From both flgures we can deduce that residual values are much larger for robust method than those of classical method. The leverages which are the diagonal elements of the hat matrix indicate which stations might be in uential a priori. We can clearly see that leverage value for observations 21 (Resolute), 11 (Iqaluit) and 10 (Inuvik) computed by robust method are larger than those of the classical method (Figure 4.6(a)). The in uential observations detected by Cook?s measure are 21 (Resolute) and 11 (Iquluit) by using the robust method and only 21 (Resolute) by using the classical method (Figure 4.6(b)). Comparing in uential observations for both methods the observations 21 and 11 have much higher CP(r)i value than CPi. The cut-ofi value for HM2i is 0:43 and for HM2(r)i is 0:46. The in uential ob- servations detected by Hadi?s in uence measure are 21 (Resolute) and 11 (Iquluit) by the robust method and 21 (Resolute) by the classical method (Figure 4.6(c)). Comparing in uential observations for both methods the observations 21 and 11 have much higher HM(r)i value than HMi. By comparing Hadi?s measure with Cook?s measure for classical method we can conclude that Hadi?s measure is superior in terms of distinguishing in uential obser- vations from regular observations. Observation 11 (Figure 4.6(c)) identifled by Hadi?s measure is clearly distinguished from regular observations when compared with obser- vation 11 (Figure 4.6(b)) by Cook?s measure. Therefore we can say that the additive 67 Table 4.1: Names of the Candian weather stations 1)Arvida 2)Bagottvi 3)Calgary 4)Charlott 5)Churchil 6)Dawson 7)Edmonton 8)Frederic 9)Halifax 10)Inuvik 11)Iqaluit 12)Kamloops 13)London 14)Montreal 15)Ottawa 16)Princeal 17)Princege 18)Princeru 19)Quebec 20)Regina 21)Resolute 22)Schefier 23)Sherbroo 24)Stjohns 25)Sydney 26)Thepas 27)Thunderb 28)Toronto 29)Uraniumc 30)Vancouvr 31)Victoria 32)Whitehor 33)Winnipeg 34)Yarmouth 35)Yellowkn measure used by Hadi?s measure, is much more e?cient than multiplicative measure used by Cook?s measure. The P-R plot shown in Figure 4.7(a) indicates that flve observations with high values of HM2i can be classifled as follows: observation 21 is a high leverage point, ob- servation 12 is an outlier and observation 11, 18 and 31 are combinations of both. The P-R plot by robust method indicates that three observations which have large HM2(r)i values can be classifled as follows: observation 21 is an high leverage point, obser- vation 12 is an outlier and observation 11 is a combination of both (Figure 4.7(b)). Here P-R plot helps to identify the outliers, leverages and regular observations. By comparing the classical and robust P-R plots (Figure 4.7(c)) we flnd that observations 21, 12 and 11 yielded much larger values for robust method than the ones obtained from the classical method. Observations identifled as bad leverages in orthogonal score plot (Figure 4.3) and observations identifled as outliers and leverages in P-R plot by using both methods (Figure 4.7) are highlighted in Figure 4.8. 68 Table 4.2: Classical and Robust in uence measures for the Candian weather stations n ei e(r)i H(kn);ii H(r)(kn);ii CPi CP(r)i HM2i HM2(r)i 1 0.16 0.11 0.01 0.01 0.01 0.00 0.02 0.01 2 0.19 0.16 0.01 0.01 0.01 0.01 0.01 0.01 3 -0.30 -0.33 0.01 0.00 0.01 0.01 0.02 0.01 4 0.13 0.09 0.01 0.01 0.00 0.00 0.01 0.01 5 0.06 0.23 0.07 0.15 0.00 0.24 0.07 0.19 6 0.03 0.08 0.10 0.12 0.00 0.02 0.12 0.13 7 -0.14 -0.16 0.00 0.00 0.00 0.00 0.00 0.00 8 0.15 0.08 0.01 0.01 0.00 0.00 0.01 0.01 9 0.18 0.13 0.02 0.02 0.01 0.01 0.02 0.02 10 -0.03 0.16 0.10 0.20 0.00 0.16 0.12 0.24 11 0.05 0.32 0.17 0.34 0.01 1.62 0.20 0.53 12 -0.57 -0.71 0.04 0.05 0.21 0.63 0.22 0.28 13 0.02 -0.08 0.02 0.03 0.00 0.00 0.02 0.03 14 0.09 -0.01 0.03 0.04 0.00 0.00 0.03 0.04 15 0.09 -0.01 0.03 0.04 0.00 0.00 0.03 0.04 16 -0.08 -0.10 0.04 0.04 0.00 0.01 0.04 0.04 17 -0.13 -0.16 0.01 0.01 0.00 0.00 0.01 0.01 18 0.26 0.25 0.15 0.14 0.18 0.25 0.18 0.17 19 0.24 0.18 0.01 0.01 0.01 0.01 0.02 0.01 20 -0.18 -0.25 0.04 0.04 0.02 0.05 0.04 0.04 21 -0.26 0.15 0.41 0.77 1.03 7.38 0.71 3.42 22 0.27 0.41 0.04 0.10 0.05 0.45 0.05 0.14 23 0.18 0.13 0.01 0.00 0.00 0.00 0.01 0.00 24 0.14 0.16 0.08 0.08 0.02 0.05 0.08 0.09 25 0.17 0.15 0.03 0.03 0.01 0.02 0.04 0.03 26 -0.01 -0.02 0.04 0.04 0.00 0.00 0.04 0.04 27 0.04 0.02 0.00 0.00 0.00 0.00 0.00 0.00 28 -0.07 -0.17 0.02 0.03 0.00 0.02 0.02 0.03 29 0.01 0.06 0.07 0.08 0.00 0.01 0.07 0.09 30 -0.11 -0.21 0.10 0.09 0.02 0.10 0.11 0.10 31 -0.26 -0.33 0.12 0.11 0.15 0.31 0.14 0.13 32 -0.30 -0.26 0.00 0.02 0.01 0.02 0.02 0.02 33 -0.01 -0.08 0.06 0.06 0.00 0.01 0.06 0.07 34 0.05 0.00 0.05 0.05 0.00 0.00 0.05 0.05 35 -0.07 0.00 0.08 0.10 0.01 0.00 0.08 0.11 69 0 50 100 150 200 250 300 350 ?30 ?25 ?20 ?15 ?10 ?5 0 5 10 15 20 Day Mean Temperature in degree Celcius Sample curves of the Canadian data (a) 1 2.2 2.4 2.6 2.8 3 3.2 3.4 Log?Precipitation Boxplot of the log?precipitation data (b) Figure 4.2: (a)Sample curves (X-data) of the Canadian data; (b)Boxplot of the log- precipitation (y-data) data. 0 0.5 1 1.5 2 2.5 3 3.5 40 5 10 15 20 25 30 35 40 45 Score distance Orthogonal distance CPCA 5 2435 30 6 10 31 1811 21 4 1225 17 32 (a) 0 2 4 6 8 10 12 140 10 20 30 40 50 60 70 80 90 100 Score distance Orthogonal distance BACONPCA 18 3229 635 22 5 10 11 21 17 1230 31 (b) Figure 4.3: Diagnostic plot of the Candian dataset based on (a)Classical Sore plot; (b)Robust Sore plot. 70 0 50 100 150 200 250 300 350 400?1 ?0.5 0 0.5 1 1.5 2 2.5 3x 10?4 Weather Station Beta function ClassicalRobust (a) 0 50 100 150 200 250 300 350 400?0.1 ?0.09 ?0.08 ?0.07 ?0.06 ?0.05 ?0.04 ?0.03 ?0.02 ?0.01 Weather Station First Eigenfunction ClassicalRobust (b) 0 50 100 150 200 250 300 350 400?0.1 ?0.08 ?0.06 ?0.04 ?0.02 0 0.02 0.04 0.06 0.08 0.1 Weather Station Second Eigenfunction ClassicalRobust (c) Figure 4.4: (a)Estimated beta function by Robust and classical FPCA; (b)First eigen- function and (c)Second eigenfunction by Robust and Classical FPCR. 71 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2?0.6 ?0.5 ?0.4 ?0.3 ?0.2 ?0.1 0 0.1 0.2 0.3 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 1819 20 21 22 23 2425 26 27 2829 30 3132 33 34 35 Predicted values Residuals (a) 2 2.5 3 3.5?0.8 ?0.6 ?0.4 ?0.2 0 0.2 0.4 0.6 1 2 3 4 5 6 7 8 9 10 11 12 13141516 17 1819 20 21 22 232425 26 27 28 29 30 3132 33 3435 Predicted values Residuals (b) Figure 4.5: Fitted values against the residuals (a)Classical; (b)Robust. 72 0 5 10 15 20 25 30 350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 8 9 10 11 121314151617 18 1920 21 2223242526272829 3031 323334 35 5 6 10 11 12 21 22 29 32 35 Weather Station Leverages Classical and Robust Methods (a) 0 5 10 15 20 25 30 350 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 1011121314151617181920 21 22232425262728293031323334355 10 11 12 21 22 31 Weather Station Cook?s Measures for prediction (b) 0 5 10 15 20 25 30 350 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5 6 7 8 9 1011121314151617181920 21 22232425262728293031323334355 10 11 12 21 22 Weather Station Hadi?s Measures for prediction (c) Figure 4.6: (a)Leverages; (b)Cook?s In uence Measures; (c)Hadi?s In uence Measures (Classical(o) and Robust(+)). 73 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 8 9 10 11 121314151617 18 1920 21 2223242526272829 30 31 32 333435 Residual Potential (a) 0 0.2 0.4 0.6 0.8 1 1.2 1.40 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5 6 7 8 9 10 11 121314151617 181920 21 222324252627282930 3132333435 Residual Potential (b) 0 0.2 0.4 0.6 0.8 1 1.2 1.40 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5 6 7 8 9 1011 121314151617 181920 21 222324252627282930 31323334351 2 3 4 5 7 8 9 10 11 1214151920 21 2223242830 3132 Residual Potential (c) Figure 4.7: P-R plot (a)Classical(o); (b)Robust(+); (c)Classical(o) and Robust(+). 74 0 50 100 150 200 250 300 350 ?30 ?25 ?20 ?15 ?10 ?5 0 5 10 15 20 Day Mean Temperature (Degree Celcius) 2111 10 1831 12 Figure 4.8: Sample curves of the Canadian data. 75 Chapter 5 Conclusion Robust PCA based Functional Data Analysis has been developed for dimension reduction purpose. This method can also be used to detect functional outliers and classify these outliers in three classes. An extensive simulation study was conducted and a real dataset was used to asses the performance of the robust FPCA. From this simulation study based on difierent contamination conflgurations (symmetric, asymmetric, partial and peak), we concluded that robust PCA based Functional Data Analysis yields better results than CPCA based Functional Data Analysis. Besides, we have developed robust FPCR method and constructed robustifled in uence regression diagnostic measures. The practical use of robust FPCR and diagnostic measures are illustrated by means of a real data example. In view of this example, we concluded that the proposed robust FPCR method performs much better than classical FPCR and robustifled in uence measures appears to be more useful diagnostic tools for detecting heterogeneity in functional regression model with scalar response. As future work, the difierent methods to estimate beta function can be explored. Simulation study can be conducted to assess the performance of the robust FPCR in presence of outliers, for a variety of scenarios. It is stated that the optimal number of bases is an important issue in FDA. Since, least squares method is sensitive to outliers, the choice of number of bases is afiected by outliers. Robust version of this criteria for selecting number of bases can be determined. 76 Bibliography [1] Ash, R. B. and Gardner, M. F., Topics in stochastic processes. New York: Aca- demic Press, 1975. [2] Billor, Nedret, Hadi A. S., and Paul, F. Velleman P. F., \BACON: blocked adap- tive computationally e?cient outlier nominators", Computatuional Statistics and Data Analysis, 34, 279-298, 2000. [3] Billor, N., Kiral, G. and Turkmen, A., \Outlier Detection Using Principal Com- ponents", Twelfth International Conference on Statistics, Combinatorics, Math- ematics and Applications, Auburn, 2005. (Unpublished Manuscript). [4] Cardot, H., Ferraty F. and Sarda, P., \Functional linear model", Statistics and Probability Letters, 45, 11-22, 1999. [5] Cardot, H., Ferraty F. and Sarda, P., \Spline estimators for the functional linear model", Statistica Simica, 13, 571-591, 2003. [6] Chatterjee, S., and Hadi, A. S., Sensitivity Analysis in Linear Regression. New York: Wiley, 1988. [7] Chiou, J. M. and M?uller, H. G., \Diagnostic for functional regression via residual processes", Computational Statistics and Data Analysis, 51, 4849-4863, 2006. [8] Cook, R. D., \Detection of in uential observations in linear regression", Tech- nometrics, 19, 15-18, 1977. [9] Craven, P. and Wahba G., \Smoothing noisy data with spline functions: es- timating the correct degree of smoothing by the method of generalized cross- validation", Numerische Mathematik, 31, 377-403, 1979. [10] Cuevas, A., Febrero, M., Fraiman R., \Linear Functional regression: the case of flxed design and functional response", Canad. J. Statist, 30, 285-300, 2002. [11] Farawayy, J. J., \Regression analysis for a functional response", Technometrics, 39, 254-261, 1977. [12] Fraiman, R. and Muniz, G., \Trimmed means for functional data", Test, 10, 419-440, 2001. 77 [13] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \A functional analysis of NOx levels: location and scale estimation and outlier detection", Computational Statistics, 22, 411-427, 2007. [14] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \Outlier detection in func- tional data by depth measures, with application to identify abnormal NOx lev- els", Environmetrics, 19, 331-345, 2008. [15] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \In uence in the Func- tional Linear Model with Scalar Response", In Functional and Operatorial Statis- tics, edited by Sophie Dabo-Niang and Frdric Ferraty. Physica-Verlag Heidelberg, 1 edition, 165-171, 2008. [16] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \Measures of in uence for the functional linear model with scalar response", Journal of Multivariate Analysis, In Press, Corrected Proof, Available online 25 December 2008. [17] Hadi, A. S., \A new measure of overall potential in uence in linear regression", Computatuional Statistics and Data Analysis, 14, 1-27, 1992. [18] Hall, P. and Hosseini-Nasab, M., \On properties of functional principal compo- nents analysis", Journal of the Royal Statistical Society, Series B, 68 109-126, 2006. [19] Huber, P. J. \Projection pursuit", Annals of Statistics, 13, no. 2, 435-525, 1985. [20] Hubert, M., and Verboven, S., \A robust PCR method for high-dimensional regressors", Journal of Chemometrics, 17, 438-452, 2003. [21] Hubert, M., and Engelen, S., \Robust PCA and classiflcation in biosciences", Bioinformatics, 20, 1728-1736, 2004. [22] Hubert, M., Rousseeuw, P. J. and Branden, K. V., \ROBPCA: A new approach to Robust Principal Component analysis", Technometrics, 47, no. 1, 64-79, 2005. [23] Iglewicz, B.andHoaglinD.C., \HowtoDetectandHandleOutliers", Milwaukee: American Society for Quality Control, WI, 1993. [24] Jolifie, I. T. Principal Component Analysis. New York: Springer-Verlag, 1986. [25] Kolmogorov, A. N., and Fomin, S. V., Introductory real analysis [English trans- lation], Englewood Clifis, NJ: Prentice-Hall, 1968. [26] Levitin, D. J., Nuzzo, R. L., Vines, B. W. and Ramsay, J. O., \Introduction to Functional Data Analysis", Canadian Psychology, 48, no. 3, 135-155, 2007. 78 [27] Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and Cohen, K. L., \Robust principal component analysis for functional data", TEST, 8, no. 1, 1-73, 1999. [28] Lopez-Pintado, S., and Romo J., \Depth based inference for functional data", Computational Statistics and Data Analysis, 51, no. 10, 4957-4968, 2007. [29] P~ena, D. \A new statistic for in uence in linear regression", Technometrics, 47, 1-12, 2005. [30] Ramsay, J. O. and Dalzell, C. J., \Some tools for functional data analysis", Journal of Royal Statistical Society, 53, no. 3, 539-572, 1991. [31] Ramsay, J. O. and Silverman, B. W., "Functional Data Analysis Software", MATLAB edition. Online at http://www.psych.mcgill.ca/faculty/ramsay/software.html, 2001. [32] Ramsay, J. O. and Silverman, B. W., Functional Data Analysis. Second Edition. New York: Springer-Verlag, 2005. [33] Rice, J. A. and Silverman, B. W., \Estimating the mean and covariance structure nonparametrically when the data are curves", Journal of the Royal Statistical Society, Series B, 53, 233-243, 1991. [34] Roman, D. R. \A tool for Functional Data Analysis and Experimentation", M.S. thesis, University of Puerto Rico, 2005. [35] Rousseeuw, P. J. \Least median of squares regression", Journal of the American Statistical Association, 79, 871-880, 1984. [36] Rousseeuw, P.J.\Multivariateestimationwithhighbreak-downpoint", InMath- ematical Statistics and Applications, edited by Grossmann, W. et al.. Reidel, Dordrecht, Vol. B., 283-297, 1985. [37] Rousseeuw, P. J. and Van Driessen, K., \A fast algorithm for the minimum covariance determinant estimator", Technometrics, 41, 212-223, 1999. [38] Shen, Q. and Xu, H., \Diagnostic for linear models with functional resposes", Technometrics, 49, 26-33, 2007. [39] Zhang, Jin-Ting. \Smoothed Functional Data Analysis", PhD dissertation, The University of North Carolina at Chapel Hill, 1999. 79