Robust Methods for Functional Data Analysis
Except where reference is made to the work of others, the work described in this
thesis is my own or was done in collaboration with my advisory committee. This
thesis does not include proprietary or classifled information.
Pallavi Sawant
Certiflcate of Approval:
Asheber Abebe
Associate Professor
Mathematics and Statistics
Nedret Billor, Chair
Associate Professor
Mathematics and Statistics
Hyejin Shin
Assistant Professor
Mathematics and Statistics
George T. Flowers
Dean
Graduate School
Robust Methods for Functional Data Analysis
Pallavi Sawant
A Thesis
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulflllment of the
Requirements for the
Degree of
Master of Science
Auburn, Alabama
August 10, 2009
Robust Methods for Functional Data Analysis
Pallavi Sawant
Permission is granted to Auburn University to make copies of this thesis at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Vita
Pallavi Rajesh Sawant was born on the 9th of October, 1972 in Thane, Maha-
rashtra, India. She graduated with Bachelor of Science in Statistics from University
of Mumbai, Mumbai, Maharashtra, India in April 1993. She continued her Master?s
program in University of Mumbai and was awarded Master of Science in Statistics in
1995. Following her graduation she taught computer programming courses at com-
puter institutes in Mumbai, India. She joined Auburn University in fall 2006 for the
Masters program in the Department of Mathematics and Statistics.
iv
Thesis Abstract
Robust Methods for Functional Data Analysis
Pallavi Sawant
Master of Science, August 10, 2009
(M.S., Mumbai University, 1995)
(B.S., Mumbai University, 1993)
91 Typed Pages
Directed by Nedret Billor
Functional data consist of observed functions or curves at a flnite subset of an
interval. Each functional observation is a realization from a stochastic process. This
thesis aims to develop a suitable statistical methodologies for functional data analysis
in the presence of outliers.
Statistical methodologies assume that functional data are homogeneous but in
reality they contain functional outliers. Exploratory methods in functional data anal-
ysis are outlier sensitive. In this thesis we explore the efiect of outliers in functional
principal component analysis and propose a tool for identifying functional outliers by
using robust functional principal components in a functional data. This is done by
means of robust multivariate principal component analysis. Diagnostic plots based on
functional principal component analysis are also found to be useful for identiflcation
and classiflcation of functional outliers. Extensive simulation study is conducted to
evaluate the performance of the proposed procedures and also real dataset is employed
to illustrate the goodness of the method.
v
In addition, regression diagnostics for a functional regression model where regres-
sors are functional data such as curves and the response is a scalar are discussed. We
proposed a robust principal component based method for the estimation of the func-
tional parameter in this type of functional regression model. Further we introduce
robust diagnostic measures for identifying in uential observations. A real dataset is
also used to illustrate the usefulness of the proposed robust measures for detecting
in uential observations.
vi
Acknowledgments
I express my sincere gratitude to my advisor, Dr. Nedret Billor, who suggested
this thesis and generously gave, continued guidance and support during the course
of this work. I highly appreciate her insights and valuable suggestions for this work,
without which I could not have completed this thesis. My special thanks to Dr. Hyejin
Shin, who served on my committee, for the many helpful discussions and advice. I
would also like to acknowledge my committee member Dr. Asheber Abebe for his
valuable support.
Many thanks to the faculty, stafi and fellow students at the Department of Math-
ematics and Statistics for making my study at Auburn such a wonderful experience.
I would like to express my special gratitude to my parents for their never-ending love
and support throughout my life. I feel lucky to have Rajesh as my life partner, whose
companionship, and ever lasting support and commitment during all these years of
studying abroad have been a source of strength and encouragement. Finally it?s the
grace and blessings of the God that has helped me fulflll my aspirations.
vii
StylemanualorjournalusedJournalofApproximationTheory(togetherwiththe
style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars.
Computer software used The document preparation package TEX (speciflcally
LATEX) together with the departmental style-flle aums.sty.
viii
Table of Contents
List of Figures x
List of Tables xii
1 Introduction 1
2 Robust Functional Principal Component Analysis and Outlier
Detection 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Classical and Robust Principal Component Analysis . . . . . . . . . . 11
2.2.1 Classical Principal Component Analysis . . . . . . . . . . . . 12
2.2.2 Robust Principal Component Analysis . . . . . . . . . . . . . 12
2.3 Classical and Robust Functional Principal Component Analysis and
Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Classical Functional Principal Component Analysis . . . . . . 18
2.3.2 Robust Functional Principal Component Analysis and Outlier
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Numerical Examples 29
3.1 Dataset: NOx Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Robust Functional Principal Component Regression 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Estimation of Functional Parameter fl . . . . . . . . . . . . . . . . . 55
4.3 Functional Regression Diagnostic measures . . . . . . . . . . . . . . . 59
4.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Conclusion 76
Bibliography 77
ix
List of Figures
1.1 Example of (a)Functional Observation and (b)Functional dataset . . . 2
2.1 (a)Difierent types of outliers when a 3 dimensional dataset is projected
on robust 2 dimensional PCA-subspace. (b)Difierent types of outliers
in Orthogonal-Score distance plot (Hubert et al. [22]). . . . . . . . . . 28
3.1 Sample curves of NOx data by using (a) B-spline Basis and (b) Fourier
Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Generalized Cross-Validation by using (a) B-spline Basis and (b)
Fourier Basis for NOx data . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Orthogonal-score plot for whole sample by using B-spline basis com-
puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 32
3.4 Orthogonal-score plot for whole sample by using Fourier basis com-
puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 33
3.5 Working Days of NOx data by using (a) B-spline Basis and (b) Fourier
Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Orthogonal-score plot for working days by using B-spline basis com-
puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . 35
3.7 Orthogonal-score plot for working days by using Fourier computed with
(a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . . . . . . . . . . 36
3.8 Non-working Days of NOx data by using (a) B-spline Basis and (b)
Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Orthogonal-score plot for Non-working Days by using B-spline basis
computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . 38
3.10 Orthogonal-score plot for Non-working Days by using Fourier basis
computed with (a)CPCA, (b)ROBPCA, (c)BACONPCA . . . . . . . 39
x
3.11 Outliers detected by proposed method for (a)Whole sample,
(b)Working days, (c)Non-working days . . . . . . . . . . . . . . . . . 40
3.12 Curves generated from model 1 (without contamination), model 2
(asymmetric contamination), model 3 (symmetric contamination),
model 4 (partial contamination) and model 5 (peak contamination)
with n=50, p=100, M=10 and q=0.1. . . . . . . . . . . . . . . . . . . 51
3.13 Boxplots of norm when there is no contamination (0%) and symmetric
contamination (5%, 10% and 15% ) for high dimensional cases for
CPCA(C) ROBPCA(R) and BACONPCA(B). . . . . . . . . . . . . . 52
3.14 Boxplots of norm when there is no contamination (0%) and symmet-
ric contamination (5%, 10% and 15% ) for low dimensional cases for
CPCA(C) ROBPCA(R) and BACONPCA(B). . . . . . . . . . . . . . 53
4.1 Potential-Residual (P-R) plot. . . . . . . . . . . . . . . . . . . . . . . 64
4.2 (a)Sample curves (X-data) of the Canadian data; (b)Boxplot of the
log-precipitation (y-data) data. . . . . . . . . . . . . . . . . . . . . . 70
4.3 Diagnostic plot of the Candian dataset based on (a)Classical Sore plot;
(b)Robust Sore plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 (a)Estimated beta function by Robust and classical FPCA; (b)First
eigenfunction and (c)Second eigenfunction by Robust and Classical
FPCR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Fitted values against the residuals (a)Classical; (b)Robust. . . . . . . 72
4.6 (a)Leverages; (b)Cook?s In uence Measures; (c)Hadi?s In uence Mea-
sures (Classical(o) and Robust(+)). . . . . . . . . . . . . . . . . . . . 73
4.7 P-R plot (a)Classical(o); (b)Robust(+); (c)Classical(o) and Robust(+). 74
4.8 Sample curves of the Canadian data. . . . . . . . . . . . . . . . . . . 75
xi
List of Tables
3.1 Outliers detected by three PCA methods. (.) denotes case number for
three datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Outliers detected by Febrero et al. [13] for the NOx data . . . . . . . 42
3.3 Simulation results of the Mean Proportion of Explained Variability
when there is no contamination (0%) and with symmetric contamina-
tion (5%, 10% and 15% ) for low and high dimensional cases. . . . . 47
3.4 Simulation results of the Mean Proportion of Explained Variability
when there is no contamination (0%) and with asymmetric contami-
nation (5%, 10% and 15% ) for low and high dimensional cases. . . . 48
3.5 Simulation results of the Mean Proportion of Explained Variability
when there is no contamination (0%) and with partial contamination
(5%, 10% and 15% ) for low and high dimensional cases. . . . . . . . 49
3.6 Simulation results of the Mean Proportion of Explained Variability
when there is no contamination (0%) and with peak contamination
(5%, 10% and 15% ) for low and high dimensional cases. . . . . . . . 50
4.1 Names of the Candian weather stations . . . . . . . . . . . . . . . . . 68
4.2 Classical and Robust in uence measures for the Candian weather stations 69
xii
Chapter 1
Introduction
Functional Data Analysis (FDA), comparatively new area in the statistical mod-
eling, has become more popular in recent years. FDA is an assemblage of difierent
methods in statistical analysis for analyzing curves or functional data. The com-
plexity of data generated and large size of databases mandate use of new tools for
analysis such as FDA [32, 26]. Functional data analysis helps to extract additional
information from densely sampled observations over a time or space. In standard
statistical methodology the focus is on the set of data vectors whereas, in FDA focus
is on the type of data structure such as curves, shapes, images, or set of functional
observations.
In FDA, each observed curve is thought of as a single observation rather than
a collection of individual observations. A curve can be regarded as an inflnite-
dimensional vector, whose dimensions may not be countable (Figure 1.1(a),(b)).
In a traditional statistical methodology, the usual data types are univariate and
multivariate. A univariate dataset contains numbers as its observations; while a mul-
tivariate dataset contains vectors as its observations. A number is one-dimensional
while a vector is multi-dimensional. Multivariate Data Analysis (MDA) is an exten-
sion of Univariate Data Analysis and FDA is an extension of multivariate analysis,
where the random vectors are of inflnite dimensions.
In number of situations functional data can be treated as multivariate data.
However, treating data directly as multivariate data may pose di?culty, such as when
1
0 2 4 6 8 10 12
?10
?5
0
5
10
15
(a)
0 2 4 6 8 10 12
?30
?25
?20
?15
?10
?5
0
5
10
15
20
(b)
Figure 1.1: Example of (a)Functional Observation and (b)Functional dataset
design points are not equal in subjects. So, direct multivariate treatment may not be
possible in this case. This calls for the development of functional data analysis.
When each functional observation is sampled at a same set of design points,
the functional data we get may look like multivariate data. But functional data
is in general difierent from the multivariate data in the following aspects: 1. For
a functional observation, the observed data is sampled from an underlying smooth
function, whereas in a multivariate dataset for an observed vector there is no such
structure. 2. The dimension of a functional observation is so large that it is regarded
as a continuous function. This can be seen in Figure 1.1(a). This dimension is often
larger than the sample size. 3. The time points can be difierent from one data point
to another. All of these difierent aspects necessitate development of functional data
analysis [39].
There are three advantages in treating data in functional forms. First, by rep-
resenting data in functional form with small number of parameters reduces its size
considerably. Second, since FDA deals with continuous functions; information be-
tween observed points is not lost. For flnite sets of observations, FDA flrst estimates
2
functions from the observed data, and then discretizes the function at any suitable
choice of time points for further analysis. The free choice of analyzed points is attrac-
tive when observational points are difierent in each subject. Thirdly, it is very useful
to have particularly interesting features in some time interval to have more closely
spaced points.
In functional data framework, the random variables are deflned on the functional
space. To model the population of these random functions we think of a functional
data observation as a realization of a stochastic process, X(t), t 2 T, where T is a
bounded interval in <.
Some mathematical concepts used for FDA are explained here. In FDA, we work
with a functional Hilbert space L2 (e.g., inner product space) which is determined
by an inner product hx;yi [25]. In a flnite dimension with x = (x1;:::;xn) and
y = (y1;:::;yn), the Euclidean inner product is deflned in the following way:
hx;yi<n = ?xiyi:
In a functional space where x = x(t) and y = y(t) are functions, the L2 inner product
is deflned as:
hx;yiL2 =
Z
x(t)y(t)dt;
where x;y 2 L2. For the convenience we drop the subscripts hx;yiL2 and just use
hx;yi.
The L2 norm is the most common type of norm, which is related to the inner
product. The norm on an inner product space is deflned by:
k x k2= hx;xi =
Z
x2(t);dt
3
and a distance between x and y is given by:
d(x;y) =k x?y k= (hx?y;x?yi)1=2:
Consider the functions as processes in continuous time deflned over an interval,
say T 2 [tmin; tmax]. The ith replication of functional observation is denoted as xi(t)
2 L2[T], i = 1;:::;n. In practice, it is impossible to observe the functional values
in continuous time. We usually obtain the data only on a flnite and discrete grid
t = ft1;t2;:::;tpg2 T in the following manner:
yi = xi(t)+?i; 1 ? i ? n;
where ?i is a random error or noise with zero mean and variance function  2i(t). For
simplicity, we assume that all processes are observed at the same time points, which
are equally spaced on T and is denoted by t = ft1;t2;:::;tpg, but in reality tj 2 T
can be difierent, i.e., tij depending on i where 1 ? j ? pi.
Functional Basis Expansion
For any data analysis in the FDA framework flrst step is functional data smoothing.
It is done to convert raw discrete data points into smooth functions i.e., to convert
data to functional form. Smoothing method is used to minimize noise in raw data
for calculations and analysis. There are difierent types of smoothers that can be
applied to functional data. In this thesis we use smoothing based on basis-function
method. By the use of basis function discrete data is represented as a smooth function
this is also known as functional data smoothing. In the basis expansion method; the
function xi can be represented as a linear combination of flrst k known basis functions
4
`K; K = 1;:::;k, where k is large enough, k < p. In this approach, a functional
observation xi is expressed as:
xi(t) =
kX
K=1
ciK`K(t);
where ` is vector-valued function having components `1;:::;`k. The C is n ? k
coe?cient matrix of the expansion, where C = [ciK]; 1 ? i ? n; 1 ? K ? k. The
simultaneous expansion of all n curves can be expressed in matrix notation as:
x = C`;
where x is a vector-valued function with xi, 1 ? i ? n, as its components. This
approach is preferred since it makes good approximation of the data with a relatively
small number of parameters. This may be considered to be a dimension reduction
operation.
There are many basis functions possible. Fourier and B-spline bases are most
frequently used bases functions which are summarized below:
Fourier basis system
This is the best known basis expression for the periodic data which is based on Fourier
series. The Fourier basis expansion is given by:
^x(t) = c0 +c1sin(wt)+c2cos(wt)+c3sin(2wt)+c4cos(2wt)+:::;.
Here,
`j(t) = 1; j = 0
`j(t) = sin(rwt); j = 2r?1
5
`j(t) = cos(rwt); j = 2r:
The period and the length of the interval jTj = 2?=w is determined by the fre-
quency w. The Fourier basis is said to be orthogonal if the values of tj are equally
spaced on T and the period is equal to the length of T. Due to orthogonal property
the cross-product matrix `0` is diagonal, and can be made equal to the identity by
dividing the basis function by suitable constants n1=2 for j = 0 and (n=2)1=2 for all
j. This basis is well known partially due to the Fast Fourier Transformation (FFT)
Algorithm, which makes it possible to compute the coe?cients speedily and e?ciently.
B-Spline basis system
B-Spline basis is a well-known functional basis for non-periodic data. The interval
T on which basis is deflned is divided into L subintervals separated by values ?l
l = 1;:::;L?1 called breakpoints or knots. A spline function is determined by two
quantities: the order of the B-spline and the number of knots. A spline is piecewise
polynomial function of order m over each interval, which is smoothly connected at
breakpoints. The notation BK(t;?) of the B-spline function deflned by the breakpoint
sequence ? indicate the value at t and K refers to the number of the largest knot at
or to the immediate left of value t.
The notation `K(t) is B-spline of order m and a knot sequence ? given by:
`K(t) = BK(t;?), K = 1;:::;m+L?1.
The advantage of this basis function is that they are  exible and fast.
Since no basis is universally good, choosing one is a complex issue. However,
there are guidelines for speciflc situations as each candidate function for the basis has
the unique characteristics; for example if the data are periodic then a Fourier basis
is used and for non-periodic data or data that have a lot of local features B-spline
6
works better. The selection of the basis function `K(t) is done by observing the data.
Selecting proper order of expression k (the number of basis functions) is important
question in the basis expansion. There are many ways to decide the number of basis
functions like Cross-Validation (CV), Generalized Cross Validation (GCV) or other
similar criteria. In this thesis we use GCV developed by Craven and Wahba [9], which
is described in Chapter 2.
Since X(t) is a random function the mean, variance, covariance and correlation
of X(t) are deflned as:
VarX(t) = E[X(t)?EX(t)]2, t 2 T
CovX(s;t) = E[X(s)?EX(s)][X(t)?EX(t)], s;t 2 T
CorrX(s;t) = CovX(s;t)p[VarX(s)VarX(t)], s;t 2 T
where VarX(s), VarX(t) > 0.
The functional sample descriptive statistics, where we have an n-dimensional subspace
for the sample space, xi(t), i = 1;:::;n are also be deflned as:
x(t) = 1n[
nX
i=1
xi(t)], t 2 T
dVarx(t) = 1
n?1[
nX
i=1
[xi(t)?x(t)]2], t 2 T
dCovx(s;t) = 1
n?1[
nX
i=1
[xi(s)?x(s)][xi(t)?x(t)]], s;t 2 T
dCorrx(s;t) = dCovx(s;t)p
[dVarx(s)dVarx(t)]
, s;t 2 T.
Majority of statistical techniques used in traditional and functional data analysis
assume that the dataset is free of outliers. However, outliers occur very frequently
in real data. Outlier is deflned as a data point appearing to be inconsistent with the
rest of the data. Possible sources of outliers are errors in recording and measurement,
incorrect distribution assumption, unknown data structure, or novel phenomenon
[23]. In any statistical data analysis, investigation of outliers is important. Since
7
traditionalstatisticalmethodsaresensitivetooutliers, presenceofoutliersinadataset
make estimators and statistical conclusions unreliable. In addition presence of outliers
severely afiects modeling and prediction.
High dimensional data occurrence is natural in some practical applications such
as studies involving image analysis and microarray datasets in genomic studies. In
such applications dimension p is greater than n, sample size. The FDA analysis
and high-dimensionality are closely related as the functional data generally come
in a discretized manner so that a function xi in the sample is in fact given by
(xi(t1);:::;xi(tp)). High dimensionality problem has two distinct features: flrst the
dimension p depends on the descretization order. This is not given in advance and
can be arbitrarily increased. Second, the data from the discretized functions likely to
be highly correlated thus creating di?culty in estimation of the covariance matrices.
Functional Principal Component Analysis (FPCA) is a useful statistical tech-
nique for understanding the structure of data. They are efiective dimension reduction
tools for functional data. FPCA aims to explain the covariance structure of data by
means of small number of functional components. These functional components are
linear combinations of the original variables. This gives better interpretation of the
difierent sources of variation. Thus efiectiveness of FPCA in data reduction is useful
in analysis of high dimensional data. In the presence of outliers, dimension reduction
via FPCA would yield untrustworthy results since FPCA is known to be sensitive to
outliers.
The main contribution of this work is the construction of the method to detect
outliers in functional data through robust FPCA. We have used library developed
by Ramsay and Silverman [31] to construct our code for functional data analysis.
This thesis is organized as follows. Chapter 2 reviews functional principal component
analysis and explore sensitivity of FPCA to outliers. We also propose a functional
8
outlier detection procedure based on robust multivariate techniques. We have de-
scribed accompanying diagnostic plots that can be used to detect and classify possi-
ble outliers. Chapter 3 consists of numerical examples, real dataset and simulation
study for the proposed procedure. Chapter 4 reviews the functional linear model with
scalar response. In this chapter we propose a robust technique of parameter function
estimation based on the robust functional principal components method. We also
introduce robust diagnostic measures for identifying in uential observations. Finally,
in Chapter 5, we give the conclusions of this work.
9
Chapter 2
Robust Functional Principal Component Analysis and Outlier
Detection
2.1 Introduction
In various areas such as chemometrics, biometrics, engineering, genetics, and e-
commerce the data come from the observation of continuous phenomenons of time
or space known as functional data. Due to advancement of new techniques it is
now possible to record large number of variables simultaneously. The nature of this
data in many applications is high dimensional where the number of variables (p) is
greater than the number of observations (n) (n ? p). The focus of researchers is
on analysis of such data due to the emergence of statistical problems while applying
various statistical tools for data analysis. Functional principal component analysis
(FPCA) is a useful tool to reduce the dimension of the functional data.
In functional data, the flrst step is to represent the data in a lower dimensional
space in order to have better interpretation. This is done by performing FPCA
to capture the main modes of variability of the data by means of small number of
components which are linear combinations of original variables that allow for better
interpretation of various sources of variation.
Sensitivity of the variance and the covariance matrix to irregular observations
make it vulnerable to outliers and may not capture the variation of the regular ob-
servations. Therefore data reduction based on FPCA becomes unreliable. This ne-
cessiates the need of the robust FPCA method.
10
Lacontore et al.[27], has proposed a robust functional principal component anal-
ysis based on spherical (SPHER) and elliptical (ELL) PCA [27]. However, these
methods have two drawbacks. First drawback is that SPHER and ELL only estimate
the principal components and not their eigenvalues. Second drawback is that SPHER
and ELL PCA are in uenced by outliers [22]. Febrero et al. [13, 14] also proposed
two methods for outlier detection that are based on the idea of functional depths and
distance measures.
The main contribution of our work is to construct a robust PCA method to
achieve dimension reduction of data and to develop tools for detection of functional
outliers.
The outline of this chapter is as follows. In Section 2.2, a brief description of
classical principal component analysis (CPCA) method and robust PCA methods are
given. The main concepts of PCA which are used in FPCA are discussed in Section
2.3. In this section, the outlier detection procedure via robust FPCA is also described.
2.2 Classical and Robust Principal Component Analysis
Principal Component Analysis (PCA) is used for understanding the structure of
a multivariate dataset. PCA is a useful tool for data reduction, which is achieved by
identifying main modes of variability of a given dataset. Unfortunately, if the data
contains outliers then data reduction based on classical principal component analysis
becomes unreliable. The goal of the robust PCA methods is to obtain principal
components that are not afiected by outliers.
11
2.2.1 Classical Principal Component Analysis
In multivariate data the central notion is to flnd weight vectors  j 2<p for which
a linear combination of centered variable values
Zi =
pX
j=1
 jxij =  0xi i = 1;:::;n (2.1)
that have maximal variance subject to constraints  0m r = I(m = r) for m < r; where
 = [ 1;:::; p]0 and xi = [xi1;:::;xip]0. The solution is obtained by the means of
spectral decomposition of the variance-covariance matrix [24, 31].
2.2.2 Robust Principal Component Analysis
In this Section three robust methods for PCA for multivariate data are reviewed.
These are 1. Elliptical Principal Component Analysis, 2. Robust Principal Compo-
nent Analysis (ROBPCA), 3. Blocked Adaptive Computationally E?cient Outlier
Nominators Principal Component Analysis (BACONPCA).
Elliptical Principal Component Analysis
In order to flnd the robust principal directions, Spherical (SPHER) Principal Com-
ponent Analysis and Elliptical (ELL) Principal Component Analysis, developed by
Lacontore et al.[27], are used. In this method: the flrst step is to flnd the robust esti-
mate of the center of the population. This is done by considering the spatial median
which is given by:
^ = argmax
 
nX
i=1
k xi ? k2;
12
where k?k2 denotes the Euclidean norm on <p and x1;:::;xn 2<p. To overcome the
problem of high dimensionality SPHER PCA, which is a robust version of PCA, is
performed by projecting the data on the unit sphere in <p. For coordinates measured
in difierent scales, ELL PCA is adopted by flrst scaling the data through robust scale
estimates. Prescaling is done by the Median Absolute Deviations (MAD). By doing
this the vertical axis is compressed and the bulk of the data look like a sphere. Then
the data is projected on this scale. After the projected data is rescaled, which lie on
an ellipse, classical PCA is performed.
The SPHER and ELL PCA methods facilitate fast algorithm for performing
robust PCA. SPHER and ELL only estimate the principal components and do not
calculate estimates of the eigenvalues, so computing score distances is not possible.
Hubert et al. [22] showed that SPHER and ELL PCA are in uenced by outliers when
the data are high-dimensional or when there is a large percentage of contamination in
the data. In such instances these methods convert the bad leverage points into good
leverage points and orthogonal outliers.
ROBPCA and BACONPCA
Let the original data be n ? p matrix X = Xn;p, where n denotes the number of
observations and p denotes the number of variables. ROBPCA [20, 21, 22] is based
on the minimum covariance determinant (MCD) estimator [35, 36] of multivariate
location vector and scatter matrix. BACONPCA [3] is based on the estimator of the
location vector and scatter matrix obtained from the basic subset which is found al-
gorithmically by utilizing BACON approach [2]. We will describe these two methods
in details for low and high dimensional cases.
13
ROBPCA for low dimensional data (n > p)
Minimum covariance determinant (MCD) estimator [35, 36] of multivariate location
and covariance matrix are popular for this case because of its high resistance to
outliers. It also provides fast algorithm (FAST-MCD) [37] for computation.
In this algorithm: the initial MCD estimators are deflned, based on h observa-
tions (h < n), as the mean ^?0 and the covariance matrix ^?0. The covariance matrix of
these h observations has the lowest determinant and h should be at least [(n+p+1)=2].
MCD estimator can resist n?h outliers and with this choice the MCD estimator has
a breakdown value of (n?h+1)=n. The value of h is taken approximately between
0:5n and 0:75n. The value h ? 0:5n is taken when there is 50% contamination and
if there is 25% contamination then the value of h ? 0:75n. When there are smaller
number of outliers the value of h is increased for a more precise estimates.
Reweighting is then done to increase the flnite sample e?ciency. Each data
point receives a weight 1 if its robust distance RD(xi) =
q
(xi ? ^?0)0^??10 (xi ? ^?0)
is ?
q
?2p;0:975 and weight 0 otherwise. For the observations with weight one the
reweighted MCD estimator is then deflned as the classical mean ^?M and covariance
matrix ^?M. The robust loadings are the flrst k1 eigenvectors of the MCD estimator
of ^?M sorted in descending order of the eigenvalues [20, 21, 22].
ROBPCA for high dimensional data (p > n)
For data with high-dimension (p > n), the MCD estimator can not be used because
the covariance matrix of h < p observations is always singular and can not be min-
imized. In this case ROBPCA method suggested by Hubert et al. [20, 21, 22] is
used on the Xn;p data. ROBPCA method is a combination of both projection pursuit
technique (PP)[19] and MCD covariance estimation in lower dimensions. PP is used
14
flrst to reduce dimension. The MCD method is then applied to this low dimensional
subspace to estimate the center and the scatter of the data.
Initial data preprocessing is done by applying singular value decomposition of
Xn;p. This results in huge dimension reduction as the data are represented using at
most n?1 = rank( ~Xn;p) variables without loss of information.
By applying PP, the high dimensional data points are projected on many uni-
variate directions. Then the robust center ^?r and scale ^ r (based on univariate MCD
method) of these projected data points on every direction are computed. For each
projected data point, the Stahel-Donoho?s outlyingness measure:
Outl(zj) = maxvj z
0
iv ? ^?r j
^ r ; i = 1;:::;n:
is used to form h subset, that has smallest outlyingness. Optimal k0 ? p principal
components are then selected from the covariance matrix of the flnal h subset. The
data are then projected onto this k0 dimensional subspace. Next the reweighted MCD
estimator is used to compute the center and the scatter of the data points in this low-
dimensional subspace. The dominant k1 eigenvectors of this covariance matrix are
the k1 robust principal components, and the MCD location and covariance matrix
estimates serve as robust estimates for the location vector ? and covariance matrix
 .
BACONPCA in low dimensions (n > p)
The Blocked Adaptive Computationally E?cient Outlier Nominators (BACON) al-
gorithm developed by Billor et al. [2] is used for this robust procedure. BACON
is a cost efiective, fast computational method with high breakdown point based on
measuring robust distances from a basic subset, which is free of outliers. The initial
15
basic subset is derived algorithmically in two ways by Mahalanobis distances or by
Euclidean distances.
For BACONPCA in low dimensions, initial dimension reduction of the mean
centered data matrix Xc is done by singular value decomposition (SVD).
Xc = (X ?1n^?0) = UD?0;
where ^? is the classical mean vector, D is a p?p diagonal matrix of the eigenvalues
of the X0cXc and U0U = Ip = ?0?, ? is the matrix of the eigenvectors corresponding
to the eigenvalues of X0cXc. Ip is the p?p identity matrix. The next step is to obtain
the score matrix, Z = Xc?. The robust mean, ^?B and the robust variance-covariance
matrix, ^?B, are computed from clean observations obtained from BACON algorithm.
From the BACON based covariance matrix, ^?B, number of robust PCs are determined
as k1. ?1 is the matrix of the eigenvectors corresponding to the nonzero eigenvalues
of ^?B. Finally, robust score matrix, Z1 = (Z ?1n^?0B)?1, is obtained [3].
BACONPCA in high dimensions (p > n)
In this case BACONPCA method, suggested by Billor et al. [3], is used on the
centered Xc data. The mean centered data matrix Xc are preprocessed by singular
value decomposition (SVD) based on the eigenvalues and the eigenvectors of XcX0c
instead of X0cXc. Since decomposition of XcX0c is much faster than that of X0cXc.
Then the score matrix Z = Xc? is obtained, where ? is the matrix of the eigenvectors
corresponding to the eigenvalues of XcX0c.
Since BACON or MCD methods are useful only when n > p, these methods can-
not be used, where n < p, to determine clean observations of Z because of singularity
16
of the covariance matrix. Stahel-Donoho?s outlyingness measure is useful to deter-
mine a clean set of h observations of Z (Hubert et al. [22]). The high dimensional
data points, zi, are projected onto many univariate directions v. For every direction
v, robust center ?r and robust standard deviation, ^ r (based on univariate BACON
method) are obtained for the projected observations, z0iv (i = 1;:::;n). Outlyingness
measure based on these robust center and scale values can be deflned as:
Outl(zj) = maxvj z
0
iv ? ^?r j
^ r ; i = 1;:::;n:
This measure will detect the points which are outlying on the projection vector.
Therefore, this will result into h clean set of observations (h=0:75n). For h observa-
tions the mean, ^?1, and the scatter matrix, ^?1, of the Z matrix are obtained. The
spectral decomposition of ^?1, gives:
^?1 = ?1?1?01;
where ?1 is the matrix of the eigenvectors corresponding to the eigenvalues of ^?1, ? =
diag(?1;:::;?p) is the diagonal matrix of the eigenvalues of ^?1. Then we determine
the retaining number of principal components k0 < p by using some techniques, like
a scree plot. The data are then projected onto the subspace spanned by the flrst k0
eigenvectors of the covariance matrix ^?1, that is
Z2 = (Zn?p ?1n^?01)?p?k0
17
where ?p?k0 is a matrix of the flrst k0 eigenvectors ?1. Next, BACON algorithm is
applied to flnd the mean vector, ^?B, and scatter matrix, ^?B, of the matrix, Z2. Based
on the robust covariance matrix, ^?B, the robust PCs are obtained as:
Z3 = (Z2 ?1n^?0B)??p?k1;
where ??p?k1 is the matrix of eigenvectors corresponding to the flrst k1 eigenvalues,
that are determined by a selection criteria (e.g., a scree plot), of the robust BACON
based covariance matrix ^?B [3].
2.3 Classical and Robust Functional Principal Component Analysis and
Outlier Detection
When the dataset is in the form of a curve, the procedure for PCA can be
generalized for functional principal component analysis (FPCA) to obtain main modes
of variability for the curves.
2.3.1 Classical Functional Principal Component Analysis
When the dataset is in the form of a curve, the procedure for classical PCA
can be generalized for Functional Principal Component Analysis (FPCA) to obtain
main modes of variability for the curves. Instead of variable values xij, used in
PCA, functional values xi(t) are used in FPCA, so that the discrete index j in the
multivariate context is replaced by continuous index t. Unlike multivariate PCA,
components in functional PCs are functions rather than vectors. So summations over
j are replaced by integrations over t.
18
Let fx(t);t 2Tg be a stochastic process where T is some index set which is a
bounded interval on <. The principal component scores corresponding to weight  is
generalized to an integral form,
Zi =
Z
 j(t)xi(t)dt: (2.2)
The weight function  j(t) is obtained by solving
max
h j;  mi=I(j=m); j?m
N?1X(
Z
 jxi)2 (2.3)
or equivalent to solving the functional eigenequation
Z
?(s;t) (t)dt = ? (s);  2 L2; (2.4)
where ? is the covariance function of the x(t). The sequence of eigenfunctions  i,
i = 1;2;:::, sorted with respect to the corresponding eigenvalues ?1 ? ?2 ? ::: solves
the FPCA problem (2.3). The eigenequation is the same general equation as in PCA,
except here  is now an eigenfunction rather than an eigenvector. There is a major
difierence between the multivariate and functional eigenanalysis. In multivariate case
the eigenvalue-eigenfunction pairs are p (number of variables) whereas, in functional
case they are inflnite (number of functional values). In practice, the unknown covari-
ance function ? needs to be estimated by the sample values xi(t), 1 ? i ? n; where
for each i; xi(t) is observed on a discrete set of points t = ft1;:::;tpg for flnite p.
FPCA problem can be represented in terms of basis function approach. In which,
flrst k bases functions in a basis f`1;:::;`kg are used, where k is large enough, so
that these functions will be able to describe most of the features of the data. The
bases are selected based on the nature of the data; for example if the data are smooth
19
and periodic then a Fourier basis might be ideal and for data that have a lot of local
features then B-splines might work better. Approximate each xi by:
^xi(t) =
kX
K=1
ciK`K(t): (2.5)
We can express all n curves simultaneously by deflning the vector-valued function x to
have components x1;x2;:::;xn and the vector valued function ` to have components
`1;:::;`k as:
x = C`; (2.6)
where the coe?cient matrix C is n ? k. In matrix terms the variance-covariance
function is:
?(s;t) = n?1`(s)0C0C`(t): (2.7)
Deflne W as a symmetric matrix of order k
W =
Z
``0: (2.8)
Suppose that the weight function  has the expansion
 (s) = XbK`K(s) (2.9)
and in matrix notation,  (s) = `(s)0b. Using equations (2.6-2.9) the left hand side of
eigen equation (2.4) becomes
Z
?(s;t) (t)dt =
Z
n?1`(s)0C0C`(t)`(t)0bdt
= `(s)0n?1C0CW0b:
20
The eigenequation can be written as:
`(s)0n?1C0CWb = ?`(s)0b: (2.10)
As this equation holds true for all s, it can be written in matrix form in following
manner:
n?1C0CWb = ?b: (2.11)
Ask  k= 1 implies b0Wb = 1 and similarly, two functions  1 and  2 will be orthogonal
if and only if the corresponding vectors of coe?cients satisfy b01Wb2 = 0. We deflne
u = W1=2b to get the required principal components by solving equivalent symmetric
eigenvalue problem
n?1W1=2C0CW1=2u = ?u (2.12)
and compute b = W?1=2u for each eigenvector. If the basis is orthonormal then
W = I. The functional PCA problem reduces to the standard multivariate PCA of
the coe?cient array C.
In this section, we examined FPCA as a dimension reduction tool. Although,
FPCA solves dimensionality problem, it fails to deal with data containing outliers. In
next section a new robust method, robust FPCA method, is given to overcome this
problem.
2.3.2 Robust Functional Principal Component Analysis and Outlier De-
tection
In this section, an outlier detection method for functional data via robust FPCA
is given. The robust FPCA method is to obtain functional principal components
21
that are less in uenced by outliers. Outlier detection method proposed by Febrero et
al.[13] is discussed and then the construction of the robust FPCA method is described.
Outlier Detection using Likelihood Ratio Test
Outlier detection procedure in functional data using Likelihood Ratio Test (LRT)
is developed by Febrero et al.[13], which is based on distance measure. Let the
functional sample be x1;:::;xn and the statistic is given as:
Ofi(xi) = kxi ? ^?TM;fi^ 
TSD;fi
k;
? = max1?i?nOfi(xi);
where k ? k is a norm in the functional space (k ? k1, k ? k2 or k ? k1), ^?TM;fi is the
fi-trimmed mean and ^ TSD;fi is the fi-trimmed standard deviation. Hence, Ofi(xi)
is the distance between xi and ^?TM;fi relative to ^ TSD;fi. The presence of outliers is
determined by comparing the test statistic (?) with some threshold and an iterative
procedure.
Description of the Proposed Method
Consider the functions as processes in continuous time deflned over an interval,
say T 2 [tmin; tmax]. The ith replication of functional observation is denoted as
xi(t) 2 L2[T]; i = 1;:::;n. In practice, it is impossible to observe the functional
values in continuous time. We usually obtain data only on a flnite and discrete grid
22
t = ft1;t2;:::;tpg2 T in the following manner:
yi = xi(t)+?i; 1 ? i ? n;
where ?i is a random error or noise with zero mean and constant variance function
 2i(t). For simplicity, we assume that all processes are observed at the same time
points, which are equally spaced on T and is denoted by t = ft1;t2;:::;tpg.
The function xi can be represented as a linear combination of the flrst k basis
functions `K; K = 1;:::;k, where k is large enough, k < p using basis expansion
method given in Chapter 1 as:
xi(t) =
kX
K=1
ciK`K(t);
where ` is vector-valued function having components `1;:::;`k. The C is n x k
coe?cient matrix of the expansion, where C = [ciK]; 1 ? i ? n; 1 ? K ? k. The
simultaneous expansion of all n curves can be expressed in matrix notation as:
x = C`;
where x is a vector-valued function with xi, 1 ? i ? n, as its components.
We assume that basis function is orthonormal. To select optimal number of basis
functions, k, GCV developed by Craven and Wahba [9], is used which is described in
the following section.
Coe?cient Estimation: On partially observed functions the coe?cients ciK are com-
puted by using the least squares approach, for i = 1;:::;n and K = 1;:::;k,
nX
i=1
[yi(t)?
kX
K=1
ciK`K(t)]2
23
= (y ?C`)0(y ?C`)
=k y ?C` k2 :
SincewedealwithbasisfunctionthatisorthonormalthefunctionalPCAproblem
reduces to the standard multivariate PCA of the coe?cient array C (see Section
2.3.1). Instead of dealing with FPCA we apply classical PCA on C. Applying PCA
on C would provide the equivalent information about the structure of the covariance
function of functional data x(t). Outliers in C will be equivalent to the outliers in
functional data x(t). Therefore, the diagnostic plots developed to detect outliers
by using multivariate PCA method can also be used to detect functional outliers.
Diagnostic plot, Orthogonal-score plot [22], which is a by-product the robust PCA
method is used for identiflcation and classiflcation of outliers. By using PCA method
we obtain robust scores Z in the following manner:
Z = C ?V;
where Z is n ? k1 matrix, C is n ? k matrix of the coe?cients, V is k ? k1 robust
eigenvectors and k1 ? k. The selection criteria to choose the components k1 is based
on the eigenvalues. The predetermined threshold value is 90%. The optimal number
of components k1 is the minimal value for which the cumulative percentage of total
variance is greater than or equal to 90%.
Robust coe?cients are obtained by transforming the data back to <k as:
^C = Z ?V 0:
24
Finally we obtain functional data which are robust in following way:
^x = ^C ?`:
Selecting Number of Basis
Selecting optimal number of bases, k, is important because if k is too large it
may introduce small variation with large bias and if k is too small then we may miss
some aspects of smooth function x that we want to estimate. This will also introduce
less bias with large variance. To choose the appropriate number of basis functions
a popular measure in the smoothing methods known as generalized cross validation
(GCV) developed by Craven and Wahba [9] is used. This criterion is deflned as:
k = argmin
j
(GCV(j));
where
GCV(j) = n?SSE(n?j)2 ; j = 3;:::;p?1;
SSE =
nX
i=1
(xi ? ^xi)2; ^xi =
kX
K=1
ciK`K:
There is another technique cross-validation (CV) based on minimizing mean
squared error (MSE). Minimizing CV can lead to under-smoothing the data by in-
troducing large variation. However, GCV has advantage over CV technique as it has
less tendency to undersmooth the data.
The choice of number of bases relies on ^xi. The coe?cients C = [ciK] are com-
puted by using least squares method and then ^xi are estimated. Since, least squares
25
method is sensitive to outliers, the choice of number of bases is also afiected by out-
liers. Robust version of this criteria for selecting number of bases can be obtained by
estimating the coe?cients robustly.
Diagnostic Plot for Detection of Outliers
The diagnostic plot developed to detect outliers by using PCA method for multi-
variate data can be used to detect functional outliers. Outliers in C will be equivalent
to the outliers in x(t) functional curves. Orthogonal score plot proposed by Hubert
et al. [22] is used to distinguish between regular observations and the three types of
outliers. This diagnostic plot is a scatter plot of the orthogonal distance Odi versus
the robust score distance Sdi. The score distance is deflned as:
Sdi =
vu
uu
t
k1X
j=1
z2ij=?j; i = 1;:::;n;
where zij are the scores and ?j are the eigenvalues. The orthogonal distance which
measuresthedistancebetweenanobservationxi anditsprojectioninthek1-dimensional
PCA-subspace, Odi, is given by:
Odi =k xi ? ^???p;k1z0i k; i = 1;:::;n;
where ?p;k1 is the p ? k1 matrix of eigenvectors and z0i is the ith row of the score
matrix.
If Sdi is large and Odi is small, then the ith observation is far away from the
homogeneous observations and close to the PCA space (e.g., Observations 1 and 4 in
Figure 2.1(a),(b)). If Odi is large and Sdi is small, then the ith observation is far away
from the PCA space orthogonally (e.g., Observation 5 in Figure 2.1(a),(b)). If Sdi
26
and Odi are both large, then the ith observation is far away from the homogeneous
observations and the PCA space (e.g., Observations 2 and 3 in Figure 2.1(a),(b)).
Therefore observations can be classifled in 1) Homogeneous observations (close
to PCA and not far away from the remaining data), 2) Orthogonal outliers, 3)Good
Leverage points, and 4) Bad Leverage points (See Figure 2.1).
Two cutofi lines are used to classify the observations. The cutofi value horizontal
line is
q
?2k;0:975 when k > 1 and ?
q
?2k;0:975 when k = 1. For the cutofi value of
orthogonal distances the Wilson-Hilferty approximation for a ?2 distribution is used,
where the Od2=3 ? N(?; 2). Estimates of ^? and ^ 2 are obtained by using univariate
MCD. The cutofi value of the vertical line equals (^? + ^ z:975)2 where z:975 is 97.5%
quantile of the normal distribution.
27
(a) (b)
Figure 2.1: (a)Difierent types of outliers when a 3 dimensional dataset is projected
on robust 2 dimensional PCA-subspace. (b)Difierent types of outliers in Orthogonal-
Score distance plot (Hubert et al. [22]).
28
Chapter 3
Numerical Examples
In this chapter a real data and simulation study are given to demonstrate opti-
mality of the proposed method for outlier detection in FDA.
3.1 Dataset: NOx Data
The aim of our analysis is to illustrate the performance of the robust FPCA on
the NOx data, which was used by Febrero et al. [13, 14]. The NOx emission levels
data collected by a control station near a power plant in Barcelona in year 2005 is
analyzed by using techniques for functional data. The dataset consists of NOx levels
(?g=m3) measured every hour for the period February 23, 2005 to June 26, 2005. Only
NOx levels for 115 days are available due to missing observations problem for several
consecutive hours of some days. The dataset of NOx emission levels are displayed in
Figure 3.1.
The whole NOx data is divided into working days (n1 = 76 curves) and non
working days (n2 =39 curves). Non working days are weekends and holidays during
the given data period. For using the functional data analysis it is essential to flrst
convert the discrete data to functional form (i.e., continuous function) by using basis
function. We will utilize these three datasets 1) to explore the efiect of difierent basis
expansion (Fourier and B-spline) on outlier detection. 2) to compare the proposed
method for detection of outliers with the Febrero?s results.
29
0 5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
(a)
0 5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
(b)
Figure 3.1: Sample curves of NOx data by using (a) B-spline Basis and (b) Fourier
Basis
From the Figure 3.1 we can say that NOx levels increase in the morning and
reach peak value around 8:00 am, then decrease until 2:00 pm, and again increase in
the evening.
The Figures 3.5 and 3.8 exhibit the sample curves for working days and non
working days, respectively. In Figures 3.1, 3.5 and 3.8, the group of curves shows
presence of a few trajectories that are in some way difierent from the rest.
NOx (Whole sample):
Initial dimension of the dataset is 115?24. GCV method is used to determine
optimal number of bases for B-spline and Fourier bases, and the resulting k based on
GCV are 12 for B-spline basis (Figure 3.2(a)) and 10 for Fourier basis (Figure 3.2(b)).
Whole sample data curves by using 12 B-spline bases and 11 Fourier bases are shown
in Figure 3.1(a) and (b), respectively. Since there is a high correlation among the
variables of coe?cient matrix we apply CPCA, ROBPCA and BACONPCA on coef-
flcient matrix for dimension reduction and outlier detection. For B-spline and Fourier
30
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
3.5x 105
12
Number of Basis?>k
GCV(k)
Generalized Cross?Validation
(a)
2 4 6 8 10 12 14 16 180.4
0.6
0.8
1
1.2
1.4
1.6
1.8x 105
10
Number of Basis?>k
GCV(k)
Generalized Cross?Validation
(b)
Figure 3.2: Generalized Cross-Validation by using (a) B-spline Basis and (b) Fourier
Basis for NOx data
bases, six and flve principal components were retained, respectively each for CPCA,
ROBPCA and BACONPCA, yielding a classical and robust explanation percentage
more than 90%.
The resulting diagnostic plots for the three PCA methods by using B-spline and
Fourier bases are given in Figures 3.3 and 3.4, respectively. All bad leverage points,
detected by these diagnostic plots (orthogonal-score plots) formed by using the three
PCA methods based on both bases are listed in Table 3.1.
NOx (Working Days):
Initial dimension of the dataset is 76 ? 24. Optimal number of bases based
on GCV is obtained as k=12 for B-spline, k=10 for Fourier basis. Data curves for
working days by using 12 B-spline bases and 11 Fourier bases are given in Figure 3.5(a)
and (b), respectively. Due to high correlation among the variables in C, we apply
CPCA, ROBPCA and BACONPCA on coe?cient matrix. For B-spline and Fourier
bases, seven and flve principal components were retained, respectively each for CPCA,
31
0 1 2 3 4 50
20
40
60
80
100
120
140
Score distance
Orthogonal distance
Score Plot
 15
 57
 19
 59
 20
 56
 71
 58
 16
 22
  4106 63   3
 93  14
 60
(a)
0 1 2 3 4 5 6 7
0
50
100
150
Score distance (6 LV)
Orthogonal distance
 59
 21
 20
 71 23 19
 16
 58
 15 57
 56
 22
 44   4 63
 73106   3
 14
 60
ROBPCA
(b)
0 1 2 3 4 5 6 70
50
100
150
Score distance
Orthogonal distance
Robust Score Plot
 23
 59
 58
 20
 57 19 15
 16
 56
 22 63  73106
 14  3
 60
(c)
Figure 3.3: Orthogonal-score plot for whole sample by using B-spline basis computed
with (a)CPCA, (b)ROBPCA, (c)BACONPCA
32
0 1 2 3 4 50
20
40
60
80
100
120
Score distance
Orthogonal distance
Score Plot
 73
 15
 57 58
 20
 19
 56 22
 59
 16  3109
 74  93  9 52
 83
 71
(a)
0 1 2 3 4 5 6 7
0
20
40
60
80
100
120
Score distance (5 LV)
Orthogonal distance  21 58 73 23
 15
 19
 59
 20 57
 56
 16 22
105  3 74  83 52109
 71
ROBPCA
(b)
0 1 2 3 4 5 6 70
20
40
60
80
100
120
140
Score distance
Orthogonal distance
Robust Score Plot
 58
 73
 15
 19
 57
 20
 59
 56 16 22
  9 74  83109 52
 71
(c)
Figure 3.4: Orthogonal-score plot for whole sample by using Fourier basis computed
with (a)CPCA, (b)ROBPCA, (c)BACONPCA
33
0 5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
(a)
0 5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
(b)
Figure 3.5: Working Days of NOx data by using (a) B-spline Basis and (b) Fourier
Basis
ROBPCA and BACONPCA, yielding a classical and robust explanation percentage
more than 90%.
The resulting diagnostic plots for the three PCA methods by using these bases
are displayed in Figures 3.6 and 3.7, respectively. All bad leverage points detected
by these diagnostic plots (orthogonal-score plots) formed by using the three PCA
methods based on both bases are listed in Table 3.1.
NOx (Non-Working Days):
Initial dimension of the dataset is 39?24. By GCV method the optimal numbers
of bases for B-spline and Fourier bases are k=12 for B-spline bases and k=6 for
Fourier basis. Data curves for Non-working days by using 12 B-spline bases and 7
Fourier bases are given in Figure 3.8(a) and (b), respectively. We again apply CPCA,
ROBPCA and BACONPCA on coe?cient matrix. For B-spline and Fourier basis, flve
and three principal components were retained, respectively each for CPCA, ROBPCA
34
0 1 2 3 4 50
20
40
60
80
100
120
Score distance
Orthogonal distance
Score Plot
28
11
 3
38
13
1416
39
37
12
69 9 5542  7
10
(a)
0 1 2 3 4 5 6 7 8
0
20
40
60
80
100
120
140
Score distance (7 LV)
Orthogonal distance 36
38
132815
12
11
14
37
16
42 9
69
10
39
ROBPCA
(b)
0 1 2 3 4 5 6 70
20
40
60
80
100
120
140
Score distance
Orthogonal distance
Robust Score Plot
4728 3 1314
38
11
12
37
16 94269
10
39
(c)
Figure 3.6: Orthogonal-score plot for working days by using B-spline basis computed
with (a)CPCA, (b)ROBPCA, (c)BACONPCA
35
0 0.5 1 1.5 2 2.5 3 3.5 4 4.50
10
20
30
40
50
60
70
80
90
100
Score distance
Orthogonal distance
Score Plot
47
11
15
2814
38
37
13
12
16
5648 7361
33 55 7
(a)
0 1 2 3 4 5 6 7
0
20
40
60
80
100
120
140
Score distance (5 LV)
Orthogonal distance 284715
38
14
37
13
11
12 16
ROBPCA
(b)
0 1 2 3 4 5 6 70
20
40
60
80
100
120
140
Score distance
Orthogonal distance
Robust Score Plot
15
28
471413
11
38
37 12 16
75 748
7333
(c)
Figure 3.7: Orthogonal-score plot for working days by using Fourier computed with
(a)CPCA, (b)ROBPCA, (c)BACONPCA
36
0 5 10 15 20
50
100
150
200
250
Time(Hours)
Nox
(a)
0 5 10 15 200
20
40
60
80
100
120
140
160
180
200
Time(Hours)
Nox
(b)
Figure 3.8: Non-working Days of NOx data by using (a) B-spline Basis and (b) Fourier
Basis
and BACONPCA, yielding a classical and robust explanation percentage more than
90%.
The resulting diagnostic plots for the three PCA methods by using B-spline and
Fourier bases which are given in Figures 3.9 and 3.10, respectively, illustrate all types
of outliers in this dataset. All bad leverage points detected by these diagnostic plots
(orthogonal-score plots) formed by using the three PCA methods based on both bases
are listed in Table 3.1.
Figure 3.11 shows the outliers identifled by the proposed method for the three
datasets. All bad leverage points detected by three PCA methods by using two
difierent bases are summarized in Table 3.1. Outliers detected by Febrero et al. [13]
for the same dataset are also given in Table 3.2 for comparison purposes.
From Table 3.1 we conclude that both ROBPCA and BACONPCA detected
similar outliers in whole sample by using Fourier basis. These results match with the
results obtained by Febrero et al. [13] (refer Table 3.2). However, we have detected
37
0 0.5 1 1.5 2 2.5 3 3.5 4 4.50
10
20
30
40
50
60
70
80
Score distance
Orthogonal distance
Score Plot
31
 5
22
3618 7
27
20
26
21
 2 14 811  1
35 19
 6
17
(a)
0 1 2 3 4 5 6 7 8
0
20
40
60
80
100
Score distance (5 LV)
Orthogonal distance  319
 7
18 27
20
21
 5
26
ROBPCA
(b)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.50
10
20
30
40
50
60
70
80
Score distance
Orthogonal distance
Robust Score Plot
 5
22
 1
1836 27
 7 26
20
21
14 811 35
19 6
17
(c)
Figure 3.9: Orthogonal-score plot for Non-working Days by using B-spline basis com-
puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA
38
0 1 2 3 4 50
20
40
60
80
100
120
Score distance
Orthogonal distance
Score Plot
 73
 15
 57 58
 20
 19
 56 22
 59
 16  3109
 74  93  9 52
 83
 71
(a)
0 1 2 3 4 5 6 7
0
20
40
60
80
100
120
Score distance (5 LV)
Orthogonal distance  21 58 73 23
 15
 19
 59
 20 57
 56
 16 22
105  3 74  83 52109
 71
ROBPCA
(b)
0 1 2 3 4 5 6 70
20
40
60
80
100
120
140
Score distance
Orthogonal distance
Robust Score Plot
 58
 73
 15
 19
 57
 20
 59
 56 16 22
  9 74  83109 52
 71
(c)
Figure 3.10: Orthogonal-score plot for Non-working Days by using Fourier basis com-
puted with (a)CPCA, (b)ROBPCA, (c)BACONPCA
39
5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
 Outliers in Whole Sample
03/18(16)
03/09/(11)
04/29(37)
05/02(38)03/11/(12)
(a)
5 10 15 200
50
100
150
200
250
300
350
Time(Hours)
Nox
 Outliers in Working Days
03/18(16)
03/09(11)
04/29(37)
05/02(38)03/11(12)
(b)
5 10 15 200
20
40
60
80
100
120
140
160
180
200
Time(Hours)
Nox
 Non Working Days with outliers
05/01(21)
04/30(20)03/19(7)
(c)
Figure 3.11: Outliers detected by proposed method for (a)Whole sample, (b)Working
days, (c)Non-working days
40
one additional outlier (03/09) by using both ROBPCA and BACONPCA methods.
The orthogonal score plot based on CPCA detects only one bad leverage point.
For whole samples by using B-spline basis four similar outliers are detected by
both ROBPCA and BACONPCA methods. CPCA detected three outliers. But
outliers detected by these three PCA methods using B-spline basis do not conform
with outliers detected by Febrero et al. [13].
For working days dataset we obtained flve similar outliers for both ROBPCA and
BACONPCA by using Fourier basis. Outliers thus detected by us match with outliers
detected by Febrero et al. [13] for working days dataset using Fourier basis, except
that we have detected two additional outliers (05/02 and 03/09) (refer Table 3.1).
CPCA method using Fourier basis detected only one outlier.
Under B-spline basis four outliers are detected by ROBPCA method and flve
outliers are detected by BACONPCA method for working dataset. Results obtained
by both PCA methods do not match with results obtained by Febrero et al. [13]. Here
also CPCA detected only one outlier and converted bad leverage points detected by
ROBPCA and BACONPCA into good leverage points.
In non-working days dataset under Fourier basis ROBPCA detected two outliers
and BACONPCA detected three outliers. These results match with the flndings of
Febrero et al. [13] except that BACONPCA detected one additional outlier (05/01).
Under Fourier basis CPCA failed to detect any outlier.
Similarly for non-working dataset using B-spline basis the CPCA method has
not detected any outlier. Outliers detected by ROBPCA and BACONPCA using
B-spline basis are difierent from those obtained by Febrero et al. [13].
After detecting outliers, we checked for sources for abnormal values of these
curves. We expected to provide information about the abnormally large NOx emis-
sions on these particular days. We found that Friday, March 11 is the beginning
41
Table 3.1: Outliers detected by three PCA methods. (.) denotes case number for
three datasets
Fourier Basis B-spline Basis
Dataset CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
Whole sample
05/02(59) 03/18(22) 03/18(22) 04/29(56) 04/29(56) 04/29(56)
04/29(56) 04/29(56) 03/11(16) 03/11(16) 03/11(16)
03/11(16) 03/11(16) 05/02(59) 05/02(59) 05/02(59)
05/02(59) 05/02(59) 03/16(20) 03/16(20)
03/09(15) 03/09(15)
Working days
05/02(38) 03/18(16) 03/18(16) 05/03(39) 04/29(37) 04/29(37)
04/29(37) 04/29(37) 03/11(12) 03/11(12)
03/11(12) 03/11(12) 05/02(38) 05/02(38)
05/02(38) 05/02(38) 03/09(11) 03/09(11)
03/09(11) 03/09(11) 03/18(16)
Non working days
| 04/30(20) 04/30(20) | 04/30(20) 05/15(26)
03/19(07) 03/19(07) 03/19(07)
05/01(21)
Table 3.2: Outliers detected by Febrero et al. [13] for the NOx data
Dataset k?k1 k?k2 k?k1
Whole sample
03/18 03/18 03/18
04/29 04/29 04/29
03/11 03/11 03/11
05/02 05/02
Working days
03/18 03/18 03/18
04/29 04/29 04/29
03/11 03/11 03/11
Non working days 04/30 04/30 04/3003/19 03/19 03/19
of a weekend. The Friday, March 18 and Saturday, March 19 are the beginning of
the Eastern vacation in Spain in the year 2005. Also Friday, April 29, Saturday,
April 30, Sunday, May 1, and Monday, May 2 correspond to long weekend. There
is sudden increase in tra?c on these small vacation periods. So we conclude that
abnormal observations on speciflc days can be attributed to increase in tra?c due to
small vacation periods. We have also detected outlier on Wednesday, March 9. It is
observed that high NOx emissions are recorded on March 9 after 8:00 pm. Since the
observation on March 10th is missing and thus not included in analysis, we could not
pinpoint the reason behind this abnormal observation on March 9th.
42
3.2 Simulation
The simulation study is conducted to compare the performance of ROBPCA and
BACONPCA with the classical PCA (CPCA) on coe?cient matrix. The simulation
setting given by Fraiman and Muniz [12, 28], with few changes, is used here. For
simulation we consider functional data x1;:::;xn obtained as realizations from a
stochastic process X(?). This functional data has continuous paths on the observation
period [tmin;tmax] = [0;1]. Curves are generated from difierent models. Model 1
was generated without contamination and several other models were generated with
difierent types of contaminations.
Model 1 (no contamination): Xi(t) = g(t) + ei(t);1 ? i ? n; where model error
term ei(t) is a stochastic Gaussian process with zero mean and covariance function
#(s;t) = (1=2)(1=2)(0:9)jt?sj and g(t)=4t, with t 2 [0;1].
Model 2 (asymmetric contamination): Yi(t) = Xi(t) + ciM; 1 ? i ? n; where
ci is 1 with probability q and 0 with probability 1?q; M is the contamination size
constant.
Model 3 (symmetric contamination): Yi(t) = Xi(t)+ci iM; 1 ? i ? n; where ci
and M are deflned as in model 2 and  i is a sequence of random variables independent
of ci taking values 1 and -1 with probability 1/2.
Model 4 (partially contaminated): Yi(t) = Xi(t) + ci iM; if t ? Ti; 1 ? i ? n;
and Yi(t) = Xi(t); if t < Ti; where Ti is a random number generated from a uniform
distribution on [0;1].
Model 5 (Peak contamination): Yi(t) = Xi(t)+ci iM; if Ti ? t ? Ti +?; 1 ? i ?
n; and Yi(t) = Xi(t); if t =2 [Ti;Ti + ?]; where ? = 2=30 and Ti is a random number
from a uniform distribution in [0;1??]. Figure 3.12 exhibits curves simulated from
these flve models.
43
For each model, we generated 100 replications, with two settings each for low and
high dimensional data. For low dimensional data we consider 1) n = 100; p = 12; k =
8 and 2) n = 50; p = 5; k = 4 settings. For high dimensional data we also consider
two settings with 1) n = 50; p = 100; k = 51 and 2) n = 50; p = 500; k = 151. For
the model 1 contamination percent is q = 0 and contamination constant is M = 0. For
each contaminated model (2, 3, 4 and 5) we considered several levels of contamination:
q = 5; 10; 15 percentage and contamination constants M = 10 and 25. Classical
PCA and robust methods ROBPCA and BACONPCA are used on the simulated
functional data based on the flve models.
Two quantitative measures of the goodness of the methods are considered.
The flrst one is Mean proportion of variability (MPV) :
MPV = 1=N
NX
m=1
^?m1 + ^?m2 +:::+ ^?mk
?m1 +?m2 +:::+?mk +:::+?mp
where N = 100 denotes the number of iterations and ?mj is an jth eigenvalue at mth
replication obtained from the covariance matrix of coe?cient matrix of uncontami-
nated model. ^?mj is the estimated value of ?mj at the mth replication. ^?mj is obtained
by using classical or robust multivariate techniques on coe?cient matrix of contam-
inated or uncontaminated model. 90% of variability is explained by the flrst three
components for each setting. For the mean proportion of explained variability the
optimal values are 0.9 for low and high dimensional data.
The second quantitative measure is the Norm of the difierence between ^?m1 and
?m1 which is given as jj^?m1 ? ?m1 jj, where ?m1 is largest eigenvalue obtained from the
covariance matrix of coe?cient matrix of uncontaminated model. ^?m1 is the esti-
mated value of ?m1 at mth replication. ^?m1 is largest eigenvalue obtained by using
44
classical or robust multivariate techniques on coe?cient matrix of contaminated or
uncontaminated model. The optimal value is zero or near zero.
Model 1 is compared with models 2, 3, 4 and 5. The simulation results of mean
proportion of variability for four comparisons are given in Tables 3.3 - 3.6. It is
clear that CPCA provides the best mean proportion of explained variability when
there is no contamination in the data, which is expected. For the uncontaminated
data robust methods also yield comparable results. However, when contamination is
introduced to the data (models 2-5) the eigenvalues obtained with CPCA are over-
estimated. Since estimated percentages of MPV are larger than 100%. In ROBPCA
and BACONPCA we obtain MPV of 90% for low dimensional data without and with
contamination. For high dimensional data the mean percentage of explained variabil-
ity is similarly 90% for without and with contamination. The main reason behind
this is the optimal direction obtained by ROBPCA and BACONPCA are robust to
outliers. CPCA clearly fails and provides the worst possible result because mean pro-
portion of variability is above 100%. It is clear from Tables 3.3 - 3.5 that the MPV
for ROBPCA at 15% contamination level is above 100% in most of the cases except
for model 5. BACONPCA gives better results than ROBPCA at 15% for both low
and high dimensional case. From these results we can deduce that BACONPCA and
ROBPCA outperform the CPCA.
Simulation results for the norm with N = 100 iterations and difierent contami-
nation levels for comparison of model 1 vs models 2 are summarized in Figures 3.13
- 3.14. For this comparison, we used two high and two low dimensional settings with
the two values of M (10 and 25). The ideal value of norm must be very small or
near zero. We conclude that the norm is near zero when there is no contamination
for all methods. This is an indication of ROBPCA and BACONPCA being also ef-
fective methods for uncontaminated data. The norm for CPCA tends to increase
45
as contamination level increases. For contaminated data, norms corresponding to
ROBPCA and BACONPCA method yield minimum value which is near zero for high
and low dimensional settings. The comparisons of model 1 vs model 3-5 for low and
high dimensional settings yielded very similar results observed in Figures 3.13-3.14,
therefore they are not repeated here.
We conducted the simulation given above for M = 5, but the results were not up
to the mark. This is because in this case the outliers are not yet very well separated
from the regular data group. As soon as the contamination lies somewhat further,
robust methods are capable to distinguish the outliers. Therefore the results of the
simulation for this case is not reported in this thesis since we aimed at distinguishing
outliers from the regular points.
46
Table 3.3: Simulation results of the Mean Proportion of Explained Variability when
there is no contamination (0%) and with symmetric contamination (5%, 10% and
15% ) for low and high dimensional cases.
High dimension:n=50, p=100, k=51
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.943 0.790 0.912 0.944 0.801 0.914
5% 13.213 0.833 0.918 71.297 0.838 0.915
10% 21.955 0.874 0.910 121.020 0.866 0.917
15% 32.354 2.472 0.918 195.299 12.720 0.927
High dimension:n=50, p=500, k=151
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.949 0.793 0.907 0.948 0.785 0.909
5% 11.749 0.821 0.917 73.250 0.804 0.912
10% 21.131 0.842 0.908 126.076 0.840 0.910
15% 31.715 1.512 0.919 201.703 14.303 0.928
Low dimension:n=100, p=12, k=8
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.914 0.833 0.892 0.917 0.830 0.898
5% 10.441 0.856 0.901 55.600 0.852 0.897
10% 18.859 0.860 0.889 114.111 0.870 0.903
15% 30.120 1.707 0.899 169.042 0.885 0.896
Low dimension:n=50, p=5, k=4
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.955 0.801 0.913 0.956 0.812 0.918
5% 9.736 0.837 0.923 56.574 0.830 0.913
10% 18.992 0.866 0.910 107.322 0.0.875 0.922
15% 25.839 1.360 0.921 161.370 3.781 0.920
47
Table 3.4: Simulation results of the Mean Proportion of Explained Variability when
there is no contamination (0%) and with asymmetric contamination (5%, 10% and
15% ) for low and high dimensional cases.
High dimension:n=50, p=100, k=51
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.943 0.815 0.921 0.944 0.801 0.914
5% 10.254 0.837 0.920 70.863 0.839 0.914
10% 20.233 0.862 0.919 115.722 0.869 0.914
15% 26.345 0.895 0.919 165.056 9.011 0.921
High dimension:n=50, p=500, k=151
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.948 0.785 0.909 0.948 0.785 0.909
5% 12.728 0.804 0.909 74.551 0.806 0.905
10% 20.080 0.849 0.910 120.222 0.847 0.911
15% 28.722 2.110 0.916 173.636 9.202 0.917
Low dimension:n=100, p=12, k=8
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.914 0.833 0.892 0.915 0.834 0.896
5% 9.856 0.846 0.888 57.283 0.847 0.897
10% 17.285 0.870 0.896 106.619 0.868 0.897
15% 24.770 0.887 0.914 152.289 3.153 0.901
Low dimension:n=50, p=5, k=4
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.953 0.800 0.899 0.954 0.824 0.923
5% 10.930 0.826 0.908 51.805 0.844 0.922
10% 16.986 0.853 0.907 105.488 0.873 0.920
15% 24.150 1.923 0.920 135.844 0.903 0.921
48
Table 3.5: Simulation results of the Mean Proportion of Explained Variability when
there is no contamination (0%) and with partial contamination (5%, 10% and 15% )
for low and high dimensional cases.
High dimension:n=50, p=100, k=51
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.944 0.793 0.913 0.942 0.781 0.911
5% 5.641 0.831 0.925 34.047 0.819 0.917
10% 10.920 0.865 0.910 60.144 0.843 0.905
15% 15.836 1.280 0.908 99.545 1.487 0.918
High dimension:n=50, p=500, k=151
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.948 0.785 0.909 0.949 0.822 0.911
5% 6.514 0.820 0.909 33.517 0.850 0.929
10% 11.222 0.861 0.920 69.062 0.872 0.922
15% 15.815 1.095 0.919 98.250 1.209 0.927
Low dimension:n=100, p=12, k=8
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.915 0.805 0.882 0.915 0.829 0.892
5% 5.792 0.830 0.885 30.162 0.842 0.895
10% 11.325 0.856 0.885 62.695 0.860 0.894
15% 14.734 0.855 0.873 90.655 1.119 0.897
Low dimension:n=50, p=5, k=4
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.955 0.819 0.910 0.956 0.816 0.917
5% 6.745 0.831 0.903 35.184 0.841 0.912
10% 11.983 0.856 0.918 65.041 0.870 0.929
15% 18.445 1.577 0.931 100.715 4.946 0.920
49
Table 3.6: Simulation results of the Mean Proportion of Explained Variability when
there is no contamination (0%) and with peak contamination (5%, 10% and 15% )
for low and high dimensional cases.
High dimension:n=50, p=100, k=51
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.945 0.802 0.922 0.944 0.812 0.916
5% 1.505 0.832 0.913 5.285 0.851 0.923
10% 1.792 0.882 0.904 7.427 0.871 0.917
15% 2.070 0.921 0.873 8.989 0.912 0.898
High dimension:n=50, p=500, k=151
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.949 0.822 0.911 0.948 0.785 0.909
5% 1.462 0.855 0.925 5.123 0.824 0.910
10% 1.752 0.878 0.902 7.006 0.860 0.900
15% 1.998 0.933 0.901 8.347 0.896 0.896
Low dimension:n=100, p=12, k=8
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.916 0.841 0.893 0.916 0.823 0.895
5% 1.703 0.852 0.893 6.506 0.834 0.896
10% 2.419 0.850 0.887 10.273 0.851 0.893
15% 2.917 0.862 0.889 15.164 0.871 0.899
Low dimension:n=50, p=5, k=4
Contamination M=10 M=25CPCA ROBPCA BACONPCA CPCA ROBPCA BACONPCA
0% 0.954 0.799 0.901 0.954 0.814 0.898
5% 1.591 0.808 0.904 5.652 0.818 0.901
10% 2.456 0.811 0.903 9.499 0.828 0.897
15% 3.369 0.820 0.908 15.931 0.831 0.896
50
0 0.2 0.4 0.6 0.8 1
?1
0
1
2
3
4
5
(a) Model 1
0 0.2 0.4 0.6 0.8 1
?1
0
1
2
3
4
5
6
7
8
9
(b) Model 2
0 0.2 0.4 0.6 0.8 1?6
?4
?2
0
2
4
6
8
10
(c) Model 3
0 0.2 0.4 0.6 0.8 1
?2
0
2
4
6
8
10
(d) Model 4
0 0.2 0.4 0.6 0.8 1?4
?2
0
2
4
6
8
(e) Model 5
Figure 3.12: Curves generated from model 1 (without contamination), model 2 (asym-
metric contamination), model 3 (symmetric contamination), model 4 (partial contam-
ination) and model 5 (peak contamination) with n=50, p=100, M=10 and q=0.1.
51
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
0.5
1
1.5
2
x 106
norm
                        n=50,p=100,k=51,M=10
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
1
2
3
4
5
6
7
8
x 107
norm
n=50,p=100,k=51,M=25
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
0.5
1
1.5
2
x 107
norm
n=50,p=500,k=151,M=10
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
1
2
3
4
5
6
7
x 108
norm
n=50,p=500,k=151,M=25
Figure 3.13: Boxplots of norm when there is no contamination (0%) and symmet-
ric contamination (5%, 10% and 15% ) for high dimensional cases for CPCA(C)
ROBPCA(R) and BACONPCA(B).
52
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
1
2
3
4
5
x 104
norm
n=100,p=12,k=8,M=10
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
2
4
6
8
10
12
14
16
x 105
norm
n=100,p=12,k=8,M=25
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
1
2
3
4
5
6x 10
5
norm
                       n=50,p=5,k=4,M=10
0%C0%R0%B5%C5%R5%B10%C10%R10%B15%C15%R15%B
0
1
2
3
4
5
x 105
norm
n=50,p=5,k=4,M=25
Figure 3.14: Boxplots of norm when there is no contamination (0%) and symmet-
ric contamination (5%, 10% and 15% ) for low dimensional cases for CPCA(C)
ROBPCA(R) and BACONPCA(B).
53
Chapter 4
Robust Functional Principal Component Regression
4.1 Introduction
Recently researchers have put more attention to functional linear models in which
the regressors and/or the response are of a functional nature and proposed several
methods for estimating the functional parameter [10, 11, 15, 16, 30]. Functional
regressors are inflnite in nature. Problem with inflnite dimensionality of the regressor
is that, it results into inflnitely many sets of solutions or sufiers from multicollinearity.
Therefore the flrst step in this regression setup is to reduce dimension by using FPCA
and then is to regress the response onto these components obtained from FPCA.
The presence of outliers or in uential observations has a serious efiect on the
estimation and prediction of the functional linear model. In the presence of outliers,
the decomposition of the classical covariance matrix is unreliable. In such situation,
both the FPCA stage and the regression stage called Functional Principal Component
Regression (FPCR) yield unreliable results. In uential observations in a given dataset
can have a strong impact on analysis. If these outlying or in uential observations are
removed from the data then this may substantially afiect the statistical inference.
The functional versions of the diagnostic measures based on Cook?s distance [8] is
introduced by Chiou and M?uller [7] and Shen and Xu [38] for the models where the
regressors are real or curves and the responses are functional.
Recently, Febrero et al. [16] reviewed estimation based on the classical functional
principal components method and then analyzed in uence in the functional linear
54
model with scalar response. They have proposed three measures of in uence by
generalizing the measures proposed for the standard regression model by Cook [8]
and P~ena [29]. In this chapter we propose a robust functional principal component
regression method which consists of two parts. First we apply a robust FPCA method
on the regressors, and then regress the response variables on the scores, which are
discussed in Section 4.2. In Section 4.3 we propose robustifled in uence regression
diagnostic measures to detect which observations have strong in uence. In Section
4.4 the practical use of these measures is illustrated by means of a real dataset.
4.2 Estimation of Functional Parameter fl
The functional linear model with a scalar response is a regression model with
the regressor which is a random curve and the response which is real random variable
deflned on the same probability space. We assume that (X;y) is a pair of random
variables where X = (X(t)), X 2 L2(T), t 2 T = [tmin;tmax] ? < and y is a real
random variable. For easy computation we assume that both X and y are centered
i.e., E[X(t)] = 0, and E[y] = 0. Assuming E(k X k2) < 1, the dependence between
the scalar response y and the functional random variable X is written as:
y = hX;fli+? =
Z
T
X(t)fl(t)dt+?; (4.1)
where h:;:i, denotes the L2(T) inner product, fl is a square integrable function deflned
on T and errors, ?, is a real random variable with E[?] = 0, E[X(t)?] = 0 and flnite
variance equal to  2.
Suppose that a random sample of pairs (Xi;yi); i = 1;:::;n; is observed where
Xi and yi (i = 1;:::;n) are realizations of the functional X and y, respectively.
55
The estimate of fl can be obtained by flnding such fl that minimizes residual sum of
squares
RSS(fl) =
nX
i=1
(yi ?hXi;fli)2: (4.2)
Such fl is a functional parameter, that has high dimensionality problem. Thus
minimization of RSS can be accomplished by using PC approach. The sample covari-
ance operator of X denoted by ?X allows a spectral decomposition into orthonormal
eigenfunctions  1; 2;::: [33]. By Mercer?s Theorem, an orthogonal expansion for ?X
in L2 is given by:
?X(s;t) =
1X
K=1
?K K(s) K(t); (4.3)
with ordered nonnegative eigenvalues ?1 ? ?2 ? ::: ? ?n ? 0 = ?n+1 ? ::: The
sequence of eigenvalue-eigenvector pairs satisfles the eigenequation given by ?X K =
?K K, for K ? 1 and h K; li = 1 if K = l and h K; li = 0 otherwise.
By using Karhunen-Lo eve expansion [1] the functional variables Xi and the func-
tional parameter fl can be written in terms of the eigenfunctions  K in following
manner:
Xi =
1X
K=1
?iK K; (4.4)
fl =
1X
K=1
flK K; (4.5)
where ?iK = hXi; Ki, such that ?iK = 0, for K > n and flK = hfl; Ki, respectively,
for i = 1;:::;n and K = 1;2;:::. Eigenfunctions form an orthonormal basis of the
functional space L2(T) [30].
56
By using the new deflnitions of Xi and fl the residual sum of squares in equation
(4.2) can be written as:
RSS(fl) =
nX
i=1
(yi ?
nX
K=1
?iKflK)2: (4.6)
The dimension of fl in equation (4.6) is reduced from 1 to n. But minimizing
this equation will give us a perfect flt of the response variable. To avoid this problem
alternate method proposed by Cardot et al. [4] to estimate fl is used. In this method
flK = 0, for K ? kn + 1, where 0 < kn < n and ?kn > 0. The coe?cients flK, for
K = 1;:::;kn, are obtained by minimizing the residual sum of squares given by:
RSS(fl1:kn) =
nX
i=1
(yi ?
knX
K=1
?iKflK)2 =k Y ??(1:kn)fl(1:kn) k2; (4.7)
where Y = (y1;:::;yn)0 is the n?1 vector, fl(1:kn) = (fl1;:::;flkn)0 is the kn?1 vector
and ?(1:kn) is n?kn score matrix whose Kth column is the vector ?:K = (?1K;:::;?nK)0,
the Kth principal component score, which satisfles Var(?:K) = ?K and dcov(?:K;?:m) =
0 for K 6= m. The least-squares estimate of fl(1:kn) is then given by:
^fl(1:k
n) = (?
0
1:kn?1:kn)
?1?0
1:knY;
where ?01:kn?1:kn is a kn?kn diagonal matrix whose (K;K)th element is ?0:K?:K = n?K
and ?0:K?:m = 0, for K 6= m. And ?01:knY is a kn ? 1 vector whose Kth element is
n? dcov(?:K;Y),
^fl(1:k
n) = (
?0:1Y
n?1 ;:::;
?0:knY
n?kn ): (4.8)
57
This expression deflnes the least-squares estimate of the slope fl, denoted by ^fl(kn), as
follows:
^fl(k
n) =
knX
K=1
^flK K = knX
K=1
?0:KY
n?K  K =
knX
K=1
dcov(?:K;Y)
?K  K: (4.9)
^fl(k
n) is an estimator of functional regression parameter obtained by using classical
functional principal component regression (CFPCR). Cardot et al.[4] showed that,
under several conditions, hX; ^fl(kn)i converges in probability and almost surely to
hX;fli. The principal components become more rough with increase in value of K.
kn acts as a smoothing parameter which has to be determined. There are many
ways to determine the value of kn. The selection criteria to choose kn is based on
the eigenvalues. The predetermined threshold value is 90%. The optimal number of
components kn is then the minimal value for which the cumulative percentage of total
variance is greater than or equal to 90%.
If dataset contains outliers then spectral decomposition of covariance in (4.3)
is unreliable. Both the expressions in equations (4.4) and (4.5) are not reliable.
To obtain the robust estimate of the ^fl(kn) we employ robust method of principal
components (e.g., BACONPCA Section 2.2.2) on X. Robust values of scores ?(r)iK
i = 1;:::;n and K = 1;::: and robust eigenvector  (r)K K = 1;::: are obtained.
Using robust values both the expression in equation (4.4) and (4.5) can be written
as:
X(r)i =
1X
K=1
?(r)iK (r)K ; (4.10)
fl(r) =
1X
K=1
fl(r)K  (r)K : (4.11)
58
Both expansions allow to write the residual sum of squares in equation (4.7) as:
RSS(fl(r)1:kn) =
nX
i=1
(yi ?
knX
K=1
?(r)iKfl(r)K )2: (4.12)
By minimizing the residual sum of squares the robust estimate of the slope fl(kn),
denoted by ^fl(r)(kn) is obtained as:
^fl(r)(k
n) =
knX
K=1
^flK(r) (r)K = knX
K=1
?0(r):K Y
n?(r)K  
(r)
K : (4.13)
^fl(r)(k
n) is an estimator of functional regression parameter obtained by using robust
functional principal component regression (RFPCR).
4.3 Functional Regression Diagnostic measures
A pair of ith observation of the form (Xi;yi) is called in uential whose deletion
would lead to a noticeable change in the regression parameter estimates. To detect
the presence of in uential observations or outliers, diagnostic measures deflned for
ordinary linear regression model can be extended to functional linear model with
scalar response [16]. We will start deflning robustifled versions of residuals, fltted
values, leverages and then some other diagnostic measures such as Cook?s D [8],
Hadi?s potential-residual measure [17] for functional data.
Similar to the standard regression model the fltted values and residuals are useful
in deflning the in uence measures of single observation for functional linear model
with scalar response.
59
Fitted values and Leverages
Let the fltted values of the response variable be denoted by ^yi and can be obtained
from equations (4.1) and (4.9) in the following manner:
^yi = hXi; ^fl(kn)i =
knX
K=1
?iK ^flK =
knX
K=1
?iK?
0
:KY
n?K ; (4.14)
fori = 1;:::;n, whichallowstodeflne then?1vector offltted values ^Y = (^y1;:::; ^yn)0
The matrix form of above equation can be written as follows: ^Y = H(kn)Y, where
H(kn) is the n?n hat matrix, given by:
H(kn) = ?1:kn(?01:kn?1:kn)?1?01:kn;
which can be written as:
H(kn) = ?1:kn?01:kn;
where ?1:kn is the n?kn matrix whose Kth column is the vector
?:K = ?:Kpn?
K
;
H(kn) = 1n(?:1?
0
:1
?1 +:::+
?:kn?0:kn
?kn ): (4.15)
The diagonal elements of H(kn) are leverage values denoted by H(kn);ii, given by:
H(kn);ii = 1n(?
2
i1
?1 +:::+
?2ikn
?kn ); (4.16)
60
where 0 ? H(kn);ii ? 1 and TraceH(kn) = kn. The leverage can be used as a quick way
to measure in uential observation in prediction. Smaller the values of hat diagonal
for the pair (Xi;yi), the better is predicted yi. We can say that if the leverage value
H(kn);ii of any observation (Xi;yi) exceeds 2?kn=n, then that observation might be
an in uential observation. But, an observation with high leverage value may not
necessarily be in uential. Observations with high leverage values are the outliers in
X ?space but the converse is not necessarily true.
Residuals (Ordinary and other)
The residuals are deflned as e = Y ? ^Y = (In?H(kn))Y, where In is the n?n identity
matrix. Using equation (4.1) the relationship between ? and e can be established in
following manner:
e = (In ?H(kn))Y = ?(kn+1:n)fl(kn+1:n) +(In ?H(kn))?; (4.17)
where matrix ?(kn+1:n) is the n ? (n ? kn) whose columns are the vectors ?:K, for
K = kn +1;:::;n and fl(kn+1:n) = (flkn+1;:::;fln).
As ? has zero mean and covariance  2In, then the vector of residuals e conditional
on X1;:::;Xn has mean ?(kn+1:n)fl(kn+1:n) and covariance  2(In ?H(kn)). If n is large
then the term ?(kn+1:n)fl(kn+1:n) can be omitted [5, 18]. Since Trace(In?H(kn)) = n?kn
E[e0ejX1;:::;Xn] = n(fl
2
kn+1
?kn+1 +:::+
fl2n
?n)+(n?kn) 
2: (4.18)
Therefore, the error variance  2 may be estimated by the functional residual variance
estimate, ^s2, given by:
^s2 = e
0e
n?kn: (4.19)
61
Internally studentized residual
The internally studentized residual is given as:
r2i = e
2
i
^s2(1?H(kn);ii): (4.20)
Cook?s D Measure
The Cook?s distance is the standardized difierence in estimating fl with and without
the observation (Xi;yi)
CPi = (^y ? ^y(?i;kn))
0(^y ? ^y(?i;k
n))
kn^s2 ; (4.21)
where ^y(?i;kn) denotes the prediction of the response vector y excluding the ith observa-
tion (Xi;yi) in the estimation. A high value of CPi indicates that the ith observation
has in uence on estimated responses in the sense that deleting it from the dataset will
alter the prediction value. To compute CPi in equation (4.21) requires n standard
linear regressions with an observation deleted. Cook?s distance can be written as a
measure which is a function of quantities related to the full dataset
CPi = e
2
i
kn^s2(1?H(kn);ii)
H(kn);ii
(1?H(kn);ii) = r
2
i
H(kn);ii
kn(1?H(kn);ii); (4.22)
where r2i is the ith internally studentized residual. Observations with large residual
values (r2i) are called outliers.
Hadi?s Potential-Residual Measure
This measure combines two measures H(kn);ii, which provides information about high-
leverage points and ri, which contains information about outliers. High leverage and
outlier observations may in uence the regression results and conclusions based on
62
them. High leverage points are likely to have small residuals, so detecting residu-
als alone is not su?cient to flnd in uential observations. The Cook?s measure is a
multiplicative function of the residual and the leverage value. The drawback of mul-
tiplicative function is that when one of the two components is small or near zero,
it suppresses the other component. If one of the component is too small then the
multiplicative measure is also small. The observations with large leverage value are
likely to have small residuals and this will result in small values for multiplicative
measures. This may lead to incorrect conclusions that these observations are not
in uential. But an additive measure which is a function of the residual and leverage
value, by contrast, is large if either or both components are large. Instead of multi-
plicative measure we use additive measure of in uence suggested by Hadi [17, 6] and
is deflned as:
HM2i = kn(1?H
(kn);ii)
d2i
(1?d2i) +
H(kn);ii
(1?H(kn);ii); (4.23)
i = 1;:::;n; where d2i = e2i=e0e is the square of the ith normalized residual and
Pd2
i = 1. The flrst component is a function of the i
th normalized residual. The
second component is the ratio of the Var(^yi) and Var(ei) and this quantity is known
as the \potential". HM2i is monotonically increasing in both the leverage value and
the squared residual. Therefore, a large value of H2i may be due to a large value of
ei, a large value of H(kn);ii or both. The additive measures can identify outliers in the
X-space, in the y-space, or in both. The cut-ofi value for HM2i suggested by Hadi
[17] is mean(HM2i )+c
q
Var(HM2i ). Since mean and variance are nonrobust; median
and median absolute deviations are used to estimate the cut-ofi value, respectively.
The Potential-Residual (P-R) plot can be used to distinguish between regular
observations, outliers and leverage points (Figure 4.1). This diagnostic plot is a
63
Figure 4.1: Potential-Residual (P-R) plot.
scatter plot of H(kn);ii(1?H
(kn);ii)
versus kn(1?H
(kn);ii)
e2i
(1?d2i). Since the flrst component is known
as potential and the second component is a function of the residual, this plot is
referred as the Potential-Residaul (P-R) plot. In this plot the high leverage points
are located in the upper left area and observations with high prediction error are
located in the lower right area [17].
If dataset contains outliers then all diagnostic measures deflned earlier will be
sensitive to outliers. To obtain the robust estimates of these diagnostic measures
we apply robust method of principal components (e.g., Section 2.2.2) on X. Robust
64
eigenvector  (r)K , K = 1;:::; robust eigenvalue ?(r)K , K = 1;:::; and robust values of
scores ?(r)iK, i = 1;:::;n and K = 1;::: are obtained. Using robust values both the
expression in equations (4.14) and (4.16) can be redeflned.
Robust fltted values are given as:
^yi(r) = hXi; ^fl(r)(kn)i =
knX
K=1
?(r)iK ?
(r)0
:K Y
n?(r)K : (4.24)
Robust Leverages are deflned as:
H(r)(kn);ii = 1n(?
(r)2
i1
?(r)1 +:::+
?(r)2ikn
?(r)kn ): (4.25)
The robust version of Cook?s measure can also be deflned in the following manner:
CP(r)i = e
(r)2
i
kn^s(r)2(1?H(r)(kn);ii)
H(r)(kn);ii
(1?H(r)(kn);ii); (4.26)
where e(r)i = yi ? ^yi(r), i = 1;:::;n; is the ith residual.
The robust version of Hadi?s measure is given as follows:
HM2(r)i = kn(1?H(r)
(kn);ii)
d2(r)i
(1?d2(r)i ) +
H(r)kn;ii
(1?H(r)(kn);ii); (4.27)
where e(r)i = yi ? ^yi(r), i = 1;:::;n; is the ith robust residual and d2(r)i = e2(r)i =e(r)0e(r)
is the square of the ith normalized robust residual.
Section 4.4 compares the classical and robustifled versions of diagnostic measures
by utilizing a real dataset.
65
4.4 Numerical Example
In this section we demonstrate the practical use of proposed in uence measures.
For this we have considered a dataset analyzed by Ramsay and Silverman [32]. The
data is from thirty-flve Canadian weather stations which are listed in Table 4.1. In
this dataset the regressor set of curves is the mean daily temperatures (in degree
Celsius) and the response is the logarithm (base 10) of total annual precipitation (in
mm). The temperature is assumed to be periodic due to its cyclical behavior during
years. Therefore, we used the Fourier basis functions. The Figure 4.2(a) shows the
curves dataset using 65 Fourier functions, while a boxplot of the log-precipitation
data is shown in the Figure 4.2(b). The curves dataset is assumed to be observed
in the interval [tmin;tmax] = [0;1]. The time unit is one year which is discretized
in 365 days. For this analysis both regressor and response variable are centered by
substracting their means.
To select the cutofi value kn we estimate the coe?cients by using the least squares
method and then we compute the total variance. The optimal number of components
kn = 2 as the cumulative percentage of total variance for two components is 96%.
Orthogonal-scoreplotsfortwocomponentsareconstructedusingCPCAandBACON-
PCA which are given in Figure 4.3. The orthogonal-score plots for CPCA indicated
no bad leverage points while BACONPCA revealed three bad leverage points (21, 11,
and 10).
The estimated beta function by classical FPCA and robust FPCA and the two
eigenfunctions byCFPCRand RFPCR aredepicted inFigure4.4. The estimated beta
function and flrst eigenfunction for both methods show difierences; on the other hand
second eigenfunction do not difier too much. The estimated beta function and flrst
66
eigenfunction clearly indicates that the classical methods are not robust to outliers
and there is a need for robust method to estimate the parameter function [20].
Table 4.2 contains values for diagnostic measures. Residual plots by classical
(Figure 4.5(a)) and robust methods (Figure 4.5(b)) show that observation 12 (Kam-
loops) is an outlier as it has high residual value as compared to other stations. From
both flgures we can deduce that residual values are much larger for robust method
than those of classical method.
The leverages which are the diagonal elements of the hat matrix indicate which
stations might be in uential a priori. We can clearly see that leverage value for
observations 21 (Resolute), 11 (Iqaluit) and 10 (Inuvik) computed by robust method
are larger than those of the classical method (Figure 4.6(a)).
The in uential observations detected by Cook?s measure are 21 (Resolute) and
11 (Iquluit) by using the robust method and only 21 (Resolute) by using the classical
method (Figure 4.6(b)). Comparing in uential observations for both methods the
observations 21 and 11 have much higher CP(r)i value than CPi.
The cut-ofi value for HM2i is 0:43 and for HM2(r)i is 0:46. The in uential ob-
servations detected by Hadi?s in uence measure are 21 (Resolute) and 11 (Iquluit)
by the robust method and 21 (Resolute) by the classical method (Figure 4.6(c)).
Comparing in uential observations for both methods the observations 21 and 11 have
much higher HM(r)i value than HMi.
By comparing Hadi?s measure with Cook?s measure for classical method we can
conclude that Hadi?s measure is superior in terms of distinguishing in uential obser-
vations from regular observations. Observation 11 (Figure 4.6(c)) identifled by Hadi?s
measure is clearly distinguished from regular observations when compared with obser-
vation 11 (Figure 4.6(b)) by Cook?s measure. Therefore we can say that the additive
67
Table 4.1: Names of the Candian weather stations
1)Arvida 2)Bagottvi 3)Calgary 4)Charlott 5)Churchil 6)Dawson
7)Edmonton 8)Frederic 9)Halifax 10)Inuvik 11)Iqaluit 12)Kamloops
13)London 14)Montreal 15)Ottawa 16)Princeal 17)Princege 18)Princeru
19)Quebec 20)Regina 21)Resolute 22)Schefier 23)Sherbroo 24)Stjohns
25)Sydney 26)Thepas 27)Thunderb 28)Toronto 29)Uraniumc 30)Vancouvr
31)Victoria 32)Whitehor 33)Winnipeg 34)Yarmouth 35)Yellowkn
measure used by Hadi?s measure, is much more e?cient than multiplicative measure
used by Cook?s measure.
The P-R plot shown in Figure 4.7(a) indicates that flve observations with high
values of HM2i can be classifled as follows: observation 21 is a high leverage point, ob-
servation 12 is an outlier and observation 11, 18 and 31 are combinations of both. The
P-R plot by robust method indicates that three observations which have large HM2(r)i
values can be classifled as follows: observation 21 is an high leverage point, obser-
vation 12 is an outlier and observation 11 is a combination of both (Figure 4.7(b)).
Here P-R plot helps to identify the outliers, leverages and regular observations. By
comparing the classical and robust P-R plots (Figure 4.7(c)) we flnd that observations
21, 12 and 11 yielded much larger values for robust method than the ones obtained
from the classical method. Observations identifled as bad leverages in orthogonal
score plot (Figure 4.3) and observations identifled as outliers and leverages in P-R
plot by using both methods (Figure 4.7) are highlighted in Figure 4.8.
68
Table 4.2: Classical and Robust in uence measures for the Candian weather stations
n ei e(r)i H(kn);ii H(r)(kn);ii CPi CP(r)i HM2i HM2(r)i
1 0.16 0.11 0.01 0.01 0.01 0.00 0.02 0.01
2 0.19 0.16 0.01 0.01 0.01 0.01 0.01 0.01
3 -0.30 -0.33 0.01 0.00 0.01 0.01 0.02 0.01
4 0.13 0.09 0.01 0.01 0.00 0.00 0.01 0.01
5 0.06 0.23 0.07 0.15 0.00 0.24 0.07 0.19
6 0.03 0.08 0.10 0.12 0.00 0.02 0.12 0.13
7 -0.14 -0.16 0.00 0.00 0.00 0.00 0.00 0.00
8 0.15 0.08 0.01 0.01 0.00 0.00 0.01 0.01
9 0.18 0.13 0.02 0.02 0.01 0.01 0.02 0.02
10 -0.03 0.16 0.10 0.20 0.00 0.16 0.12 0.24
11 0.05 0.32 0.17 0.34 0.01 1.62 0.20 0.53
12 -0.57 -0.71 0.04 0.05 0.21 0.63 0.22 0.28
13 0.02 -0.08 0.02 0.03 0.00 0.00 0.02 0.03
14 0.09 -0.01 0.03 0.04 0.00 0.00 0.03 0.04
15 0.09 -0.01 0.03 0.04 0.00 0.00 0.03 0.04
16 -0.08 -0.10 0.04 0.04 0.00 0.01 0.04 0.04
17 -0.13 -0.16 0.01 0.01 0.00 0.00 0.01 0.01
18 0.26 0.25 0.15 0.14 0.18 0.25 0.18 0.17
19 0.24 0.18 0.01 0.01 0.01 0.01 0.02 0.01
20 -0.18 -0.25 0.04 0.04 0.02 0.05 0.04 0.04
21 -0.26 0.15 0.41 0.77 1.03 7.38 0.71 3.42
22 0.27 0.41 0.04 0.10 0.05 0.45 0.05 0.14
23 0.18 0.13 0.01 0.00 0.00 0.00 0.01 0.00
24 0.14 0.16 0.08 0.08 0.02 0.05 0.08 0.09
25 0.17 0.15 0.03 0.03 0.01 0.02 0.04 0.03
26 -0.01 -0.02 0.04 0.04 0.00 0.00 0.04 0.04
27 0.04 0.02 0.00 0.00 0.00 0.00 0.00 0.00
28 -0.07 -0.17 0.02 0.03 0.00 0.02 0.02 0.03
29 0.01 0.06 0.07 0.08 0.00 0.01 0.07 0.09
30 -0.11 -0.21 0.10 0.09 0.02 0.10 0.11 0.10
31 -0.26 -0.33 0.12 0.11 0.15 0.31 0.14 0.13
32 -0.30 -0.26 0.00 0.02 0.01 0.02 0.02 0.02
33 -0.01 -0.08 0.06 0.06 0.00 0.01 0.06 0.07
34 0.05 0.00 0.05 0.05 0.00 0.00 0.05 0.05
35 -0.07 0.00 0.08 0.10 0.01 0.00 0.08 0.11
69
0 50 100 150 200 250 300 350
?30
?25
?20
?15
?10
?5
0
5
10
15
20
Day
Mean Temperature in degree Celcius
Sample curves of the Canadian data
(a)
1
2.2
2.4
2.6
2.8
3
3.2
3.4
Log?Precipitation
Boxplot of the log?precipitation data
(b)
Figure 4.2: (a)Sample curves (X-data) of the Canadian data; (b)Boxplot of the log-
precipitation (y-data) data.
0 0.5 1 1.5 2 2.5 3 3.5 40
5
10
15
20
25
30
35
40
45
Score distance
Orthogonal distance
CPCA
 5
2435
30
 6
10
31
1811
21
 4 1225
17
32
(a)
0 2 4 6 8 10 12 140
10
20
30
40
50
60
70
80
90
100
Score distance
Orthogonal distance
BACONPCA
18
3229 635 22
 5
10 11
21
17 1230
31
(b)
Figure 4.3: Diagnostic plot of the Candian dataset based on (a)Classical Sore plot;
(b)Robust Sore plot.
70
0 50 100 150 200 250 300 350 400?1
?0.5
0
0.5
1
1.5
2
2.5
3x 10?4
Weather Station
Beta function
ClassicalRobust
(a)
0 50 100 150 200 250 300 350 400?0.1
?0.09
?0.08
?0.07
?0.06
?0.05
?0.04
?0.03
?0.02
?0.01
Weather Station
First Eigenfunction
ClassicalRobust
(b)
0 50 100 150 200 250 300 350 400?0.1
?0.08
?0.06
?0.04
?0.02
0
0.02
0.04
0.06
0.08
0.1
Weather Station
Second Eigenfunction
ClassicalRobust
(c)
Figure 4.4: (a)Estimated beta function by Robust and classical FPCA; (b)First eigen-
function and (c)Second eigenfunction by Robust and Classical FPCR.
71
2.5 2.6 2.7 2.8 2.9 3 3.1 3.2?0.6
?0.5
?0.4
?0.3
?0.2
?0.1
0
0.1
0.2
0.3
1 2 
3 
4 
5 6 
7 
8 9 
10
11
12
13
1415
16 17
1819
20
21
22
23 2425
26 27 2829
30
3132
33 34
35
Predicted values
Residuals
(a)
2 2.5 3 3.5?0.8
?0.6
?0.4
?0.2
0
0.2
0.4
0.6
1 2 
3 
4 
5 
6 
7 
8 9 10
11
12
13141516 17
1819
20
21
22
232425
26 27
28
29
30
3132
33 3435
Predicted values
Residuals
(b)
Figure 4.5: Fitted values against the residuals (a)Classical; (b)Robust.
72
0 5 10 15 20 25 30 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 
6 
7 8 9 
10
11
121314151617
18
1920
21
2223242526272829
3031
323334
35
5 6 10
11
12
21
22 29
32
35
Weather Station
Leverages
Classical and Robust Methods
(a)
0 5 10 15 20 25 30 350
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 1011121314151617181920
21
22232425262728293031323334355 10
11
12
21
22 31
Weather Station
Cook?s Measures for prediction
(b)
0 5 10 15 20 25 30 350
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 1011121314151617181920
21
22232425262728293031323334355 10
11
12
21
22
Weather Station
Hadi?s Measures for prediction
(c)
Figure 4.6: (a)Leverages; (b)Cook?s In uence Measures; (c)Hadi?s In uence Measures
(Classical(o) and Robust(+)).
73
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 
5 6 
7 8 9 
10
11
121314151617
18
1920
21
2223242526272829
30 31
32
333435
Residual
Potential
(a)
0 0.2 0.4 0.6 0.8 1 1.2 1.40
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 10
11
121314151617 181920
21
222324252627282930 3132333435
Residual
Potential
(b)
0 0.2 0.4 0.6 0.8 1 1.2 1.40
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 1011 121314151617 181920
21
222324252627282930 31323334351 2 3 4 5 7 8 9 10
11
1214151920
21
2223242830 3132
Residual
Potential
(c)
Figure 4.7: P-R plot (a)Classical(o); (b)Robust(+); (c)Classical(o) and Robust(+).
74
0 50 100 150 200 250 300 350
?30
?25
?20
?15
?10
?5
0
5
10
15
20
Day
Mean Temperature (Degree Celcius)
2111
10
1831
12
Figure 4.8: Sample curves of the Canadian data.
75
Chapter 5
Conclusion
Robust PCA based Functional Data Analysis has been developed for dimension
reduction purpose. This method can also be used to detect functional outliers and
classify these outliers in three classes.
An extensive simulation study was conducted and a real dataset was used to
asses the performance of the robust FPCA. From this simulation study based on
difierent contamination conflgurations (symmetric, asymmetric, partial and peak),
we concluded that robust PCA based Functional Data Analysis yields better results
than CPCA based Functional Data Analysis.
Besides, we have developed robust FPCR method and constructed robustifled
in uence regression diagnostic measures. The practical use of robust FPCR and
diagnostic measures are illustrated by means of a real data example. In view of
this example, we concluded that the proposed robust FPCR method performs much
better than classical FPCR and robustifled in uence measures appears to be more
useful diagnostic tools for detecting heterogeneity in functional regression model with
scalar response.
As future work, the difierent methods to estimate beta function can be explored.
Simulation study can be conducted to assess the performance of the robust FPCR in
presence of outliers, for a variety of scenarios. It is stated that the optimal number
of bases is an important issue in FDA. Since, least squares method is sensitive to
outliers, the choice of number of bases is afiected by outliers. Robust version of this
criteria for selecting number of bases can be determined.
76
Bibliography
[1] Ash, R. B. and Gardner, M. F., Topics in stochastic processes. New York: Aca-
demic Press, 1975.
[2] Billor, Nedret, Hadi A. S., and Paul, F. Velleman P. F., \BACON: blocked adap-
tive computationally e?cient outlier nominators", Computatuional Statistics and
Data Analysis, 34, 279-298, 2000.
[3] Billor, N., Kiral, G. and Turkmen, A., \Outlier Detection Using Principal Com-
ponents", Twelfth International Conference on Statistics, Combinatorics, Math-
ematics and Applications, Auburn, 2005. (Unpublished Manuscript).
[4] Cardot, H., Ferraty F. and Sarda, P., \Functional linear model", Statistics and
Probability Letters, 45, 11-22, 1999.
[5] Cardot, H., Ferraty F. and Sarda, P., \Spline estimators for the functional linear
model", Statistica Simica, 13, 571-591, 2003.
[6] Chatterjee, S., and Hadi, A. S., Sensitivity Analysis in Linear Regression. New
York: Wiley, 1988.
[7] Chiou, J. M. and M?uller, H. G., \Diagnostic for functional regression via residual
processes", Computational Statistics and Data Analysis, 51, 4849-4863, 2006.
[8] Cook, R. D., \Detection of in uential observations in linear regression", Tech-
nometrics, 19, 15-18, 1977.
[9] Craven, P. and Wahba G., \Smoothing noisy data with spline functions: es-
timating the correct degree of smoothing by the method of generalized cross-
validation", Numerische Mathematik, 31, 377-403, 1979.
[10] Cuevas, A., Febrero, M., Fraiman R., \Linear Functional regression: the case of
flxed design and functional response", Canad. J. Statist, 30, 285-300, 2002.
[11] Farawayy, J. J., \Regression analysis for a functional response", Technometrics,
39, 254-261, 1977.
[12] Fraiman, R. and Muniz, G., \Trimmed means for functional data", Test, 10,
419-440, 2001.
77
[13] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \A functional analysis of
NOx levels: location and scale estimation and outlier detection", Computational
Statistics, 22, 411-427, 2007.
[14] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \Outlier detection in func-
tional data by depth measures, with application to identify abnormal NOx lev-
els", Environmetrics, 19, 331-345, 2008.
[15] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \In uence in the Func-
tional Linear Model with Scalar Response", In Functional and Operatorial Statis-
tics, edited by Sophie Dabo-Niang and Frdric Ferraty. Physica-Verlag Heidelberg,
1 edition, 165-171, 2008.
[16] Febrero, M., Galeano, P. and Gonzales-Mantegia, W., \Measures of in uence
for the functional linear model with scalar response", Journal of Multivariate
Analysis, In Press, Corrected Proof, Available online 25 December 2008.
[17] Hadi, A. S., \A new measure of overall potential in uence in linear regression",
Computatuional Statistics and Data Analysis, 14, 1-27, 1992.
[18] Hall, P. and Hosseini-Nasab, M., \On properties of functional principal compo-
nents analysis", Journal of the Royal Statistical Society, Series B, 68 109-126,
2006.
[19] Huber, P. J. \Projection pursuit", Annals of Statistics, 13, no. 2, 435-525, 1985.
[20] Hubert, M., and Verboven, S., \A robust PCR method for high-dimensional
regressors", Journal of Chemometrics, 17, 438-452, 2003.
[21] Hubert, M., and Engelen, S., \Robust PCA and classiflcation in biosciences",
Bioinformatics, 20, 1728-1736, 2004.
[22] Hubert, M., Rousseeuw, P. J. and Branden, K. V., \ROBPCA: A new approach
to Robust Principal Component analysis", Technometrics, 47, no. 1, 64-79, 2005.
[23] Iglewicz, B.andHoaglinD.C., \HowtoDetectandHandleOutliers", Milwaukee:
American Society for Quality Control, WI, 1993.
[24] Jolifie, I. T. Principal Component Analysis. New York: Springer-Verlag, 1986.
[25] Kolmogorov, A. N., and Fomin, S. V., Introductory real analysis [English trans-
lation], Englewood Clifis, NJ: Prentice-Hall, 1968.
[26] Levitin, D. J., Nuzzo, R. L., Vines, B. W. and Ramsay, J. O., \Introduction to
Functional Data Analysis", Canadian Psychology, 48, no. 3, 135-155, 2007.
78
[27] Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and
Cohen, K. L., \Robust principal component analysis for functional data", TEST,
8, no. 1, 1-73, 1999.
[28] Lopez-Pintado, S., and Romo J., \Depth based inference for functional data",
Computational Statistics and Data Analysis, 51, no. 10, 4957-4968, 2007.
[29] P~ena, D. \A new statistic for in uence in linear regression", Technometrics, 47,
1-12, 2005.
[30] Ramsay, J. O. and Dalzell, C. J., \Some tools for functional data analysis",
Journal of Royal Statistical Society, 53, no. 3, 539-572, 1991.
[31] Ramsay, J. O. and Silverman, B. W., "Functional Data Analysis Software",
MATLAB edition.
Online at http://www.psych.mcgill.ca/faculty/ramsay/software.html, 2001.
[32] Ramsay, J. O. and Silverman, B. W., Functional Data Analysis. Second Edition.
New York: Springer-Verlag, 2005.
[33] Rice, J. A. and Silverman, B. W., \Estimating the mean and covariance structure
nonparametrically when the data are curves", Journal of the Royal Statistical
Society, Series B, 53, 233-243, 1991.
[34] Roman, D. R. \A tool for Functional Data Analysis and Experimentation", M.S.
thesis, University of Puerto Rico, 2005.
[35] Rousseeuw, P. J. \Least median of squares regression", Journal of the American
Statistical Association, 79, 871-880, 1984.
[36] Rousseeuw, P.J.\Multivariateestimationwithhighbreak-downpoint", InMath-
ematical Statistics and Applications, edited by Grossmann, W. et al.. Reidel,
Dordrecht, Vol. B., 283-297, 1985.
[37] Rousseeuw, P. J. and Van Driessen, K., \A fast algorithm for the minimum
covariance determinant estimator", Technometrics, 41, 212-223, 1999.
[38] Shen, Q. and Xu, H., \Diagnostic for linear models with functional resposes",
Technometrics, 49, 26-33, 2007.
[39] Zhang, Jin-Ting. \Smoothed Functional Data Analysis", PhD dissertation, The
University of North Carolina at Chapel Hill, 1999.
79