Robust Statistical Methods for the Functional Logistic Model by Melody B. Denhere A dissertation submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 3, 2013 Keywords: functional data, outliers, diagnostics, logistic regression Copyright 2013 by Melody B. Denhere Approved by Nedret Billor, Chair, Associate Professor of Mathematics and Statistics Asheber Abebe, Associate Professor of Mathematics and Statistics Peng Zeng, Associate Professor of Mathematics and Statistics Abstract Over the last decade or so, a lot of interest has emerged in the eld of functional data analysis. This interest spans from a broad spectrum of elds such as brain imaging studies, bio-metrics, genetics, e-commerce and computer science. Statistical tools, models and meth- ods, whose strength is in recognizing this structural aspect of data are being discussed and developed; ranging from functional linear regression, functional ANOVA, functional princi- pal component analysis and functional outlier detection. In this work, we discuss statistical methods for the functional logistic regression model; a model where the response is binary and the covariate(s) functional. Essentially, we consider ways that allow for the parameter estimator to be resistant to outliers, in addition to eliminating multicollinearity and high dimensional problems; issues which are inherent with functional data. The methods include robust techniques of estimating the parameter function for the model as well as diagnostic measures to assess the t of the model. Two estimation approaches are discussed; the rst one makes use of robust principal component estimation techniques and the second one uses a robust penalization approach. Results from a simulation study and a real world example are also presented to illustrate the performance of the proposed estimators. ii Acknowledgments The completion of this dissertation would not have been possible without the support and encouragement from the following individuals who have been instrumental in my grad- uate studies at Auburn University. To my advisor, Dr. Nedret Billor, I am thankful for the many discussions and insights provided in this work. For stirring the interest in this eld of research, for knowing how to keep me going and for always being available with immeasurable guidance, support and encouragement. I sincerely appreciate that. To the supervisory committee made up of Drs. Ash Abebe and Peng Zeng, whose knowledge and advice in this work were invaluable. To all my fellow classmates, and in particular Guy-vanie Miakonkana, Achard Bindele, Jasdeep Pannu, Brice Nguelifack, Pallavi Sawant, Ruchika Sabharwal and Julian Allagan; the journey would not have been the same without you. To Rose and Dan Brauss, you were God-sent! No words could ever describe how much I appreciate you both. For always being there and believing in me, especially when I became discouraged. The road trips, the movie nights, the endless talks, the many many many wonderful memories - I am forever grateful. To my amazing sister Chiedza, you kept me sane through it all. Thank you for indulging me and being everything a sister could ask for and more. To my brothers Godfrey, Ngo- nidzashe and Michael, I appreciate your support and prayers. To my mom, for nurturing my love for numbers at an early age and guiding me to the One who gave me wings to y. And to my dad, for seeing so much more in me than I ever did and showing me that even the sky is not the limit! This work is dedicated to the memory of my grandfather, the late B.M. Denhere, who encouraged us to pursue education and excel in all we do. I know you would be proud Sekuru. To God be the Glory, great things He has done! (Psalm 100:5) iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Estimation of Smooth Functions . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Smoothing with Roughness Penalty . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Functional Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Estimation of Regression Parameters . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Principal Component Estimation . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Ridge Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Robust Principal Component Functional Logistic Regression . . . . . . . . . . . 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Estimation of X(t) and (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Model Selection Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iv 3.4.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Canadian Weather Data Set . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Robust Penalized Functional Logistic Regression . . . . . . . . . . . . . . . . . . 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Estimation of X(t) and (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Robust Penalized Maximum Likelihood Estimation . . . . . . . . . . . . . . 48 4.3.1 Details of the Estimation Technique . . . . . . . . . . . . . . . . . . . 50 4.3.2 Tuning Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Poblenou NOX Levels Data Set . . . . . . . . . . . . . . . . . . . . . 55 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Diagnostic Methods for the Functional Logistic Model . . . . . . . . . . . . . . 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Measures of Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 Canadian Weather Diagnostics . . . . . . . . . . . . . . . . . . . . . . 67 5.4.2 Poblenou NOX Emission Diagnostics . . . . . . . . . . . . . . . . . . 68 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 v List of Figures 2.1 50 sample curves generated from the stochastic process X(t) = Z(t) +t=4 +E . 20 2.2 Comparison of the median (t) estimates for the three di erent techniques against the true parameter function sin(t+ =4). . . . . . . . . . . . . . . . . . . . . . 22 2.3 Median (t) estimates (a) Maximum Likelihood Estimation (b) Principal Com- ponent Estimation and (c) Ridge Estimation . . . . . . . . . . . . . . . . . . . . 23 3.1 50 sample curves generated from the overlapping stochastic process Xi(t) = ai1 + ai2t+Wi(t) and di erentiated for each class of the response . . . . . . . . . . . 32 3.2 Sampling curves for Model 1 - 4 at q = 5% . . . . . . . . . . . . . . . . . . . . . 34 3.3 Side-by-side boxplots on the e ect of outlier curves on maximum likelihood esti- mate (MLE),classical PCA (CPCA) and robust PCA (RPCA) estimates of 0 at q = 5% contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Overlay comparison of the e ect of outlier curves on classical PCR and robust PCR when the rst 4 PCs are included in the model at q = 5% contamination. . 38 3.5 The mean monthly temperatures for 23 Canadian weather stations used to predict the risk of drought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Churchill weather station?s temperatures are altered . . . . . . . . . . . . . . . 40 3.7 Parameter estimation without using principal component estimation . . . . . . 41 vi 3.8 Parameter estimation when Churchill weather station?s temperatures are shifted and its e ect on the principal component estimation methods . . . . . . . . . . 42 3.9 CPCA vs RPCA when Churchill weather station?s temperatures are shifted . . . 42 4.1 Comparison of the penalized method and robust penalized method at di ering contamination levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 NOX emmission levels for non-working (top) and working (bottom) days . . . . 57 4.3 Functional form of the hourly trajectories with the outliers detected in the NOX dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 A comparison of the estimated (t) with and without the 5 outliers for the non- robust and robust penalized methods . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Index plots for the Canadian Weather data comparing residuals of the principal- component based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Weather stations with the larger residuals shown as peaks in Figure (5.1) . . . . 69 5.3 Diagnostic plots that show the in uence of each observation on the di erent diagnostic measures versus hii . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Index plot of the deviance residuals for the NOX penalized maximum likelihood tted models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Diagnostic plots that show the in uence of each observation on the di erent diagnostic measures versus hii for the NOX model . . . . . . . . . . . . . . . . 72 vii List of Tables 3.1 Median MSEB (standard error) for the estimation of the functional parameter based on the optimum model for Model 1 and Model 2 . . . . . . . . . . . . . . 35 3.2 Median MSEB (standard error) for the estimation of the functional parameter based on the optimum model for Model 3 and Model 4 . . . . . . . . . . . . . . 35 3.3 The model details for each estimation technique . . . . . . . . . . . . . . . . . . 40 3.4 Goodness of t measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Mean MSEB (standard error) for the estimation of the functional parameter, (t), for Models 2, 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 viii Chapter 1 Functional Data Analysis 1.1 Introduction In the last decade, a substantial amount of attention has been drawn to the eld of functional data analysis; resulting in the development and generalization of many statistical techniques to this type of data. Much work has been devoted to this eld with publications on functional regression models forthcoming from James (22), Cardot and Sarda (4); M uller and StadtM uller (29); Escabias et al. (9); Ferraty and Vieu (13) to name just a few. Ramsay and Silverman (34) present functional data ideas and statistical methods in their book which gave impetus to the functional data analysis community. Interest has been from a broad spectrum of elds such as biometrics, genetics, e-commerce and computer science. Statistical tools, models and methods, whose strength is in recognizing this structural aspect of the data have been discussed; ranging from functional linear regression, functional ANOVA, functional principal component analysis and functional outlier detection. In fact, functional regression methods have resulted in a re-look at some of the ways used to analyze longitudinal data. There are many steps that arise in the analysis of functional data that di erentiate it from analysis approaches taken for non-functional data. An important aspect to understand when it comes to functional data is primarily in observing properties of functional data. The term ?functional? is used to describe the intrinsic structure of the data; one thinks of the observed data functions as whole entities and the analysis is based on these functions. Naturally, the functional data is observed and recorded as discrete observations. However, this data is viewed as having been generated by some function and so the analysis is made on the function rather than viewing the data as a sequence of individual observations. Ramsay and Silverman (34) discuss the details of the rst steps in functional data analysis (FDA) 1 which include data representation issues, data registration and plotting pairs of derivatives. In this chapter, we look at some of the major concepts that are integral in FDA. In particular, we explore issues on estimation of the observed function from the discrete time points. 1.2 Estimation of Smooth Functions We have already established that whilst analysis in FDA is based on observed functions, in practice the observed data is recorded as discrete observations. Therefore, for n obser- vations, discrete recordings of (tij;Xij) pairs are made and Xij is viewed as a snapshot of the function Xi(t) at time tj where i = 1;:::;n represents the number of observations and j = 1;:::;ni represents the number of replicates for each observation. It?s important to note that time is not the only continuum over which functional data can be recorded. Other continua such as spatial, position and frequency may be involved and the continuum time is being used loosely. An initial step in FDA is the estimation of the function Xi(t) based on the observed discrete points Xi(tj). We assume that the function Xi(t) is smooth and therefore possesses one or more derivatives. Without the smoothness property, not much is gained by treating the data as functional rather than multivariate. This smoothness property ensures that any two adjacent data values Xi(tj) and Xi(tj+1) are unlikely to be very di erent from each other and are linked together to a certain extent. To simplify notation for this chapter, we will denote the function to be estimated as X(t) since the estimation of the functional observation from the discrete data is done independently for each i. There are several issues to consider in estimating the smooth function X(t) from the discrete observed data. The actual recorded data might be far from being smooth due to the presence of measurement error; the data might be sparsely sampled or few in number making it di cult to adequately give a stable estimate of the function; the time points at which the data is recorded might be unevenly spaced. All these and other issues call for techniques that work around these speci c problems with the sampled data. 2 Since smoothness is a property that is essential for the function X(t), there is need to consider smoothing techniques in estimating this function from the raw discrete data in the presence of measurement error or noise. The observed discrete data is expressed as, Xj = X(tj) +"j; j = 1;:::;ni; i = 1;:::;n; where the measurement error "j causes a roughness to the observed data. One might choose to lter out the noise or to alternatively leave the noise but ensure that the smoothness property is met in the analysis of the results. An approach taken in this dissertation work in estimating the smooth function, X(tj), is by way of some basis function. By de nition, a basis function system is a set of known functions, k, that are mathemat- ically independent from each other with the property that one can approximate arbitrarily well any function by taking a linear combination of a su ciently large number of these func- tions. Therefore, the basis function approach represents a function X(t) (in L2(T)) in terms of K known basis functions k as, X(t) = KX k=1 ck k(t); = c0 ; (1.1) where c and are column vectors of length K. Two important issues arise from this basis representation of a function. One of the important issues is the choice of the basis function system. The determination of the appropriate basis function system is based on the charac- teristics that are inherent in the data. In general, Fourier basis functions are used to model periodic data and B-spline basis is used for non-periodic data. Other basis choices such as wavelets, trigonometric functions or even polynomial functions can also be used should that be appropriate (Ramsay and Silverman (34)). 3 A second important aspect in this representation is the determination of the dimension of the expansion, K. An interpolation is achieved when K = ni. Therefore, K can be viewed as a smoothing parameter, since the degree to which the data, Xj is smoothed is determined by the number of basis functions. When an appropriate basis function system is selected for the observed data, fewer basis functions are required to make a good approximation of the function, i.e. the smaller K is. There are two opposing discussions when it comes to choosing the number, K, of basis functions. On the one hand, the larger the K, the better the t. This is however, at the expense of tting unnecessary noise that should ideally be ignored. On the other hand, the smaller the K, the smoother the function (Ramsay and Silverman (34)). The problem with this is that if K is made to be too small, then we risk missing out on some important aspects of the functions that is being estimated. Therefore, a trade-o between t and smoothness is necessary. An assortment of algorithms exist for this purpose; including methods such as cross-validation and generalized cross-validation (Craven and Wahba (7)). In order to estimate the coe cients of the linear expansion in (1.1), the ordinary least squares criterion can be used. This criterion is expressed as, SMSSE(Xjc) = niX j=1 " Xj KX k=1 ck k(tj) #2 = (X c)T(X c): (1.2) From this, the estimate, ^c, that minimizes SMSSE is , ^c = ( 0 ) 1 0X: This approach is appropriate when we assume the standard model for error, i.e. the errors, "j are independently and identically distributed with mean zero and constant variance ". In the presence of non-stationary and/or autocorrelated errors, the weighted least squares 4 t would be more appropriate. Thus, the weighted least squares estimate is, ^c = ( 0W ) 1 0WX; where W is a symmetric positive de nite matrix that incorporates the unequal weighting of the errors. The next section looks at a more powerful option for approximating discrete data by a function. The shortcomings of using (weighted) least squares criterion is in that the degree of smoothness is established in a discontinuous manner and better results can be established with a roughness penalty. 1.3 Smoothing with Roughness Penalty As discussed in the previous section, there are two competing objectives in function estimation: good t vs. smooth t. The competing objectives can be viewed in terms of the following statistical principle, MSE = Bias2 +Variance; where MSE is the mean squared error; and bias and sampling variance are de ned as, Bias[ ^X(t)] = X(t) E[ ^X(t)]; Var[ ^X(t)] = E[( ^X(t) E[ ^X(t)])2]: Whilst an unbiased estimate is desirable, when the standard error model is not met, the high variance associated with the curve will result in a compromised MSE. Therefore, the MSE might be drastically reduced by sacri cing some bias so as to reduce some sampling variance. This is achieved by imposing some smoothness to the estimated curve. Therefore, the roughness penalty makes explicit what is sacri ced in bias to achieve an improved MSE. 5 The penalized residual sum of squares, de nes the compromise of trading smoothness for t as, PENSSE(Xjc) = (X c)TW(X c) + Pen(X(t)); where c is the coe cient vector; is the appropriate basis vector; W is a symmetric positive de nite matrix that incorporates the unequal weighting of the errors; is a smoothing parameter that measures the rate of exchange between the t to the data; and Pen(X(t)) is a measure of the function?s roughness. The curvature of a function is usually used to quantify the idea of the roughness of a function. Therefore, a natural measure of a function?s roughness is given as, Pen(X(t)) = Z T h X00(s) i2 ds = c0Pc; where P = Z T 00(s)f 00(s)gTds; is the roughness penalty matrix. Thus, the objective function being optimized becomes, PENSSE(Xjc) = (X c)TW(X c) + c0Pc; resulting in an estimate of the coe cient vector being, ^c = ( 0W + P) 1 0WX: When = 0, this reduces to the weighted least squares estimate. It can also be noticed that as !1, and more weight is given to the roughness term, the estimated function 6 approaches the standard linear regression; and as !0, the estimated function approaches an interpolation of the data. The selection of can be achieved by cross-validation and generalized cross-validation methods which will be discussed in more detail in later chapters of this dissertation work. 1.4 Conclusion In this chapter we introduced the concept of functional data and discussed some of the pertinent issues in the eld of functional data analysis. The rest of this work is organized as follows. Chapter 2 gives some background to the functional logistic model and discusses some of the estimation techniques in literature as well as making a case for the contribution of this work. Chapter 3 explores the concept of robust principal component estimation which is one of the results of this work. The estimation technique is developed and its performance considered in a simulation study as well as with a real world example. Chapter 4 examines and proposes a robust penalized approach to the estimation of the parameter function in the functional logistic model, with simulation results and application to a real world example. Chapter 5, studies diagnostic issues for the functional logistic model and discusses some model validation methods, extending results from the standard logistic model. In the last chapter, there is a discussion of other aspects of the functional logistic model that may be explored as future work. 7 Chapter 2 Functional Logistic Regression 2.1 Introduction Ramsay and Silverman (34) discussed functional linear regression models, a case in which the linear relationship between random variables and functions is explored. The di erent models that can arise from this set-up include, A functional response and a scalar independent variable(s). In this model, the observed values are of the formfXij;Yi(t) : t2Tgfor i = 1;:::;n, j = 1;:::;p and where T is the support of the functional predictors. The relationship between the functional response and scalar predictors can be formulated as, Yi(t) = pX j=1 Xij j(t) + i(t); i = 1;:::;n: A scalar response and a functional independent variable(s). The observed values in this case are of the form fXi(t);Yig for i = 1;:::;n and this relationship formulated as Yi = Z T Xi(t) (t)dt+ i; i = 1;:::;n: A functional response and a functional independent variable(s). This last form occurs when both the response and predictor are functional, i.e. fXi(t);Yi(t)gfor i = 1;:::;n. The model is de ned as, Yi(t) = Z T Xi(t) (t)dt+ i(t); i = 1;:::;n: 8 Our work is focused on the functional logistic regression (FLR) model, where the re- sponse is binary and the predictor(s) functional. Di erent approaches have been developed in the estimation methods of the functional parameters of the functional logistic model. Escabias et al. (9) discuss two approaches of parameter estimation that employ principal component estimation. James (22) and M uller and StadtM uller (29) discuss the generalized functional linear model and consider estimation methods for its parameters. These estima- tion techniques, however, are not resistant to outliers and thus there is a need for some robust estimation methods. We focus on functional logistic regression in particular as there are some interesting problems in FDA that could bene t from this model. In addition to modeling functional covariates with binary responses, functional logistic regression is also useful for classi cation methods. An increasing amount of research is focusing on functional logistic regression and its application to functional Magnetic Resonance Imaging (fMRI) data. Ratcli e et al. (36), Reiss et al. (37), Escabias et al. (10), Leng and M uller (25) and Tian (41) are some examples of work that deal with applications of the functional logistic model to a variety of areas including fMRI data. The presence of outliers in any data set is almost inevitable and their e ect on the model unquestionable. One school of thought proposes developing outlier detection methods and using these to eliminate identi ed outliers when tting the model. However, there are many drawbacks to this approach. Firstly, there might be sample curves that are just at the limit of outlyingness, and their removal from the data set might be too drastic an approach to make. Over and above this, the process might need to be repeated before the data set can be declared outlier-free. Secondly, there might be no consistency among di erent researchers in the approaches taken to eliminate outliers and therefore, there is an element of subjectivity. Lastly, making inferences from the model where data has been removed causes problems in that regard due to the dependence between the ?good? observations and the ?bad? observations (Victoria-Feser (44)). Therefore, robust estimation methods are more 9 viable in that the e ect of the outliers is minimized without removing them from the data set. Whilst there has been some work that addressed estimation of the parameter function for this model as cited above, to our knowledge there?s no contribution to the eld as far as robust estimation methods for this model is concerned. It is with this void in mind that this dissertation work is dedicated to contributing to the eld of functional data analysis by proposing robust estimation methods for the functional logistic model. 2.2 The Model We consider having independent functional covariates as X1(t);:::;Xn(t) where t2T and T is the support of the sample curves Xi(t), i = 1;2;:::;n; and a random sample of a binary response variable Y to be Y1;:::;Yn, that is Yi 2f0;1g; i = 1;2;:::;n. Then the random variable Y is such that the observations, Y Bernoulli( i) where i is the probability of a positive response given Xi(t) which is given as, i = P(Y = 1jfXi(t) : t2Tg) = expf 0 + R T Xi(t) (t)dtg 1 +expf 0 +RT Xi(t) (t)dtg; i = 1;:::;n: 0 is a real parameter and (t) is a smooth function of t, of which both are unknown parameters. The logit transformation is, li = log i 1 i = 0 + Z T Xi(t) (t)dt; i = 1;:::;n: (2.1) 10 This can be viewed in the more general sense as a generalized functional linear model with the link as the logit function. In this work, we consider the no-intercept model (i.e. 0 = 0). 2.3 Estimation of Regression Parameters Estimation of the parameters cannot be achieved by the usual method of maximum likelihood [Ramsay and Silverman (34)] resulting in a di erent approach for these functional data. In particular, we consider the functional covariates, Xi(t), and functional parameter, (t), as belonging to a nite-dimension space generated by the same (not necessarily) basis f j(t)gj2N. We assume Xi(t)2L2(T) of squared integrable functions with the inner product, hf;giu = Z T f(s)g(s)ds; 8f;g2L2(T); such that Xi(t) = KXX j=1 cij j(t); i = 1;:::;n; (2.2) where f j(t)g is an appropriate basis which is selected to re ect the characteristics of the data. With this assumption, we are able to reconstruct the functional form of the functional covariates from the observed discrete points using two di erent approaches. On the one hand, if the functional covariate(s) is observed with some measurement error, then the ith subject at the kth replication is Xik = Xi(tk) +"i(tk); k = 0;1;::;ni; i = 1;:::;n; where "(t) and X(t) are independent. In this case where the functional covariate is observed with some noise, we use some least squares approximation approach to obtain the functional 11 form of the covariates by approximating the basis coe cientfcijgfrom the discrete observa- tions. On the other hand, if the functional covariate(s) is observed without error, then the ith subject at the kth replication is Xik = Xi(tk); k = 0;:::;ni; i = 1;:::;n: In this case, some interpolation method such as cubic spline interpolation can be used to get the functional form of the predictors. We also de ne, (t) = KbX k=1 bk?k(t); (2.3) where ?k(t); k = 1;:::;Kb is a known basis function. Since estimates for the basis coe cient fcijg can be found either by smoothing (as discussed in Chapter 1) or interpolation, using the basis expansion of the covariates and parameter function as de ned in (2.2) and (2.3) in the regression model (2.1), the functional logistic model reduces to a standard multiple logistic one, li = 0 + Z T Xi(t) (t)dt = 0 + KXX j=1 KbX k=1 cij jkbk; i = 1;:::;n; where jk = RT j(t)?0k(t)dt for j = 1;:::;KX;k = 1;:::;Kb; cij is the basis coe cient; 0 is an unknown real parameter; bk is the unknown basis coe cient used to estimate the parameter function (t). In matrix form, this can be written as, L = 01 +C b; (2.4) where L= (l1;:::;ln)0, C=fcijgn KX, =f jkgKX Kb, 1= (1;:::;1)0 and b = (b1;:::;bKb)0. 12 Three di erent approaches of estimation can be discussed for this model (2.4). Firstly, there is the maximum likelihood estimation (MLE) which results in unstable and inaccurate estimators due to the higher collinear nature of the covariates. To deal with this shortcoming of the MLE, the second approach of principal component estimation as discussed by Escabias et al. (9) can be considered. Alternatively, a third approach is the penalized maximum like- lihood estimation approach which is also explored in this work; and an empirical comparison of the three approaches is given by way of a simulation study. 2.3.1 Maximum Likelihood Estimation The FLR model in (2.4) is now in the same form as the standard logistic regression model and, therefore, the maximum likelihood parameter estimates can be found. The responses, Yi, are assumed to follow a Bernoulli( i) distribution, such that the log-likelihood function can be written as, L( ;Y) = nX i=1 Yilog i 1 i +log(1 i) : (2.5) The derivative of this log-likelihood function w.r.t. i is @L @ i = nX i=1 Yi i i(1 i): Using the fact that, log i 1 i = li = 0 + C0i b; 13 together with the chain rule, the derivative of the log-likelihood w.r.t. 0 is, @L @ 0 = nX i=1 @L @ i @ i @l @l @ 0 = nX i=1 Yi i i(1 i) i(1 i) 1 = nX i=1 (Yi i): Similarly, the derivative of the log-likelihood w.r.t. b is, @L @b = nX i=1 @L @ i @ i @l @l @b = nX i=1 Yi i i(1 i) i(1 i) C 0 i = nX i=1 (Yi i)C0i : Therefore the likelihood equations for model (2.4), are: nX i=1 [Yi i] = 0: (2.6) nX i=1 h (Yi i)C0i i = 0: (2.7) Since these equations are non-linear in the parameters of interest, 0 and b, an iterative method is used to solve these equations and get estimates of the parameters. One such method is the Newton-Raphson method which is used to solve these non-linear likelihood equations, X0(Y ) = 0; 14 where from (2.4), X=(1j C ), Y= (Y1;:::;Yn)0 and = ( 1;:::; n)0. The approximated parameter function estimate for our functional logistic model will then be, ^ (t) = KbX k=1 ^bk?k(t) (2.8) = ^b0?; where ? is a known basis function and ^b is the maximum likelihood estimate of the reduced functional logistic model (2.4). The estimation of the parameter function obtained using the maximum likelihood ap- proach is not very accurate in the presence of highly correlated data. In fact, the design matrix, C in (2.4),has high correlation among its columns. Thus, there is a need to elimi- nate multicollinearity in order to obtain a more reliable estimation of the parameter function, (t) in our functional model. One such approach is discussed by Escabias et al. (9) and uses the idea of principal component estimation. Another well-known adapted approach which we explore in this chapter is ridge estimation. 2.3.2 Principal Component Estimation The reduced functional logistic model in (2.4) is considered and in order to deal with the multicollinearity problem highlighted, the principal components (PCs) of the design matrix C are utilized instead. We let Z = f ijgn Kb be the matrix of PCs of the design matrix, such that Z = C V; 15 where V is a Kb Kb matrix whose columns are the eigenvectors associated with the eigen- values of the covariance matrix of C . Then the model becomes, L = 01 + Z ; (2.9) where = V0b. As with the MLE, the (t) parameter function is then estimated as, ^ (t) = ^b0?; where ^b = V ^ . A selected number of PCs are included in the model instead, so that (2.9) becomes, L(s) = 01(s) + Z(s) (s); (2.10) where s denotes the number of selected PCs retained in the model. The number of se- lected PCs can be determined either by the cross-validation method or the generalized cross- validation method. These and other methods for selecting the number of PCs are discussed in greater detail in Chapter 3. 2.3.3 Ridge Estimation A second approach that can be used to deal with the highly correlated covariates in model (2.4), would be in the form of penalized maximum likelihood estimation. Le Cessie and van Houwelingen (23), extended the ridge regression theory used in standard linear regression to logistic regression. We adopt this approach to de ne a ridge estimator for the reduced functional logistic regression model. In earlier work, Schaefer (39) also discussed the ridge logit estimator to handle collinear data. In that work, the parameter estimator is 16 given as, ^ = (X0VX + I) 1X0VX^ MLE; where ^ MLE denotes the maximum likelihood estimator of the parameter; X is the design matrix; V is the diagonal matrix of MLEs of success probabilities; I is the identity matrix and is the ridge parameter. We consider the reduced functional logistic regression model (2.4) obtained by consid- ering the functional observations, Xi(t), and the parameter function, (t), as belonging to the nite-dimensional space generated by the bases f j(t)gKXj=1 and f?k(t)gKbk=1, respectively. Under this model, the log-likelihood function as stated earlier is, L( ;Y) = nX i=1 Yilog i 1 i +log(1 i) : Using the approach discussed by Le Cessie and van Houwelingen (23), we consider the log-likelihood function above with a penalty on the norm of b, the unknown parameter in model (2.4) as, L ( ;Y) = L( ;Y) kbk2; (2.11) where kbk= X k b2k !1 2 . Thus, instead of maximization of the log-likelihood function(2.5), the maximization is on L ( ;Y) with a penalty on the norm of b. In this case, controls the shrinkage of the norm of the parameter b. When = 0, the solution will be the ordinary MLE. As the ridge parameter, , approaches 1, the bk?s tend to 0. As with the concept of ridge regression, the idea is to introduce some bias so as to get a biased estimator but whose variance is smaller than that of the unbiased estimator (MLE in this case). The choice of would be such that MSE( ^ ) >> >< >>> >: 1 : Ynew = 1 and ^ new < 0:5 1 2 : ^ new = 0:5 0 : otherwise (ii) Squared Error, SE = (Ynew ^ new)2: (iii) Minus Log-Likelihood Error ML = fYnewlog(^ new) + (1 Ynew)log(1 ^ new)g; where Ynew denotes the response for a new observation and ^ new denotes the probability that the response is positive for a new covariate Xnew(t). Other measures include use of the cross-validation method and this criterion is the one we used in the simulation study 18 presented in the next section. This is de ned for the mean squared error as, MSECV = 1n nX i=1 fYi ^ ( i)g2; where ^ ( i) is the estimated probability of a positive response, when the model is t without the ith observation. We compare the performance of each of these estimation methods discussed in this section by way of a simulation study. 2.4 Simulation Study The following steps were carried out in this simulation study in order to investigate how the di erent estimation techniques discussed in the previous section compare in the estimation of the parameters of the functional logistic regression model. Step 1: Generate the simulated data. (a) Generate n sample curves for the functional logistic model. We generate 50 sample functional observations of a known stochastic process X( ) con- sidered over the interval [0;10] which has 21 equally spaced time points. We de ne this stochastic process as X(t) = Z(t)+t=4+E where Z(t) is a zero mean Gaussian stochastic process with covariance function C(s;t) = (12) (12)80js tj and E is a Bernoulli random variable with p = 0:1. This model is illustrated in Figure 2.1. To obtain the functional form of these sample curves, we consider them to belong to the nite-dimensional space generated by a cubic B-spline basis de ned on equally spaced nodes on the [0;10] inter- val. We used the generalized cross-validation (GCV) method to determine the number of basis functions (KX) to use as well as the smoothing parameter ( ) in the estimation of the functional form of the covariate, X(t). The criterion is expressed as, GCV( ) = n n df( ) SSE n df( ) 19 Figure 2.1: 50 sample curves generated from the stochastic process X(t) = Z(t) +t=4 +E where df( ) =tracefS ; gis the equivalent degrees of measure and S ; is the projection operator such that ^X = S ; X ; SSE is the residual sum of squares; and is the smoothing parameter that measures the rate of exchange between the t to the data. The optimal number of basis was achieved at KX = 10 with = 2048, which is what was used for all the replications. (b) Select a parameter function. The natural cubic spline interpolation of the parameter function sin(t+ =4) is selected. The basis coe cients, b = (b1;:::;bKb)0 of the parameter function are known and used to assess estimation techniques presented in this work. (c) Generate the values of the response variable. The probabilities are, i = expflig1 +expfl ig ; 20 where li is as de ned in (2.1), 0 is xed at 0 and i = 1;::;n. The n values of the response are obtained by simulating observations of a Bernoulli distribution with probabilities i. The reduced functional logistic model in (2.4) was t with the coe cient matrix as the covariate. The Hosmer-Lemeshow goodness of t test was also carried out [Hosmer and Lemeshow (20)], which con rmed validity of the model for the generated data. Step 2: Obtain the approximated estimations for the parameter function. (a) Use maximum likelihood estimation with the coe cient matrix as the covariate of the logistic model. (b) Perform PCA on the covariates in (2.4) and t the logistic model with the retained s PCs as the covariates. (c) Use penalized maximum likelihood estimation with the coe cient matrix as the covari- ate. Small values of MSEB indicate better estimation when the three di erent estimation techniques are compared. All simulations were implemented using R [R Development Core Team (33)]. The pack- ages fda[Ramsay et al. (35)], rrcov[Todorov (42)] and penalized [Goeman (17)] were partic- ularly useful in obtaining the functional form of the data; performing principal component estimation and penalized maximum likelihood estimation, respectively. The cross-validated likelihood criterion was used to determine the optimal shrinkage parameter for the penalized ML estimation. The median estimates of (t) for 250 simulations are shown in Figures (2.2) and (2.3), illustrating how the estimates compared with the true estimates of the function parameter. As is evident in the graphs, the ML estimation method does not give accurate estimates since no consideration is taken for the multicollinearity issue that is inherent in the design matrix. 21 Figure 2.2: Comparison of the median (t) estimates for the three di erent techniques against the true parameter function sin(t+ =4). Both the PC-based and penalized ML estimation approaches give more accurate results, with the estimates of the parameter function being much closer to the true (t) function. This is especially evident when the mean (t) estimates (results not shown) are used. The median MSEB for each of the techniques is 2:1311 for MLE; 0:1762 for PC estimation; and 0:2934 for the ridge estimation. 2.5 Conclusion In this chapter, we have introduced the FLR model and reviewed some of the estimation methods discussed in literature. It has been shown that due to the formulation of the design matrix of the FLR model developed in this work, the estimation of the parameter function, (t), is greatly a ected by the presence of multicollinearity. Two approaches discussed to minimize the collinear covariates were those of principal component estimation [Escabias et al. (9)] and of penalized maximum likelihood estimation. These alternative approaches provide more reliable estimates of (t). These results are consistent with literature [Barker 22 (a) (b) (c) Figure 2.3: Median (t) estimates (a) Maximum Likelihood Estimation (b) Principal Com- ponent Estimation and (c) Ridge Estimation and Brown (2), Le Cessie and van Houwelingen (23), Escabias et al. (9) and V ag o and Kemen ey (43)] and we conclude that estimation of the function parameter (t) can be es- tablished with better accuracy using principal component estimation or penalized maximum likelihood estimation techniques. However, in the presence of outlying sample curves, there is a need to consider robust estimation methods. In chapters 3 and 4, we turn our focus to this by proposing robust methods for the functional logistic regression model. 23 Chapter 3 Robust Principal Component Functional Logistic Regression 3.1 Introduction Inherent with most functional data, and the estimation techniques discussed in this work, is the fact that we are often dealing with highly correlated data, which poses problems in parameter estimation resulting in estimations that might be far from accurate and unreliable. The presence of functional observations that deviate from the overall pattern of the data, creates additional ine ciencies in many procedures. In this work, we propose a robust estimation approach for functional data with a binary response and functional covariates that addresses these inadequacies. To our knowledge, there is no work that has been done in this area of robust estimation for the functional logistic regression model. There are some robust methods proposed for the functional linear regression model and for functional principal component analysis. Boente and Fraiman (3), Gervini (15), Gervini (16), Bali et al. (1), Sawant et al. (38) and Lee et al. (24) are examples of such work. This work is di erent from these others in that it looks at robust methods for the functional logistic regression model, for which no known robust approach has been proposed. However, functional outliers are inevitable and therefore it is important that robust techniques for this model be developed. The rst approach taken in this chapter to develop robust estimators is that of us- ing robust principal component estimation. The functional logistic model is reduced to a standard multiple logistic model by approximating the functional covariates as a linear com- bination of an appropriate basis [Ramsay and Silverman (34)]. This estimation approach of the functional data results in collinear and (potentially) high dimensional data and there- fore, an initial focus is the elimination of such problems. Some of the proposed approaches in literature that eliminate these issues are by way of principal component estimation or 24 ridge estimation as discussed in Chapter 2. However, in the presence of atypical curves, these estimations are unstable and inaccurate. Febrero et al. (11) de ne a functional outlier as that curve that has been generated by a stochastic process with a di erent distribution than the rest of the curves, which are assumed to be identically distributed. This de nition appears to be all-encompassing as it refers to those curves that could be far away from most of the curves; curves that have a di erent pattern from the rest or even those curves that are atypical in some sub-interval of the period of interest. With real word data, outliers are inevitable and this makes it necessary to develop robust estimators, of which we consider robust estimation techniques using principal component estimation. 3.2 Estimation of X(t) and (t) We consider the functional covariates, Xi(t), and functional parameter, (t), as belong- ing to a nite-dimensional space generated by (not necessarily) the same basis f j(t)gKXj=1 andf?k(t)gKbk=1, respectively. We consider Xi(t)2L2(T) with the usual inner product, such that Xi(t) KXX j=1 cij j(t); (3.1) where j(t); j = 1;:::;KX, is an appropriate basis, selected to re ect the characteristics and main features of the data. As discussed in Chapter 1, the truncation lag, KX, is a parameter that is selected based on the features and characteristics of the data. This determines the dimension of expansion; the larger KX is, the better the t to the data is. However, this can result in problems of over- tting and therefore capturing noise or variation in the data that might be ignored. Smaller KX on the other hand, whilst being desirable should not be too small such that there is a risk of overlooking important features of the data. The selection of the basis system is an important aspect in functional data in that the features that might be evident in the observed data should be adequately met by the selection 25 of an appropriate basis system. A good basis system selection potentially results in a smaller KX which means less computational time in estimating the functional covariate and ensuring that the basis coe cients, cij, serve as interesting descriptors of the given data. In general, the Fourier basis functions are used to model periodic data and the B-spline basis is used for non-periodic data. However, this is by no means the status-quo and great consideration needs to be made in this regard. Thus, with the assumption that Xi(t) belongs to the Hilbert space, we are able to reconstruct the functional form of the functional covariates from the observed discrete points using two di erent approaches - either by smoothing or interpolating. We also de ne the functional parameter, (t) = KbX k=1 bk?k(t); (3.2) where ?k(t); k = 1;:::;Kb is a known basis function; and Kb is selected in such a way that KX Kb. Since the estimate for the basis coe cientfcijgcan be found either by smoothing or interpolation, using the basis expansion of the covariates and parameter function as de ned in (3.1) and (3.2) in the regression model (2.1), the functional model reduces to a standard multiple one, li = 0 + Z T Xi(t) (t)dt = 0 + KXX j=1 KbX k=1 cij jkbk; i = 1;:::;n; where jk = RT j(t)?0k(t)dt ;j = 1;:::;KX;k = 1;:::;Kb; cij is the basis coe cient; 0 is an unknown real parameter; bk is the unknown basis coe cient used to estimate the parameter function (t). In matrix form, this can be written as, L = 01 +C b; (3.3) 26 where L= (l1;:::;ln)0, C=fcijgn KX, =f jkgKX Kb, 1= (1;:::;1)0 and b = (b1;:::;bKb)0. We develop the principal component estimation technique discussed in Escabias et al. (9) further to cater for cases where there are functional outliers in the data. Due to the fact that principal component estimation makes use of the eigen decomposition of the covariance matrix of the design matrix, C , the presence of outliers will greatly in uence the PCs. This sensitivity to outliers results in the rst few PCs being attracted towards the outliers, and therefore this approach might miss the main modes of variability of the rest of the observations. Our proposed approach uses robust PC estimation techniques that eliminate multi- collinearity and reduces the e ect of functional outliers, resulting in a more accurate esti- mator in the presence of outliers. We use robust PCA methods on the covariate matrix to obtain robust PCs which are used as the design matrix in the standard multiple logistic model. Robust Principal Component Analysis, ROBPCA, by Hubert et al. (21) is one such approach which uses projection pursuit and robust covariance estimation based on the Min- imum Covariance Determinant (MCD) method. The MCD method is based on seeking an h-subset of observations whose classical covariance matrix has the smallest determinant. The three basic steps in ROBPCA are given in the following algorithm: Input: Data matrix A = C n Kb where n is the number of observations and Kb represents the initial number of variables. Output: Robust PC scores Zn k0 where k0 < Kb is the number of eigenvectors re- tained. 1. A singular value decomposition (SVD) of the data is performed so as to project the observations on the space spanned by themselves. This step is especially useful when Kb n as it yields a huge dimension reduction. 2. A measure of outlyingness is de ned for each data point. This is achieved by pro- jecting all the points onto many univariate directions, v, and then determining the 27 standardized distance of each projected point to the center of the data. The h(< n) least outlying data points are determined and retained, where outlyingness is de ned as, Out(ai) = maxvja 0 iv ^ rj ^ r ; i = 1;:::;n; where ^ r and ^ r are the univariate MCD based location and scale estimates for the pro- jected data points, a0iv, respectively. The h data points are projected on the subspace spanned by the rst k0 eigenvectors of the sample covariance matrix of the h-subset. 3. The covariance matrix of the mean-centered data matrix, A n k0, obtained in the second step using the MCD estimator is robustly estimated and PCA is applied on to the data matrix. We consider the functional logistic regression model as de ned before in (2.1) where the functional covariate is de ned as (3.1), resulting in the standard multiple logistic regression (3.3). We let Z(r) =f ijgn k0 be the matrix of robust PCs of the design matrix, s.t. Z(r) = A V(r); where A is the design matrix as described in Step 3 of the ROBPCA algorithm; V(r) is a Kb k0 matrix whose columns are the eigenvectors associated with the eigenvalues of the robust covariance estimation based on the MCD of A n k0, the mean-centered data matrix. The logit transformation of the reduced functional logistic model becomes, L(s)(r) = (s)0 1 + Z(s)(r) (s); (3.4) where = V0(r)b and s is the number of retained PCs in the model. The design matrix of this model is now void of collinear columns, and also because of the robust approach taken to compute the PCs, the e ect of outlying curves is minimized. 28 Therefore, the estimate of the functional parameter is given by, ^ (t) = ^b0?; where ^b = V(r)^ . 3.3 Model Selection Issues There are di erent criteria used in deciding which PCs should be included in the model. The natural order is to include PCs based on the explained variability. There are other criterion available as discussed by Hocking (19) and by M uller and StadtM uller (29) which take into consideration the predictive ability of the PCs. In the simulation study carried out in the next section, the simplistic natural order of explained variability is used to decide which PCs to include in the model. Another issue in model selection is the decision concerning the number of PCs to include in the model. Some of the measures that can be used include the integrated mean squared error of the (t) parameter function (IMSEB). This is de ned as, IMSEB(s) = 1T Z T ( (t) ^ (s)(t))2dt; where ^ (s)(t) is the estimated parameter function for the logistic model with s PCs. Another available measure is the mean squared error of (t) parameters (MSEB), which is de ned as, MSEB(s) = 1K b + 1 ( 0 ^ (s)0 )2 + ( (t) ^ (t)(s))2 ; where Kb is the number of basis functions, ^ (s)0 is the estimated intercept in the model with s PCs and ^ (t)(s) is the estimated parameter function in the functional logistic model with s PCs. The optimal model is selected as that model whose IMSEB or MSEB is the smallest. In 29 the simulation study, we use the MSEB to determine s, the number of PCs to include in the model. The PCs are then added based on the explained variability of each PC, starting with the one with the largest explained variability until the optimal number of PCs is attained. However, in the case of real data, both these methods cannot be obtained and therefore more practical approaches are required. Escabias et al. (9) suggest the use of the estimated variance of the estimated parameters which is de ned as, Var( ^ (t)(s)) = V(s)(r)(Z0(s)(r) W(s)(r)Z(s)(r)) 1V0(s)(r) ; where W(s)(r) = diag[^ (s)(1 ^ (s))], Z(s)(r) is the matrix of s robust PC scores from the design matrix and V(s)(r) is the matrix whose columns are the eigenvectors associated with the eigen- values of the robust covariance estimation based on the MCD of the design matrix. Escabias et al. (9) suggested selecting the optimum number of PCs by plotting the estimated variance against the number of PCs, s, and selecting s just before a signi cant increase in the esti- mated variance. In the case where there are potentially multiple cases like this, the smallest s is then selected. The percent of variance explained (PVE) is another alternative approach to determine the number of PCs to retain in the model. In this case, PCs with eigenvalues greater than 1.0 can be retained. In both these instances, the idea is to ensure that much of the variability in the model is retained. Another method is the cross validation (CV) method. Cross validation involves parti- tioning the data into two sets; the rst set known as the training set is used to determine a predictive model whilst the second set, known as the test set is used to validate the predic- tive model. The leave-one out cross validation method leaves out one observation and ts the model with the remaining n 1 observations. Prediction is then made for the left-out observation using this model, and this procedure is repeated for all the observations. For 30 the logistic model, this is de ned as, CV(s) = 1n nX i=1 (yi ^ (s)i; i)2; where ^ i; i indicates the predicted response with observation i missing from the predictive model. An optimal number of PCs are selected as those with a minimum CV. The Informa- tion Criterion (IC) method is another alternative. The IC can be viewed as a compromise between the goodness of t and the complexity of the model. The Akaike Information Cri- terion (AIC) and Bayesian Information Criterion (BIC) are some of the widely used IC in which case to select the optimal number of PCs, the IC is computed for varying s values and the optimal s would then be chosen at the minimum. 3.4 Numerical Examples In this section, we study the performance of our proposed estimation approach by way of a simulation study, as well as applying the methodology to the Canadian Weather data. We show the improved accuracy in the parameter function estimation in the case where outliers are present. 3.4.1 Simulation Study The following steps were carried out in this simulation study in order to investigate how the proposed estimation technique performs and compares to other existing approaches in the estimation of the parameter function of the functional logistic regression model. In the rst step to generate the data, the following were done, (a) Generate n = 50 sample curves for the functional logistic model. We generate 50 sample functional observations of a known stochastic process X( ) con- sidered over the interval [0;10] which has 21 equally spaced time slots. We de ne this 31 Figure 3.1: 50 sample curves generated from the overlapping stochastic process Xi(t) = ai1 +ai2t+Wi(t) and di erentiated for each class of the response process as, Xi(t) = ai1 +ai2t+Wi(t) Wi(t) = 10X r=1 bi1sin(2 10rt) +bi2cos(2 10rt) where ai1 U[1;4] or ai1 U[2;4], ai2 N[1;0:2] or ai2 N[1;0:6], bi1;bi2 N[0;1=r2]. This model is illustrated in Figure 3.1. Since the response is binary, the sample curves are di erentiated for the two classes. To obtain the functional form of these sample curves, we consider them to belong to the nite-dimensional space generated by some basis, where (t) is a cubic B-spline basis de ned on equally spaced nodes on the [0;10] interval. We used the generalized cross-validation (GCV) method to determine the number of basis functions to use as well as to determine the smoothing parameter in estimating the 32 functional covariate, X(t). The criterion is expressed as, GCV( ) = n n df( ) SSE n df( ) where df( ) = trace fS ; g is the equivalent degrees of measure and S ; is the projec- tion operator such that ^X = S ; X; SSE is the residual sum of squares; and is the smoothing parameter that measures the rate of exchange between the t to the data. The minimization of GCV with respect to is achieved by doing a grid search of some values of and selecting that with the minimum GCV value. (b) The natural cubic spline interpolation of the parameter function sin(t+ =4) is selected as the true (t) function. The basis coe cients, b = (b1;:::;bKb)0 of the parameter function are known and used to assess the estimation techniques presented in this work. (c) The probabilities of a positive response (Yi = 1) given Xi(t) are, i = expflig1 +expfl ig ; i = 1;::;n where li is as de ned before in (3.3); and 0 is xed at 0. The values of the response, Yi, are obtained by simulating observations of a Bernoulli distribution with probabilities i. The second step involves contamination of the simulated data. We adopted the con- tamination process as discussed by Fraiman and Muniz (14): (i) Model 0: No Contamination X(t) = a1 +a2t+W(t) is the generated data as discussed before. (ii) Model 1: Asymmetric Contamination Z(t) = X(t) + cM where c is 1 with probability q and 0 with probability 1 q and q =f0%;5%;10%;15%;20%g is the contamination level; M is the contamination constant size taking a value of 25 and X(t) is as de ned in Model 0. 33 (a) Model 1 (b) Model 2 (c) Model 3 (d) Model 4 Figure 3.2: Sampling curves for Model 1 - 4 at q = 5% (iii) Model 2: Symmetric Contamination Z(t) = X(t) +c M where X(t), c and M are as de ned before and is a sequence of random variables independent of c that takes the values 1 and 1 with probability 0:5 (iv) Model 3: Partial Contamination Z(t) = X(t) + c M if t>T and Z(t) = X(t) if t 0.05). The robust PCA approach has the highest area under the ROC curves, especially in the presence of the outlying sample curve. These goodness-of- t measures indicate that the model that uses the robust approach has AUC of 0:7125 whilst that which uses the non-robust approach has one of 0:6755. This is considered as excellent discrimination by Hosmer and Lemeshow (20), an indication that the model based on the robust PCs better predicts the risk of drought. 3.5 Conclusion The objective in this chapter was to suggest a robust estimation technique for the func- tional logistic regression model. The estimation of the functional parameter in this model cannot be achieved by the regular method of maximum likelihood. Therefore, we approxi- mate the functional observations and de ne the parameter function in a nite-dimensional space generated by a known basis. This reduces the functional model to a multiple model, al- beit with highly collinear covariates. The presence of multicollinearity and outliers causes the maximum likelihood estimate from this multiple logistic model to be unstable and therefore, unreliable. Robust estimation methods for the functional logistic model are therefore an important tool in estimation of the parameter function derived in this manner. In this work, we have proposed an approach that makes use of robust principal component estimation, and essen- tially reduces dimensionality and improves the estimation of the parameter function in the presence of multicollinearity and outliers. From the simulation study, we have shown that in 43 the presence of outliers, this approach results in better estimations for the parameter func- tion, and subsequently better interpretation of the model. We also illustrated the improved performance of the proposed method, by analyzing a real data set. 44 Chapter 4 Robust Penalized Functional Logistic Regression 4.1 Introduction In Chapter 3, we proposed an estimation technique that makes use of robust principal component estimation. This method is shown to be a more reliable approach in estimating the functional parameter in the functional logistic regression model, especially since it became necessary to look at an approach that eliminated multicollinearity as the resulting design matrix in the reduced standard logistic model was ridden with collinear columns. Even though the use of a robust principal component estimation technique resulted in better estimations for the model in the presence of outliers, the in uence of outliers, however, is still noticeable. This can be attributed to the fact that this approach still makes use of the maximum likelihood estimation approach. It has been shown that for the standard logistic model, the MLE is sensitive to outliers [Hosmer and Lemeshow (20), Croux et al. (8), Carroll and Pederson (5), Pregibon (32)]. The robustness properties for the MLE were analyzed in terms of the in uence function. The in uence function (IF) of an estimator is an asymptotic approximation to the behavior of that estimator when the sample contains a small fraction of identical outliers. An estimator with an unbounded IF is non-robust since an in nitesimal model deviation causes the bias to be arbitrarily large. The IF of the MLE for the standard logistic model is, IF(y;x; ) = M 1(y ( 0x))x; where M is the Fisher information matrix, M = E[ 0( x)xx0]. Therefore, the MLE is unbounded in x and bounded in y. In fact, the kind of outliers whose in uence is large for 45 the logistic model are such that either kxik!1; yi = 1 and 0xi is bounded away from 1 or k xi k!1; yi = 0 and 0xi is bounded away from 1. In other words, extreme values in the design space lead to a biased MLE. Moreover, it has also been shown that mis-classi cation errors, i.e. errors in the response, lead to a biased MLE [Pregibon (32), Copas (6)]. Therefore, there is a need to develop robust estimation methods that deal with these issues. There have been many approaches discussed in literature that look at ways to turn the MLE into an estimate with bounded in uence. Such methods involve techniques of down-weighting high-leverage observations. In this chapter, we look at one such alternative approach that is a Mallows-type estimate. We adopt the same estimation procedures as before that reduce the functional logistic model to a standard logistic model by approx- imating the functional covariate and the functional parameter as a linear combination of some appropriate basis. A Huber-type loss function is then used to down-weight the high- leverage observations and there is also a roughness penalty term to ensure that the estimated parameter function, ^ (t), is smooth. 4.2 Estimation of X(t) and (t) We consider the functional covariates, Xi(t), and functional parameter, (t), as belong- ing to a nite-dimensional space generated by (not necessarily) the same basis. We consider Xi(t)2H (i.e. Hilbert space), such that, Xi(t) KXX j=1 cij j(t); (4.1) wheref j(t)gKXj=1, is an appropriate basis that is selected to re ect the characteristics of the data. 46 We assume (t) 2 H and that the parameter is de ned as a linear combination of a cubic B-spline basis f k(t)gKbk=1, such that, (t) = KbX k=1 bk k(t); where the order of expansion, Kb, is estimated either by cross-validation or generalized cross- validation; and so that Kb KX. This results in the functional logistic model being reduced to a standard logistic model, i = P(Yi = 1jXi(t)2L2(T)) = exp 0 + KXX j=1 KbX k=1 cijJjkbk ! 1 +exp 0 + KXX j=1 KbX k=1 cijJjkbk !; i = 1;:::;n; (4.2) whose logit transformation is given by, li = 0 + KXX j=1 KbX k=1 cijJjkbk; i = 1;:::;n; L = 01 +CJb; (4.3) where L= (l1;:::;ln)0; C= fcijgn KX; J = fJjkgKX Kb with Jjk = RT j(t) k(t)dt; 1= (1;:::;1)0; and b = (b1;:::;bKb)0. In this reduced form, the maximum likelihood estimate can be derived and the unknown parameters established using an iterative method such as Newton-Raphson. The next section explains a robust penalization method that minimizes the e ect of outliers on the estimate, resulting in a smooth robust estimate for the functional logistic model. 47 4.3 Robust Penalized Maximum Likelihood Estimation Cardot and Sarda (4), Goldsmith et al. (18) and Ogden and Reiss (30) have considered penalization in the estimation of the functional logistic model or generalized functional lin- ear model. The introduction of the penalty term in the maximization of the log-likelihood function is to ensure that the smoothness property of the parameter function is met. This regularization approach in the estimation of smooth functional estimates is common, espe- cially with estimating the functional form of the predictor, X(t), as discussed in Chapter 1 and also in functional principal component analysis [Ramsay and Silverman (34)]. The general form of the penalized likelihood is, L ( ;Y) = L( ;Y) Pen( (t)); where L( ;Y) is the log-likelihood function; is the smoothing parameter and Pen( (t)) is the penalty term that ensures that the parameter function is smooth. Cardot and Sarda (4) considered a roughness penalty of the formk?(m)0k bk2 where?k denotes the vector of all the B-splines and ?(m)k the vector of derivatives of order m of all the B-splines for some integer m. The GCV criterion is then used to determine the value of the smoothing parameter, . Goldsmith et al. (18), on the other hand, proposed an automated way of deducing the smoothing parameter in the penalty term. They considered the parameter function, (t), as the truncated power series spline basis expansion, (t) = b0 +b1t+ KbX k=3 bk(t Kk)+ where fKkgKbk=3 are knots. They also considered the reduced functional model as a mixed e ects model with Kb 2 random e ects,fbkgKbk=3 N(0; 2bI). Therefore, the penalty term is given as, 1 2 b b0Db, where D is the penalty matrix and 1 2 b is the smoothing parameter. 48 Thus, the smoothing parameter is estimated as a variance component in the mixed e ects model. We build on these ideas and propose a robust penalization method. It is important to note that the penalized likelihood methods discussed above and such similar methods in literature are sensitive to outlying observations. From (4.2), we denote the parameter vector as = ( 0;b0)0, and the covariate vector as ai = (1;fCJg0i)0 to simplify the notation. The proposed estimator [RPMLE] is de ned as, ^ = arg max nX i=1 wifyilog( i) + (1 yi)log(1 i)g 2 0D0 ; (4.4) where i is short for (a0i ;yi); is a smoothing parameter; wi are weights that might depend on ai, yi or both; and D0 is the penalty matrix augmented by a leading column and row of Kb + 1 zeros. When wi = 1;8i and = 0, then (4.4) yields the maximum likelihood estimator. We consider a Mallows class estimator (Mallows (26)), that addresses the problem of outliers in the design space for the reduced functional logistic model. The main idea with a Mallows-type estimator is to down-weight observations which have high leverage. In our proposed robust estimator in (4.4), this is achieved by the following weighting scheme as proposed by Stefanski (40), wi = W(hn(xi)); hn(x) = h (x ^ n)0 ^ 1n (x ^ n); i1 2 with ^ n and ^ n being a robust location vector and a robust dispersion matrix estimate of the design matrix, respectively. W is a non-increasing function such that W(u)u is bounded. We use the weight function corresponding to Huber?s ?s, which is de ned as, Wk(u) = min 1; kjuj ; 49 where k = 1:345 is a tuning constant. It is noted that while the Mallows-class estimator can be less e cient than the usual logistic MLE, its bias can be smaller than that of the MLE. Carroll and Pederson (5) were able to illustrate that in the case where the design has extreme design points, selective down-weighting can lead to less biased estimates and this decrease in bias together with the robustness property, may be worth the lower e ciency. The estimation of the smoothing parameter, , can be achieved by cross-validation or by the GCV method. The smoothness of the parameter function, (t), is based on the idea of the curvature of a function. This curvature is quanti ed as, Z T f 00(t)g2dt = Z T KbX k=1 bk 00k(t) 00k(t)bkdt = KbX k=1 bkD(t)bk (4.5) = b0Db (4.6) where D(t) = RT 00k(t) 00k(t)dt is the smoother. The approximation of the smoother is possible by means of a numerical quadrature scheme. 4.3.1 Details of the Estimation Technique Iterative algorithms are usually used for the computation of M-estimators. Since the likelihood equations for (4.4) are non-linear in the parameter of interest, , an iterative method has to be used to solve these equations and get estimates of the parameters. A modi ed scoring algorithm or Newton-Raphson method may be used to solve the non-linear likelihood equations. 50 To make use of the Newton-Raphson method, the gradient and hessian matrices are required. We consider the likelihood function in (4.4) and consider that, log( i) = log exp(a0 i ) 1 +exp(a0i ) ; = a0i log(1 +exp(a0i )); and similarly, log(1 i) = log 1 1 +exp(a0i ) = log(1 +exp(a0i )) The likelihood function in (4.4) can then be re-written as, L( ;y) = nX i=1 wifyilog( i) + (1 yi)log(1 i)g 2 0D0 = nX i=1 wifyia0i log(1 +exp(a0i )g 2 0D0 : The rst derivative (the gradient matrix) then becomes, L0( ;y) = nX i=1 wi yiai aiexp(a 0 i ) 1 +exp(a0i ) 0D0 = nX i=1 wiaifyi ig 0D0 = A0Wr 0D0; 51 where r = y n and W = diag(wi). The second derivative (the hessian matrix) becomes, L00( ;y) = nX i=1 wi (a0 i) 2exp(a0 i )(1 +exp(a 0 i ) (a 0 i) 2(exp(a0 i )) 2 [1 +exp(a0i )]2 D0 = nX i=1 wif(a0i)2 i (a0i)2( i)2g D0 = nX i=1 wifai i(1 i)a0ig D0 = A0V12WV12A D0; where V =diag(n i(1 i)). The derivation of the hessian matrix can be computationally expensive and an approximation based on the gradient may be used instead. The Newton- Raphson method leads to the iterative scheme, k+1 = k + (A0V12WV12A + D0) 1(A0Wr 0D0); (4.7) in which case V and r are evaluated at k. When all the weights are unity, i.e. wi = 1 8i and = 0, then the above reiterative scheme simpli es to that of of the usual maximum likelihood iterative procedure for ^ . A grid search is done to obtain the optimal smoothing parameter, , by way of the cross-validation method. This selection criteria is discussed in the following section. 4.3.2 Tuning Parameter Selection The in uence of the penalty is based on the magnitude of the smoothing parameter, which is non-negative. When = 0 then we have un-penalized estimates and thus, have unbiased estimates where smoothness is not emphasized; whereas large values of indicate that more weight is given on the smoothness of the estimate. Increasing the value of makes the (t) smoother and an optimum is reached by doing a grid search of . The smoothing parameter, , is searched by varying it systematically and then monitoring the prediction 52 error using techniques such as cross-validation. The factor 12 is used so as to get rid of a factor 2 that occurs when the penalty is di erentiated. The leave-one out cross validation method leaves out one observation and ts the models with the remaining n 1 observations. Prediction is then made for the left-out observation using this predictive model, and this procedure is repeated for all the observations. For the logistic model, this is de ned as, CV = 1n nX i=1 (Yi ^ i; i)2; where ^ i; i corresponds to the predicted probability of a positive response given the predictor with observation i missing from the predictive model. The CV is computed for a variety of and an optimal smoothing parameter with a minimum CV is then selected. Another option for exponential family models is minimization of the information criterion (IC). The IC can be viewed as a compromise between the goodness of t and the complexity of the model. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are some of the widely used ICs. To select the optimal smoothing parameter, a logarithmic grid search on the non-negative penalty parameter is performed and IC computed. The optimal is chosen at the minimum. 4.4 Numerical Examples To display the robustness properties of the proposed estimator, we conduct a simulation study and compare the robust approach discussed in this chapter with the existing maximum likelihood estimate (MLE), ^ = arg max nX i=1 fyilog( i) + (1 yi)log(1 i)g (4.8) 53 as well as the penalized maximum likelihood estimate (PMLE), ^ = arg max nX i=1 fyilog( i) + (1 yi)log(1 i)g 2 0D0 : (4.9) An application of the estimation method to a real data set is also explored and the results discussed in this section. 4.4.1 Simulation Study We use the same data generated in Chapter 3 for this simulation study. That is, we generate 50 sample functional observations of a known stochastic process X( ) considered over the interval [0;10] which has 21 equally spaced time slots where the process is de ned as, Xi(t) = ai1 +ai2t+Wi(t) Wi(t) = 10X r=1 bi1sin(2 10rt) +bi2cos(2 10rt) where ai1 U[1;4] or ai1 U[2;4], ai2 N[1;0:2] or ai2 N[1;0:6], bi1;bi2 N[0;1=r2]. The functional predictor is estimated using a cubic B-spline basis expansion where the order of expansion, KX, is determined by way of the GCV method. The true (t) function is taken to be sin(t+ =4) and the natural cubic spline of this used to get an approximation of the function. Outlying sample curves are introduced for the stochastic process X( ) using the model of Fraiman and Muniz (14) as discussed in Chapter 3. These are asymmetric contamination (Model 1), symmetric contamination (Model 2), partial contamination (Model 3) and peak contamination (Model 4). The smoothing parameter is determined by the cross-validation method. The Huber- type weights makes use of the Mahalanobis distance, where the robust location and dispersion 54 Table 4.1: Mean MSEB (standard error) for the estimation of the functional parameter, (t), for Models 2, 3 and 4 Model 2 Model 3 Model 4 Cont. (%) MLE PMLE RPMLE MLE PMLE RPMLE MLE PMLE RPMLE 0 26.3043 1.3211 0.5894 26.3043 1.3211 0.5894 26.3043 1.3211 0.5894 (9.5266) (0.5115) (0.1081) (9.5266) (0.5115) (0.1081) (9.5266) (0.5115) (0.1081) 5 20.5941 1.0453 0.5681 16.3381 0.9292 0.7444 20.5692 1.0591 0.5308 (12.7574) (0.3335) (0.1072) (6.2826) (0.2919) (0.3015) (6.9599) (0.3239) (0.0883) 10 8.1512 0.8327 0.4756 12.4196 0.8676 0.7055 18.2714 0.8676 0.6798 (3.6822) (0.2573) (0.0386) (4.9123) (0.2793) (0.2218) (4.7792) (0.2182) (0.2067) 15 4.3307 0.7522 0.4958 11.5515 1.0341 0.6168 17.0797 0.9183 0.6318 (2.9919) (0.2844) (0.0469) (4.7697) (0.3466) (0.1429) (4.4364) (0.2361) (0.1510) 20 3.7119 0.6544 0.4858 7.8925 0.8528 0.6115 17.5679 0.9504 0.6578 (4.4867) (0.2118) (0.0214) (3.3326) (0.2776) (0.1223) (5.1051) (0.2697) (0.1408) estimates are both MCD-based. Table 4.1 summarizes the average of the MSEB measures for the three estimation methods discussed in this chapter, i.e. the maximum likelihood estimate (MLE), the penalized maximum likelihood estimate (PMLE) and our proposed robust penalized approach (RPMLE). There is improved estimation with the inclusion of the smoothing term, as expected and discussed by Cardot and Sarda (4) and Goldsmith et al. (18). In the presence of outliers, however, our robust approach is a markedly better estimator. A comparison of our robust penalized estimation approach with the non-robust penalized approach is shown in Figure (4.1). It is evident from this that, in the presence of outliers, down-weighting these leverage points results in a more reliable estimate of the parameters (with notable exception of high contamination levels for Model 1). Essentially, the e ect of the outlying curves is minimized in the estimation of the functional parameter, (t), resulting in a more reliable functional logistic model for the observed data. 4.4.2 Poblenou NOX Levels Data Set This dataset was used by Febrero et al. (11) and is a collection of NOX emissions in Poblenou, Barcelona (Spain) over a period of 115 days with recordings starting on 23 February and ending on 26 June, in 2005. Over that period of time, hourly measurements of the NOX are recorded and therefore we split the whole sample of hourly measures in a 55 (a) Model 1 (b) Model 2 (c) Model 3 (d) Model 4 Figure 4.1: Comparison of the penalized method and robust penalized method at di ering contamination levels dataset of functional trajectories of 24 h observations (each curve represents the evolution of the levels in 1 day). There were twelve curves that contained missing data that were excluded in the study. We consider the functional logistic regression model where the functional predictor is the functional trajectories of NOX levels over a 24 hour period with a binary response indicating whether the day is a working day (Y = 1) or a non-working day (Y = 0). This data set is 56 Figure 4.2: NOX emmission levels for non-working (top) and working (bottom) days Figure 4.3: Functional form of the hourly trajectories with the outliers detected in the NOX dataset known to have outliers and a robust functional principal component method by Sawant et al. (38) identi ed the outlying curves as shown in Figure 4.3. These outliers were identi ed for the following ve days 03=09;03=11;03=18;04=29 and 05=02, all of which were working days. The functional trajectories are estimated using a truncated Karhunen-Loeve decompo- sition. We let 1X k=1 k k(s) k(t) be the spectral decomposition of the covariance function, 57 C(s,t), where 1 2 are the non-increasing eigenvalues associated with the orthonor- mal eigen-functions 1(t); 2(t); . We suppose Xi(t)2L2(T) and centered, then Xi(t) = KXX j=1 cij j(t); where j(t) are the rst KX eigenfunctions of the smooth covariance function C(s;t) = cov[Xi(s);Xi(t)] and cij = RT Xi(t) j(t)dt. The truncation lag, KX, is determined using a 99% cut-o of the percent variance explained (PVE). This is de ned as, PVE(KX) = KXP p=1 p 24P p=1 p ; where KX represents the minimum number of principal components needed to explain 99% of the total variation in the model. The estimated parameter function for the three methods discussed in this chapter are shown in Figure (4.4). The pattern of the parameter is similar for all three methods with slight di erences for the late afternoon to early morning time interval (i.e. 18:00 hrs to 24:00 hrs and also between 01:00 hrs and 03:00 hrs). This di erence is more pronounced for the MLE and PMLE in the presence of outliers, which inevitably gives a di erent model inter- pretation for those sub-intervals. Due to the down-weighting of high leverage observations, the e ect of outlying curves is minimized in our robust approach. It is interesting to note that the a ected sub-interval for the estimators is the same sub-interval where the functional trajectories? behave di erently from the rest of the sample curves. 4.5 Conclusion In this chapter we proposed a robust estimation approach that downweighs high leverage points. This was achieved by reducing the functional logistic model by way of basis expansion. 58 (a) PMLE (b) RPMLE Figure 4.4: A comparison of the estimated (t) with and without the 5 outliers for the non-robust and robust penalized methods The in uence function for the maximum likelihood estimator for this reduced logistic model is known to be unbounded in the x-direction. Therefore, functional observations whose pattern di er from the rest of the observations, even in some sub-interval, greatly in uence the estimator. It was with this in mind that weights based on the design space were used together with some regularization to impose the smoothness property of the estimated parameter. The proposed robust penalized method is a Mallows-type estimator that uses Huber- type weights to down-weight outliers in the x-direction by using the robust Mahalanobis distance of the covariate. The penalty term makes use of the curvature of the parameter function and ensures smoothness of the estimator. The Monte Carlo study showed the increased e ciency in estimation by the robust penalized estimator. The same conclusion was arrived at for a real world data example. 59 Chapter 5 Diagnostic Methods for the Functional Logistic Model 5.1 Introduction Often after tting a model, one needs to assess the t of the model before using it to make any inference. For this to be done, it is imperative that there?s an understanding of the model as well as the tting procedure used. The idea is on assessing how good the t is, i.e. how well does the t model describe the outcome. In this chapter, we adapt diagnostics methods for the standard logistic model and extend these ideas to the functional logistic model. We contribute to the diagnostics measures of this model. Febrero-Bande et al. (12) proposed two statistics that measure the in uence of each curve on the functional slope of the functional linear model, with a scalar response and functional predictor(s), when the principal components method was used in the estimation of the model. M uller and Chiou (28) developed diagnostic measures for functional regression models where the response was functional and the predictor was either multivariate vectors or random functions by proposing residual processes. Malloy et al. (27) proposed a method that extended the Box-Tidwell score test and involved construction of residual plots to detect non-linearity of the functional predictor(s) in the functional generalized linear model which has a scalar response and functional predictors. Our proposed diagnostic measures di er from all this work in two primary aspects. Firstly, our measures are for the functional logistic model which has a binary response and functional predictor(s). Secondly, we focus on measures of goodness-of t as well as measures to identify ill- tting observations for the model and/or observations that have a dominant in uence in the t of the model. Pregibon (32) made a signi cant contribution on diagnostic measures for the standard logistic model, by extending ideas from linear regression. We explore ways that these and 60 other diagnostic methods can be adapted for data were the predictor is functional and the response binary. We base our measures on the principal component based method discussed in chapters 2 and 3. This methodology can be extended to other estimation methods and we illustrate this by assessing the t of the numerical examples analyzed in chapters 3 and 4. 5.2 Measures of Goodness-of-Fit We consider the functional logistic model where the functional predictor, X(t)2L2(T), de ned on the closed interval T = [a;b] R. We assume that the unknown parameter function, (t)2L2(T), and that both functions are represented as a linear combination of a known basis, such that the probability of a positive response given the functional predictor is given as in (2.4). As discussed in chapter 2, the maximum likelihood estimate of this model is unstable and ine cient due to multicollinearity issues with the design matrix X = (1jC ). Therefore, we consider the principal component based approach, where we let Z =f ijgn KX be the matrix of PCs of the design matrix, such that Z = C V, where V is a KX KX matrix whose columns are the eigenvectors associated with the eigenvalues of the covariance matrix of C . Then the logit model (2.4) becomes, L(s) = 01(s) + Z(s) (s); (5.1) in matrix form, where = V0b and s denotes the number of principal components retained in the model. The maximum likelihood estimate is then, ^ = arg max nX i=1 yilog[ (zi ;yi)] + (1 yi)log[1 (zi ;yi)]: (5.2) Goodness-of- t summary measures are useful in giving an indication of the t of the model. In order to asses the t of the model, covariate patterns need to be considered. These are distinct groupings of the covariates which we will denote by J. Therefore, should some 61 subjects have the same values for the covariate, then J < n. The number of observations with the same covariate pattern is denoted by mj for j = 1;:::;J such that Pmj = n. The number of positive responses among the mj observations having the same covariate pattern is also denoted by yj. And so the sum of yj would be the total number of observations with the same covariate pattern having a positive response. Therefore, the maximum likelihood estimate in (5.2) can be re-written as, ^ = arg max JX j=1 yjlog[ (zj ;yj)] + (mj yj)log[1 (zj ;yj)]: (5.3) We discuss two measures that can be used to measure the di erence between the ob- served and tted values. We denote the tted values as, ^yj = mj^ j = mj exp ^ 0 + sP k=1 zjk^ k 1 +exp ^ 0 + sP k=1 zjk ^ k j = 1;:::;J: The Pearson residual is de ned as, r(yj;^ j) = yj mj^ jpm j^ j(1 ^ j) ; j = 1;:::;J; For the case that yj = 0, this reduces to, r(yj;^ j) = pmj s ^ j 1 ^ j; whereas in the case that yj = mj, we have r(yj;^ j) =pmj s 1 ^ j ^ j 62 This has the Pearson chi-square statistic that is based on it, X2 = JX j=1 r(yj;^ j)2: (5.4) The deviance residual is de ned as, d(yj;^ j) = 2 yjlog y j mj^ j + (mj yj)log m j yj mj(1 ^ j) 1 2 ; j = 1;:::;J; where the sign is determined by the sign of (yj mj^ j). In the case of no positive response, i.e. yj = 0, then d(yj;^ j) = 2 mjlog 1 1 ^ j 1 2 = f2mjjlog(1 ^ j)jg12; whereas, when all observations with that covariate pattern have a positive response, i.e. yj = mj, d(yj;^ j) = 2 mjlog 1 ^ j 1 2 =f2mjjlog(^ j)jg12: The deviance is then the summary statistic based on the deviance residual and is given as, D = JX j=1 d(yj;^ j)2: (5.5) Both these summary statistics are distributed as 2 with degrees of freedom J (s+1), under the assumption that the tted model is correct. The statistics measure the goodness of the t of the model; X2 measures the relative deviation between the observed and tted values, whilst D measures the disagreement between maxima of the observed and the tted 63 log likelihood functions. Large values would be an indication that the observations are poorly accounted for by the model. Graphical plots of these measures of goodness of t can be used to allow for easier detection of those observations that are ill- tting to the model. A plot of the Pearson residuals against standard normal quantiles is one such approach where deviations from a linear pattern is an indication of the lack of goodness of t. Other plots that can be utilized are index plots of the residuals discussed against the observations or more commonly the discussed residuals against the tted values, i.e. rj vs. ^ j or dj vs. ^ j. Ill- tting observations would have large values for the residuals. 5.3 Regression Diagnostics The diagnostic measures discussed in this section will assist in identifying which of the observations are not well-explained by the model, or in pointing out those observations having dominance in some aspect of the t and quantifying their e ect on the t. The work by Pregibon (32) was key in establishing the theoretical work that extended diagnostics from the linear regression model to logistic regression. We seek to discuss these ideas for the functional logistic model. The analog of the projection matrix for the reduced functional logistic model (5.1) is given by, P = I H = I U12Z(Z0UZ) 1Z0U12; 64 where U is an J J diagonal matrix with the general element uj = mj^ j(1 ^ j) and H is the hat matrix. A diagonal element from this hat matrix, hj, is therefore of the form, hj = mj^ j(1 ^ j)z0j(Z0UZ) 1z0j = uj qj where qj = z0j(Z0UZ) 1z0j. Pregibon (32) showed that, just as is the case with the linear model, P is symmetric, idempotent and spans the residual space. Therefore, small values of pjj detect extreme points in the design space. Diagnostic plots that would be useful in identifying outlying and in uential curves include the Pearson residual (rj), the deviance residual (dj) as discussed in the last section as well as the diagonal elements of the projection matrix (pjj). Finally, we discuss diagnostic measures that look at the e ect of in nitesimal model perturbations on the model t and therefore, quantify the e ect of each observation (or subset of observations) on the model t. The diagnostic to measure the sensitivity of the estimated parameter estimate in the reduced functional logistic model to in nitesimal perturbations of all observations with the same covariate pattern is given by, 4^ j = r 2 jhj (1 hj)2 = r 2 sjhj 1 hj; where rsj is the studentized Pearson residual which is de ned as, rsj = rjp1 h j ; where rj is the Pearson residual. This measure, ^ j, is analogous to Cook?s distance for the linear model. Pe~na (31) proposed a measure that di ers from that of Cook?s measure for the linear regression model. Unlike Cook?s Distance where the in uence of an observation was 65 determined by deleting the observation and measuring the in uence of that on the predictors; Pe~na?s measure looks at the in uence of an observation based on how that observation is being in uenced by the rest of the data. In the case of the reduced functional logistic model, the in uence of the jth covariate pattern is measured as, Pj = ^y j ^yj( j) 2 mj^ j(1 ^ j); where ^yj( j) is the predicted response when the jth covariate pattern is removed from the sample. If those observations with the jth covariate pattern are not well t by the model, in- nitesimal perturbations cause changes in the deviance and chi-square statistics discussed in the previous section, which most often can be isolated to the residual components. To mea- sure the e ect of each covariate pattern on the model t, the rate of change of the deviance statistic and chi-square statistic can be approximated by, Dj = d2j + r 2 jhj 1 hj; and X2j = r 2 j 1 hj; respectively. Those covariate patterns that are poorly t by the model will be identi ed by large values for Dj and X2j . Similarly, those covariate patterns having the greatest in uence on the values of the estimated parameter are identi ed by large values for ^ j. 5.4 Numerical Examples The diagnostic measures discussed in this chapter are applied to data sets that were previously used in earlier chapters. The Canadian Weather data set with the outlier is tted 66 (a) ri vs. index (b) di vs. index Figure 5.1: Index plots for the Canadian Weather data comparing residuals of the principal- component based approaches using the (robust) principal component based approach. The Poblenou NOX emission data which has been identi ed to have have 5 outlying curves is tted using a (robust) penalized maximum likelihood approach. In both instances, we compute the diagnostic measures and goodness-of- t measures for the robust and the non-robust approaches discussed in this dissertation work and by utilizing the two di erent estimation methods - the principal component based estimation; and the penalized maximum likelihood based estimation. 5.4.1 Canadian Weather Diagnostics This data set has 23 samples representing the weather stations, each with 12 mean monthly temperatures recorded. Of these sample curves, n1 = 9 stations have drought risk and the rest, n2 = 14, do not have drought risk. The functional logistic model from this data set was used to predict risk of drought for these 23 locations based on the functional temperature predictor. An outlier was introduced in this data set by shifting and stretching a random curve. 67 The classical principal component-based estimation, as explained in this chapter, as well as our proposed robust principal component-based estimation technique discussed in Chapter 3 are used to t this data and their diagnostic measures compared. There were no covariate patterns distinguished and so J = n. Figure (5.1) displays the index plots for the two types of residuals discussed, i.e. Pearson residual and deviance residual. The residuals in both instances are relatively small, indicating that both models are a good t of the data. On average, the robust PCA t model has smaller residuals than the classical PCA t functional logistic model. The summary statistics that measure the appropriateness of the tted models were D = 2:8414 10 9 and X2 = 1:4207 10 9 for the t that used classical PCA approach, whilst the t that used our proposed robust PCA method had D = 8:6886 10 10 and X2 = 4:344 10 10 on 19 degrees of freedom in both cases. The weather stations that were identi ed as having the higher residuals by both ts are shown in Figure (5.2). The identi ed stations are Sche ervll, Bagottville, Kamloops, Yellowknife and Resolute - the rst two are classi ed as having no risk of drought whilst the last 3 are identi ed as having risk of drought. However, it?s important to note that these residuals are still relatively small and so the appropriateness of the t is met. The diagnostic measures ^ j, X2j and Dj were evaluated and a summary of these statistics plotted against hjj as shown in Figure (5.3). The values of these statistics are small and therefore, we can conclude that there are no observations that are poorly t or having a great in uence on the values of the estimated parameter. The temperature function for Sche ervll (#7) has noticeably higher values for the non-robust model t. 5.4.2 Poblenou NOX Emission Diagnostics The Poblenou NOX example is considered as a functional logistic regression model where the functional predictor is the functional trajectories of NOX levels over a 24 hour period with a binary response indicating whether the day is a working day (Y = 1) or a non-working day (Y = 0). 68 (a) No drought risk (b) Drought risk (c) All Figure 5.2: Weather stations with the larger residuals shown as peaks in Figure (5.1) The approach taken to t this model was that of (robust) penalized maximum likelihood as discussed in Chapter 4. With this example we demonstrate that the diagnostic methods discussed in this chapter can be extended to other estimation approaches. Figure (5.4) displays the index plots of the deviance residuals (dj) against the days for which the NOX were recorded. There are several days with large values for the deviance residuals. Three 69 (a) (b) (c) Figure 5.3: Diagnostic plots that show the in uence of each observation on the di erent diagnostic measures versus hii noticeable days are 03=23(#27), 03=25(#29) and 05=16(#72) with deviance residuals above 10. The summary statistics for the goodness of t for the non-robust penalized approach and the robust penalized approach were extremely large in both cases (p-value << 0:01). Therefore, both models are poor ts of the data and drawing inferences on the odds of a working day based on these hourly NOX emission levels would be inappropriate. 70 (a) Non-Robust (b) Robust (c) Overlay Figure 5.4: Index plot of the deviance residuals for the NOX penalized maximum likelihood tted models The diagnostic plots in Figure (5.5) also indicate the observations with the largest in uence on the parameter estimate as well as the diagnostic summaries. These observations are consistent with those in the index plot. 71 (a) (b) (c) Figure 5.5: Diagnostic plots that show the in uence of each observation on the di erent diagnostic measures versus hii for the NOX model 72 Chapter 6 Conclusion The aim of this dissertation work was to propose robust statistical methods for the functional logistic model, which has functional predictors and a binary response. Whilst an increasing amount of work has been directed in functional data analysis, we are not aware of any work speci cally looking at robust methods for the functional logistic model. Firstly, we proposed a robust estimation technique that was principal component based. Both the functional predictor and functional parameter were estimated using basis expansion, where the order of expansion was determined by the generalized cross-validation method. The reduced functional logistic model had multicollinearity issues that were eliminated by robustly obtaining principal components that were used as the covariates in the model. The estimated functional parameter was shown to be more e cient in a simulation study as well as with an application to a real dataset. Secondly, we proposed a Mallows-type estimator. This other robust estimation approach made use of a penalization technique as well as down-weighting high leverage points. The penalty term was introduced to ensure that the estimator retained the smoothness property and it was based on the curvature properties of the ^ (t) function. Huber-type weights that were based on the robust Mahalanobis distance of the covariate were used to minimize the e ect of those observations that were outliers in the design space. The proposed estimator was shown to be better and more e cient in a simulation study and on real functional data with outlying curves. Lastly, we explored goodness-of- t and diagnostic measures for the functional logistic model. These measures were generalizations of widely-used and known diagnostics measures 73 of the standard logistic model. We showed that some adaptions are necessary when deal- ing with the functional logistic model and illustrated their usefulness and performance by analyzing the t of the real-world examples discussed in prior chapters. 6.1 Future Work The robust estimation approaches proposed in Chapters 3 and 4 only considered outliers in the design space. However, for the functional logistic model, outliers can be viewed as either extreme sample curves in the design space or as mis-classi cation errors in the response. The robust methods proposed, whilst resistant to outlying curves in the design space, are not resistant to mis-classi cation errors. These errors occur when observations in the y = 0 class are recorded as being in y = 1 class, and vice versa. To simultaneously addresses the problem of mis-classi cation as well as dealing with extreme observations in the design space, consideration of a weight function of the form wi = wyi wxi where wyi denotes weights for mis-classi cation problems and wxi denotes weights to down-weigh extreme data points in the design space can be made for the robust penalized approach proposed in Chapter 4. Croux et al. (8) showed that the breakdown point for the MLE is 2(p 1)=n where p indicates the number of predictors in the standard logistic model. The basis for the wxi is therefore made from the hat matrix H = X(X0X) 1X0 where X is the n (p + 1) design matrix for our reduced functional model. In particular, one can consider the Huber-type weights wxi = min 1; 2Kbnh ii ; wyi = min 1;c yi i [ i(1 i)]12 1 where c is a tuning constant that controls the degree of robustness. 74 The estimation methods proposed in this dissertation work have taken an approach that reduces the functional predictors into multivariate predictors by way of basis expan- sion. Therefore, in both cases robust multivariate techniques were used. Zhang and Chen (45) showed that the smoothing step in these approaches may result in some bias. A fully functional approach would be ideal, especially in the case of the principal component ap- proach. Escabias et al. (9)?s work contrasted a functional PCA approach with one that used multivariate PCA techniques on a reduced functional logistic regression. Gervini (15) pro- posed a fully functional approach for spherical PCA. Lee et al. (24) also recently proposed an M-estimation type functional principal component analysis method. These methods re- sult in robust principal functions that can be used as functional predictors in the functional logistic model. The nature of the functional logistic model, allows it to be used for classi cation purposes where the data has two classes and the observations are functional. Robust functional classi cation methods would be applicable in diverse elds where there are two or more classes. The functional logistic model is part of the generalized functional linear model family. Therefore, a broader focus on robust estimation methods for this family could be made. Finally, the attraction to work with functional data stemmed from brain imaging studies and the application of statistical methods and tools with functional Magnetic Resonance Imaging (fMRI) data. An increasing amount of study and focus has emerged with brain images and other neuro-imaging studies. In future work, a look at the application of these functional statistical methods in dealing with this particular data will be a main focus. 75 Bibliography [1] J. L. Bali, G. Boente, D. E. Tyler, and J.-L. Wang. Robust functional principal compo- nents: A projection-pursuit approach. The Annals of Statistics, 39:2852 { 2882, 2011. [2] L. Barker and C. Brown. Logistic regression when binary predictor variables are highly correlated. Statistics in Medicine, 20:1431 { 1442, 2001. [3] G. Boente and R. Fraiman. Discussion of robust principal components for functional data by locantore et al. Test, 8:28 { 35, 1999. [4] H. Cardot and P. Sarda. Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis, 92(1):24 { 41, 2005. [5] R.J. Carroll and S. Pederson. On robustness in the logistic regression model. Journal of the Royal Statistical Society (B), 55:693 { 706, 1993. [6] J.B. Copas. Binary regression models for contaminated data. Journal of Royal Statistical Society (B), 50:225 { 265, 1988. [7] P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377 { 403, 1979. [8] C. Croux, G. Dhaene, and D. Hoorelbeke. The breakdown behaviour of the maximum likelihood estimator in the logistic regression model. Statistics and Probability Letters, 60:377 { 386, 2002. 76 [9] M. Escabias, A.M. Aguilera, and M.J. Valderrama. Principal component estimation of functional logistic regression: Discussion of two di erent approaches. Journal of Nonparametric Statistics, 16(3 { 4):365 { 384, 2004. [10] M. Escabias, A.M. Aguilera, and M.J. Valderrama. Modeling environment data by functional principal component logistic regression. Environmetrics, 16:95 { 107, 2005. [11] M. Febrero, P. Galeano, and W. Gonz alez-Manteiga. Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics, 19:331 { 345, 2007. [12] M. Febrero-Bande, P. Galeano, and W. Gonz alez-Manteiga. In uence in the functional linear model with scalar response. Journal of Multivariate Analysis, 101(2):327 { 339, 2010. [13] F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice. Springer, 2006. [14] R. Fraiman and G. Muniz. Trimmed means for functional data. Test, 10:419 { 440, 2001. [15] D. Gervini. Robust functional estimation using the median and spherical principal components. Biometrika, 95:587 { 600, 2008. [16] D. Gervini. Detecting and handling outlying trajectories in irregularly sampled func- tional datasets. The Annals of Applied Statistics, 3:1758 { 1775, 2009. [17] J. J. Goeman. L1 penalized estimation in the cox proportional hazards model. Biomet- rical Journal, 52(1):70{84, 2010. [18] J. Goldsmith, J. Bobb, C.M. Crainiceanu, B. Ca o, and D. Reich. Penalized functional regression. Journal of Computational and Graphical Statistics, 20:830 { 851, 2011. 77 [19] R.R. Hocking. The analysis and selection of variables in linear regression. Biometrics, 32:1 { 49, 1976. [20] D.W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, second edition, 2000. [21] M. Hubert, P.J. Rousseeuw, and K.V. Branden. Robpca: a new approach to robust principal component analysis. Technometrics, 47(1):64 { 79, 2005. [22] G.M. James. Generalized linear models with functional predictors. Journal of the Royal Statistical Society, Series B, 64(3):411 { 432, 2002. [23] S. Le Cessie and J.C. van Houwelingen. Ridge estimators in logistic regression. Journal of the Royal Statistical Society, 41(1):191 { 201, 1992. [24] S. Lee, H. Shin, and N. Billor. M-type smoothing spline estimation for principal func- tions. Computational Statistics and Data Analysis, 2013. [25] X. Leng and H.G. M uller. Classi cation using functional data analysis for temporal gene expression data. Bioinformatics, 22:68 { 76, 2006. [26] C.L. Mallows. On some topics in robustness. Technical Memorandum, 1975. [27] E.J. Malloy, E.J. Bedrick, and T. Goldsmith. Diagnostics for the scale of functional predictors in generalized linear models. Technometrics, 49(4):480 { 489, 2007. [28] H-G. M uller and J-M. Chiou. Diagnostics for functional regression via residual processes. Computational Statistics and Data Analysis, 51:4849 { 4863, 2007. [29] H.G. M uller and U. StadtM uller. Generalized functional linear models. The Annals of Statistics, 33(2):774 { 805, 2005. [30] R.T. Ogden and P.T. Reiss. Functional generalized linear models with images as pre- dictors. Biometrics, 66:61 { 69, 2010. 78 [31] D. Pe~na. A new statistic for in uence in linear regression. Technometrics, 47(1):1 { 12, 2005. [32] D. Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4):705 { 724, 1981. [33] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. URL http://www. R-project.org. ISBN 3-900051-07-0. [34] J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer, second edition, 2005. [35] J.O. Ramsay, H. Wickham, S. Graves, and G. Hooker. Functional Data Analysis, 2012. URL http://www.cran.r-project.org/web/packages/fda. [36] S.J. Ratcli e, G.Z. Heller, and L.R. Leader. Functional data analysis with application to periodically simulated foetal heart rate data ii: Functional logistic regression. Statistics in Medicine, 21:1115 { 1127, 2002. [37] P.T. Reiss, R.T. Ogden, J.J. Mann, and R.V. Parsey. Functional logistic regression with pet imaging data: A voxel-level clinical diagnostic tool. Journal of Cerebral Blood Flow and Metabolism, 25(S635), 2005. [38] P. Sawant, N. Billor, and H. Shin. Functional outlier detection with robust functional principal component analysis. Computational Statistics, 27(1):83 { 102, 2012. [39] R.L. Schaefer. Alternate estimators in logistic regression when the data are collinear. Journal of Statistical Computation and Simulation, 25(1 { 2):75 { 91, 1986. [40] L.A. Stefanski. Infuence and measurement error in logistic regression. PhD Thesis, 1983. 79 [41] T.S. Tian. Functional data analysis in brain imaging studies. Frontiers in Psychology, 1(35), 2010. [42] V. Todorov. Robust Location and Scatter Estimation and Robust Multivariate Analy- sis with High Breakdown Point, 2012. URL http://www.cran.r-project.org/web/ packages/rrcov. [43] E. V ag o and S. Kemen ey. Logistic ridge regression for clinical data analysis (a case study). Applied Ecology and Environmental Research, 4(2):171 { 179, 2006. [44] M. Victoria-Feser. Robust logistic regression for binomial responses. 2000. URL http: //dx.doi.org/10.2139/ssrn.1763301. [45] J.T. Zhang and J. Chen. Statistical inferences for functional data. The Annals of Statistics, 35:1052 { 1079, 2007. 80