Group Lasso for Functional Logistic Regression by Jessica Godwin A thesis submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Master of Science Auburn, Alabama August 3, 2013 Keywords: functional data analysis, group lasso, functional logistic regression, fMRI Copyright 2013 by Jessica Godwin Approved by Nedret Billor, Chair, Associate Professor of Mathematics and Statistics Dmitry Glotov, Associate Professor of Mathematics and Statistics Asheber Abebe, Associate Professor of Mathematics and Statistics George Flowers, Dean of the Graduate School Abstract Functional datasets are comprised of data that have been sampled discretely over a continuum, usually time. While the recorded data are discrete, it is assumed that there is a smooth, underlying curve describing the observations. In this thesis an attempt is made to develop a variable selection technique for the functional logistic regression model. The functional logistic regression model is the functional analog of logistic regression. In this model, the responses are binary and represent two separate classes; the predictors are functional. Due to the nature of functional data, observations at neighboring time points are correlated, leading to redundant information within each observation and each functional predictor. In a dataset with many variables, it is necessary to be able to select a smaller subset of informative variables. In this thesis, we attempt to remove the autocorrelation between neighboring observations and perform variable selection. We do this by employing a principal component analysis on binary data with multiple functional predictors. The data are then subject to a variant of the group lasso, an L1 regularization method that estimates the logistic model and selects variables simultaneously. We assess our method with a simulation study and an application to a real dataset. ii Acknowledgments I would like to thank my advisor, Dr. Nedret Billor. You have been an incredible mentor to me. From the rst semester you taught me, you have encouraged me in the direction of research. Thank you to my friends and family for all of your support. To my parents, my deepest gratitude for supporting me throughout my life and teaching me to work hard. Cory, thank you for all the love, laughter and constant encouragement. Thank you to the rest of my committee, Dr. Dmitry Glotov and Dr. Ash Abebe. Finally, thank you to Dr. Gopikrishna Deshpande and his student, Yun Wang, for sharing your data with me. I could not have done any of this without any of you. War eagle! iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Basic Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Basis Expansion of Functional Data . . . . . . . . . . . . . . . . . . . . . . . 4 2 Functional Principal Component Logistic Regression . . . . . . . . . . . . . . . 7 2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Functional Principal Component Analysis . . . . . . . . . . . . . . . . . . . 7 2.3 The Functional Logistic Regression Model . . . . . . . . . . . . . . . . . . . 9 2.4 Principal Component Functional Logistic Regression for Multiple Functional Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Group Lasso for Functional Logistic Regression . . . . . . . . . . . . . . . . . . 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Group Lasso for Functional Logistic Regression . . . . . . . . . . . . . . . . 15 4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 fMRI Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 iv A Simulation of Preprocessed fMRI Data . . . . . . . . . . . . . . . . . . . . . . . 24 B Simulation Analysis Code for R . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 v List of Figures 1.1 Berkeley Growth Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Group Lasso Algorithm: Outline of the block coordinate gradient descent algo- rithm used to perform grouped variable selection in the logistic regression model [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Simulation Results: The values reported are the means, followed by their stan- dard deviations in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 fMRI Nine Dot Experiment: Subjects were asked to use four and ve lines to connect all nine dots in the gure above. . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Voxel Mask of the Anterior Temporal Lobe: The right anterior temporal lobe is in red, the left in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 vi Chapter 1 Introduction Functional data analysis (FDA) is a relatively new area within the discipline of statis- tics. The rst in depth text on the subject, Functional Data Analysis, was published by as recently as 1997 [8]. Functional data are data that have been measured discretely over a continuum, usually time. Instead of treating the many discrete measurements as individual observations, one makes the assumption that these measurements represent a smooth, un- derlying curve. This curve, then, is considered as one observation. Much of this work was motivated by Functional Magnetic Resonance Imaging (fMRI), a noninvasive imaging technique which allows an experimenter to take images of a subject?s brain over time. These images are taken while the subject performs a task such as nger tapping or correctly identifying images of human faces amidst a series of images containing both human faces and objects. fMRI is currently being used to assess which areas of the brain are activated while performing certain tasks. This is done by dividing the brain into voxels, the three-dimensional analog of a pixel, and measuring brain activation. Depend- ing on how one de nes a voxel, a typical fMRI image has over 1,000,000 voxels. As fMRI studies usually have a small number of subjects, this results in datasets that are incredibly high-dimensional. High dimensionality is one of the biggest problems in statistical analysis of fMRI data [10]. Initial statistical analysis of fMRI data was univariate in nature [3]. This is obviously simplistic for data that take into account four dimensions: three spatial dimensions and one temporal dimension. The second decade of fMRI research focused on multivariate data anal- ysis. Considering the fact that, the capturing of images happens over time, the next logical step in this progression is functional data analysis. In an fMRI session, images of a subject?s 1 brain are taken at discrete time points. However, one would expect the brain?s response to a stimulus to be continuous in nature. This makes fMRI imaging data a great candidate for functional data analysis. In 2005 Viviani et al. published a paper using functional principal component analysis in fMRI. They showed the results to be much more interpretable than multivariate PCA [12]. Since then more statistical analysis of fMRI data has been functional in nature. There is a need to attack the problem of high dimensionality of brain imaging data. There is also a need for the development of better classi cation methods [10]. One of the best things about Functional Magnetic Resonance Imaging is its noninvasiveness. If statistical classi cation methods are improved, it could aid the advancement of noninvasive diagnostic techniques for mental illness or even degenerative diseases such as Alzheimer?s. There is some research being done in this area, but there is room for more advancement [10]. 1.1 Basic Functional Data Analysis Consider sample curves of the formfxi(t), t2T, i = 1, . . . , ng, where T is an interval over which the observations were measured. The observations belong to the Hilbert space, L2(T), of square-integrable functions with the inner product hf;gi= Z T fgdt; 8f;g2L2(T): (1.1) A vector xi = (xi1, . . . , xiN), represents the discrete measurements for the ith subject of one variable, x, at N points in T. There are functional analogs of the traditional summary statistics. The functional sample mean, x(t), is de ned below: x(t) = n 1 nX i=1 xi(t): (1.2) The sample mean is computed point-wise at t2T. Similarly, one can compute the covariance between measurements at two time points s and t 2 Figure 1.1: Berkeley Growth Study These are the curves describing the height in cm of girls followed in the Berkeley Growth Study beginning in 1929. The circles mark the discrete height measurements over time. cov(s;t) = (n 1) 1 nX i=1 (xi(s) xi(s))(xi(t) xi(t)): (1.3) When s = t, this is simply the variance of x computed at that time point. If we further assume that the observations, xi(t), belong to an L2 space, we can de ne the inner product, hx;yi= Z x(t)y(t)dt: (1.4) Once we consider the observations as smooth functions, a natural extension would be to estimate the derivatives. In the case of the Human Growth Data in Figure 1.1, one might want to estimate the velocity or acceleration of the growth for inference as well [8]. 3 1.2 Basis Expansion of Functional Data Let observations fxi(t), t2T, i = 1, . . . , ng belong a subspace of L2 spanned by the p-dimensional basis system of independent functions, f 1, . . . , pg. The assumed smooth functional observation, or linear expansion, xi(t), can be expressed in terms of the sum xi(t) = pX k=1 ak k: (1.5) Here, each ak is called a basis coe cient. When derivative estimation is required, one must estimate it separately. Taking the derivate of the smooth function xi(t) does not often give a good derivative estimate. There are many di erent basis systems that can be used in a basis expansion. One of the most common systems is the Fourier basis system. A Fourier basis is de ned by a Fourier series, 1, sin(!t), cos(!t), sin(2!t), cos(2!t), sin(3!t), cos(3!t), . . . . Fourier bases are useful for data that are periodic, e.g. weather patterns [8]. Additionally, derivative estimation is easier with a Fourier basis as the derivatives of sin(t) and cos(t) are known and easy to compute. For data that are not cyclical in nature, the most common choice is the B-spline basis system. They are computationally e cient and are exible enough to approximate most non-periodic functions. Splines are de ned over an interval T subdivided into subintervals at points called knots. A basis system with L knots contains L+ 1 subintervals. Over each subinterval, a polynomial functional of order m is de ned. These functions are constrained to be equal at the knots. The number of knots, L, and order of the splines, m, are two important parameters de ning a B-spline basis system. In this work, only B-spline bases will be used. After a basis system has been chosen, the basis coe cients must be estimated. One method of basis coe cient estimation is that of least squares estimation. Denote the discrete 4 observations of a functional dataset yj, j = 1, . . . , N, where N is the number of time points at which observations were taken. Assume independently and identically distributed measurement errors, "j, with E["j] = 0 and Var("j) = 2. Then, yj = x(tj) +"j: (1.6) De ning x(tj) in terms of (1.4), the sum of squared errors can be de ned as SSE(yja) =ky ak2: (1.7) Taking the rst derivative, the least squares solution minimizing SSE(yja) ^a = ( 0 ) 1 0y: (1.8) However, least squares smoothing is inappropriate if the error assumptions are not true. In the case of functional data measured over time, observations at adjacent time points are likely correlated, violating the standard error assumptions. A more common method of spline smoothing is smoothing by roughness penalty. This method is designed to estimate a curve that is rough enough to describe observed features of the data, but suppresses high-frequency features of the data, including noise. To nd the coe cients of a smooth approximation of this type, the sum of squared errors is minimized with the added constraint of a roughness penalty. Roughness of a function is described by the curvature, or the squared second derivative. The quantity penalized is the integrated squared second derivative, PEN2(x) = Z [D2x(s)]2ds: (1.9) The corresponding sum of squared errors to be minimized is as follows: 5 PENSSE (xjy) = [y x(t)]0W[y x(t)] + PEN2(x); (1.10) where W is the matrix of weights describing the covariance structure of the errors. The smoothing parameter is chosen by the method of generalized cross validation developed by Craven and Wahba [1]. This is done by choosing such that it minimizes the following equation GCV( ) = n SSE(n df( ))2: (1.11) Here df( ) = trace(S ; ), where S ; = ( 0W ) 1 W (1.12) is the hat matrix of the spline smoother. Once you have smoothed functional observations, the statistical analysis can begin. As with traditional statistics, the beginning of FDA focused on univariate statistics. Our focus, however, is on a set of multiple functional predictors. In any functional data set with multiple functional predictors, and especially in fMRI, dimensionality is an issue. It may be important, for the sake of interpretation and computational expense, to select a smaller subset of important variables from the dataset. In any dataset, classi cation is often of interest. This particularly resonates within fMRI. Currently classi cation is used to explore brain functionality. For example, in a certain study subjects are given one of two stimuli. Does the brain behave di erently in the presence of each stimulus to be able to predict which stimulus a subject was presented with? If so, it would be important to identify which areas of the brain are associated with this di erence. As classi cation improves within the realm of fMRI, it could contribute to diagnosis of brain degeneration or mental illness in clinical settings. The rest of this thesis focuses on a method of dimension reduction and variable selection in a dataset with multiple functional predictors and a binary response. 6 Chapter 2 Functional Principal Component Logistic Regression 2.1 Principal Component Analysis Traditional principal component analysis (PCA) is a method of data reduction for mul- tivariate datasets [4]. Consider a data matrix Xn m, where n is the sample size and m is the number of variables. Let E[X] = 0 and C = (X0X) 1 be the variance-covariance matrix. PCA is performed on the covariance matrix or correlation matrix of X. It reduces dimension by nding a linear combination of the variables that has the maximum variance; this linear combination is the rst principal component (PC). The next PC is found by nding a linear combination of the variables that is independent of the rst PC and has the next largest variance. This goes on until minfN 1;mg PCs have been found. This is equivalent to solving the following equation: Cf = f: (2.1) The solutions to this equation are the eigenvalues, , and eigenvectors, f, of the covariance or correlation matrix. Dimension reduction occurs when a number of principal components, s m, is chosen. Often this is done by examining the amount of cumulative variance explained by each additional principal component. 2.2 Functional Principal Component Analysis Consider a sample containing observations of one functional predictor. There is corre- lation between observations at adjacent points in T. In a linear model framework, this leads to the problem of high multicollinearity. Escabias et al. propose a method of functional 7 principal component analysis (fPCA) for the logistic regression model with one functional predictor that alleviates this issue [6]. We will extend this method to a functional logistic regression model with multiple functional predictors. As outlined by Ramsay and Silverman, fPCA is merely a functional analog of the tradi- tional multivariate principal component analysis [8]. Assume functional observations of one variable xi(t) 2L2, where i = 1, . . . , n, with the usual functional sample mean, x(t), and sample covariance function, ^C(s;t), s;t2T. Without loss of generality, assume x(t)=0. The functional principal components, j, are found by solving the following functional equivalent of (2.1) Z T ^C(s;t)f(s)ds = f(t): (2.2) The solutions to (2.2) are the eigenvalues, , and eigenfunctions, f(t), of the the covariance matrix C. The number of eigenvalues is N 1. The ith component of the jth principal component, ij, is expressed as ij = Z T xi(t)fj(t)dt: (2.3) The solution of (2.2) cannot always be computed. When thensample functions of a functional predictor belong to the spaceL2(T) spanned by orthonormal basesf 1;:::; pg, the functional PCs are equivalent to the multivariate PCs of the matrix A [2]. Here, A is the n p matrix of coe cients of the basis expansions and is a p p matrix whose components are de ned as ij = Z T i jdt: (2.4) 8 2.3 The Functional Logistic Regression Model Escabias et al. develop a principal component functional logistic regression model for one functional variable [2]. We describe this method before extending it to the case with multiple functional predictors. Consider observations f(yi, xi(t)), t 2 T, i = 1, . . . , ng, where xi(t) is a functional predictor. Each xij(t) is the ith observation at the jth time point, and each yi 2f0,1g. The conditional distribution of YijXi(t) is Bernoulli( i), with i = E[YijXi(t)] = expf + R x i(t) (t)dtg 1 + expf +R xi(t) (t)dtg i = 1;:::;n; (2.5) where 2R and (t), the parameter, is a function. Making the logit transform, a generalized model is formed [7]: li = ln i 1 i = + Z T xi(t) (t)dt i = 1;:::;n: (2.6) Under the assumption that (t) belongs to the same L2 space spanned by f 1;:::; pg, (t) = pX k=1 k k: (2.7) The li?s can be expressed in terms of the A matrix, L = 1n 1 + A ; (2.8) where is a p 1 dimensional vector containing the coe cients k for the basis expansion of (t). The basis coe cients k can be estimated using a Newton-Raphson algorithm maximizing the likelihood equation Y0(X ) = 0: (2.9) 9 In (2.9), Y = (y1;:::;yn), = ( 1;:::; n) and X=(1j A ) [6]. Once the coe cients are estimated, ^ (t) = pX k=1 ^ k k: (2.10) The vector L in (2.9) can be reexpressed in terms of the principal components of A : L = 1n 1 + V0 = ; (2.11) where =( ij) is the matrix of the p principal components and V, the eigenvectors. This notation allows for estimation of the components of , ^ = V^ : (2.12) Reduction of the e ects of multicollinearity occurs when a number of PCs, s p, is chosen. 2.4 Principal Component Functional Logistic Regression for Multiple Func- tional Predictors The extension of PCA to the model with multiple functional predictors lies in the de nition of the inner product [4]. Let f(yi, xim(t)), t2T, i = 1, . . . , n, m = 1, . . . , Mg 2L2(T) spanned by f 1;:::; pg, where each xim(t) is a functional predictor and each yi2 f0,1g. The inner product, then, of two functional PCs can be de ned as follows: h i; ji= MX m=1 Z T mi mj dt: (2.13) The functional logistic regression model with multiple predictors is still de ned by (2.4). The de nition of i, however, changes with the change in inner product: 10 i = E[YijXi(t)] = expf + PM m=1 R T x m i (t) m(t)dtg 1 + expf +PMm=1RT xmi (t) m(t)dtg i = 1;:::;n: (2.14) Making the logit transform, li = + MX m=1 Z T xmi (t) m(t)dt i = 1;:::;n: (2.15) To perform the dimension reducing PCA, we rede ne the design matrix A . Here, An (Mp) = h A1 ::: AM i ; (2.16) and (Mp) (Mp) = 1 0 ::: ::: 0 2 0 ::: ::: ::: ::: ::: 0 ::: ::: M : (2.17) Each Am is an n p matrix of basis coe cients. is a diagonal block matrix with each m having dimensions p p. Now, = ( 01,. . . , 0M)0. Rede ning (2.7), L = 1n 1 + A = 1n 1 + MX m=1 Am m m: (2.18) This de nition depends on the assumption that each observation xmi (t) and the respective pa- rameter function, m(t), can be de ned by the same set of basis functionsf m1 (t);:::; mp (t)g. Below, we express L in terms of the principal components L = 1n 1 + MX m=1 mVm0 m; (2.19) 11 where m = ( mij )n p are the principal components of the A m, and Vm is the matrix of eigenvectors. The number of of PCs sm pm, to be chosen for each of the M should be determined by cumulative variance. For simplicity, we chose the same number of PCs, s, for each of the M predictors. After the dimension has been reduced on the within-variable level, (2.12) becomes i(s) = expf (s) + PM m=1 Ps j=1 m ij(s)f m ij(s) m j(s)g 1 + expf (s) +PMm=1Psj=1 mij(s)fmij(s) mj(s)g i = 1;:::;n: (2.20) We can re-express (2.17) as L = (s)1n 1 + MX m=1 m(s)Vm0(s) m(s): (2.21) The vectors m can be estimated as in (2.9), ^ m = Vm^ m: (2.22) Although multicollinearity has been dealt with on a within variable basis, as M gets large there could be multiple predictors providing similar information. In the case of fMRI data, time series of adjacent voxels are expected to be similar. There is a need to reduce dimension on the multiple variable level by selecting only those functional predictors which are relevant to the response. This will alleviate another potential source of multicollinearity. The following section of this thesis will focus on a method of simultaneous model estimation and variable selection. 12 Chapter 3 Group Lasso for Functional Logistic Regression 3.1 Introduction The principal component analysis allows for removal of redundant information on a within predictor basis. As stated before, this is necessary due to the autocorrelation of observations between time points. There is also a need to select only those predictors which provide relevant information to the model. There are many methods of variable selection used in linear models and generalized linear models. Model selection techniques, such as stepwise and forward selection, cycle algorithmically through subsets of variables until certain criteria are met. The variables included in the various steps of the algorithm are determined by previously selected p-values. These methods are inherently subjective, as it is up to the person analyzing the model to choose the "best" model based on a set of criteria. The criteria that determine the quality of the model are also chosen by the analyst. 3.2 Lasso A more objective method of variable selection, the lasso, was introduced by Tibshirani in 1996. The lasso is a method that simultaneously performs model selection and parameter estimation. It is an L1 regularization technique that performs this variable selection by shrinking certain coe cients to exactly 0, excluding those predictors from the model. The other, non-zero, coe cients represent variables that are relevant to the model. This is done by solving the least squares estimation subject to a constraint on the coe cients. Assume 13 a standard regression model with independent observations f(yi, xi), i = 1;:::;ng, where xi=(xi1,. . . ,xip). The estimates of regression coe cients by the lasso method (^ ; ^ ) are (^ ; ^ ) := arg minf nX i=1 (yi pX j=1 jxij)2g; (3.1) under Pjj jj t, where t 0. Note that is not penalized. Tibshirani also applied the lasso to the logistic regression model [11]. Consider inde- pendent observations f(yi;xi);i = 1;:::;ng where yi 2f0;1g. The variable selection and model estimation are performed by maximizing the loglikelihood function, l( ) = nX i=1 yi(x0i ) ln(1 + expfx0i g); (3.2) under Pjj jj t, where t 0. An iterated reweighted least squares algorithm is used to compute ^ under these conditions. 3.3 Group Lasso Consider a linear model with multiple predictors, some of them categorical. A categor- ical predictor with l levels will be represented in the model by l 1 variables. The lasso only has the ability to shrink individual regression coe cients to zero. In the case of the categorical predictor, this has little interpretation. If the categorical predictor is not relevant to the response, all l 1 variables should be removed from the model. Consider independent observations f(yi;xi);i = 1;:::;ng where xi = (x0i1, . . . , x0iM)0. Each xim represents a group of predictors. The linear regression model is de ned as Y = + MX m=1 xm m +"; (3.3) where 2 R is the intercept, each m is a vector whose components are the regression coe cients for the mth group of predictors and Yn 1 is the vector of responses. Yuan and 14 Lin [14] developed a method of variable selection called the group lasso considers each of the M groups of variables for inclusion or exclusion in the model. The coe cient estimates are de ned as ^ = arg min(kY X k22 + MX m=1 k mk2); (3.4) where is a tuning parameter and = ( ; 01;:::; 0M)0. The penalty is a mixture of L1 and L2 regularization methods, the lasso and the ridge regression penalties. 3.4 Group Lasso for Functional Logistic Regression Meier et al. [6] describe a method of group lasso for the multivariate logistic regression model. Consider independent observations f(yi;xi);i = 1;:::;ng where yi 2f0;1g and xi = (x0i1, . . . , x0iM)0. Each xmi represents a group of predictors. Group lasso is performed by minimizing the following convex function: S ( ) = l( ) + MX m=1 s(dfm)k mk2; (3.5) where dfm is the degrees of freedom of the mth group of predictors. The use of s(dfm) = df1=2m is suggested [6]. The solution to this equation is the logistic group lasso estimator, ^ . It is found using a block co-ordinate gradient descent minimization algorithm. The algorithm uses a second-order Taylor series expansion, S ( ^ (t) + d) fl( ^ (t)) + dTrl( ^ (t)) + 12dTH(t)dg+ MX m=1 s(dfm)k^ (t)m + dmk2 = M(t) (d): (3.6) The algorithm begins by assuming an initial parameter vector, (0). For each of the M groups of variables, the algorithm nds d that minimizes M (d). If this d is not identically 0, the estimate of is updated. The updated estimate ^ (t+1) is the previous estimate, (t), plus 15 Figure 3.1: Group Lasso Algorithm: Outline of the block coordinate gradient descent algo- rithm used to perform grouped variable selection in the logistic regression model [6]. a scalar times d. This algorithm proceeds for each group until some convergence criterion is met. The choice of is dependent upon n and the degrees of freedom of each of the M groups. In the multivariate model, group lasso performed on a dataset containing M groups of discrete predictors. A number of groups of predictors less than M is selected. This version of the group lasso is shown to be asymptotically consistent [6]. The minimization can be done in R, using the package grplasso written by Meier et al [6]. We apply this group lasso method to the functional logistic regression model with mul- tiple functional predictors. Recall observationsf(yi, xim(t)), t2T, i = 1, . . . , n, m = 1, . . . , Mg2LM2 (T) spanned byf 1;:::; pg, where each xim(t) is a functional predictor and each yi 2f0,1g. Consider the model in (2.19), after s principal components have been chosen. The loglikelihood function, l( ), of the functional logistic regression model is l( ) = nX i=1 yi( (s) + MX m=1 Z xmi (t) m(s)(t)dt) ln(1 + expf (s) + MX m=1 Z xmi (t) m(s)(t)dtg): (3.7) Using the de nition l( ) in (3.7) we minimize the objective equation, 16 S ( ) = l( ) + MX m=1 s(dfm)k mk2: (3.8) In the case of our method of principal component logistic regression with multiple functional predictors, the degrees of freedom in (3.3) is equivalent to the number of chosen PCs, s. In the functional case, each of the M functional predictors is de ned by a group of s coe cients. When one entire group of coe cients is shrunk to 0, it excludes the correspond- ing single functional predictor from the model. For any set of s coe cients that are not equal to zero, the corresponding functional predictor is included in the model. In essence, the group lasso performs single variable selection in the functional logistic regression model with multiple functional predictors. 17 Chapter 4 Application 4.1 Introduction In a Functional Magnetic Resonance Imaging experiment, an experimenter aims to measure the amount of activation in each voxel of the brain. When a part of the brain is active, there is increased blood ow to the area. fMRI measures the change in blood ow using the blood-oxygen-level-dependent (BOLD) contrast [3]. Assessing which parts of the brain are active during an fMRI experiment allows researchers to determine which parts of the brain respond to certain stimuli. There is a need for classi cation tools in the statistical analysis of fMRI. Logistic regression could be used to distinguish between brains at rest and those presented with stimuli. Another application would be to distinguish between subjects receiving one of two particular stimuli. For example, an experimenter may play pieces of music or speech to a subject [9]. Being able to classify which stimulus was presented allows one to learn more about the way the brain works. Classi cation also has an application in diagnosis of mental illness or degenerative disease. 4.2 Simulation Study We assess our methodology using a simulation study. Using the R package neuRosim, we were able to simulated preprocessed fMRI data. The package can be used to simulated fMRI time series or complete 4D fMRI volumes. With neuRosim, one can de ne the onset and duration of stimuli. One can specify the e ect size of the stimuli, TR and times of spatial and temporal noise [13]. 18 We simulated preprocessed four dimensional fMRI data for an area of 4000 voxels con- taining two non-overlapping regions of activation. We used neuRosim to create block designs of a stimulus followed by rest. Design 1 presented an e ect size that was larger in Region 1 than in Region 2. Design 2 created activation that was larger in Region 2 than in Region 1. Half of the observations in each simulation were simulated under Design 1, the other half under Design 2. Our goal was to use the developed method of group lasso for functional logistic regression with multiple functional predictors to classify the validation set into the proper groups. We simulated 15 datasets each at two levels of signal-to-noise ratio (SNR), 0.75 and 3.87, and two levels of subject size, 30 and 50. An observation simulated under Design 1 was given a y value of 1, otherwise yi = 0. Two-fold cross validation was then used to form training and validation sets. All analysis was performed in R; the package grplasso was used to perform the nal variable selection [6]. Observations with ^ i >0.5 were classi ed as yi=1, otherwise yi=0. For each of the four sets, two-fold cross validation resulted in 30 models. We report the number of voxels selected out of the initial 4000 voxels. We also report sensitivity, de ned as the ratio of true positives to true positives plus false negatives; false positive rate, the ration of false positives to false positives plus true negatives; and the accuracy, de ned as the ratio of true positives plus true negatives to the number of observations in the validation set. These ndings can be seen in Figure 4.1. We did not report the number of principal components selected in the table. In the cases where n = 50, the original number of basis functions was 43. After PCA, 8, 9 or 10 principal components were selected every time. In the cases where n = 30, 7 or 8 PCs were chosen. The method classi es well, even after use of a small number of voxels. As expected, the cases with fewer subjects have lower sensi- tivity and accuracy and a higher false positive rate. In the two simulations with lower SNR classi cation appears to improve, which is surprising. 19 No. Voxels Selected Sensitivity False Positive Rate Accuracy SNR = 3.87 n = 50 5.414(1.842) 0.935(0.096) 0.053(0.079) 0.936(0.060) SNR = 0.75 n = 50 5.267(1.617) 0.953(0.099) 0.044(0.082) 0.949(0.670) SNR = 3.87 n = 30 3.867(1.332) 0.825(0.188) 0.166(0.197) 0.840(0.158) SNR = 0.75 n = 30 3.967(1.449) 0.874(0.153) 0.108(0.159) 0.871(0.106) Figure 4.1: Simulation Results: The values reported are the means, followed by their stan- dard deviations in parentheses. Figure 4.2: fMRI Nine Dot Experiment: Subjects were asked to use four and ve lines to connect all nine dots in the gure above. 4.3 fMRI Example To test this methodology on a real dataset, we used fMRI data collected by Auburn University?s MRI Research Center. The data was collected from 6 subjects on a 7T MRI scanner, and each scanning session lasted 1000s. The data were preprocessed by the experi- menters using SPM8. Slice timing correction was made. Spatial realignment, normalization, and smoothing were performed. And, nally, the data were detrended. The complete raw data voxel-wise time series were then extracted using MarsBaR. The experimental design was a block design with two conditions.In one condition, sub- jects were asked to use four lines to connect all dots in Figure 4.2. In the other, subjects were asked to use ve lines to connect all dots in Figure 4.2. Conditions were presented in a random sequence, and each condition was followed by a period of rest. All subjects were able to connect the nine dots using ve lines, but only one (Subject 5) was able to solve the puzzle using four lines. Our aim was to use group lasso for functional logistic regression to classify the six subjects as having solved the four line puzzle, y = 1, or as being unable to 20 Figure 4.3: Voxel Mask of the Anterior Temporal Lobe: The right anterior temporal lobe is in red, the left in blue. solve the puzzle, y = 0. According to the experimenters, there were two important regions of interest (ROIs) in this study. These regions are the left and right anterior temporal lobes (ATL) which can be seen in Figure 4.3. They are associated with semantic memory, knowledge of objects and facts. From the right ATL, 7560 voxel time series were extracted. From the left ATL, 6584 voxel time series were extracted. This led to a total of time series from 14144 voxels. To perform PCA on the 14144 A matrices of the spline smooths, the number of basis func- tions, p, must be less than the number of subjects. We chose to use 5 basis functions. After performing the principal component analysis, 3 PCs were chosen. From the 14144 voxels, the group lasso procedure selected 11 voxels. From these 11 voxel time series, the classi cation procedure correctly selected Subject 5 as having solved the puzzle. 4.4 Conclusion We have developed a viable method of variable selection for functional logistic regression by employing the group lasso in an interesting way. There are obvious limitations with small sample sizes. Being limited to a number of spline basis functions that is less than the sample size could lead to poor estimation of the functional observations. The method employed is also computationally expensive for a large number of functional variables and subjects. Each 21 spline smooth procedure must be performed for all variables and subjects. Additionally, we would like to explore the statistical properties of the estimators of the parameter functions. The application in the eld of Functional Magnetic Resonance Imaging is exciting. In future studies on real data, it would be interesting to study the neurological signi cance of the voxels selected by the group lasso. 22 Bibliography [1] Craven, P., Wahba, G., 1979. Smoothing Noisy Data with Spline Functions. Numerische Mathematik, 31, 377-403. [2] Escabias, M., Aguilera, A. M., Valderrama, M. J., 2004. Principal component estimation of functional logistic regression: discussion of two di erent approaches. Nonparametric Statistics, 16(3-4), 365-384. [3] Huettel, S.A., Song, A.W., McCarthy, G., 2009. Functional Magnetic Resonance Imag- ing (2nd Edition). Sunderland, MA: Sinauer Associates. [4] Joli e, I. T., 2002. Principal Component Analysis, 2nd ed., Springer-Verlag, New York. [5] Matsui, H., Konishi, S., 2011. Variable selection for functional regression models via the L1. Computational Statistics and Data Analysis, 55:3304-3310. [6] Meier, L., van de Geer, S., B uhlmann, P., 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B, 70(Part 1), 53-71. [7] M uller, H., Stadtm uller, U., 2005. Generalized functional linear models. The Annals of Statistics, 33(2), 774-805. [8] Ramsay, J.O., Silverman, B.W., 2005. Functional Data Analysis, 2nd ed., Springer- Verlag, New York. [9] Ryali, S., Kaustubh, S., Abrams, D., Menon, V., 2010. Neuroimage, June 51(2), 752-764. [10] Tian, T.S, 2010. Functional data analysis in brain imaging studies. Frontiers in Psy- chology, 1(35), 1-11. [11] Tibshirani, R., 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58(1), 267-288. [12] Viviani, R., Gr on, G., Spitzer, M., 2005. Functional Principal Component Analysis of fMRI Data. Human Brain Mapping, 24, 109-129. [13] Welvaert, M., Durnez, J., Moerkerke, B., Verdoolaege, G., Rosseel, Y., 2011. neuRosim: An R Package for Generating fMRI Data. Journal of Statistical Software, 40(10). [14] Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B, 68, 4967. 23 Appendix A Simulation of Preprocessed fMRI Data seed1 <- NULL seed0 <- NULL n1 <- n2 <- n <- n1 + n2 SNR <- #Evaluate and store results for ith observation for (i in 1:n1){ seedi<-sample(1:10000,1) seed1[[i]] <- seedi } for (i in 1:n2){ seedi<-sample(1:10000,1) seed0[[i]] <- seedi } library(neuRosim) #Parameters dimx <- 20 dimy <- 20 24 dimz <- 10 nscan <- 100 TR <- 2 total <- TR*nscan os <- seq(1, total, 20) dur <- 7 regions <- simprepSpatial(regions = 2, coord = list(c(2, 2, 2), c(-2,-3,1)), radius = c(3,1), form = "sphere") onset <- list(os,os) duration <- list(dur, dur) effect1 <- list(7, 15) effect2 <- list(12, 8) design1 <- simprepTemporal (regions = 2, onsets = list(os,os), durations = duration, TR = TR, hrf = "double-gamma", effectsize = effect1, totaltime=total) design2 <- simprepTemporal (regions = 2, onsets = list(os,os), durations = duration, TR = TR, hrf = "double-gamma", effectsize effect2, totaltime=total) #Subjects 1-25 25 data <- NULL a <- NULL for(i in 1:n1){ set.seed(seed1[i]) a <- simVOLfmri(design = design1, image = regions, base = 100, dim= c(dimx,dimy,dimz),SNR = SNR, noise = "mixture", type = "rician", rho.temp = c(0.142,0.108,0.084), rho.spat = 0.4, w= c(0.05,0.1,0.01,0.09,0.05,0.7)) data[[i]] <- a } #Subjects 26-50 for(i in 1:n2){ set.seed(seed0[i]) a <- simVOLfmri(design = design2, image = regions, base = 100, dim= c(dimx,dimy,dimz),SNR = SNR, noise = "mixture", type = "rician", rho.temp = c(0.142,0.108,0.084), rho.spat = 0.4, w= c(0.05,0.1,0.01,0.09,0.05,0.7)) data[[i+n1]] <- a } 26 Appendix B Simulation Analysis Code for R #Sort data all subjects vts<- NULL vtsoutijk <- NULL a <- NULL for(m in 1:n){ for (i in 1:dimx){ for (j in 1:dimy){ for (k in 1:dimz){ vtsoutijk<- data[[m]][i,j,k,] vtsoutijk <- t(vtsoutijk) a <- cbind(a, vtsoutijk) } } } vts[[m]] <- a a <- NULL } vtsALL <- NULL for(i in 1:n){ 27 vtsALL <- rbind(vtsALL,vts[[i]]) } vtsV <- NULL a <- NULL for(i in 1:(dimx*dimy*dimz)){ a <-matrix(vtsALL[,(100*i-99):(100*i)], n, 100) vtsV[[i]] <- a } vtsV <- t(vtsV) #begin basis expansion library(fda) knots <- seq(0,200,10) norder <- 4 nbasis <- length(knots) + norder - 2 T <- seq(2,200,2) Trange = c(0,200) lambda=1e6 #Create b-spline basis bbasisV <- create.bspline.basis(Trange,nbasis,norder,knots) 28 curv.Lfd <- int2Lfd(2) curv.fdParV <- fdPar(bbasisV, curv.Lfd,lambda) #choose smoothing parameter lambdas = 10^seq(-4,4,by=0.5) mean.gcv = rep(0,length(lambdas)) #Initialize dataframes SmoothV <- NULL SmoothVfd <-NULL A <- NULL for(j in 1:(dimx*dimy*dimz)){ for(ilam in 1:length(lambdas)){ # Set lambda curv.fdParVi <- curv.fdParV curv.fdParVi$lambda <- lambdas[ilam] # Smooth Smoothi <- smooth.basis(T,t(vtsV[[j]]),curv.fdParVi) # Record average gcv mean.gcv[ilam] <- mean(Smoothi$gcv) } #plot(lambdas,mean.gcv,type=?b?,log=?x?) best = which.min(mean.gcv) 29 lambdabest = lambdas[best] curv.fdParV$lambda = lambdabest bj = smooth.basis(T,t(vtsV[[j]]),curv.fdParV) SmoothV[[j]] <-bj #Create A matrices c <- SmoothV[[j]]$fd SmoothVfd[[j]] <- c d <- SmoothVfd[[j]]$coefs d <- matrix(d,n,nbasis) A[[j]] <- d } #Create PSI matrix Psi1 <- inprod(bbasisV,bbasisV) e<- NULL APsi <- NULL #multiply Am by Psim for(i in 1:(dimx*dimy*dimz)){ e <- A[[i]]%*%Psi1 APsi[[i]] <- e } #Perform PCA in APsi matrices 30 APsiPCA <- NULL for(i in 1:(dimx*dimy*dimz)){ h <- princomp(APsi[[i]], cor = TRUE, scores = TRUE) APsiPCA[[i]] <- h } c<-NULL d <- NULL e<-NULL #Calculate proportion of variance and choose number of PCs PropVar <- NULL for ( j in 1:(dimx*dimy*dimz)){ for ( i in 1:nbasis){ sdev <- APsiPCA[[j]]$sdev c[i] <- (sdev[i])^2 c <- t(c) } for( i in 1:nbasis){ d[i] <- c[i]/sum(c) } e <- cumsum(d) PropVar[[j]] <- e } NPCV <- NULL NPC <- NULL for(j in 1:(dimx*dimy*dimz)){ d <- NULL 31 for(i in 1:nbasis){ if(PropVar[[j]][i] < .91){ d[i] <- 1 } else{ d[i] <- 0} } NPCV[[j]]<-d NPC[j] <- sum(NPCV[[j]]) } NPCV <- NULL NPC <- NULL for(j in 1:(dimx*dimy*dimz)){ d <- NULL for(i in 1:nbasis){ if(PropVar[[j]][i] < .95){ d[i] <- 1 } else{ d[i] <- 0} } NPCV[[j]]<-d NPC[j] <- sum(NPCV[[j]]) } #Select # PCs min(NPC) max(NPC) 32 numPC <- #extract pCs PC <- NULL for(i in 1:(dimx*dimy*dimz)){ d <- APsiPCA[[i]]$scores[,1:numPC] PC[[i]]<-d } #Create Y vector Y <- c(rep(1,n1),rep(0,n2)) Y <- t(Y) Y <- t(Y) #Select training and validation sets train <- sample(1:n,n1) valid <- NULL for(i in 1:n){ if (i %in% train){} else{ valid <- rbind(valid,i) } } PCtrain <- NULL PCvalid <- NULL x<- NULL 33 a <- NULL #Training Design for(j in 1:(dimx*dimy*dimz)){ a <- NULL x <- NULL for(i in 1:n1){ a <- PC[[j]][train[i],] x<-rbind(x,a) } PCtrain[[j]]<-x } #Validation Design for(j in 1:(dimx*dimy*dimz)){ a <- NULL x <- NULL for(i in 1:n1){ a <- PC[[j]][valid[i],] x<-rbind(x,a) } PCvalid[[j]]<-x } #Training Y Ytrain <- NULL 34 for(i in 1:n1){ Ytrain[i] <- Y[train[i]] } #Valid Y Yvalid <- NULL for(i in 1:n1){ Yvalid[i] <- Y[valid[i]] } #grplasso library(grplasso) library(calibrate) Int <- ones(n1,1) #Validaton Set as Validation lasX<-NULL lasX<-cbind(lasX,Int) for(i in 1:(dimx*dimy*dimz)){ lasX <- cbind(lasX,PCtrain[[i]]) } index <- NULL index <- c(index,NA) for(i in 1:(dimx*dimy*dimz)){ b <-rep(i,numPC) index <- c(index,b) 35 } lambdalas <- lambdamax(lasX, y=Ytrain, index=index, penscale = sqrt, model = LogReg()) * 0.5^(0:5) lasfit <- grplasso(lasX, y=Ytrain, index= index, lambda = lambdalas, model= LogReg(), penscale=sqrt, control = grpl.control(update.hess = "lambda", trace=0)) Coefs <- lasfit$coefficients dim(Coefs) CoSUM <- NULL for(i in 1:6){ CoSUM[i]<-sum(Coefs[2:(dimx*dimy*dimz*numPC + 1),i]) } CoCo <- NULL VarNUM <- NULL for(j in 1:6){ g <- NULL for(i in 1:(numPC*dimx*dimy*dimz + 1)){ if(Coefs[i,j] ==0){ g[i] <- 0} else{g[i] <- 1} } CoCo[[j]]<-g } SumCoCo <- NULL for(i in 1:6){ 36 SumCoCo[i] <- sum(CoCo[[i]]) } #View number of functional variables selected by grplasso SumCoCo #Find Lihat and Pihat and classify yhat ModelCoef <- Coefs[,2] Alpha <- ModelCoef[1] Betastar <- ModelCoef[2:(numPC*dimx*dimy*dimz +1)] Betastar <- t(Betastar) Betastar <- t(Betastar) M<-NULL for(i in 1:(dimx*dimy*dimz)){ b <- PCvalid[[i]]%*%Betastar[(numPC*(i-1)+1):(numPC*i)] M[[i]] <- b } APsiB <- 0*ones(n1,1) for(i in 1:(dimx*dimy*dimz)){ APsiB <- APsiB + M[[i]] } #Find Lhat Lhatvalid <- NULL Lhatvalid = Alpha*ones(n1,1) + APsiB 37 #Pistar1 #find Pi(i)?s Pstar <- NULL YstarV <- NULL for (i in 1:n1){ Pstar[[i]] <- (exp(Lhatvalid[i]))/(exp(Lhatvalid[i])+1) if(Pstar[i]>0.5){ YstarV[[i]] <- 1 } else{YstarV[[i]]<- 0} } #Training Set as Validation lasX<-NULL lasX<-cbind(lasX,Int) for(i in 1:(dimx*dimy*dimz)){ lasX <- cbind(lasX,PCvalid[[i]]) } index <- NULL index <- c(index,NA) for(i in 1:(dimx*dimy*dimz)){ b <-rep(i,numPC) index <- c(index,b) 38 } #Group Lasso lambdalas <- lambdamax(lasX, y=Yvalid, index=index, penscale = sqrt, model = LogReg()) * 0.5^(0:5) lasfit <- grplasso(lasX, y=Yvalid, index= index, lambda = lambdalas, model= LogReg(), penscale=sqrt, control = grpl.control(update.hess = "lambda", trace=0)) CoefsV <- lasfit$coefficients dim(CoefsV) CoSUM <- NULL for(i in 1:6){ CoSUM[i]<-sum(CoefsV[2:(dimx*dimy*dimz*numPC + 1),i]) } VCoCo <- NULL VarNUM <- NULL for(j in 1:6){ g <- NULL for(i in 1:(numPC*dimx*dimy*dimz + 1)){ if(CoefsV[i,j] ==0){ g[i] <- 0} else{g[i] <- 1} } VCoCo[[j]]<-g } 39 SumVCoCo <- NULL for(i in 1:6){ SumVCoCo[i] <- sum(VCoCo[[i]]) } #View number of functional variables selected by grplasso SumVCoCo #Find Lihat and Pihat and classify yhat ModelCoefV <- CoefsV[,2] AlphaV <- ModelCoefV[1] BetastarV <- ModelCoefV[2:(numPC*dimx*dimy*dimz +1)] BetastarV <- t(BetastarV) BetastarV <- t(BetastarV) M<-NULL for(i in 1:(dimx*dimy*dimz)){ b <- PCtrain[[i]]%*%BetastarV[(numPC*(i-1)+1):(numPC*i)] M[[i]] <- b } APsiBV <- 0*ones(n1,1) for(i in 1:(dimx*dimy*dimz)){ APsiBV <- APsiBV + M[[i]] } #Find Lhat 40 Lhattrain <- NULL Lhattrain = AlphaV*ones(n1,1) + APsiBV #Find Pi(i)?s PstarT <- NULL YstarT <- NULL for (i in 1:n1){ PstarT[[i]] <- (exp(Lhattrain[i]))/(exp(Lhattrain[i])+1) if(PstarT[i]>0.5){ YstarT[[i]] <- 1 } else{YstarT[[i]]<- 0} } 41