Group Lasso for Functional Logistic Regression
by
Jessica Godwin
A thesis submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
August 3, 2013
Keywords: functional data analysis, group lasso, functional logistic regression, fMRI
Copyright 2013 by Jessica Godwin
Approved by
Nedret Billor, Chair, Associate Professor of Mathematics and Statistics
Dmitry Glotov, Associate Professor of Mathematics and Statistics
Asheber Abebe, Associate Professor of Mathematics and Statistics
George Flowers, Dean of the Graduate School
Abstract
Functional datasets are comprised of data that have been sampled discretely over a
continuum, usually time. While the recorded data are discrete, it is assumed that there is a
smooth, underlying curve describing the observations. In this thesis an attempt is made to
develop a variable selection technique for the functional logistic regression model.
The functional logistic regression model is the functional analog of logistic regression.
In this model, the responses are binary and represent two separate classes; the predictors are
functional. Due to the nature of functional data, observations at neighboring time points
are correlated, leading to redundant information within each observation and each functional
predictor. In a dataset with many variables, it is necessary to be able to select a smaller
subset of informative variables. In this thesis, we attempt to remove the autocorrelation
between neighboring observations and perform variable selection. We do this by employing
a principal component analysis on binary data with multiple functional predictors. The
data are then subject to a variant of the group lasso, an L1 regularization method that
estimates the logistic model and selects variables simultaneously. We assess our method
with a simulation study and an application to a real dataset.
ii
Acknowledgments
I would like to thank my advisor, Dr. Nedret Billor. You have been an incredible mentor
to me. From the  rst semester you taught me, you have encouraged me in the direction of
research. Thank you to my friends and family for all of your support. To my parents,
my deepest gratitude for supporting me throughout my life and teaching me to work hard.
Cory, thank you for all the love, laughter and constant encouragement. Thank you to the
rest of my committee, Dr. Dmitry Glotov and Dr. Ash Abebe. Finally, thank you to Dr.
Gopikrishna Deshpande and his student, Yun Wang, for sharing your data with me. I could
not have done any of this without any of you. War eagle!
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Basis Expansion of Functional Data . . . . . . . . . . . . . . . . . . . . . . . 4
2 Functional Principal Component Logistic Regression . . . . . . . . . . . . . . . 7
2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Functional Principal Component Analysis . . . . . . . . . . . . . . . . . . . 7
2.3 The Functional Logistic Regression Model . . . . . . . . . . . . . . . . . . . 9
2.4 Principal Component Functional Logistic Regression for Multiple Functional
Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Group Lasso for Functional Logistic Regression . . . . . . . . . . . . . . . . . . 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Group Lasso for Functional Logistic Regression . . . . . . . . . . . . . . . . 15
4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 fMRI Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
A Simulation of Preprocessed fMRI Data . . . . . . . . . . . . . . . . . . . . . . . 24
B Simulation Analysis Code for R . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
v
List of Figures
1.1 Berkeley Growth Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Group Lasso Algorithm: Outline of the block coordinate gradient descent algo-
rithm used to perform grouped variable selection in the logistic regression model
[6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Simulation Results: The values reported are the means, followed by their stan-
dard deviations in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 fMRI Nine Dot Experiment: Subjects were asked to use four and  ve lines to
connect all nine dots in the  gure above. . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Voxel Mask of the Anterior Temporal Lobe: The right anterior temporal lobe is
in red, the left in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
Chapter 1
Introduction
Functional data analysis (FDA) is a relatively new area within the discipline of statis-
tics. The  rst in depth text on the subject, Functional Data Analysis, was published by as
recently as 1997 [8]. Functional data are data that have been measured discretely over a
continuum, usually time. Instead of treating the many discrete measurements as individual
observations, one makes the assumption that these measurements represent a smooth, un-
derlying curve. This curve, then, is considered as one observation.
Much of this work was motivated by Functional Magnetic Resonance Imaging (fMRI),
a noninvasive imaging technique which allows an experimenter to take images of a subject?s
brain over time. These images are taken while the subject performs a task such as  nger
tapping or correctly identifying images of human faces amidst a series of images containing
both human faces and objects. fMRI is currently being used to assess which areas of the
brain are activated while performing certain tasks. This is done by dividing the brain into
voxels, the three-dimensional analog of a pixel, and measuring brain activation. Depend-
ing on how one de nes a voxel, a typical fMRI image has over 1,000,000 voxels. As fMRI
studies usually have a small number of subjects, this results in datasets that are incredibly
high-dimensional. High dimensionality is one of the biggest problems in statistical analysis
of fMRI data [10].
Initial statistical analysis of fMRI data was univariate in nature [3]. This is obviously
simplistic for data that take into account four dimensions: three spatial dimensions and one
temporal dimension. The second decade of fMRI research focused on multivariate data anal-
ysis. Considering the fact that, the capturing of images happens over time, the next logical
step in this progression is functional data analysis. In an fMRI session, images of a subject?s
1
brain are taken at discrete time points. However, one would expect the brain?s response to
a stimulus to be continuous in nature. This makes fMRI imaging data a great candidate for
functional data analysis. In 2005 Viviani et al. published a paper using functional principal
component analysis in fMRI. They showed the results to be much more interpretable than
multivariate PCA [12]. Since then more statistical analysis of fMRI data has been functional
in nature. There is a need to attack the problem of high dimensionality of brain imaging
data. There is also a need for the development of better classi cation methods [10]. One
of the best things about Functional Magnetic Resonance Imaging is its noninvasiveness. If
statistical classi cation methods are improved, it could aid the advancement of noninvasive
diagnostic techniques for mental illness or even degenerative diseases such as Alzheimer?s.
There is some research being done in this area, but there is room for more advancement [10].
1.1 Basic Functional Data Analysis
Consider sample curves of the formfxi(t), t2T, i = 1, . . . , ng, where T is an interval
over which the observations were measured. The observations belong to the Hilbert space,
L2(T), of square-integrable functions with the inner product
hf;gi=
Z
T
fgdt; 8f;g2L2(T): (1.1)
A vector xi = (xi1, . . . , xiN), represents the discrete measurements for the ith subject of
one variable, x, at N points in T. There are functional analogs of the traditional summary
statistics. The functional sample mean,  x(t), is de ned below:
 x(t) = n 1
nX
i=1
xi(t): (1.2)
The sample mean is computed point-wise at t2T. Similarly, one can compute the covariance
between measurements at two time points s and t
2
Figure 1.1: Berkeley Growth Study
These are the curves describing the height in cm of girls followed in the Berkeley Growth
Study beginning in 1929. The circles mark the discrete height measurements over time.
cov(s;t) = (n 1) 1
nX
i=1
(xi(s)  xi(s))(xi(t)  xi(t)): (1.3)
When s = t, this is simply the variance of x computed at that time point. If we further
assume that the observations, xi(t), belong to an L2 space, we can de ne the inner product,
hx;yi=
Z
x(t)y(t)dt: (1.4)
Once we consider the observations as smooth functions, a natural extension would be to
estimate the derivatives. In the case of the Human Growth Data in Figure 1.1, one might
want to estimate the velocity or acceleration of the growth for inference as well [8].
3
1.2 Basis Expansion of Functional Data
Let observations fxi(t), t2T, i = 1, . . . , ng belong a subspace of L2 spanned by the
p-dimensional basis system of independent functions, f 1, . . . ,  pg. The assumed smooth
functional observation, or linear expansion, xi(t), can be expressed in terms of the sum
xi(t) =
pX
k=1
ak k: (1.5)
Here, each ak is called a basis coe cient. When derivative estimation is required, one
must estimate it separately. Taking the derivate of the smooth function xi(t) does not often
give a good derivative estimate.
There are many di erent basis systems that can be used in a basis expansion. One of
the most common systems is the Fourier basis system. A Fourier basis is de ned by a Fourier
series,
1, sin(!t), cos(!t), sin(2!t), cos(2!t), sin(3!t), cos(3!t), . . . .
Fourier bases are useful for data that are periodic, e.g. weather patterns [8]. Additionally,
derivative estimation is easier with a Fourier basis as the derivatives of sin(t) and cos(t) are
known and easy to compute.
For data that are not cyclical in nature, the most common choice is the B-spline basis
system. They are computationally e cient and are  exible enough to approximate most
non-periodic functions. Splines are de ned over an interval T subdivided into subintervals
at points called knots. A basis system with L knots contains L+ 1 subintervals. Over each
subinterval, a polynomial functional of order m is de ned. These functions are constrained
to be equal at the knots. The number of knots, L, and order of the splines, m, are two
important parameters de ning a B-spline basis system. In this work, only B-spline bases
will be used.
After a basis system has been chosen, the basis coe cients must be estimated. One
method of basis coe cient estimation is that of least squares estimation. Denote the discrete
4
observations of a functional dataset yj, j = 1, . . . , N, where N is the number of time
points at which observations were taken. Assume independently and identically distributed
measurement errors, "j, with E["j] = 0 and Var("j) =  2. Then,
yj = x(tj) +"j: (1.6)
De ning x(tj) in terms of (1.4), the sum of squared errors can be de ned as
SSE(yja) =ky  ak2: (1.7)
Taking the  rst derivative, the least squares solution minimizing SSE(yja)
^a = ( 0 ) 1 0y: (1.8)
However, least squares smoothing is inappropriate if the error assumptions are not true. In
the case of functional data measured over time, observations at adjacent time points are
likely correlated, violating the standard error assumptions.
A more common method of spline smoothing is smoothing by roughness penalty. This
method is designed to estimate a curve that is rough enough to describe observed features
of the data, but suppresses high-frequency features of the data, including noise. To  nd the
coe cients of a smooth approximation of this type, the sum of squared errors is minimized
with the added constraint of a roughness penalty. Roughness of a function is described by
the curvature, or the squared second derivative. The quantity penalized is the integrated
squared second derivative,
PEN2(x) =
Z
[D2x(s)]2ds: (1.9)
The corresponding sum of squared errors to be minimized is as follows:
5
PENSSE (xjy) = [y x(t)]0W[y x(t)] + PEN2(x); (1.10)
where W is the matrix of weights describing the covariance structure of the errors. The
smoothing parameter  is chosen by the method of generalized cross validation developed
by Craven and Wahba [1]. This is done by choosing  such that it minimizes the following
equation
GCV( ) = n SSE(n df( ))2: (1.11)
Here df( ) = trace(S ; ), where
S ; =  ( 0W ) 1 W (1.12)
is the hat matrix of the spline smoother.
Once you have smoothed functional observations, the statistical analysis can begin.
As with traditional statistics, the beginning of FDA focused on univariate statistics. Our
focus, however, is on a set of multiple functional predictors. In any functional data set with
multiple functional predictors, and especially in fMRI, dimensionality is an issue. It may
be important, for the sake of interpretation and computational expense, to select a smaller
subset of important variables from the dataset. In any dataset, classi cation is often of
interest. This particularly resonates within fMRI. Currently classi cation is used to explore
brain functionality. For example, in a certain study subjects are given one of two stimuli.
Does the brain behave di erently in the presence of each stimulus to be able to predict which
stimulus a subject was presented with? If so, it would be important to identify which areas
of the brain are associated with this di erence. As classi cation improves within the realm
of fMRI, it could contribute to diagnosis of brain degeneration or mental illness in clinical
settings. The rest of this thesis focuses on a method of dimension reduction and variable
selection in a dataset with multiple functional predictors and a binary response.
6
Chapter 2
Functional Principal Component Logistic Regression
2.1 Principal Component Analysis
Traditional principal component analysis (PCA) is a method of data reduction for mul-
tivariate datasets [4]. Consider a data matrix Xn m, where n is the sample size and m is
the number of variables. Let E[X] = 0 and C = (X0X) 1 be the variance-covariance matrix.
PCA is performed on the covariance matrix or correlation matrix of X. It reduces dimension
by  nding a linear combination of the variables that has the maximum variance; this linear
combination is the  rst principal component (PC). The next PC is found by  nding a linear
combination of the variables that is independent of the  rst PC and has the next largest
variance. This goes on until minfN 1;mg PCs have been found. This is equivalent to
solving the following equation:
Cf =  f: (2.1)
The solutions to this equation are the eigenvalues,  , and eigenvectors, f, of the covariance
or correlation matrix. Dimension reduction occurs when a number of principal components,
s  m, is chosen. Often this is done by examining the amount of cumulative variance
explained by each additional principal component.
2.2 Functional Principal Component Analysis
Consider a sample containing observations of one functional predictor. There is corre-
lation between observations at adjacent points in T. In a linear model framework, this leads
to the problem of high multicollinearity. Escabias et al. propose a method of functional
7
principal component analysis (fPCA) for the logistic regression model with one functional
predictor that alleviates this issue [6]. We will extend this method to a functional logistic
regression model with multiple functional predictors.
As outlined by Ramsay and Silverman, fPCA is merely a functional analog of the tradi-
tional multivariate principal component analysis [8]. Assume functional observations of one
variable xi(t) 2L2, where i = 1, . . . , n, with the usual functional sample mean,  x(t), and
sample covariance function, ^C(s;t), s;t2T. Without loss of generality, assume  x(t)=0. The
functional principal components,  j, are found by solving the following functional equivalent
of (2.1)
Z
T
^C(s;t)f(s)ds =  f(t): (2.2)
The solutions to (2.2) are the eigenvalues,  , and eigenfunctions, f(t), of the the covariance
matrix C. The number of eigenvalues is N 1. The ith component of the jth principal
component,  ij, is expressed as
 ij =
Z
T
xi(t)fj(t)dt: (2.3)
The solution of (2.2) cannot always be computed.
When thensample functions of a functional predictor belong to the spaceL2(T) spanned
by orthonormal basesf 1;:::; pg, the functional PCs are equivalent to the multivariate PCs
of the matrix A [2]. Here, A is the n p matrix of coe cients of the basis expansions and
 is a p p matrix whose components are de ned as
 ij =
Z
T
 i jdt: (2.4)
8
2.3 The Functional Logistic Regression Model
Escabias et al. develop a principal component functional logistic regression model for
one functional variable [2]. We describe this method before extending it to the case with
multiple functional predictors. Consider observations f(yi, xi(t)), t 2 T, i = 1, . . . , ng,
where xi(t) is a functional predictor. Each xij(t) is the ith observation at the jth time point,
and each yi 2f0,1g. The conditional distribution of YijXi(t) is Bernoulli( i), with
 i = E[YijXi(t)] = expf +
R x
i(t) (t)dtg
1 + expf +R xi(t) (t)dtg i = 1;:::;n; (2.5)
where  2R and  (t), the parameter, is a function. Making the logit transform, a generalized
model is formed [7]:
li = ln
  
i
1  i
 
=  +
Z
T
xi(t) (t)dt i = 1;:::;n: (2.6)
Under the assumption that  (t) belongs to the same L2 space spanned by f 1;:::; pg,
 (t) =
pX
k=1
 k k: (2.7)
The li?s can be expressed in terms of the A matrix,
L =  1n 1 + A  ; (2.8)
where  is a p 1 dimensional vector containing the coe cients  k for the basis expansion
of  (t). The basis coe cients  k can be estimated using a Newton-Raphson algorithm
maximizing the likelihood equation
Y0(X  ) = 0: (2.9)
9
In (2.9), Y = (y1;:::;yn),  = ( 1;:::; n) and X=(1j A ) [6]. Once the coe cients are
estimated,
^ (t) =
pX
k=1
^ k k: (2.10)
The vector L in (2.9) can be reexpressed in terms of the principal components of A :
L =  1n 1 +  V0 =   ; (2.11)
where  =( ij) is the matrix of the p principal components and V, the eigenvectors. This
notation allows for estimation of the components of  ,
^ = V^ : (2.12)
Reduction of the e ects of multicollinearity occurs when a number of PCs, s p, is chosen.
2.4 Principal Component Functional Logistic Regression for Multiple Func-
tional Predictors
The extension of PCA to the model with multiple functional predictors lies in the
de nition of the inner product [4]. Let f(yi, xim(t)), t2T, i = 1, . . . , n, m = 1, . . . , Mg
2L2(T) spanned by f 1;:::; pg, where each xim(t) is a functional predictor and each yi2
f0,1g. The inner product, then, of two functional PCs can be de ned as follows:
h i; ji=
MX
m=1
Z
T
 mi  mj dt: (2.13)
The functional logistic regression model with multiple predictors is still de ned by (2.4).
The de nition of  i, however, changes with the change in inner product:
10
 i = E[YijXi(t)] = expf +
PM
m=1
R
T x
m
i (t) 
m(t)dtg
1 + expf +PMm=1RT xmi (t) m(t)dtg i = 1;:::;n: (2.14)
Making the logit transform,
li =  +
MX
m=1
Z
T
xmi (t) m(t)dt i = 1;:::;n: (2.15)
To perform the dimension reducing PCA, we rede ne the design matrix A . Here,
An (Mp) =
h
A1 ::: AM
i
; (2.16)
and
 (Mp) (Mp) =
  
  
  
  
  
  
 
 1 0 ::: :::
0  2 0 :::
::: ::: ::: :::
0 ::: :::  M
  
  
  
  
  
  
 
: (2.17)
Each Am is an n p matrix of basis coe cients.  is a diagonal block matrix with each
 m having dimensions p p. Now,  = ( 01,. . . , 0M)0. Rede ning (2.7),
L =  1n 1 + A  =  1n 1 +
MX
m=1
Am m m: (2.18)
This de nition depends on the assumption that each observation xmi (t) and the respective pa-
rameter function,  m(t), can be de ned by the same set of basis functionsf m1 (t);:::; mp (t)g.
Below, we express L in terms of the principal components
L =  1n 1 +
MX
m=1
 mVm0 m; (2.19)
11
where  m = ( mij )n p are the principal components of the A m, and Vm is the matrix of
eigenvectors. The number of of PCs sm  pm, to be chosen for each of the M should be
determined by cumulative variance. For simplicity, we chose the same number of PCs, s,
for each of the M predictors. After the dimension has been reduced on the within-variable
level, (2.12) becomes
 i(s) = expf (s) +
PM
m=1
Ps
j=1 
m
ij(s)f
m
ij(s) 
m
j(s)g
1 + expf (s) +PMm=1Psj=1 mij(s)fmij(s) mj(s)g i = 1;:::;n: (2.20)
We can re-express (2.17) as
L =  (s)1n 1 +
MX
m=1
 m(s)Vm0(s) m(s): (2.21)
The vectors  m can be estimated as in (2.9),
^ m = Vm^ m: (2.22)
Although multicollinearity has been dealt with on a within variable basis, as M gets
large there could be multiple predictors providing similar information. In the case of fMRI
data, time series of adjacent voxels are expected to be similar. There is a need to reduce
dimension on the multiple variable level by selecting only those functional predictors which
are relevant to the response. This will alleviate another potential source of multicollinearity.
The following section of this thesis will focus on a method of simultaneous model estimation
and variable selection.
12
Chapter 3
Group Lasso for Functional Logistic Regression
3.1 Introduction
The principal component analysis allows for removal of redundant information on a
within predictor basis. As stated before, this is necessary due to the autocorrelation of
observations between time points. There is also a need to select only those predictors which
provide relevant information to the model.
There are many methods of variable selection used in linear models and generalized
linear models. Model selection techniques, such as stepwise and forward selection, cycle
algorithmically through subsets of variables until certain criteria are met. The variables
included in the various steps of the algorithm are determined by previously selected p-values.
These methods are inherently subjective, as it is up to the person analyzing the model to
choose the "best" model based on a set of criteria. The criteria that determine the quality
of the model are also chosen by the analyst.
3.2 Lasso
A more objective method of variable selection, the lasso, was introduced by Tibshirani
in 1996. The lasso is a method that simultaneously performs model selection and parameter
estimation. It is an L1 regularization technique that performs this variable selection by
shrinking certain  coe cients to exactly 0, excluding those predictors from the model. The
other, non-zero, coe cients represent variables that are relevant to the model. This is done
by solving the least squares estimation subject to a constraint on the  coe cients. Assume
13
a standard regression model with independent observations f(yi, xi), i = 1;:::;ng, where
xi=(xi1,. . . ,xip). The estimates of regression coe cients by the lasso method (^ ; ^ ) are
(^ ; ^ ) := arg minf
nX
i=1
(yi   
pX
j=1
 jxij)2g; (3.1)
under Pjj jj t, where t 0. Note that  is not penalized.
Tibshirani also applied the lasso to the logistic regression model [11]. Consider inde-
pendent observations f(yi;xi);i = 1;:::;ng where yi 2f0;1g. The variable selection and
model estimation are performed by maximizing the loglikelihood function,
l( ) =
nX
i=1
yi(x0i ) ln(1 + expfx0i g); (3.2)
under Pjj jj t, where t 0. An iterated reweighted least squares algorithm is used to
compute ^ under these conditions.
3.3 Group Lasso
Consider a linear model with multiple predictors, some of them categorical. A categor-
ical predictor with l levels will be represented in the model by l 1 variables. The lasso
only has the ability to shrink individual regression coe cients to zero. In the case of the
categorical predictor, this has little interpretation. If the categorical predictor is not relevant
to the response, all l 1 variables should be removed from the model.
Consider independent observations f(yi;xi);i = 1;:::;ng where xi = (x0i1, . . . , x0iM)0.
Each xim represents a group of predictors. The linear regression model is de ned as
Y =  +
MX
m=1
xm m +"; (3.3)
where  2 R is the intercept, each  m is a vector whose components are the regression
coe cients for the mth group of predictors and Yn 1 is the vector of responses. Yuan and
14
Lin [14] developed a method of variable selection called the group lasso considers each of the
M groups of variables for inclusion or exclusion in the model. The coe cient estimates are
de ned as
^ = arg min(kY X k22 + 
MX
m=1
k mk2); (3.4)
where  is a tuning parameter and  = ( ; 01;:::; 0M)0. The penalty is a mixture of L1 and
L2 regularization methods, the lasso and the ridge regression penalties.
3.4 Group Lasso for Functional Logistic Regression
Meier et al. [6] describe a method of group lasso for the multivariate logistic regression
model. Consider independent observations f(yi;xi);i = 1;:::;ng where yi 2f0;1g and xi
= (x0i1, . . . , x0iM)0. Each xmi represents a group of predictors. Group lasso is performed by
minimizing the following convex function:
S ( ) = l( ) + 
MX
m=1
s(dfm)k mk2; (3.5)
where dfm is the degrees of freedom of the mth group of predictors. The use of s(dfm) = df1=2m
is suggested [6]. The solution to this equation is the logistic group lasso estimator, ^  . It
is found using a block co-ordinate gradient descent minimization algorithm. The algorithm
uses a second-order Taylor series expansion,
S ( ^ (t) + d)  fl( ^ (t)) + dTrl( ^ (t)) + 12dTH(t)dg+ 
MX
m=1
s(dfm)k^ (t)m + dmk2 = M(t) (d):
(3.6)
The algorithm begins by assuming an initial parameter vector,  (0). For each of the M groups
of variables, the algorithm  nds d that minimizes M (d). If this d is not identically 0, the
estimate of  is updated. The updated estimate ^ (t+1) is the previous estimate,  (t), plus
15
Figure 3.1: Group Lasso Algorithm: Outline of the block coordinate gradient descent algo-
rithm used to perform grouped variable selection in the logistic regression model [6].
a scalar times d. This algorithm proceeds for each group until some convergence criterion
is met. The choice of  is dependent upon n and the degrees of freedom of each of the M
groups. In the multivariate model, group lasso performed on a dataset containing M groups
of discrete predictors. A number of groups of predictors less than M is selected. This version
of the group lasso is shown to be asymptotically consistent [6]. The minimization can be
done in R, using the package grplasso written by Meier et al [6].
We apply this group lasso method to the functional logistic regression model with mul-
tiple functional predictors. Recall observationsf(yi, xim(t)), t2T, i = 1, . . . , n, m = 1, . . . ,
Mg2LM2 (T) spanned byf 1;:::; pg, where each xim(t) is a functional predictor and each
yi 2f0,1g. Consider the model in (2.19), after s principal components have been chosen.
The loglikelihood function, l( ), of the functional logistic regression model is
l( ) =
nX
i=1
yi( (s) +
MX
m=1
Z
xmi (t) m(s)(t)dt) ln(1 + expf (s) +
MX
m=1
Z
xmi (t) m(s)(t)dtg): (3.7)
Using the de nition l( ) in (3.7) we minimize the objective equation,
16
S ( ) = l( ) + 
MX
m=1
s(dfm)k mk2: (3.8)
In the case of our method of principal component logistic regression with multiple functional
predictors, the degrees of freedom in (3.3) is equivalent to the number of chosen PCs, s.
In the functional case, each of the M functional predictors is de ned by a group of s
coe cients. When one entire group of coe cients is shrunk to 0, it excludes the correspond-
ing single functional predictor from the model. For any set of s coe cients that are not
equal to zero, the corresponding functional predictor is included in the model. In essence,
the group lasso performs single variable selection in the functional logistic regression model
with multiple functional predictors.
17
Chapter 4
Application
4.1 Introduction
In a Functional Magnetic Resonance Imaging experiment, an experimenter aims to
measure the amount of activation in each voxel of the brain. When a part of the brain is
active, there is increased blood ow to the area. fMRI measures the change in blood  ow
using the blood-oxygen-level-dependent (BOLD) contrast [3]. Assessing which parts of the
brain are active during an fMRI experiment allows researchers to determine which parts of
the brain respond to certain stimuli. There is a need for classi cation tools in the statistical
analysis of fMRI. Logistic regression could be used to distinguish between brains at rest and
those presented with stimuli. Another application would be to distinguish between subjects
receiving one of two particular stimuli. For example, an experimenter may play pieces of
music or speech to a subject [9]. Being able to classify which stimulus was presented allows
one to learn more about the way the brain works. Classi cation also has an application in
diagnosis of mental illness or degenerative disease.
4.2 Simulation Study
We assess our methodology using a simulation study. Using the R package neuRosim,
we were able to simulated preprocessed fMRI data. The package can be used to simulated
fMRI time series or complete 4D fMRI volumes. With neuRosim, one can de ne the onset
and duration of stimuli. One can specify the e ect size of the stimuli, TR and times of
spatial and temporal noise [13].
18
We simulated preprocessed four dimensional fMRI data for an area of 4000 voxels con-
taining two non-overlapping regions of activation. We used neuRosim to create block designs
of a stimulus followed by rest. Design 1 presented an e ect size that was larger in Region
1 than in Region 2. Design 2 created activation that was larger in Region 2 than in Region
1. Half of the observations in each simulation were simulated under Design 1, the other half
under Design 2. Our goal was to use the developed method of group lasso for functional
logistic regression with multiple functional predictors to classify the validation set into the
proper groups. We simulated 15 datasets each at two levels of signal-to-noise ratio (SNR),
0.75 and 3.87, and two levels of subject size, 30 and 50. An observation simulated under
Design 1 was given a y value of 1, otherwise yi = 0. Two-fold cross validation was then used
to form training and validation sets. All analysis was performed in R; the package grplasso
was used to perform the  nal variable selection [6]. Observations with ^ i >0.5 were classi ed
as yi=1, otherwise yi=0.
For each of the four sets, two-fold cross validation resulted in 30 models. We report the
number of voxels selected out of the initial 4000 voxels. We also report sensitivity, de ned
as the ratio of true positives to true positives plus false negatives; false positive rate, the
ration of false positives to false positives plus true negatives; and the accuracy, de ned as
the ratio of true positives plus true negatives to the number of observations in the validation
set. These  ndings can be seen in Figure 4.1. We did not report the number of principal
components selected in the table. In the cases where n = 50, the original number of basis
functions was 43. After PCA, 8, 9 or 10 principal components were selected every time. In
the cases where n = 30, 7 or 8 PCs were chosen. The method classi es well, even after use
of a small number of voxels. As expected, the cases with fewer subjects have lower sensi-
tivity and accuracy and a higher false positive rate. In the two simulations with lower SNR
classi cation appears to improve, which is surprising.
19
No. Voxels Selected Sensitivity False Positive Rate Accuracy
SNR = 3.87
n = 50 5.414(1.842) 0.935(0.096) 0.053(0.079) 0.936(0.060)
SNR = 0.75
n = 50 5.267(1.617) 0.953(0.099) 0.044(0.082) 0.949(0.670)
SNR = 3.87
n = 30 3.867(1.332) 0.825(0.188) 0.166(0.197) 0.840(0.158)
SNR = 0.75
n = 30 3.967(1.449) 0.874(0.153) 0.108(0.159) 0.871(0.106)
Figure 4.1: Simulation Results: The values reported are the means, followed by their stan-
dard deviations in parentheses.
   
   
   
Figure 4.2: fMRI Nine Dot Experiment: Subjects were asked to use four and  ve lines to
connect all nine dots in the  gure above.
4.3 fMRI Example
To test this methodology on a real dataset, we used fMRI data collected by Auburn
University?s MRI Research Center. The data was collected from 6 subjects on a 7T MRI
scanner, and each scanning session lasted 1000s. The data were preprocessed by the experi-
menters using SPM8. Slice timing correction was made. Spatial realignment, normalization,
and smoothing were performed. And,  nally, the data were detrended. The complete raw
data voxel-wise time series were then extracted using MarsBaR.
The experimental design was a block design with two conditions.In one condition, sub-
jects were asked to use four lines to connect all dots in Figure 4.2. In the other, subjects
were asked to use  ve lines to connect all dots in Figure 4.2. Conditions were presented in
a random sequence, and each condition was followed by a period of rest. All subjects were
able to connect the nine dots using  ve lines, but only one (Subject 5) was able to solve the
puzzle using four lines. Our aim was to use group lasso for functional logistic regression to
classify the six subjects as having solved the four line puzzle, y = 1, or as being unable to
20
Figure 4.3: Voxel Mask of the Anterior Temporal Lobe: The right anterior temporal lobe is
in red, the left in blue.
solve the puzzle, y = 0.
According to the experimenters, there were two important regions of interest (ROIs) in
this study. These regions are the left and right anterior temporal lobes (ATL) which can be
seen in Figure 4.3. They are associated with semantic memory, knowledge of objects and
facts. From the right ATL, 7560 voxel time series were extracted. From the left ATL, 6584
voxel time series were extracted. This led to a total of time series from 14144 voxels. To
perform PCA on the 14144 A matrices of the spline smooths, the number of basis func-
tions, p, must be less than the number of subjects. We chose to use 5 basis functions. After
performing the principal component analysis, 3 PCs were chosen. From the 14144 voxels, the
group lasso procedure selected 11 voxels. From these 11 voxel time series, the classi cation
procedure correctly selected Subject 5 as having solved the puzzle.
4.4 Conclusion
We have developed a viable method of variable selection for functional logistic regression
by employing the group lasso in an interesting way. There are obvious limitations with small
sample sizes. Being limited to a number of spline basis functions that is less than the sample
size could lead to poor estimation of the functional observations. The method employed is
also computationally expensive for a large number of functional variables and subjects. Each
21
spline smooth procedure must be performed for all variables and subjects. Additionally, we
would like to explore the statistical properties of the estimators of the parameter functions.
The application in the  eld of Functional Magnetic Resonance Imaging is exciting. In future
studies on real data, it would be interesting to study the neurological signi cance of the
voxels selected by the group lasso.
22
Bibliography
[1] Craven, P., Wahba, G., 1979. Smoothing Noisy Data with Spline Functions. Numerische
Mathematik, 31, 377-403.
[2] Escabias, M., Aguilera, A. M., Valderrama, M. J., 2004. Principal component estimation
of functional logistic regression: discussion of two di erent approaches. Nonparametric
Statistics, 16(3-4), 365-384.
[3] Huettel, S.A., Song, A.W., McCarthy, G., 2009. Functional Magnetic Resonance Imag-
ing (2nd Edition). Sunderland, MA: Sinauer Associates.
[4] Joli e, I. T., 2002. Principal Component Analysis, 2nd ed., Springer-Verlag, New York.
[5] Matsui, H., Konishi, S., 2011. Variable selection for functional regression models via the
L1. Computational Statistics and Data Analysis, 55:3304-3310.
[6] Meier, L., van de Geer, S., B uhlmann, P., 2008. The group lasso for logistic regression.
Journal of the Royal Statistical Society: Series B, 70(Part 1), 53-71.
[7] M uller, H., Stadtm uller, U., 2005. Generalized functional linear models. The Annals of
Statistics, 33(2), 774-805.
[8] Ramsay, J.O., Silverman, B.W., 2005. Functional Data Analysis, 2nd ed., Springer-
Verlag, New York.
[9] Ryali, S., Kaustubh, S., Abrams, D., Menon, V., 2010. Neuroimage, June 51(2), 752-764.
[10] Tian, T.S, 2010. Functional data analysis in brain imaging studies. Frontiers in Psy-
chology, 1(35), 1-11.
[11] Tibshirani, R., 1996. Regression Shrinkage and Selection via the Lasso. Journal of the
Royal Statistical Society B, 58(1), 267-288.
[12] Viviani, R., Gr on, G., Spitzer, M., 2005. Functional Principal Component Analysis of
fMRI Data. Human Brain Mapping, 24, 109-129.
[13] Welvaert, M., Durnez, J., Moerkerke, B., Verdoolaege, G., Rosseel, Y., 2011. neuRosim:
An R Package for Generating fMRI Data. Journal of Statistical Software, 40(10).
[14] Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society B, 68, 4967.
23
Appendix A
Simulation of Preprocessed fMRI Data
seed1 <- NULL
seed0 <- NULL
n1 <-
n2 <-
n <- n1 + n2
SNR <-
#Evaluate and store results for ith observation
for (i in 1:n1){
seedi<-sample(1:10000,1)
seed1[[i]] <- seedi
}
for (i in 1:n2){
seedi<-sample(1:10000,1)
seed0[[i]] <- seedi
}
library(neuRosim)
#Parameters
dimx <- 20
dimy <- 20
24
dimz <- 10
nscan <- 100
TR <- 2
total <- TR*nscan
os <- seq(1, total, 20)
dur <- 7
regions <- simprepSpatial(regions = 2, coord = list(c(2, 2, 2),
c(-2,-3,1)), radius = c(3,1), form = "sphere")
onset <- list(os,os)
duration <- list(dur, dur)
effect1 <- list(7, 15)
effect2 <- list(12, 8)
design1 <- simprepTemporal (regions = 2, onsets = list(os,os),
durations = duration, TR = TR, hrf = "double-gamma", effectsize =
effect1, totaltime=total)
design2 <- simprepTemporal (regions = 2, onsets = list(os,os),
durations = duration, TR = TR, hrf = "double-gamma", effectsize
effect2, totaltime=total)
#Subjects 1-25
25
data <- NULL
a <- NULL
for(i in 1:n1){
set.seed(seed1[i])
a <- simVOLfmri(design = design1, image = regions, base = 100,
dim= c(dimx,dimy,dimz),SNR = SNR, noise = "mixture", type =
"rician", rho.temp = c(0.142,0.108,0.084), rho.spat = 0.4, w=
c(0.05,0.1,0.01,0.09,0.05,0.7))
data[[i]] <- a
}
#Subjects 26-50
for(i in 1:n2){
set.seed(seed0[i])
a <- simVOLfmri(design = design2, image = regions, base = 100,
dim= c(dimx,dimy,dimz),SNR = SNR, noise = "mixture", type =
"rician", rho.temp = c(0.142,0.108,0.084), rho.spat = 0.4, w=
c(0.05,0.1,0.01,0.09,0.05,0.7))
data[[i+n1]] <- a
}
26
Appendix B
Simulation Analysis Code for R
#Sort data all subjects
vts<- NULL
vtsoutijk <- NULL
a <- NULL
for(m in 1:n){
for (i in 1:dimx){
for (j in 1:dimy){
for (k in 1:dimz){
vtsoutijk<- data[[m]][i,j,k,]
vtsoutijk <- t(vtsoutijk)
a <- cbind(a, vtsoutijk)
}
}
}
vts[[m]] <- a
a <- NULL
}
vtsALL <- NULL
for(i in 1:n){
27
vtsALL <- rbind(vtsALL,vts[[i]])
}
vtsV <- NULL
a <- NULL
for(i in 1:(dimx*dimy*dimz)){
a <-matrix(vtsALL[,(100*i-99):(100*i)], n, 100)
vtsV[[i]] <- a
}
vtsV <- t(vtsV)
#begin basis expansion
library(fda)
knots <- seq(0,200,10)
norder <- 4
nbasis <- length(knots) + norder - 2
T <- seq(2,200,2)
Trange = c(0,200)
lambda=1e6
#Create b-spline basis
bbasisV <- create.bspline.basis(Trange,nbasis,norder,knots)
28
curv.Lfd <- int2Lfd(2)
curv.fdParV <- fdPar(bbasisV, curv.Lfd,lambda)
#choose smoothing parameter
lambdas = 10^seq(-4,4,by=0.5)
mean.gcv = rep(0,length(lambdas))
#Initialize dataframes
SmoothV <- NULL
SmoothVfd <-NULL
A <- NULL
for(j in 1:(dimx*dimy*dimz)){
for(ilam in 1:length(lambdas)){
# Set lambda
curv.fdParVi <- curv.fdParV
curv.fdParVi$lambda <- lambdas[ilam]
# Smooth
Smoothi <- smooth.basis(T,t(vtsV[[j]]),curv.fdParVi)
# Record average gcv
mean.gcv[ilam] <- mean(Smoothi$gcv)
}
#plot(lambdas,mean.gcv,type=?b?,log=?x?)
best = which.min(mean.gcv)
29
lambdabest = lambdas[best]
curv.fdParV$lambda = lambdabest
bj = smooth.basis(T,t(vtsV[[j]]),curv.fdParV)
SmoothV[[j]] <-bj
#Create A matrices
c <- SmoothV[[j]]$fd
SmoothVfd[[j]] <- c
d <- SmoothVfd[[j]]$coefs
d <- matrix(d,n,nbasis)
A[[j]] <- d
}
#Create PSI matrix
Psi1 <- inprod(bbasisV,bbasisV)
e<- NULL
APsi <- NULL
#multiply Am by Psim
for(i in 1:(dimx*dimy*dimz)){
e <- A[[i]]%*%Psi1
APsi[[i]] <- e
}
#Perform PCA in APsi matrices
30
APsiPCA <- NULL
for(i in 1:(dimx*dimy*dimz)){
h <- princomp(APsi[[i]], cor = TRUE, scores = TRUE)
APsiPCA[[i]] <- h
}
c<-NULL
d <- NULL
e<-NULL
#Calculate proportion of variance and choose number of PCs
PropVar <- NULL
for ( j in 1:(dimx*dimy*dimz)){
for ( i in 1:nbasis){
sdev <- APsiPCA[[j]]$sdev
c[i] <- (sdev[i])^2
c <- t(c)
}
for( i in 1:nbasis){
d[i] <- c[i]/sum(c)
}
e <- cumsum(d)
PropVar[[j]] <- e
}
NPCV <- NULL
NPC <- NULL
for(j in 1:(dimx*dimy*dimz)){
d <- NULL
31
for(i in 1:nbasis){
if(PropVar[[j]][i] < .91){
d[i] <- 1
}
else{ d[i] <- 0}
}
NPCV[[j]]<-d
NPC[j] <- sum(NPCV[[j]])
}
NPCV <- NULL
NPC <- NULL
for(j in 1:(dimx*dimy*dimz)){
d <- NULL
for(i in 1:nbasis){
if(PropVar[[j]][i] < .95){
d[i] <- 1
}
else{ d[i] <- 0}
}
NPCV[[j]]<-d
NPC[j] <- sum(NPCV[[j]])
}
#Select # PCs
min(NPC)
max(NPC)
32
numPC <-
#extract pCs
PC <- NULL
for(i in 1:(dimx*dimy*dimz)){
d <- APsiPCA[[i]]$scores[,1:numPC]
PC[[i]]<-d
}
#Create Y vector
Y <- c(rep(1,n1),rep(0,n2))
Y <- t(Y)
Y <- t(Y)
#Select training and validation sets
train <- sample(1:n,n1)
valid <- NULL
for(i in 1:n){
if (i %in% train){}
else{
valid <- rbind(valid,i)
}
}
PCtrain <- NULL
PCvalid <- NULL
x<- NULL
33
a <- NULL
#Training Design
for(j in 1:(dimx*dimy*dimz)){
a <- NULL
x <- NULL
for(i in 1:n1){
a <- PC[[j]][train[i],]
x<-rbind(x,a)
}
PCtrain[[j]]<-x
}
#Validation Design
for(j in 1:(dimx*dimy*dimz)){
a <- NULL
x <- NULL
for(i in 1:n1){
a <- PC[[j]][valid[i],]
x<-rbind(x,a)
}
PCvalid[[j]]<-x
}
#Training Y
Ytrain <- NULL
34
for(i in 1:n1){
Ytrain[i] <- Y[train[i]]
}
#Valid Y
Yvalid <- NULL
for(i in 1:n1){
Yvalid[i] <- Y[valid[i]]
}
#grplasso
library(grplasso)
library(calibrate)
Int <- ones(n1,1)
#Validaton Set as Validation
lasX<-NULL
lasX<-cbind(lasX,Int)
for(i in 1:(dimx*dimy*dimz)){
lasX <- cbind(lasX,PCtrain[[i]])
}
index <- NULL
index <- c(index,NA)
for(i in 1:(dimx*dimy*dimz)){
b <-rep(i,numPC)
index <- c(index,b)
35
}
lambdalas <- lambdamax(lasX, y=Ytrain, index=index, penscale = sqrt, model = LogReg()) * 0.5^(0:5)
lasfit <- grplasso(lasX, y=Ytrain, index= index, lambda = lambdalas, model= LogReg(), penscale=sqrt, control = grpl.control(update.hess = "lambda", trace=0))
Coefs <- lasfit$coefficients
dim(Coefs)
CoSUM <- NULL
for(i in 1:6){
CoSUM[i]<-sum(Coefs[2:(dimx*dimy*dimz*numPC + 1),i])
}
CoCo <- NULL
VarNUM <- NULL
for(j in 1:6){
g <- NULL
for(i in 1:(numPC*dimx*dimy*dimz + 1)){
if(Coefs[i,j] ==0){
g[i] <- 0}
else{g[i] <- 1}
}
CoCo[[j]]<-g
}
SumCoCo <- NULL
for(i in 1:6){
36
SumCoCo[i] <- sum(CoCo[[i]])
}
#View number of functional variables selected by grplasso
SumCoCo
#Find Lihat and Pihat and classify yhat
ModelCoef <- Coefs[,2]
Alpha <- ModelCoef[1]
Betastar <- ModelCoef[2:(numPC*dimx*dimy*dimz +1)]
Betastar <- t(Betastar)
Betastar <- t(Betastar)
M<-NULL
for(i in 1:(dimx*dimy*dimz)){
b <- PCvalid[[i]]%*%Betastar[(numPC*(i-1)+1):(numPC*i)]
M[[i]] <- b
}
APsiB <- 0*ones(n1,1)
for(i in 1:(dimx*dimy*dimz)){
APsiB <- APsiB + M[[i]]
}
#Find Lhat
Lhatvalid <- NULL
Lhatvalid = Alpha*ones(n1,1) + APsiB
37
#Pistar1
#find Pi(i)?s
Pstar <- NULL
YstarV <- NULL
for (i in 1:n1){
Pstar[[i]] <- (exp(Lhatvalid[i]))/(exp(Lhatvalid[i])+1)
if(Pstar[i]>0.5){
YstarV[[i]] <- 1
}
else{YstarV[[i]]<- 0}
}
#Training Set as Validation
lasX<-NULL
lasX<-cbind(lasX,Int)
for(i in 1:(dimx*dimy*dimz)){
lasX <- cbind(lasX,PCvalid[[i]])
}
index <- NULL
index <- c(index,NA)
for(i in 1:(dimx*dimy*dimz)){
b <-rep(i,numPC)
index <- c(index,b)
38
}
#Group Lasso
lambdalas <- lambdamax(lasX, y=Yvalid, index=index, penscale = sqrt, model = LogReg()) * 0.5^(0:5)
lasfit <- grplasso(lasX, y=Yvalid, index= index, lambda = lambdalas, model= LogReg(), penscale=sqrt, control = grpl.control(update.hess = "lambda", trace=0))
CoefsV <- lasfit$coefficients
dim(CoefsV)
CoSUM <- NULL
for(i in 1:6){
CoSUM[i]<-sum(CoefsV[2:(dimx*dimy*dimz*numPC + 1),i])
}
VCoCo <- NULL
VarNUM <- NULL
for(j in 1:6){
g <- NULL
for(i in 1:(numPC*dimx*dimy*dimz + 1)){
if(CoefsV[i,j] ==0){
g[i] <- 0}
else{g[i] <- 1}
}
VCoCo[[j]]<-g
}
39
SumVCoCo <- NULL
for(i in 1:6){
SumVCoCo[i] <- sum(VCoCo[[i]])
}
#View number of functional variables selected by grplasso
SumVCoCo
#Find Lihat and Pihat and classify yhat
ModelCoefV <- CoefsV[,2]
AlphaV <- ModelCoefV[1]
BetastarV <- ModelCoefV[2:(numPC*dimx*dimy*dimz +1)]
BetastarV <- t(BetastarV)
BetastarV <- t(BetastarV)
M<-NULL
for(i in 1:(dimx*dimy*dimz)){
b <- PCtrain[[i]]%*%BetastarV[(numPC*(i-1)+1):(numPC*i)]
M[[i]] <- b
}
APsiBV <- 0*ones(n1,1)
for(i in 1:(dimx*dimy*dimz)){
APsiBV <- APsiBV + M[[i]]
}
#Find Lhat
40
Lhattrain <- NULL
Lhattrain = AlphaV*ones(n1,1) + APsiBV
#Find Pi(i)?s
PstarT <- NULL
YstarT <- NULL
for (i in 1:n1){
PstarT[[i]] <- (exp(Lhattrain[i]))/(exp(Lhattrain[i])+1)
if(PstarT[i]>0.5){
YstarT[[i]] <- 1
}
else{YstarT[[i]]<- 0}
}
41