Nonparametric Methods for Classiflcation and Related Feature Selection
Procedures
by
Shuxin Yin
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulflllment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
August 6, 2010
Keywords: Gene expression, Principal component, Misclassiflcation, Rank-bassed
Copyright 2011 by Shuxin Yin
Approved by
Asheber Abebe, Chair, Associate Professor of Mathematics and Statistics
Peng Zeng, Associate Professor of Mathematics and Statistics
Ming Liao, Professor of Mathematics and Statistics
Abstract
One important application of gene expression microarray data is classiflcation of
samples into categories, such as types of tumor. Gene selection procedures become
crucial since gene expression data from DNA microarrays are characterized by thou-
sands measured genes on only a few subjects. Not all these genes are thought to
determine a speciflc genetic trait. In this dissertation, I develop a novel nonparamet-
ric procedure for selecting such genes. This rank-based forward selection procedure
rewards genes for their contribution towards determining the trait but penalizes them
for their similarity to genes that are already selected. I will show that my method
gives lower misclassiflcation error rates than the dimension reduction methods such
as principal component analysis and partial least square analysis. I also explore more
properties of Wilcoxon-Mann-Whitney (WMW) statistic and propose a new classifler
based on WMW to reduce the misclassiflcation error rate. Real data analysis and
Monte Carlo simulation demonstrate the superiority of the proposed methods to the
classical methods in several situations.
ii
Acknowledgments
My greatest thanks extend to Dr. Asheber Abebe for his patience and guidance
throughout this endeavor. Dr. Abebe took considerable time and energy to further
the progress in my studies of Statistics. I also wish to express my appreciation for my
parents and other family members Man Peng and Shuyu Yin who have given of their
love and constant support throughout my life. I would also like to thank Dr. Peng
Zeng and Dr. Ming Liao on the committee for their contribution to this dissertation
and serving as the committee members.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classiflcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Variable Screening Procedures . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Dimension Reduction Methods . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 9
2.2.2 Partial Least Square Analysis . . . . . . . . . . . . . . . . . . 10
2.3 Classiflcation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Non-Robust Classiflcation . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Rank Based Classiflcation using Projections . . . . . . . . . . 13
2.3.3 Maximum Depth Classiflers . . . . . . . . . . . . . . . . . . . 17
3 Applications of Wilcoxon-Mann-Whitney Statistic (WMW) . . . . . . . . 19
3.1 Feature Annealed Independence Rules (Fan and Fan, 2008) . . . . . 19
3.2 WMW-Based Feature Annealed Classifler . . . . . . . . . . . . . . . . 21
3.2.1 Variable Selection Based on WMW Statistic . . . . . . . . . . 22
iv
3.2.2 Projection-free WMW-based Classifler . . . . . . . . . . . . . 24
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Prostate Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Smoothed Projection-Free WMW-Based Classifler . . . . . . . . . . . 29
3.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 31
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 WMW Forward Variable Selection Method . . . . . . . . . . . . . . . . 35
4.1 Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Caribbean Food Data . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Colon Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Leukemia Data . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.4 Monte Carlo Study . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 47
v
List of Figures
3.1 Scree Plot of WFAC for Lung Cancer Data . . . . . . . . . . . . . . . . 26
3.2 Scree Plot of WFAC for Prostate Cancer Data . . . . . . . . . . . . . . . 28
3.3 Misclassiflcation Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Misclassiflcation Error Rate (Smoothed with equal Size) . . . . . . . . . 33
3.5 Misclassiflcation Error Rate (Smoothed with Unequal Size) . . . . . . . . 34
4.1 Misclassiflcation error rates ( Black=QDA, Gray=MaxD, White=GGT) . . 46
vi
List of Tables
3.1 Lung Cancer data. The minimum number of incorrectly classifled samples
out of 149 testing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Prostate Cancer data. The minimum number of incorrectly classifled sam-
ples out of 102 samples using leave-one-out cross validation . . . . . . . . 27
4.1 Caribbean food data. Misclassiflcation error rates using leave-one-out
cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Colon data. The number of incorrectly classifled samples out of 62 samples
using leave-one-out cross validation. Screening for PCA and PLS uses
WMW and t statistics. The numbers in parentheses are the number of
genes kept after the screening. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Leukemia training data. The number of incorrectly classifled samples out
of 38 training samples using leave-one-out cross validation. Screening for
PCA and PLS uses WMW and t statistics. The numbers in parentheses
are the number of genes kept after the screening. . . . . . . . . . . . . . 42
4.4 Leukemia training and testing data. The number of incorrectly
classifled samples out of 24 testing samples based on training samples.
Screening for PCA and PLS uses WMW and t statistics. The numbers in
parentheses are the number of genes kept after the screening. . . . . . . 43
vii
Chapter 1
Introduction
1.1 Nonparametric Methods
Originally nonparametric methods were introduced in the mid-1930. The rank
correlation without normality assumption was discussed in Hotelling and Past (1936)
which was considered as the the true beginning of topic of nonparametric statistics.
Friedman (1940) developed the Fried test which was a nonparametric statistical test
to detect difierences in treatments across multiple test attempts in the complete block
design. Durbin (1951) proposed the nonparametric test for the incomplete block de-
sign that reduces to the Friedman test in the case of a complete block design. Benard
and Elteren (1953) generalized Durbin test to the case in which several observations
are taken on some experimental units. During the same period, Wilcoxon (1945)
proposed the signed-rank statistic named Wicoxon statistic to test the signiflcance of
the location difierences of two samples. Later Mann and Whitney (1947) introduced
the Mann-Whitney test statistic which is equivalent to Wilcoxon rank test statistic.
Wilcoxon statistic played a central role in many nonparametric approaches in 1950s
and 1960s.
Nonparametric methods have emerged as the preferred methodology in many
scientiflc areas due to their outstanding advantages:
? Nonparametric procedures require fewer assumptions than the traditional meth-
ods so that they can be used more widely than the corresponding parametric
methods. In particular, nonparametric procedures are applicable and more ef-
flcient in many situations where normality assumption is violated.
1
? Nonparametric methods are distribution-free methods, which do not rely on
assumptions that the data are drawn from a certain probability distribution.
So they can be used in many complicated situations where the distribution
theory is not achievable.
? Nonparametric methods are resistant to outliers. When the data contain some
outliers or the longer tail than the normal distribution, some traditional sta-
tistical procedures are ine?cient, even though they can perform well when the
error in the model follow a normal distribution.
? Another advantage for the use of non-parametric methods is simplicity. In some
cases, even when the use of parametric methods is justifled, non-parametric
methods may be easier to use. Due both to this simplicity and to their greater
robustness, non-parametric methods are preferred by some statisticians.
In recent years, nonparametric analysis has gained its popularity in the analysis
of linear model (Sievers and Abebe, 2004), non-linear model (Abebe and McKean,
2007), classiflcation (Nudurupati and Abebe, 2009; Montanari, 2004), generalized
estimating equations (Abebe et al., 2011), etc because it leaves less room for the
improper use and misunderstanding.
1.2 Classiflcation
Some of the basic ideas and history in classiflcation is discussed in the following.
Consider we have two populations, and the main goal of classiflcation is to determine
the membership of a new observation based on the training data set. A discriminant
function is needed to flnd the criterion in order to assign the new observations and it
generally projects the multidimensional real space into one dimension real line such
that a clear cutting value of discriminant function can be applied to determine which
class the new observation probably belongs. Fisher (1936) gave the linear discriminant
2
classifler. He found the optimal projection direction by maximizing the two-sample t-
statistic and allocated the new observations based on the Euclidian distance between
the new observations and the centers of populations. If the covariance matrices of
populations are not equal, quadratic discriminant classifler is preferred. Bickel and
Levina (2004) proposed the independence classifler by setting the non-diagonal entries
of the common covariance matrix to be zero.
However those methods above are adversely ine?cient in many circumstances
where the normality assumption is not proper because of their sensitivity to the
skewness and outliers. Their limitations call for some robust rank-based classiflers
which are highly related to the idea of transvariation probability given in Gini (1916).
Transvariation probability is originally deflned for the univariate case and can be
extended to the multivariate case by following a certain projection pursuit. Monta-
nari (2004) proposed transvariation-distance classifler to allocate a new observation
according to the Euclidian distance to population centers in the projected space.
He found the optimal projection direction that maximizes the two-sample Mann-
Whitney-Wilcoxon (WMW) statistic (Lehmann, 2006). He also proposed the point-
group classifler to determine the likelihood of the new observation belonging to which
population based on the same projection pursuit. The results by using point-group
classifler, however can be biased when two sample sizes are too difierent. An im-
proved allocation method was proposed by Nudurupati and Abebe (2009). They put
the new observation in two samples separately to smooth the data depth. We can
also use some depth functions to measure the group separation in order to classify a
new observation as shown in Liu et al (1999). A few popular depth functions are Ma-
halanobis depth (Mahalanobis, 1936; Liu and Singh, 1993), halfspace depth (Tukey,
1974), simplicial depth (Liu, 1990), majority depth (Singh, 1991), projection depth
(Donoho, 1982), and spatial or L1 depth (Vardi and Zhang, 2000).
3
1.3 Dimension Reduction
However the gene expression data are usually ultrahigh dimensional such that
sample size N is far smaller than the data dimension p which can make some clas-
siflers not applicable. As we know high dimension can easily cause over ow in the
calculation of inverse matrices that is required by some classifler, such as the ones
involving the projection pursuit. Typically, the calculation working load can be in-
creased dramatically by even adding one more gene if a projection pursuit is needed.
Besides, it is well known that only few genes carry the useful information which can
determine a speciflc genetic trait, such as susceptibility to cancer while most of genes
carry nothing useful but the noises. Taking all the genes instead of the most informa-
tive ones in to account in the process of classiflcation can?t provide a better accuracy
but result in the widely ine?ciency. Usually, a smaller set of genes are selected based
the amount of the information in terms of the group separation to be considered as
the most important genes in the process of classiflcation. Basically, there are two
ways to reduce the dimension of data:
? Select a subset of the original variables (genes) based on the power of class
determination;
? Create new variables by combining the information of all the variables (genes)
without loss much information from the original variables.
Many statisticians prefer that flrstly a smaller set of variables are selected by
following a certain variable screening method and then some optimal linear combi-
nations of the selected variables are flnally created to proceed the classiflcation while
some directly perform the classiflcation after the variable screening.
Dudoit et al (2002) performed gene screening based on the ratio of between-group
and within-group sums of squares. Many statisticians (Fan and Fan, 2008; Nguyen
and Rocke, 2002; Ding and Gentleman, 2005) applied two-sample t-statistic which
4
measures the distance between two populations and can be used as the criterion to
preliminarily select the most important genes while other people (Liao et al, 2007)
picked up the variables based on Wilcoxon-Mann-Whitney statistic which is also good
measurement in terms of group separation. Usually the variable screening method
using WMW statistic is only slightly less e?cient than the one using t-statistic when
the underlying populations are normal, and it can be mildly or wildly more e?cient
than its competitors when the underlying populations are not normal.
Because of the sensitivity of some classiflers to the dimension, the dimension is
needed to be reduced further even though the initial variable screening is applied.
Based on the genes selected by the variables screening procedures, some dimension
reduction methods can be introduced to reduce the dimension by performing a linear
mapping of data to a lower dimensional space, such as principle component analysis
(PCA) and partial least square analysis (PLS). In PCA (Massey, 1965; Jollifie, 1986),
a small set of orthogonal linear combinations of the original predictor variables can be
found by maximizing the variance of these linear combinations matrix. In practice,
the correlation matrix of the data is constructed and the eigenvectors on this matrix
are computed. The eigenvectors that correspond to the largest eigenvalues can now
be used to reconstruct a large fraction of the variance of the original data. Then
the flrst few eigenvectors are selected and can often be considered as the optimal
linear combinations and used in classiflcation. Thus the original space is reduced
to the lower dimensional space spanned by a few eigenvectors without loss of the
information carried by the original variables.
In PCA, however, the correlation between the predictor variables and the re-
sponse variable specifying the class of the observations is not considered, which may
be ine?cient. E?cient one must not treat the predictors separately from the re-
sponse. Nguyen and Rocke (2002) proposed a new approach to obtain the optimal
combinations by maximizing the covariance between those linear combinations and
5
response vector, which is referred to as partial least square analysis. PLS surpasses
PCA by taking the relation to response variable into account. A numerical algorithm
to obtain the components is also included in that paper.
1.4 Organization
In Chapter 2, I discuss seven difierent classiflcation methods as well as several
dimension reduction methods and gene screening procedures. In Chapter 3, I prove
that the WMW can pick up all the important variables with the probability tending
to 1 followed by a comparison of the performances of WMW statistic on variable
screening and classiflcation with the procedure given in Fan and Fan (2008) using two
real data sets and a large simulation study. Besides, a smoother is recommended when
two sample sizes are too difierent. In Chapter 4, I propose a new forward variable
selection method and demonstrate its superiority using some real data analysis and
simulation.
1.5 Notations
Here are some notations used throughout my dissertation:
? Consider two populations ?X and ?Y with underlying distributions F and G
with common support Rp.
? ?x and ?y are the mean vectors of ?X and ?Y
? ?x and ?y are the covariance matrices of ?X and ?Y
? XandYare two samples from ?X and ?Y with sample sizes nx and ny respec-
tively.
? X and Y are the predictor matrices of two samples, and R is the response
vector (indictor of tumor versus normal tissue).
6
Chapter 2
Background
During the past decade and a half, classiflcation and clustering methods have
gained popularity for cancer classiflcation based on gene expression proflles obtained
via DNA microarray technology. But the dimension of microarray data is usually
ultrahigh which makes some classifler inapplicable. Moreover we believe that a subset
of genes whose expression su?ces for accurately predicting the response. Dimension
reduction allows us to replace a very large number of predictors with a small number
of linear combinations, and perform a better prediction based on those optimal linear
combinations. For the sake of simplicity, before I use some dimension reduction
approaches, some variable screening methods are applied to get rid of those irrelevant
variables which have smaller variation than noise measurement.
2.1 Variable Screening Procedures
Two sample t-statistic can be considered as a measurement of group separation
since it can evaluate the difierences in means between two groups. Larger t-statistic
value implies the better group separation. Thus we flrstly rank all the variables based
on their two sample t-statistic. The two-sample t-statistic for variable k is deflned as
Tk = xk ?ykq
s2xk=nx +s2yk=ny
; k = 1;:::;p (2.1)
where xk and yk are the means of two samples respectively for variable k. s2xk
and s2yk are variances of of two samples respectively for variable k. I select the ones
with the larger values of t-statistic to be the most informative variables.
7
However, t-statistic is sensitive to the outliers and skewness, so the nonparametric
alternatives are recommended to be applied to measure the difierences between two
groups when the normality assumption is violated. The most commonly used one is
WMW statistic which is robustly measuring the difierences of two groups.
Rank the variables based on the two sample WMW statistic
Wk = 1? 2n
xny
nxX
i=1
nyX
j=1
`'(xki ?ykj)mkxy? (2.2)
where mkxy is the median of fxki ?ykjg for k = 1;:::;p. The function ` is deflned
as
`(x) =
8
>><
>>:
1 if x < 0
0 if x ? 0 :
(2.3)
The most informative variables are the ones with the larger WMW statistic which
indicates the less overlapped area under the density curves of two populations. More
details are discussed in Chapter 3.
2.2 Dimension Reduction Methods
Afterinitialvariablescreeningprocedure, mostnoninformativevariablesareelim-
inated and only few important ones left. But the dimension is still too high for some
classiflers, especially for the ones requiring the projection pursuit. I consider to apply
the dimension reduction approaches to reduce the dimension further. The purpose
of dimension reduction is to create a smaller set of linear combinations of original
variables without loss of too much information carried by the original ones. This is
achieved by optimizing a deflned objective criterion. PCA and PLS are two well-
known dimension reduction methods.
8
2.2.1 Principal Component Analysis
In PCA (Massey, 1965; Jollifie, 1986), orthogonal linear combinations are con-
structed to maximize the variance of the linear combinations of the predictor variables
sequentially
ck = Argmax
c0c=1
Var(Pc); where P = X[Y
subject to the orthogonality constraint
c0Scj = 0 for all 1 ? j ? k, where S = P0P
In fact, let n = nx +ny and V be the n?p matrix found by stacking X and Y;
that is
V =
2
66
66
66
66
66
66
66
4
x1
...
xnx
y1
...
yny
3
77
77
77
77
77
77
77
5
:
where xi and yj are the rows in X and Y respectively. Let ?1 ? ?2 ????? ?r be the
eigenvalues of V0V and fi1;:::;fir be the corresponding eigenvectors, where r is the
rank of V0V. The ith principal component is then Vfii, which is a linear combination
of the original columns of V. The flrst principal component accounts the direction
of maximum variability in the data, and each succeeding component accounts for
increasingly smaller amounts of variability. Several optimal linear combinations can
be obtained by using the eigenvectors corresponding to the larger eigenvalues.
PCA, however, only measures the variability in the predictor data V without
any consideration to the contribution of variables towards the classiflcation problem.
Ignoring the relation to the response variable may make the components selected
9
by PCA short of the information in terms of the group separation and consequently
results in inaccuracy in classiflcation. A dimension reduction approach considering
the correlation between predictors and response variables is needed.
2.2.2 Partial Least Square Analysis
Nguyen and Rocke (2002) proposed the method of PLS that sequentially max-
imizes the covariance between the response variable and a linear combinations of
predictor variables. In our case, the response variable R is made up of nx zeros and
ny ones as R = (0;:::;0;1;:::;1)0. Thus in place of the eigenvectors used in PCA,
PLS uses
fii = Argmax
kfik=1
cov(Vfi;R)
subject to the constraint (Vfi)0(Vfii) = 0 for i = 1;:::;r. This object criterion for
the dimension reduction may be more appropriate for the prediction since the relation
between predictors and response variable is considered. A basic algorithm implement-
ing PLS is given in Nguyen and Rocke (2002). For the details, see also Helland (1988),
Garthwaite (1994), H?oskuldsson (1988), and Martens and Naes (1989).
2.3 Classiflcation Methods
In general, classiflcation is to solve the problem of classifying a new observation
z 2 ?X [ ?Y in either ?X or ?Y. We need to deflne a discriminant function to
project the multidimensional data space into one dimensional real line such that we
can make the decision that where the new observation belongs:
Deflnition 2.1. A discriminant function D(z;F;G) : Rp ! R is such that z is
classifled in ?X if D(z;F;G) > 0 and in ?Y otherwise.
10
Discriminant function gives a linear combination of the predictor variables, whose
values are as close as possible within populations and as far apart as possible between
populations. Several popular classiflers are discussed in the following.
2.3.1 Non-Robust Classiflcation
Linear Discriminant Analysis (LDA)
Fisher (1936) looked at a linear combination of the p-covariates that maximizes the
separation between the two populations ?X and ?Y. This gives rise to the linear
discriminant function
L(z;F;G) ? (?x ??y)0??1
?
z? 12 (?x +?y)
?
;
where ?y and ?y are as deflned in Section 1.5, and ? is the pooled covariance matrix
of F and G. A new observation z is classifled in ?X if L(z;F;G) > 0 and in ?Y
otherwise. Such classiflcation is referred to as Linear Discriminant Analysis (LDA).
Given samples X and Y from ?X and ?Y, respectively, the discriminant func-
tion of LDA is estimated by L(z;Fnx;Gny), where Fnx is the empirical distribution
function of X obtained by putting mass 1=nx on each x sample point and Gny is
the empirical distribution function of Y. Henceforth, it will be assumed that esti-
mates of discriminant functions are obtained by replacing distribution functions by
the corresponding empirical distribution function.
Quadratic Discriminant Analysis (QDA)
In LDA, we assume two populations share the same covariance matrix. If this as-
sumption is not held, another commonly used classiflcation method named Quadratic
11
Discriminant Analysis (QDA) can be applied, which is based on the classical multi-
variate normal model for each class. Assume PX = PY be the equiprobable priors of
two populations. The quadratic discriminant function is deflned as
Q(z;F;G) = Q(z;F)?Q(z;G) ;
where
Q(z;F) = ?12(z??x)0?x?1(z??x)? 12 log?flfl?x?1flfl?
Q(z;G) = ?12(z??y)0?y?1(z??y)? 12 log?flfl?y?1flfl?
Here ?x and ?y are as deflned in 1.5. A new observation z is classifled in ?X if
Q(z;F;G) > 0 and in ?Y otherwise.
Independence Classifler
BothLDAandQDArequiretheprojectionpursuit, whichisnotanoptionforthelarge
dimensional data. In particular, in order to evaluate some variable selection methods,
we usually need to add more and more variables into the classiflcation model to flnd
the optimal number of variables for the prediction. Independence classifler can be
applied for this purpose:
I(z;F;G) ? (?x ??y)0D?1
?
z? 12 (?x +?x)
?
;
where ?x and ?y are as deflned in Section 1.5, and D = diag(?). A new observation
z is classifled in ?X if I(z;F;G) > 0 and in ?Y otherwise. As matter of fact,
linear discriminant function can be reduced to independence discriminant function
12
by setting the non-diagonal entries of common covariance matrix to be zero. This
classifler is discussed in Bickel and Levina (2004). They also show that it is superior
to LDA when the number of variables is large.
2.3.2 Rank Based Classiflcation using Projections
LDA amounts to flnding u 2Rp, say ^u0, that maximizes the square of the two-
sample t-statistic between the two projected samples u0X = fu0x1;:::;u0xnxg and
u0Y= fu0y1;:::;u0ynyg; that is
^u0 = Argmax
kuk=1
[u0(?x? ?y)]2
u0Spu
?
nx+ny
nxny
? ;
where Sp is the pooled covariance matrix, and ?x and ?y are means of two samples. The
data are then reduced to one dimension by projecting them in the direction given by
^u0 and one would classify a new observation z into ?X if jz0 ? ?x0j < jz0 ? ?y0j, where
x0i = ^u00xi, y0j = ^u00yj, and z0 = ^u00z, i = 1;:::;nx and j = 1;:::;ny. Otherwise,
one classifles z into ?Y. Here ?x0 = n?1x Pnxi=1 x0i and ?y0 = n?1y Pnyj=1 y0j.
When the underlying distributions are spherically symmetric, the direction of
maximum separation is along the line that connects the centers of the distributions.
LDA is equivalent to classifying z based on its Euclidean distance from the means.
In the case of the normal distribution, the projection direction can be obtained easily
as u0 = S?1=2p (?x? ?y)=kS?1=2p (?x? ?y)k. In other situations, the projection direction
is not obvious and has to be determined numerically. This search for "interesting"
low dimensional projection of high dimensional data is known as projection pursuit
(Friedman and Tukey, 1974). "Interestingness" is measured through a suitable func-
tion known as the projection index. For LDA, this index is the two-sample t-statistic.
Montanari (2004) and Chen et al. (1994) used a two-sample WMW type statistic
as a projection index to measure group separation. They showed that their projection
13
pursuit method is not sensitive to deviations from the homoscedasticity and normality
assumptions. Their method is related to the idea of transvariation probability given
in Gini (1916):
Deflnition 2.2. For univariate distributions F and G, the Transvariation Probability
is deflned as
?(F;G) =
Z
R
Z
R
`((x?y)(?(F)??(G)))d(F(x))d(G(y))
where ?(F) and ?(G) are the medians of F and G.
This transvariation probability is the measure of common area under underlying
distribution curves of two populations. Of course, the smaller transvariation proba-
bility indicates the better group separation.
In practice, instead of using the theoretical underlying distributions of two pop-
ulations, we can use empirical distributions of X and Y to estimate ?
?? = 1n
xny
nxX
i=1
nyX
j=1
`f(xki ?ykj)(mx ?my)g
where mx and my are medians of two samples.
However, this transvariation probability is only deflned under univariate space, so
forthemultidimensional data, weneed toredeflne ageneral transvariationprobability.
This general transvariation probability can be redeflned on a vector u through a
certain projection pursuit.
The direction of minimum overlap measured by the general transvariation prob-
ability is given by
^u1 = Argmin
kuk=1
nX
`f[u0x?u0y][mx(u)?my(u)]g
o
; (2.4)
14
where the sum is over the set fx 2X; y 2Yg and mx(u) and my(u) are medians of
two projected samples u0X and u0Y. The function ` is deflned as in (2.3)
Once the direction of maximum separation is found, the next step is to project
all the data (including the new sample point) onto that direction and allocate the
new point to one of the two populations. Three allocation schemes are discussed in
the following.
Transvariation-Distance (TD) Classifler
Montanari (2004) proposed classifying a new observation z in ?X if
j^u01z?mx(^u1)j < j^u01z?my(^u1)j
and in ?Y otherwise, where mx(^u1) and my(^u1) are medians of two projected groups.
Hereafter the classifler obtained by using this allocation method will be referred to as
Transvariation-Distance (TD) classifler. As shown in Nudurupati and Abebe (2009),
this method can be adversely afiected by skewness and outliers.
Point-Group Transvariation (PGT) Classifler
Another allocation method suggested by Montanari (2004) is based on a comparison
of the ranking of the new observation z among X and among Y. This utilizes the
point-group transvariation. The observation z is classifled in ?X if T (z;F;G) ?
T(z;F)?T(z;G) > 0, where
T(z;F) = 1n
x
X
x2X
`f[^u01x? ^u01z][mx(^u1)? ^u01z]g
15
and
T(z;G) = 1n
y
X
y2Y
`f[^u01y? ^u01z][mx(^u1)? ^u01z]g :
This allocation scheme is robust against skewness and outliers. However, it does not
perform well for data with unequal sample sizes. This is because the vote of each
member of X is either 0 or 1=nx whereas the vote of each member of Y is 0 or 1=ny.
This allocation scheme has also a problem of ties between T(?;F) and T(?;G). The
likelihood of ties is the greatest in the case of equal sample sizes, which happens
to be the only situation where this scheme works e?ciently. We will use random
tie breaking where a coin is  ipped to decide allocation in the case of a tie. The
classifler obtained by using this allocation scheme will be referred to as Point-Group
Transvariation (PGT) classifler.
Group-Group Transvariation (GGT) Classifler
An improved allocation method was proposed by Nudurupati and Abebe (2009) to
eliminate the problem caused by unequal sample sizes. Deflne two augmented samples
X? andY? by including the new point z in the two samples; that isX? =X[fzg and
Y? = Y[fzg. The point z is then classifled in ?X if T ?(z;F;G) ? T ?1 (z;F;G)?
T ?2 (z;F;G) > 0 where
T ?1 (z;F;G) = 1(1+n
x)ny
X
`f[^u01x? ? ^u01y][mx(^u1)? ^u01z]g
and
T ?2 (z;F;G) = 1n
x(1+ny)
X
`f[^u01x? ^u01y?][my(^u1)? ^u01z]g
The two sums are over the sets fx? 2X?; y 2Yg and fx 2X; y? 2Y?g, respectively.
16
The classifler obtained by using this allocation scheme will be referred to as
Group-Group Transvariation (GGT) classifler. Note that we do not have the un-
equal voting problem here. The vote of all observations is either 0 or approximately
(nxny)?1. In essence, here we are smoothing one sample using the empirical distribu-
tion of the other before applying the PGT rule.
2.3.3 Maximum Depth Classiflers
In the univariate setting, statistical methods that use rank-based nonparametric
techniques do not depend on restrictive distributional assumptions and hence are
robust to deviations from these assumptions. For higher dimensions, statistical depth
functions give a multivariate version of ranks (Liu, 1992). Depth functions give a
measure of the \centrality" of a given multivariate sample point with respect to its
underlying distribution (Liu et al, 1999). In particular, a depth function assigns higher
values to points that are more central with respect to a data cloud. This naturally
gives a center-outward ranking of the sample points. A number of depth functions
are available in the literature. A few popular depth functions are Mahalanobis depth
(Mahalanobis, 1936; Liu and Singh, 1993), halfspace depth (Tukey, 1974), simplicial
depth (Liu, 1990), majority depth (Singh, 1991), projection depth (Donoho, 1982),
and spatial or L1 depth (Vardi and Zhang, 2000).
In this paper, I use the maximum depth (MaxD) classiflcation method (Ghosh
and Chaudhuri, 2005) based on spatial (L1) depth function deflned as
S(x;F) = 1?
  
  EF
? x?X
kx?Xk
   
  ; (2.5)
where X ? F and k?k is the Euclidean norm on Rp.
The classifler MaxD uses the discriminant function
S?(z;F;G) = S(z;F)?S(z;G)
17
and classifles z in ?X if S?(z;F;G) > 0. A major drawback of this classifler is
that it lacks a?ne invariance. Thus it is necessary to transform the data so that all
the variables are similarly scaled before using the spatial depth function. Vardi and
Zhang (2000) suggest to make the spatial depth function a?ne invariant by taking
?x?1=2(z?X) and ?y?1=2(z?Y) in place of z?X and z?Y before computing
S(z;F) and S(z;G), respectively, using equation (2.5). Note that one can use any
a?ne equivariant estimators of?x and?y when computing the discriminant function.
If the scatter estimator of Tyler (1987) is used, then the resulting maximum spatial
depth classifler resembles the classifler given by Crimin et al. (2007). An alternative
method of obtaining a?ne invariance is to scale the data along its principal component
directions (PCA-scaling) as given in Hugg et al (2006).
An estimate of the MaxD discriminant function S?(z;F;G) is given by
S?(z;Fnx;Gny) =
  
  
 
1
ny
nyX
j=1
z?yj
kz?yjk
  
  
 ?
  
  
 
1
nx
nxX
i=1
z?xi
kz?xik
  
  
 :
18
Chapter 3
Applications of Wilcoxon-Mann-Whitney Statistic (WMW)
3.1 Feature Annealed Independence Rules (Fan and Fan, 2008)
Two sample t-statistic can be applied as an initial variable screening index be-
cause of its contribution to the group separation. Fan and Fan (2008) proved that
theoretically t-statistic can pick up all the informative variables with probability ap-
proaching to 1.
They consider the p-dimensional classiflcation problem between two populations
?X and ?Y. XandYare two samples from ?X and ?Y with sample sizes nx and ny
respectively. Write ith observation in ?X as
xi = ?x +?xi;
and ith observation in ?Y as
yi = ?y +?yi;
where ?xi = (?xij) and ?yi = (?yij) are iid with mean 0 and covariance matrix ?x and
?y respectively. They assume that all the observations are independent across sam-
ples and in addition, within one population, observations are identically distributed.
They also assume that the two classes have compatible sample size.
They flrst proved that without variable selection, discrimination based on linear
projections to almost all directions performs nearly the same as random guessing
under some assumptions. They then claim that using t-statistic deflned in 2.1, all the
19
informative variables can be selected in the below Theorem (3.1). In order to prove
their theorem, they need the following conditions.
Condition1
? Assume that the vector fi = ?x ??y is sparse and without loss of generality
only flrst s entries are nonzero.
? Suppose that ?xij and ?2xij ?1 satisfy the Cram?er?s condition, that is there exist
constants ?1;?2;M1 and M2, such that Ej?xijjm ? m!Mm?21 ?1=2 and Ej?2xij ?
 2xjjm ? m!Mm?21 ?1=2 for all m = 1;2;::: where  xj is the diagonal entries of
?x. Assumptions on ?yij and ?2yij?1 are the same as ?xij and ?2xij?1 respectively.
? Assume that the diagonal elements of both ?x and ?y are bounded away from
0.
Under Condition 1, they have the following theorem:
Theorem3.1. Let s be a sequence such that log(p?s) = o(n ) and logs = o(n1=2? fln)
for some fln ?! 1 and 0 <  < 13. Suppose that min1<j<s jfijjp 2
xj+ 
2
yj
= n? fln.Then
for t ? cn =2 with c some positive constant, we have
P(min
j?s
jTjj? t and max
j>s
jTjj < t) ?! 1
where n = nx +ny.
Theorem 3.1 indicates that two sample t-statistic can potentially pick all the
important variables as long as the rate of decay is not too fast and the sample size is
not too small.
In order to demonstrate the performance of two sample t-statistic on the variable
screening, they then apply the independence classifler which is mentioned in the
Section 2.3 to calculate the misclassiflcation error rate based on the most informative
variables selected by two sample t-statistic.
20
They assume both populations are from Gaussian distributions and common
variance matrix is identity, that is ?x = ?y = I. Rank all the variables according to
two sample t-statistic and pick up the most informative m variables assuming that
those m variables are all the important variables in terms of classiflcation. The theo-
retical misclassiflcation error rate calculated by this truncated independence classifler
is given in the below theorem:
Theorem 3.2. Consider ?x = ?y = I and use a truncated independence classifler
b?mn(z) = (zmn ?b?mn)(b?mn1 ?b?mn2 )
for a given sequence mn. Suppose that npmn?mnj=1fi2j ?! 1 as mn ?! 1. Then the
classiflcation error of b?mn is
= 1?'
 (1+o
P(1))?mnj=1fi2j +mn(n1 ?n2)=(n1n2)
2f(1+oP)?mnj=1fi2j +nmn=(n1n2)g1=2
?
where '(?) is the standard Gaussian distribution function. They call this trun-
cated classifler as feature annealed independence rule (FAIR). We can have the precise
value of m by minimizing this theoretical misclassiflcation error rate . In practice,
however, this equation is unsolvable and it can only be done numerically. Fan and
Fan (2008) used a simulation study and three real data analyses to demonstrate their
theoretical results and show the superiority of their method over the nearest shrunken
centroid method (Tibshirani et al, 2002).
3.2 WMW-Based Feature Annealed Classifler
As deflned in Section 2.1, two sample Wilcoxon-Mann-Whitney statistic provides
more useful information than two sample t-statistic in terms of group separation un-
der some certain circumstances where the normality assumption is not achievable.
21
Inspired by Fan and Fan (2008), I expect that using WMW statistic can also pick
up all the important variables. If so, then WMW will be used more widely than
t-statistic because most gene expression data present heavier tail than normal distri-
bution (Salas-Gonzalez et al, 2009).
3.2.1 Variable Selection Based on WMW Statistic
Using the similar strategy given in Fan and Fan (2008), I also prove that theoreti-
cally in the inflnitely multidimensional data space, Wilcoxon-Mann-Whitney statistic
can pick up all the informative variables with probability approaching to 1. The result
is given in the following theorem:
Theorem 3.3. Assume that the vector fi = ?x ??y is sparse and without loss of
generality only flrst s entries are nonzero. Let s be a sequence such that log(p?s) =
o(n ) and logs = o(n1=2? fln) for some fln ! 1 and 0 <  < 13. For w ? cn =2 with
some constant c > 0, we have
P(min
j?s
jWjj? w and max
j>s
jWjj < w) ! 1:
Proof. I divide the proof into two parts.
(a) Let us flrst look at the probability P(maxj>sjWjj > w). Clearly,
P(max
j>s
jWjj > w) ?
pX
j=s+1
P(jWjj? w)
By the Corollary 3.2 proved in Froda and Eeden (2000), there exist a fi > 0,
such that , for M0 < x < fin1=6,
P(jWjj? w) = (1?'(w))(1+O(w3=n1=2))
22
where M0 > 1. For the normal distribution, we have the following tail probability
inequality
1?'(x) ? 1p2? 1we?w2=2
Combining with the symmetry of Wj, if we let w ? cn =2, then we have
pX
j=s+1
P(jWjj? w) ? (p?s) 2p2? 1we?w2=2(1+O(w3=n1=2) ! 0:
since log(p?s) = o(n ) with 0 <  < 13. Thus, we have
P(max
j>s
jWjj > w) ! 0
(b) Next, we consider P(minj?sjWjj? w). Let ?j = fijpn
1n2(n1+n2?1)=12
and deflne
fWj = Wj ??j.
Then clone the lines in (a), we have
X
j?s
P(jfWj? w) ? s 2p2? 1we?w2=2(1+O(w3=n1=2) ! 0
Let fi0 = minj?s ?j. Then it follows that
P(min
j?s
jWjj? w) ? P(max
j?s
jfWjj? min
j?s
j?jj?w)
? P(max
j?s
jfWjj? fi0 ?w)
If w ? cn =2 and fi0 ? n? fln for some fln ! 1, then similarly to part (a), we
have
P(min
j?s
jWjj? w) ! 0:
Combination of Part (a) and (b) completes the proof.
23
Compare to Fan and Fan (2008), Theorem 3.3 requires fewer assumptions. For
example, there are no assumption on random errors and no assumptions on covariance
matrices of two population, which makes WMW statistic more e?ciently to identify
the minimal subset of variables that succinctly predict the categories of the new
observations.
3.2.2 Projection-free WMW-based Classifler
Fan and Fan (2008) used minimal misclassiflcation error rate to explicitly the per-
formance of two sample t-statistic. They named their procedures as Feature Anneal
Independence Rule (FAIR), that is, allocate the new observations using independence
classifler based on the variables chosen by t-statistic. However independence classifler
is strongly related to the t-statistic which is not appropriate where normality assump-
tion isn?t held. That is the reason I propose this new nonparametric classifler which
is simply based on WMW statistic.
Even though WMW statistic measures the group separation very well, itself can?t
be used as a classifler directly. Here I borrow the idea of Group-Group Transviation
classifler (Nudurupati and Abebe, 2009) to translate the measure of group separation
to a discriminant function.
To classify a new observation z, deflneX? =X[fzg andY? =Y[fzg. z 2? ?X
if
pX
k=1
wk(X?;Y) >
pX
k=1
wk(X;Y?)
where wk(X?;Y) is WMW statistic of variable k based on X? and Y while
wk(X;Y?) is WMW statistic of variable k based on X and Y?.
This classifler indicates that I classify this new observation z into ?X if a better
group separation can be achieved by adding this new observation z into ?X, otherwise
into ?Y.
24
The advantage of this classifler is that it is robust to deviations from the usual
assumptions. The other hand it?s projection free classifler such that it can be used to
numerically flnd the minimal misclassiflcation error rate by picking the proper number
of informative variables. The combination of WMW statistic variable screening and
WMW-basedclassiflerresultsintheWMW-basedfeatureannealedclassifler(WFAC).
3.3 Results
Two real data analyses and a Monte Carlo simulation are provided to demon-
strate the superiority of WFAC compared to FAIR.
3.3.1 Lung Cancer
I flrst use Lung Cancer data to compare the performances of WFAC and FAIR
by classifying between two types of lung cancer: malignant pleural mesothelioma
(MPM) and adenocarcinoma (ADCA). In total, the data set contains 181 sample,
with 31 from MPM and 150 from ADCA. The training set contains 32 samples, with
16 from MPM and 16 from ADCA while the testing set contains 149 samples, with
15 from MPM and 134 from ADCA. Each sample is described by 12533 genes. This
data is available at http://www.chestsurg.org.
Fan and Fan (2008) set the classiflcation rule by using independence classifler
with t-statistic variable screening based on the training data and predict each sample
in the testing data to be MPM or ADCA by following this rule.
I, instead set the classiflcation rule by using WMW-based classifler with WMW
statistic variable screening based on the training data and predict each sample in the
testing data to be MPM or ADCA by following this new classifler proposed above.
Numbers of incorrect classiflcation out of 149 testing samples are given in the
Table 3.1.
25
Table 3.1: Lung Cancer data. The minimum number of incorrectly classifled samples
out of 149 testing data
Method Test Error No. of Selected Genes
FAIR 11/149 26
WFAC 0/149 78
Figure 3.1: Scree Plot of WFAC for Lung Cancer Data
0 20 40 60 80 1000
5
10
15
20
25
30
Number of Genes Selected
Number of Misclassified Samples
Apparently, FAIR reaches the minimal misclassiflcation error rate (11) by select-
ing 26 genes while WFAC achieves zero misclassiflcation error by picking 78 genes.
In the sense of misclassiflcation error rate, our method is superior to FAIR. However
it seems WFAC needs to select much more genes to have this desirable result, which
makes our method ine?cient. For this reason, a scree plot is drawn to show how the
misclassiflcation error rate changes when more and more genes are added.
Plot 3.1 shows that 1 testing sample is misclassifled based on top 17 selected
genes. As shown in Table 3.1, using FAIR the minimum number of misclassifled
samples is 11 by selecting 26 variables, which indicates that FAIR use more genes but
achieves large number of misclassifled samples than WFAC.
26
Table 3.2: Prostate Cancer data. The minimum number of incorrectly classifled
samples out of 102 samples using leave-one-out cross validation
Method Training Error No. of Selected Genes
Fan and Fan 10/102 2
Our 5/102 4
3.3.2 Prostate Cancer
IthenuseProstateCancerdata, whichisavailableathttp://www.broad.mit.edu/cgi-
bi/cancer/datasets.cgi. The prostate cancer data contains 102 patient samples, 52 of
which are prostate tumor samples and 50 of which are normal prostate samples.
Each sample contains 12600 genes. The minimum number of misclassifled samples is
calculated by FAIR and WFAC respectively using leave-one-out cross validation.
Numbers of incorrect classiflcation out of 102 testing samples are given in the
Table 3.2.
It shows that by selecting top 2 genes, FAIR gives 10 misclassifled samples while
by selecting top 4 genes, WFAC only gives 5 misclassifled samples. It still seems
WFAC sacriflces the e?ciency to obtain the accuracy. For the same reason, a scree
plot is drawn to illustrate the e?ciency of WFAC.
As shown in Plot 3.2, WFAC gives 7 misclassifled samples if only selecting top
2 genes, which implies that WFAC uses the same number of genes to obtain lower
misclassiflcation error rate.
3.3.3 Simulation
I perform a large Monte Carlo simulation to study the optimality (in terms of
misclassiflcation error) of FAIR and WAFC under a variety of distributional set-
tings. To that end, I generated two classes of data from normal, Cauchy, and
t with two degrees of freedom (t2) distributions with dimension p = 200. I set
the center of one class at the origin (0;0;:::;0) and the center of second class
at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I considered variance-covariance matrices
27
Figure 3.2: Scree Plot of WFAC for Prostate Cancer Data
0 20 40 60 80 1005
6
7
8
9
10
11
12
13
Number of Genes Selected
Number of Misclassified Samples
?1 = I200 and
?2 =
0
BB
BB
BB
BB
BB
@
1 ?1=2 ?1=2 ::: ?1=2
?1=2 1 ?1=2 ::: ?1=2
?1=2 ?1=2 1 ::: ?1=2
::: ::: ::: ::: :::
?1=2 ?1=2 ?1=2 ::: 1
1
CC
CC
CC
CC
CC
A
:
In the simulation, training samples of sizes 20 and 30 were generated as well as
testing samples of size 1000 for each group are generated. Use FAIR and WFAC to
set the classiflcation rule based on the training data and calculate the minimum mis-
classiflcation error rate by computing the proportion of misclassifled testing samples
in each group respectively. Use the Monte Carlo simulation to generate 50 difierent
training and testing data having the same structure for each distributional setting
and apply the same procedure to all those difierent samples separately. Comparison
boxplots containing the misclassiflcation error rates are given in Figure 3.3.
It is clear from the plots that WFAC provides lower misclassiflcation error rates
for the heavier tailed distributions (Cauchy, t2). In particular, for Cauchy data, FAIR
28
leads to misclassiflcation error rates consistently around 50% which is nearly as same
as guessing. This is somewhat improved for the t2 distribution even though WFAC is
still better than FAIR. As expected, for normal data FAIR gives better performance
than WFAC.
3.4 Smoothed Projection-Free WMW-Based Classifler
3.4.1 Description
As discussed above, WFAC provides an improvement over FAIR in dealing with
non-normal distributed data. As shown in (2.2), in the calculation of WFAC, sign
function ` is used to count the number of transvariated observations. In order to
simplify the following discussion, let us assume the median of the difierences of all
the observations from ?X and ?Y respectively is positive. Thus two observations x
and y from ?X and ?Y respectively are treated as transvariation as long as x < y
regardless of whether y is barely greater than x or much greater than x. A weight
associated with the magnitude of difierence is needed and I will assign such weight s
by replacing ` with a [0;1]-valued and non decreasing function that is continuously
difierentiableonaninterval(??;?)forsome? > 0. InspiredbyAbebeandNudurupati
(2011), a continuous cumulative distribution functions deflned on R are applied.
Using smoothed WFAC, I will classify z to ?x if
pX
k=1
wk(X?;Y) >
pX
k=1
wk(X;Y?)
where
wk(X?;Y) = 1? 2n
xny
X
x?2X?
X
y2Y
Kfi'(xki ?ykj)mkxy?
and
wk(X;Y?) = 1? 2n
xny
X
x2X
X
y?2Y?
Kfi'(xki ?ykj)mkxy?
29
Figure 3.3: Misclassiflcation Error Rate
+ + ++ FAIR
FAW
G
0.300.350.400.450.50
Cau
chy
: Eq
ual 
Vari
anc
e
+ +
FAIR
FAW
G
0.250.300.350.400.450.50
T2: 
Equ
al V
aria
nce
++
+ +++
FAIR
FAW
G
0.150.200.25
Nor
mal
: Eq
ual 
Vari
anc
e
+ ++ + +++ FAIR
FAW
G
0.300.350.400.450.50
Cau
chy
: Un
equ
al V
aria
nce
+ ++ +++
FAIR
FAW
G
0.250.300.350.400.450.50
T2: 
Une
qua
l Va
rian
ce
+
++
FAIR
FAW
G
0.140.160.180.200.220.24
Nor
mal
: Un
equ
al V
aria
nce
30
Here Kfi is [0;1]-valued functions that are non decreasing on R, where fi is
smoothing parameter. For example, one could take cumulative density function of
N(0;fi). Here I use cumulative density function of t-distribution with fi degrees of
freedom.
For the data with training and testing samples, the training samples are used
to flnd the best smoother for each group. Similarly as mentioned in Abebe and
Nudurupati (2011), I use a bivariate grid (fi1;fi2) and apply a leave-one-out cross
validation to the training samples to flnd the misclassiflcation error for each possible
pairs of degrees of freedom. The combination with the least misclassiflcation error is
then selected as the pair of smoothing constants.
3.4.2 Monte Carlo Simulation
To demonstrate the optimality of smoother, several common simulation set-
ting are used: normal distribution, t distribution (heavy tail distribution), and log-
normal distribution (skewed distribution). I set the center of one class at the origin
(0;0;:::;0) and the center of second class at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I
considered variance-covariance matrices ?1 = I200.
I consider normal, t2 and log-normal distribution to generate 200-dimensional
data. In the simulation, training samples of sizes 30 and 30 as well as testing sam-
ples of size 100 for each group were generated by following the distributional settings
mentioned above. We use WFAC to set the classiflcation rule based on the training
data and calculate the minimum misclassiflcation error rate by computing the pro-
portion of misclassifled testing samples in each group respectively. I then calculate
the minimum misclassiflcation error rate for the testing sample by including the op-
timal smoother determined by training data. We use the Monte Carlo simulation
to generate 10 difierent training and testing data having the same structure for each
31
distributional setting and apply the same procedure to all those difierent samples sep-
arately. Comparison boxplots containing the misclassiflcation error rates are given in
Figure 3.4.
It shows that with the equal sample sizes, the smoother doesn?t improve the
performance of WFAC very well.
I then change the training samples sizes to 20 and 30 respectively and proceed
the same simulation as described above to show how the smoother behaves for the
difierent sample size case. Comparison boxplots containing the misclassiflcation error
rates are given in Figure 3.5.
Figure 3.5 shows that the one with smoother works much better for the skewed
data (log-normal) while it is slightly more e?cient for the heavy tailed data (t2).
3.5 Conclusion
Both two real data analysis shows the better e?ciency and accuracy of WFAC
than FAIR. A large Monte Carlo simulation study further demonstrates the obvi-
ous advantage of WFAC for heavier tailed data and slight disadvantage of normally
distributed data. Then the necessity of smoothed WFAC under some circumstance
where two sample sizes are unequal, is discussed in a Monte Carlo simulation study.
32
Figure 3.4: Misclassiflcation Error Rate (Smoothed with equal Size)
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.100.120.140.160.18
Nor
mal
?No
rma
l
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.120.140.160.180.20
Log
Nor
mal
?Lo
gNo
rma
l
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.200.250.300.35
T2?T
2
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.150.200.25
Nor
mal
?T2
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.020.030.040.05
T2?L
ogN
orm
al
33
Figure 3.5: Misclassiflcation Error Rate (Smoothed with Unequal Size)
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.120.140.160.180.20
Nor
mal
?No
rma
l
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.150.200.25
Log
Nor
mal
?Lo
gNo
rma
l
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.200.250.300.350.40
T2?T
2
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.160.180.200.220.240.26
Nor
mal
?T2
Ours 
with
out 
Smo
othe
r
Ours 
with
 Smo
othe
r
0.050.100.15
T2?L
ogN
orm
al
34
Chapter 4
WMW Forward Variable Selection Method
Two sample t-statistic and WMW statistic can be used as variable screening
index and their properties are discussed in Chapter 3. But both variable selection
procedures fail to take the correlations among predictor variables into account. They
arenotrecommendedwhentherearemanyvariablesthataredependentoneachother.
AsshowninSection2.2, twopopulardimensionreductionmethods, PCAandPLScan
also be used to reduce the dimension by flnding a smaller set of uncorrelated linear
combinations of the original variables. However, the problem with both methods
is that they use the linear combinations to combine all the predictor variables to
create new variables. Those created variables contain the information of all predictor
variables and they are very hard to be interpreted, especially in biological study.
That is the reason I propose this new WMW forward variable selection procedure to
improve those existing methods.
4.1 Descriptions
Consider two populations ?X and ?Y. XandYare two samples from ?X and ?Y
with sample sizes nx and ny respectively. X = fxijg and Y = fyijg are the predictor
matrices of two samples. Write V as a matrix of column vectors V = [v1 v2 ??? vp]
where each vi is an n ? 1 vector, where n = nx + ny. I would like to order the
variables v1;:::;vp in a decreasing order according to the amount of information
they provide for class determination, v[1] ? ??? ? v[p] say. The most informative
variable is the one that gives maximum separation between the two groups. As the
second most informative variable, it seems reasonable to pick the variable that is
35
the most dissimilar to v[1] while at the same time giving the highest contribution to
distinguishing between the two groups.
The approach I propose uses the WMW statistics to measure overlap and dis-
similarity in the sense of classiflcation. For k = 1;:::;p, deflne
tkij = xki ?ykj ; i = 1;:::;nx;j = 1;:::;ny
and, as a measure of overlap between ?X and ?Y, consider the two sample WMW
statistic based on vk
w(vk) = 1? 2n
xny
nxX
i=1
nyX
j=1
`(tkij)mk ; (4.1)
where mk = median'tkij : 1 ? i ? nx;1 ? j ? ny?. It can be seen that 0 ? w(vk) ?
1. Higher values of w(vk) indicate smaller overlap between fxk1;:::;xknxg and
fyk1;:::;yknyg. The most informative variable is the one with the the least over-
lap and hence the highest w(?).
To measure the dissimilarity between two variables vr and vs I use
d(vr;vs) = 1n
xny
nxX
i=1
nyX
j=1
`'?trijmr??tsijms?? ; (4.2)
where r;s = 1;:::;p. This quantity d(vr;vs) resembles the measure of dissimilarity
studied by Sokal and Michener (1958) and Rand (1971) (see discussion in Albatineh
et al, 2006) that is given by (nxny)?1PPtrijtsij. The measure d(vr;vs) counts how
often observations i and j in vr and vs behave in an opposite direction with respect
to their medians for i = 1;:::;nx; j = 1;:::;ny. It is clear that 0 ? d(vr;vs) ? 1
where large values of d(vr;vs) indicate large dissimilarity between vs and vr. It is
also easily observed that d(v;v) = 0 = d(v;?v). Moreover, d(vr;vs) = 1 if trijmr < 0
whenever tsijms > 0, for all i and j, and vice versa. In such cases, variables vr and vs
36
are totally dissimilar in the sense that they provide opposing information about class
membership.
4.2 Algorithm
An algorithm to implement WFS is as follows:
Algorithm 4.1.
Step 1: Let
v[1] = Argmax
k=1;:::;p
w(vk) ;
where w is deflned in (4.1).
Step 2: Use (4.2) to compute d(vs;v[1]) for s 6= [1]. Let
v[2] = Argmax
k=1;:::;p
k6=[1]
fw(vk)d(vk;v[1])g :
Step 3: For c = 2;:::;p, flnd direction of maximum separation ^u1 2Rc given
in (2.4) using v[1];:::;v[c] and set
v[c+1] = Argmax
k=1;:::;p
k=2f[1];:::;[c]g
w(vk)d(vk; ^u01[v[1]???v[c]]) :
One may use stability of the misclassiflcation error rate as a stopping crite-
rion. The downside of Algorithm 4.1 is that one needs to search higher and higher
dimensional spaces as the value of c in Step 3 increases. This introduces a huge com-
putational burden. The following modiflcation avoids high dimensional projections
by combining selected variables using the direction of maximum separation:
37
Algorithm 4.2.
Step 1: Set c = 1. Let
v[1] = Argmax
k=1;:::;p
w(vk) ;
where w is deflned in (4.1).
Step 2: If c ? p, then STOP. Otherwise use (4.2) to compute d(vs;v[1]) for
s 2f[c+1];:::;[p]g. Let
v[2] = Argmax
k2f[c+1];:::;[p]g
fw(vk)d(vk;v[1])g :
Step 3: Find direction of maximum separation ^u1 2R2 given in equation (2.4)
using [v[1] v[2]] and set
v[1] ? ^u01[v[1] v[2]]
c ? c+1
Go back to Step 2.
This algorithm is convenient for selecting variables in high dimensional data
since it only requires two dimensional projections. This is especially useful when
performing classiflcation based on gene expression data. Besides, forward selection
allows one to start with fewer variables and proceed to higher dimensions if necessary.
As a stopping rule, one may use predetermined dimensions (Nguyen and Rocke, 2002)
or cross validation using the misclassiflcation error rate.
38
4.3 Results
One real data analysis demonstrates the advantage of WFS over both simple
t-statistic and WMW statistic variable screening. Then two real data analyses and
a Monte Carlo simulation are used to compare the performances among PCA, PLS
and WFS.
4.3.1 Caribbean Food Data
Caribbean food data is applied to compare the performance among t-statistic,
WMW statistic and WFS. This data set contains information from Food and Drug
Administration (FDA) and U.S. Department of Agriculture (USDA) on the number
of rejections by country for certain Latin American and Caribbean (LAC) countries
for the years 1992 to 2003. This data set was investigated in Jolly et al. (2007) using
zero-in ated count data mixed models. The variables considered in the current study
are
? t = year (1992 - 2003)
? FDI = Foreign direct investment, net in ows (Balance of Payments (BoP),
current US $ )
? Fertcons = Fertilizer consumption (metric tons)
? USImp = U.S. Imports by Country, (1985-03; Millions of Dollars)
? AgImp = Total Agricultural Import to the US (million $)
? GNI = Gross national income per capita, Atlas method (current US $)
? Y = Detention Status (Y=0 no detention; Y=1 detention)
Consider variable Y as response variable and the other six variables as predictor
variables. I flrst select top 2 variables based on t-statistic, WMW statistic and WFS.
39
Table 4.1: Caribbean food data. Misclassiflcation error rates using leave-one-out cross
validation.
t Selection WMW Selection WFS Without Selection
LDA 0.4155 0.3873 0.3873 0.3803
MaxD 0.3873 0.3170 0.3169 0.3170
PGT 0.3944 0.3099 0.3028 0.3592
GGT 0.3803 0.3099 0.3028 0.3451
TD 0.3732 0.3310 0.3310 0.3310
Variables USImp and AgImp are selected by t-statistic while WMW statistic chooses
variables FDI and USImp. WFS, instead, picks the variables FDI and Fertcons. Then
classiflers LDA, TD, MaxD, PGT, and GGT discussed in Section 2.3 are applied to
calculate the proportion of the misclassifled samples by using leave-one-out cross val-
idation based on the top 2 variables selected by t-statistic, WMW statistic and WFS
respectively. I also calculate the corresponding misclassiflcation error rate based on
all those six predictor variables to show the necessity of variable selection procedure.
Misclassiflcation error rates are given in the Table 4.1
It is clear that using t-statistic to select the variables gets the worse classiflca-
tion than without any variable selection, which indicates that it fails to identify the
most informative variables. The results can be improved when I apply PGT and
GGT to determine the class membership based on the important variables (FDI and
Fertcons) selected by WMW. Finally WFS followed by PGT and GGT gives the min-
imal misclassiflcation error rate (0.3028). In the sense of classiflcation, variables FDI
and Fertcons chosed by WFS should be considered as the most informative ones in
prediction of food detention.
4.3.2 Colon Data
A two way cluster study is conducted by Alon et al. (1999) using a data set com-
posed of 40 colon tumor samples and 22 normal colon tissue samples, analyzed with
an Afiymetrix oligonucleotide array complementary to more than 6,500 human genes.
40
Table 4.2: Colon data. The number of incorrectly classifled samples out of 62 samples
using leave-one-out cross validation. Screening for PCA and PLS uses WMW and
t statistics. The numbers in parentheses are the number of genes kept after the
screening.
WMW Selection t Selection
PCA PLS PCA PLS PCA PLS PCA PLS WFS
(50) (50) (100) (100) (50) (50) (100) (100)
LDA 10 8 8 8 9 8 8 7 5
QDA 10 8 10 7 10 8 9 8 11
MaxD 9 12 10 11 10 12 12 11 10
PGT 8 8 8 8 8 8 8 8 5
GGT 9 9 9 9 9 10 9 9 5
TD 8 8 8 8 8 8 8 9 7
Alon et al. (1999) used an algorithm based on deterministic-annealing algorithm to
cluster the data set into two clusters. One cluster consisted of 35 tumor and 3 normal
samples while the other cluster contained 19 normal and 5 tumor samples.
A leave-one-out cross validation is used to determine the misclassiflcation error
rates based on 4 genes selected by WFS and 4 gene components selected by PCA and
PLS. Prior to using PCA and PLS, the top 50 and 100 genes were selected based on
the values of WMW and t statistics. Numbers of incorrect classiflcation out of 62
samples are given in the Table 4.2.
The results show that WFS followed by PGT, LDA, or GGT gives the fewest
(5) misclassifled samples of any combination of dimension reduction/selection and
classifler. MaxD results in between 9 and 12 misclassifled samples. A misclassiflcation
of 12 samples is the highest in the study. The fewest number of misclassifled samples
by any classifler following PCA is 8 samples. The minimum number of samples
misclassifled following PLS is 7 samples. This is achieved by LDA when 100 genes
were selected by the t-statistic and QDA when 100 genes were selected by the WMW
statistic.
41
Table 4.3: Leukemia training data. The number of incorrectly classifled samples out
of 38 training samples using leave-one-out cross validation. Screening for PCA and
PLS uses WMW and t statistics. The numbers in parentheses are the number of
genes kept after the screening.
WMW Selection t Selection
PCA PLS PCA PLS PCA PLS PCA PLS WFS
(50) (50) (100) (100) (50) (50) (100) (100)
LDA 1 1 1 2 1 2 1 2 4
QDA 2 2 2 0 3 2 2 3 5
MaxD 4 3 3 3 4 4 4 4 1
PGT 1 2 1 1 2 3 2 3 0
GGT 1 2 1 1 2 4 2 4 0
TD 1 2 2 1 1 1 2 2 3
4.3.3 Leukemia Data
Golub et al (1999) used a classiflcation procedure to discover the distinction
between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).
The original data set (training) used consisted of 38 bone marrow samples (27 ALL
and 11 AML) obtained from acute leukemia patients at the time of diagnosis. The
independent (testing) data set consisted of 24 bone marrow samples as well as 10
peripheral blood specimens from adults and children (20 ALL and 14 AML).
I flrst use the training data to apply a leave-one-out cross validation as described
earlier. Numbers of incorrect classiflcation out of 38 samples based on WFS as well as
PCA and PLS based on four genes or gene components were calculated. The results
are shown in the Table 4.3.
The results show that WFS followed by PGT or GGT gave no misclassifled
samples. The same result is attained by QDA using 100 genes selected by the t
statistic followed by PLS. The minimum number of samples misclassifled following
PCA is 1.
I then calculated the number of misclassifled samples out of testing samples by
using the variables selected by WFS and top 4 principal components obtained by
PCA and PLS based on training samples. The results are shown in the Table 4.4.
42
Table 4.4: Leukemia training and testing data. The number of incorrectly
classifled samples out of 24 testing samples based on training samples. Screening
for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the
number of genes kept after the screening.
WMW Selection t Selection
PCA PLS PCA PLS PCA PLS PCA PLS WFS
(50) (50) (100) (100) (50) (50) (100) (100)
LDA 1 3 1 1 1 3 2 3 5
QDA 2 2 3 2 2 2 1 2 3
MaxD 5 3 2 2 3 2 3 3 3
PGT 2 3 1 1 2 2 1 2 3
GGT 2 3 1 1 2 2 1 3 3
TD 2 1 1 1 2 2 2 3 3
We note that WFS gives inferior performance to PCA and PLS for this experiment
involving training and testing data.
Theresultsoftheanalysesonrealdatademonstratetheneedtocharacterizecases
where WFS performs better than PCA and PLS and vice versa. This is investigated
in the following section using simulated data.
4.3.4 Monte Carlo Study
I perform a Monte Carlo simulation to study the optimality (in terms of misclas-
siflcation error) of PCA, PLS and WFS under a variety of distributional settings. Two
classes of data are generated from normal, Cauchy, and t with two degrees of freedom
(t2) distributions with dimension p = 200. Set the center of one class at the origin
(0;0;:::;0) and the center of second class at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I
43
then consider variance-covariance matrices ?1 = I200 and
?2 =
0
BB
BB
BB
BB
BB
@
1 ?1=2 ?1=2 ::: ?1=2
?1=2 1 ?1=2 ::: ?1=2
?1=2 ?1=2 1 ::: ?1=2
::: ::: ::: ::: :::
?1=2 ?1=2 ?1=2 ::: 1
1
CC
CC
CC
CC
CC
A
:
In the simulation, training samples of sizes 20 and 30 were generated. After the
initial screening of 50 variables using WMW and t-statistics, the samples were used to
determine the PCA and PLS loadings and set the classiflcation rules based on the top
four components. Testing samples of size 1000 from each group were then generated
and the loadings found from the training samples are applied. The misclassiflcation
error rate is calculated based on the top four components by computing the proportion
ofmisclassifledtestingsampleobservationsineachgroup. ForWFS,Idirectlyselected
the top 4 variables without any screening. These same variables were retained for the
testing samples. The entire process is replicated 50 times.
For the sake of brevity, I only report the results of QDA, MaxD, and GGT. The
performance of LDA was similar to QDA and that of PGT and TD was similar to
GGT. Comparison boxplots containing the misclassiflcation error rates are given in
Figure 4.1.
It is clear from the plots that WFS provides lower misclassiflcation error rates for
the heavier tailed distributions (Cauchy, t2). For Cauchy data, PCA and PLS lead
to misclassiflcation error rates consistently around 50%. This is akin to  ipping a
coin to decide group membership without regard to the information contained in the
variables. This is somewhat improved for the t2 distribution even though WFS is still
the best among the methods considered. As expected, for normal data PCA and PLS
provide better performance than WFS. In the homoscedastic normal case, GGT is
44
the classifler with the lowest misclassiflcation error rate while for the heteroscedastic
normal case the best method is QDA. The latter is expected under normality.
4.4 Conclusion
Using two real data, it is shown that transvariation-based classiflers following the
rank-based forward variable selection procedure provide better class prediction than
LDA and QDA following dimension reduction using PCA and PLS. The forward
selection procedure also provides superior performance when the data come from
heavy-tailed distributions.
Because it starts with low dimensions, the use of forward selection makes intuitive
sense for variable selection in very high dimensional data. Even then the original
formulation of the proposed forward selection procedure required projection pursuit
in high dimensional spaces. This becomes computationally very expensive especially
for gene expression data that are ultra-high dimensional. Complicated methods of
mesh-generation and a large number of points are required to efiectively cover high
dimensional spaces. In this paper, an alternative algorithm that sequentially combines
information in two variables using the most informative direction is given as a way to
optimize the computation. This modifled algorithm only requires projections in two
dimensions which can be done by picking evenly spaced points on the unit circle.
45
Figure 4.1: Misclassiflcation error rates ( Black=QDA, Gray=MaxD, White=GGT)
+
+ ++
++
+ +
++ +
+
+
++
++ ++
+ ++ +
+ ++
+ +++
Cauchy: Equal 
Variance
0.20.30.40.5
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
++
+
+
+
T2: Equal 
Varia
nce
0.10.20.30.40.5
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
+
+
++
+
++ + +++
+
+ +
++ +
+
+
+++ ++ +++
+
++ +
Normal: Equa
l Var
iance
0.000.050.100.150.20
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
++
+
+
+ + ++
+ +
++ +
+ ++
+++ + ++++ +
++
++
+ ++
+ +
+ ++ ++
+++ ++++ +
Cauchy: Unequal 
Variance
0.20.30.40.5
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
+
+
+ +
++ +
+
++
T2: Uneq
ual 
Varia
nce
0.20.30.40.50.6
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
++
+
++
+
+
+ + ++
++
+
Normal: Un
equ
al V
ariance
0.00.10.20.30.4
WFS
t-PC
A
W-PC
A
t-PLS
W-PLS
46
Chapter 5
Conclusion and Future Work
Gene expression data usually contains a large number of genes, but a small
number of samples. It is well known that not all these genes contribute to determining
a speciflc genetic trait. Feature selection for gene expression data aims at flnding a
set of genes that best discriminate biological samples of difierent types.
In this dissertation, inspired by FAIR of Fan and Fan (2008), a new nonpara-
metric classifler (WFAC) is proposed to classify new observations based on the most
informative variables selected by Wilcoxon-Mann-Whitney statistic. Its similarity to
and difierences with FAIR are discussed theoretically and using real data analysis
and a Monte Carlo simulation study. I also introduced a smoothed version of WFAC
to improve its performance when there is a large sample size discrepancy in the two
samples. I then developed a nonparametric forward selection procedure for selecting
features to be used for classiflcation. This rank-based forward selection procedure re-
wards genes for their contribution towards determining the trait but penalizes them
for their similarity to genes that are already selected. Lower misclassiflcation error
rates are achieved by WFS compared to the dimension reduction methods such as
PCA and PLS.
It is of interest to flnd a speciflc rule to determine the number of variables I need
to select by using WFS. This requires a theoretical description of the misclassiflcation
error rate which can then be minimized with respect to the number of variables. So
far there is no clear stopping rule and I may only use the predetermined dimensions
or cross validation that uses the misclassiflcation error rate. This is currently being
studied by the author.
47
Bibliography
Abebe,A. and Mckean, J. W., (2007) Highly e?cient nonlinear regression based on
the Wilcoxon norm,
Abebe,A. and Nudurupati, S. V. (2011) Smooth Nonparametric Allocation of Classi-
flcation, J. Stat. Comp. Simul., 40: 1-16
Abebe, A., McKean, J.W., and Kloke, J.D., (2011) Iterated reweighted rank-based
estimates of GEE models, summitted
Albatineh, A. N., Niewiadomska-Bugaj, M. and Mihalko, D. (2006). On similarity
indices and correction for chance agreement. J. Class, 23:301{313.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine,
A. J. (1999) Broad patterns of gen expression revealed by clustering analysis of
tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl Acad.
Sci. USA, 96, 6745-6750.
Benard, A. and Elteren, P. V. (1953) A gereralization of method of m rankings,
Nederl. Akad. Wetensch. Proc. (Indag, Math. 15), Ser. A, 56, 358-369.
Bickel, P.L. and Levina, E. (2004). Some theory for Fisher?s linear discriminant func-
tion, "naive Bayes," and some alternatives when there ae many more variables than
observations. Bernoulli 10 989-1010. MR2108040
Chen, Z. Y. and Muirhead, R. J. (1994) Comparison of robust linear discriminant
procedures using projection pursuit methods, In Multivariate analysis and its ap-
plications (Hong Kong, 1992) volume 24 of IMS Lecture Notes Monogr. Ser., pages
163-176. Inst. Math. Stat., Hayward, CA.
48
Crimin, K., McKean, J. W., and Sheather, S. J. (2007) Discriminant procedures based
on e?cinet robust discriminant coordiantes, J. Nonpara. Stat. , 19(45): 199-213.
Ding, B. and Gentleman, R. (2005). Classiflcation using generalized partial least
squares. J. Comput. Graph. Stat., 14(2): 280{298.
Donoho, D. (1982) Breakdown properties of multivariate location estimators, PhD
Qualifying paper, Department of Statistics, Harvard University.
Dudoit, S., Fridly, J. and Speed, T. P. (2002) Classiflcation e?ciencies for robust
linear discriminant analysis, Stat. Sinica, 18(2):581-599.
Durbin, J. (1951) Incomplete Blocks in Ranking Experiments, Br. J. Psychol. (Sta-
tistical Section), 4, 85-90.
Fan, J. and Fan, Y. (2008) High-dimensional classiflcation using features annealed
independence rules. Ann. Stat. 36, 2605{2637.
Fan, J. and Lv, J. (2008) Sure inpendence screening for ultrahigh dimensional feature
space, J. R. Stat. 70, Part 5, pp. 849-911
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems, Ann.
Eugenics, VII(II):179-188.
Friedman, M. (1940) A Comparison of Alternative Tests of Signiflcance for the Prob-
lem of m Rankings, Ann. Math. Stat, 11 (1): 86-92.
Friedman, J. H. and Tukey, J. W. (1974) Projection pursuit algorithm for exploratory
data analysis, IEEE Transactions on Computers, C 23(9):881-890.
Froda, S. and Eeden, C. V. (2000) A uniform saddlepoint expansion for the null-
distribution of the Wilcoxon-Mann-Whiteney statistic J. Stat, Canadian, Vol. 28,
No 1. 137-149
49
Garthwaite, P.H. (1994) A interprtation of partial least squares, Am. Stat. Assoc.,
89, 122-127
Ghosh, A. K. and Chaudhuri, P. (2005) On maximum depth and related classiflers.
Scand, J. Stat., 32(2):327-350.
Gini, C. (1916) Il concetto di transvariazione e le sue prime applicazioni. Giornale
degli economisti Rivista di statistica.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Caasenbeek, M., Mesirov, P.,
Coller, H., Loh, M. I., Downing, J. R., Caligiuri, M. A., Bloomfleld, C. D., and
Lander, E. S. (1999) Molecular classiflcation of caver: calss discovery and class
prediction by gene expression monitoring, Science, 286, 531-537.
Helland, I.S. (1988) On the structure of partial least squares, Commun. Stat. Simul.
Comput., 17, 581-607.
H?oskuldsson, A. (1988) PLS regression methods, J. Chem., 2, 211-228.
Hotelling, M. and Pabst, M. R. (1936) Rank correlation and tests of signiflcance
involving no assumption of nomality. Ann. Math. Stat, 7:29-43.
Hugg, J., Rafalin, R., Seyboth, K., and Souvaine, D. (2006) An experimental study of
old and new depth measures, Workshop on Alforithm Engineering and Experiments
(ALENEX06), Lecture notes in Computer Sceience. New York: Springer-Verlag,
pp: 51-64.
Jollifie, I. T. (1986) Principal component analysis. Springer, New York.
Jolly, C. M., Namugabo, E., and Abebe, A. (2007). Food safety issues between latin
american and caribbean countries. Paper presented at the 27th West Indies Agri-
cultural Economics Conference.
50
Lehmann, E. L. (2006) Nonparametrics. Statistical methods based on ranks. Revised
flrst edition. Springer, New York.
Liao, C., Li, S. and Luo Z. (2007). Gene selection using wilcoxon rank sum test and
support vector machine for cancer classiflcation. In Y. Wang, Y.-m. Cheung, and
H. Liu, editors, Computational Intelligence and Security, volume 4456 of Lecture
Notes in Computer Science, pages 57{66. Springer Berlin / Heidelberg.
Liu, R. Y. (1990) On a notion of data depth based on random simplices, Ann. Stat.,
18(1):405-414.
Liu, R. Y. (1992) Data depth and multivariate rank tests, L1-statistical analysis and
related methods (Neuch^atel, 1992), 79-294, North-Holland, Amsterdam.
Liu, R. Y., Parelius, J. M., and Singh, K. (1999) Multivariate analysis by data depth:
descriptive statistics, graphics and inference, Ann. Stat., 27(3):783-858, With dis-
cussion and a rejoinder by Liu and Singh.
Liu, R. Y. and Singh, K. (1993) A quality index based on data depth and multivariate
rank tests, J. Amer. Stat. Assoc., 88(421):252-260.
Mahalanobis, P. C. (1936) On the generalized distance in statistics, Proc. Natl Acad.
Sci. India, 49-55.
Mann, H. B.; Whitney, D. R. (1947). "On a Test of Whether one of Two Random
Variables is Stochastically Larger than the Other". Annals of Mathematical Statis-
tics 18 (1): 5060.
Martens, H. and Taes, T. (1989) Multivariate Calibration. Wiley, NewYork.
Massey, W.F. (1965) Principal component regression in exploratory statistical Re-
search,J. AM. Stat. Assoc., 60, 234-246.
51
Montanari, A. (2004) Linear discriminant analysis and transvariation J. Classi.,
21(1):71-88.
Nguyen, D. V. and Rocken, D. M. (2002) Tumor classiflcation by partial least squares
using microarray gene expression data. Bioinformatics, 18(1): 39{50.
Nudurupati, S. V. and Abebe, A. (2009) A nonparametric allocation scheme for classi-
flcation based on transvariation probabilities, J. Stat. Comp. Simul., 79(8):977-987.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J.
Amer. Stat. Assoc., 66(336):846{850.
Salas-Gonzalez, D., Kuruoglu, E. E., Ruiz, D. P. (2009) A heavy-tailed empirical
Bayes method for replicated microarray data, J. Comput. Stat. Data. Anal, 53
(2009) 1535-1546
Sievers, G. L., and Abebe, A. (2004) Rank estimation of regression coeficients using
iterated reweighted least squares, J. Stat. Comp. Simul., 74, 821-831
Singh, K. (1991) A notion of majority depth, Technical Report, Department of Statis-
tics, Rutgers University.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating system-
atic relationships. University of Kansas Scientiflc Bulletin, 28:1409-1438.
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of multiple
cancer types by shrunken centroids of gene expression, Proc. Natl. Acad. Sci., 99,
6567-6572.
Tukey, J. (1974). Address to international congress of mathematicians, Vancouver.
Tyler, D. E. (1987) A distribution-free M-estimator of multivariante scatter, Ann.
Stat., 15(1):234{251.
52
Vardi, Y. and Zhang, C. H. (2000) The multivariate L1-median and associated data
depth, Proc. Natl Acad. Sci. USA. 97, 1423C1426.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80-
83.
53