Robust Nonparametric Discriminant Analysis Procedures
Except where reference is made to the work of others, the work described in this
dissertation is my own or was done in collaboration with my advisory committee.
This dissertation does not include proprietary or classi ed information.
Sai Vamshidhar Nudurupati
Certi cate of Approval:
Nedret Billor
Associate Professor
Mathematics and Statistics
Asheber Abebe, Chair
Associate Professor
Mathematics and Statistics
Peng Zeng
Assistant Professor
Mathematics and Statistics
George T. Flowers
Dean
Graduate School
Robust Nonparametric Discriminant Analysis Procedures
Sai Vamshidhar Nudurupati
A Dissertation
Submitted to
the Graduate Faculty of
Auburn University
in Partial Ful llment of the
Requirements for the
Degree of
Doctor of Philosophy
Auburn, Alabama
May 9, 2009
Robust Nonparametric Discriminant Analysis Procedures
Sai Vamshidhar Nudurupati
Permission is granted to Auburn University to make copies of this dissertation at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Vita
Sai Vamshidhar Nudurupati, son of Sri. Sai Baba Nudurupati and Smt. Sarada
Nuduruapti, was born on April 8, 1980, in Visakhapatnam, India. He graduated from
Little Flower Junior College, Hyderabad, India in May 1997. He joined MJCET, Os-
mania University, Hyderabad, India in November 1997 and graduated with a Bachelor
of Engineering in Mechanical Engineering in May 2001. He graduated with a Masters
in Industrial Engineering from Arizona State University in May 2003. He joined the
doctoral program in Department of Mathematics and Statistics at Auburn University
in January 2004 under the guidance of Dr. Asheber Abebe. He is married to Sheetal
Paliwal, daughter of Sri. Hansraj Paliwal and Smt. Savita Paliwal.
iv
Dissertation Abstract
Robust Nonparametric Discriminant Analysis Procedures
Sai Vamshidhar Nudurupati
Doctor of Philosophy, May 9, 2009
(M.S., Arizona State University, 2003)
(B.S., MJCET - Osmania University, 2001)
130 Typed Pages
Directed by Asheber Abebe
In this study, a nonparametric discriminant analysis procedure that is less sensi-
tive than traditional procedures to deviations from the usual assumptions is proposed.
The procedure uses the projection pursuit methodology where the projection index is
the two-group transvariation probability. Montanari (2004) proposed and used this
projection index to measure group separation but allocated the new observation using
simple Euclidean distances from projected centers. Our procedure employs a method
of allocation based on the centrality of the new point measured using two versions of
the transvariation probability: a symmetrized two-group transvariation and a smooth
version of point-group transvariation. It is shown by simulation that the procedures
proposed in this study provide lower misclassi cation error rates than classical pro-
cedures such as linear discriminant analysis and quadratic discriminant analysis and
recent procedures like maximum depth and Montanari?s transvariation-based classi-
ers under a variety of distributional settings. A di erent rank-based procedure for
v
classi cation is considered where ranking is applied on classical classi ers as well as
recently introduced classi ers such as the maximum L1 depth and quadratic discrimi-
nant function based on the minimum covariance determinant (MCD) estimates of the
mean and covariance. An extensive simulation study shows that not only does the
ranking method provide balance between misclassi cation error rates for each group
but also yields lower total probabilities of misclassi cation and higher consistency of
correct classi cation for heavy-tailed distributions. A theoretical evaluation of the
in uence function shows that this new procedure is robust against local in nitesimal
contaminations.
vi
Acknowledgments
I would like to express my earnest gratitude to Dr. Asheber Abebe for his
continuous support and guidance throughout the course of my degree. I would like to
acknowledge my committee members, Dr. Nedret Billor and Dr. Peng Zeng for their
valuable time and support. My sincere gratitude to Dr. Curtis Jolly for acting as an
external reader. I am deeply indebted to my family, who went far beyond their normal
boundaries to provide me with all comforts. I thank them for all the sacri ces they
have made and their devotion. The unconditional love and support of my siblings,
Kiran and Chaitanya helped me accomplish this degree. My parents-in law, deserve
a special mention for their immense support and a ection. Words fail to express my
appreciation to my wife, Sheetal, whose commitment and persistence have helped me
realize my aspirations. Her endless love and patience kept me focused and always
steered me in the right direction. I consider myself fortunate to have her as my life
partner. Thanks to one and all. Finally, thanks to Lord Sri Sairam for his grace and
blessings.
vii
Style manual or journal used Journal of Approximation Theory (together with the
style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars.
Computer software used The document preparation package TEX (speci cally
LATEX) together with the departmental style- le aums.sty.
viii
Table of Contents
List of Figures xi
List of Tables xii
1 Introduction 1
2 Background 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Rank Based Procedures for Classi cation . . . . . . . . . . . . . . . . 10
2.2.1 Projection Pursuit Based Classi ers . . . . . . . . . . . . . . . 10
2.2.2 Procedures Based on Multivariate Ranking . . . . . . . . . . . 20
2.3 Classi ers Based on Robust Estimators of Location and Scale . . . . 27
2.3.1 M Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 S Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 MCD Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.4 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Proposed Robust Projection Pursuit Based Allocation Schemes 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 NT-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Allocation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Allocation Based on Distance . . . . . . . . . . . . . . . . . . 43
3.3.2 Allocation Based on Point-Group Transvariation . . . . . . . . 44
3.4 Proposed Symmetrized Allocation Scheme . . . . . . . . . . . . . . . 46
3.5 Proposed Smoothed Allocation Scheme . . . . . . . . . . . . . . . . . 48
3.6 Application on Real Data and Monte Carlo Simulation . . . . . . . . 53
3.6.1 Application on Real Data Sets . . . . . . . . . . . . . . . . . . 54
3.6.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 56
4 Robust and Balanced Classification 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Ranking Discriminant Function . . . . . . . . . . . . . . . . . . . . . 61
4.3 Application on Real Data and Monte Carlo Simulation . . . . . . . . 66
4.3.1 Application on Real Data Sets . . . . . . . . . . . . . . . . . . 66
4.3.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 67
ix
5 Influence Function 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 IF for QDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 IF for RD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Sensitivity Curves for the Projection Based Classi er . . . . . . . . . 82
6 Conclusion 86
Appendices 104
x
List of Figures
3.1 An NT-net on S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Graphical Depiction of Transvariation and Smoothed Transvariation . 50
5.1 Sensitivity Curve for the Projection Based Methods . . . . . . . . . . 83
5.2 A Zoomed View of the Sensitivity Curve for the Projection Based
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xi
List of Tables
1 Application on Real Data: Performance of Projection Based Methods
Using Leave-one-out Misclassi cation Error Rates . . . . . . . . . . . 105
2 Performance of Projection Based Methods Using Monte Carlo Simu-
lation: Average Misclassi cation Error Rates and Standard Errors on
50 Replications of 1000 Test Cases per Group . . . . . . . . . . . . . 106
3 Application on Real Data: Performance of Rank Based Methods Using
Leave-one-out Misclassi cation Error Rates . . . . . . . . . . . . . . . 110
4 Performance of Rank Based Methods Using Monte Carlo Simulation:
Average Misclassi cation Error Rates and Standard Errors for Bivari-
ate Data on 50 Replications of 1000 Test Cases per Group . . . . . . 111
5 Performance of Rank Based Methods Using Monte Carlo Simulation:
Average Misclassi cation Error Rates and Standard Errors for 4D Data
on 50 Replications of 1000 Test Cases per Group . . . . . . . . . . . . 115
xii
Chapter 1
Introduction
Discriminant analysis is the process of devising rules to assign a new individual
data point into one of K (K > 1) known groups. The method is usually based on
previously known information related to the K groups, known as training data whose
correct classi cation information is known. In this dissertation, the focus would be
on the two-group discrimination problem (K = 2). Classi cation or discrimination
has a wide range of applications. A very small list of applications includes: spam
lters for an email engine that sends good emails to the inbox and bad emails to a
spam folder; voice recognition software used to distinguish the source of the voice
from among several speakers; methods to distinguish who is a bad risk for credit and
who is credit worthy; methods to classify a patient?s tumor as cancerous or benign.
A discriminant analysis procedure uses all the variables that the training data
contains and use their correct classi cation information to create a discriminant
model, a discriminant rule or a classi er. Classi cation is done by feeding new ob-
servations into this classi er or model and getting the group membership to which
the new observation belongs. A classi er is said to be a good classi er if it provides
low misclassi cation error rates not just for the best of the situations but also under
various conditions such as shape of the groups, number of groups, size of the groups
etc. Fisher (1936) came up with what is considered to be the rst scienti c discrimi-
nant model. His classi er performs well for data that follow normal distributions, the
1
groups share the same covariance structure and if there are no outliers. Violations of
any of these assumptions may make Fisher?s method unstable. There has been a lot of
research done in the eld of classi cation ever since then in an e ort to come up with
classi ers that are robust, ones that are not sensitive to violations of certain assump-
tions. A wide range of techniques have been used to develop various such classi ers
that are robust to deviations in the parametric, semi-parametric and nonparametric
realms. Most of the parametric methods become sensitive with deviations from their
underlying assumptions. Researchers devised methods that are more robust to such
deviations using semi-parametric and nonparametric techniques which address issues
that existing methods did not consider.
In this dissertation, we propose modi cations to existing semi-parametric classi-
ers in an e ort to create classi ers that are robust to deviations. The rst proposed
method provides two nonparametric alternatives for the allocation process provided
in the classi er by Montanari (2004). The second method is a rank based method
where discriminant functions are ranked and we wish to show that ranking can pro-
vide equal misclassi cation error rates in each group which might be very pertinent
in some situations.
The optimality of our classi ers is shown by comparing the proposed classi er
with existing classi ers. This is done by the use of real data sets and also via extensive
Monte Carlo simulation studies. Sensitivity curves will be used to show the e ect
of local and gross perturbations on the probability of misclassi cation error rate.
2
In uence functions are used to ascertain the robustness of the rank based discriminant
function theoretically.
3
Chapter 2
Background
2.1 Introduction
Let us start with the univariate two sample location problem. SupposeX1;:::;Xnx
are independent and identically distributed (iid) random variables that are normally
distributed with mean x and variance 2 (N( x; 2)) and Y1;:::;Yny iid N( y; 2).
Suppose also that Xi is independent of Yj for i = 1;:::;nx and j = 1;:::;ny.
Our desire is to test the hypothesis H0 : x = y versus the alternative hypothe-
sis H1 : x 6= y. One then may employ the two sample t test that uses the test
statistic
txy =
X Y
Sp
qn
x+ny
nxny
;
where Sp is the pooled standard deviation, to obtain an estimate of the standardized
separation between the two locations. The null is rejected in favor of the alternative if
jtxyjis larger than t (df), the critical value, where is the allowable type-I error rate
and df are the degrees of freedom. Alternatively, one may use the non-parametric
counterpart of the two-sample t test known as the Mann-Whitney test (Hollander
and Wolfe, 1999) which uses the test statistic
uxy =
nxX
i=1
nyX
j=1
(Xi;Yj) ; (2.1)
4
where (r;s) = Ifr 0 :
The function D is known as a discriminant function.
The probability of an observation from x being misclassi ed in y is
PDyjx = PfD(Z;F;G) < 0 jZ Fg
and the probability of misclassifying a random variable from population y into pop-
ulation x is
PDxjy = PfD(Z;F;G) > 0 jZ Gg :
The total cost of misclassi cation is then CxPDyjx+(1 )CyPDxjy where is the prior
probability that an observation comes from x and Cx and Cy are the costs of mis-
classi cation of an observation from x in y and from y in x, respectively. Under
the assumption that the priors are equal and that the costs of misclassi cation being
6
equal for the two populations (Cx = Cy = 1), the total probability of misclassi cation
(TPM) is PD = 12PDyjx+ 12PDxjy. Hereafter, we will assume the costs Cx and Cy to equal
1.
The following two de nitions describe the notions of optimality and robustness
in the framework of discriminant analysis.
De nition 2.1. We say a discriminant function D is optimal if PD PD for
any other discriminant function D and we say that D is more optimal than D if
PD C and to y otherwise.
The cuto C is usually 0 when the prior probabilities are unknown. This method is
built to be optimal in classifying the new observation Z under the assumption that
F and G are both multivariate normal distributions that are di erent in location
7
but have the same scatter. In particular, if F and G are Nd( x; x) and Nd( y; y),
respectively and under the assumption that = x = y.
For the situation where x6= y is true, the optimal procedure uses a quadratic
combination of the d-covariates that maximizes the separation between the popula-
tions x and y. This is known as Quadratic Discriminant Function (QDF) which is
given by:
Q(z;F;G) ln
j
yj
j xj
(z x)0 1x (z x) + (z y)0 1y (z y): (2.4)
This becomes the optimal rule for classi cation under the conditions where a new
observation Z = z is assigned to x if Q(z;F;G) > 0 and to y otherwise.
Given the random samples X and Y, the sample versions of LDF and QDF are
given by
L(z;Fnx;Gny) = ( x y)0^ 1
z 12( x + y)
and
Q(z;Fnx;Gny) = ln
j^ yj
j^ xj
!
(z x)0^ 1x (z x) + (z y)0^ 1y (z y) ;
respectively, where ^ x, ^ y, and ^ are the estimators of x, y, and , respectively.
Linear and Quadratic discriminant functions are optimal under certain conditions
and assumptions and may be very sensitive to any deviation from the assumptions.
Deviations from normality, equal covariance in the case of LDF, existence of outliers
8
in the case of QDF are examples where LDF and QDF are not optimal. In fact the
more serious the deviations and the greater the number of deviations, the more sen-
sitive these methods get. Some work that considered the issue of robustness of LDF
and QDF is Lachenbruch et al. (1973), Lachenbruch (1975), Hills (1967), McLachlan
(1992), Anderson (1984), Dillon (1979), Johnson et al. (1979) and Seber (1984). A
number of these authors investigated the robustness of LDF and QDF with respect to
their TPM to some non-linear transformations of the normal distribution suggested
in Johnson (1949). Hills (1967) looked at discrimination in data that are non-normal,
speci cally when the data are discrete. Lachenbruch et al. (1973) investigated the
performance of LDF under certain non-normality conditions, speci cally, log normal,
logit normal and the inverse hyperbolic sine normal distributions. Optimal misclas-
si cation probabilities in these cases are calculated by taking an appropriate inverse
transformation. In such cases, nding the cuto C theoretically using a minimax rule
is a very di cult problem. So Lachenbruch et al. (1973) determined the value of C,
approximately, by using 25 di erent discrete points. They also found that the trans-
formation makes the two populations heteroscedastic, so they proposed to assign the
new observation Z = z to x if Q(z;F;G) >C and to y otherwise. They provided
some theoretical results in addition to presenting a Monte Carlo study. Their work
and the work of others who extended this research, found that both LDF and QDF
are greatly a ected by these types of non-normality. As a solution to the sensitivity of
the linear and quadratic discriminant functions, several authors have proposed robust
procedures.
9
There are various classi ers, both parametric and nonparametric, which are
proposed in literature, like kernel-based classi cation rule (Mojirsheibani, 2000), k-
Nearest Neighbor classi cation rule (Hellman, 1970), decision trees (Ting, 2002),
neural networks (Pao, 1989), logistic regression (Brzezinski, 1999), support vector
machines (Gunn, 1998), combined classi ers (LeBlanc and Tibshirani, 1996; Mojir-
sheibani, 1999, 1997) just to name a few. In this dissertation, we will consider various
rank-based procedures for performing classi cation in an optimal and robust manner.
2.2 Rank Based Procedures for Classi cation
In this section we will introduce two types of rank-based procedures for classi ca-
tion. The rst will be classi cation based on projection pursuit that uses a rank-based
projection index and the second will be based on multivariate ranking.
2.2.1 Projection Pursuit Based Classi ers
Fisher?s idea of picking a linear or quadratic combination that maximizes the
separation between the two samples is in a way nding the projection that maximizes
the separation between the groups based on a particular criterion. In fact, Fisher?s
LDF could be reframed as nding u, say u0, the projection direction that maximizes
t2(u) =
u0( X Y) 2
u0Spu
nx+ny
nxny
10
where X Nd( x; ), Y Nd( y; ) and S2p = (nx 1)s2x+(ny 1)s2ynx+ny 2 . Here sx and sy are
estimates of standard deviations of X and Y respectively. One observes that this is
the same two sample t test statistic that was given in Equation (2.2).
The data is then reduced to one dimension by projecting it in the direction given
by u0 and one would classify a new observation Z = z into x ifjZ0 X0j 0, then the most \interesting" direction is
given by
u0 = S
1
p ( X Y)
kS 1p ( X Y)k
In other situations, the projection direction is not generally obvious. This search for
\interesting" low dimensional projection of high dimensional data, is more popularly
known as projection pursuit (Friedman and Tukey, 1974; Posse, 1992). \Interesting-
ness" is de ned based on a criterion which is measured through a suitable function
known as a projection index. It turns out that the projection index for LDF is the
two-sample t statistic. Friedman and Tukey (1974) in their study used an algorithm
that associated with each direction in high dimensional space, a continuous index
that measures its \usefulness" as a projection axis. The projection direction is then
11
varied to maximize the index. The hill-climbing algorithms for maximization given
by Rosenbrock (1960) and Powell (1964) were then used as the projection index was
su ciently continuous. Posse (1992) used projection pursuit for 2-group discrimina-
tion and he did it using kernel estimation. He used fast fourier transform (FFT) to
solve the kernel estimation problem of nding the projection direction that is most
interesting.
Projection pursuit is a computationally intensive procedure as it is a thorough
process of searching all the projection directions to nd the most \interesting" pro-
jection. This method gets complicated further as the dimension increases, but this
technique is gaining popularity with the increased attention given to savvy computer
programming techniques and improvements in computer technology.
Any method that e ectively helps reduce the dimension from high to low can
be treated as a form of projection pursuit. Principal component analysis (PCA)
can be treated as a special case of projection pursuit (Huber, 1985). The aim of
classical PCA is to nd a linear combinations of the original variables that have
maximal sample variance (Nason, 1995). The rst principal component a is the
vector that maximizes the variance of the data X projected along that direction.
i.e. a = arg max Varxfa0Xg. Here the projection index happens to be variance of
the projected data. The only other constraint in this method is the need for the
projection directions to be orthogonal to each other. It happens so that the rst
principal component has the highest eigenvalue and hence explains the most variance
than any other direction. In that sense, eigenvalues are used as projection indices.
12
Jones (1983) and Huber (1985) gave strong heuristic arguments indicating that
a projection is less interesting the more it is normal.
In addition to lowering the dimension, projection pursuit also allows us to over-
come problems associated with sparsity of data in high dimensions (Huber, 1985)
which is also termed \curse of dimensionality" (Goldstein, 1987). With increasing di-
mension, the need for more and more data to meet the requirement of su cient data
increases like a curse. Many techniques fail to perform well under the conditions of
sparse data. There are also situations where the number of variables or the dimension
(d) is much higher than the amount of data or number of observation (n). These are
more popularly known as datasets with d n data. As an example (Huber, 1985)
assume that a large number of points is distributed uniformly in the 10-dimensional
unit ball. Then the radius of a ball containing 5% of the points is (0:05)0:1 = 0:74. By
using projection pursuit techniques, one can reduce the dimension to one and elimi-
nate the problem of sparsity of data. As noted in Huber (1985), \Projection pursuit
is the most powerful technique that can lift a one-dimensional technique to higher
dimensions" (Chen, 1989). By this he means that a projection pursuit technique
can be used to reduce the dimension to one and then any one-dimensional statistical
technique can be applied.
Every traditional projection pursuit methodology mainly di ers in the choice of
projection index. There are certain demands of a good projection index (Posse, 1995):
Robust to deviations
13
Approximately a ne invariant
Consistent
Simple enough to permit quick computation even for large data sets.
The problem with Fisher?s LDF is that it uses the t-statistic as a projection index
which is known to be very sensitive to the underlying distributional assumptions. In
an attempt to make Fisher?s LDF more robust to deviations, many researchers have
proposed to use robust version of projection index in place of the t-statistic. Posse
(1990) proposed a projection pursuit technique based on a global optimization algo-
rithm and used a chi-squared projection index to nd the most interesting plane (two
dimensional view). This optimization algorithm was later modi ed by Posse (1995) by
combining it with a structure removal procedure given in Friedman (1987) and used a
modi ed chi-squared index which satis es all the demands of a good projection index.
Chen and Muirhead (1994) and Chen (1989) in his PhD dissertation used various ro-
bust estimates in the likes of the median location estimator (Andrews et al., 1972;
Huber, 1981), trimmed location estimator, Huber type location M-estimator, the me-
dian absolute scale estimator and Huber type scale M-estimator. Montanari (2004)
and Chen and Muirhead (1994) used a two-sample Mann-Whitney type statistic as
a projection index to measure group separation. They show that their projection
pursuit methods are not sensitive to deviations from the homoscedasticity and nor-
mality assumptions that are required for the optimality of Fisher?s LDF. The method
14
proposed by Montanari was motivated by the idea of what is known as transvariation
(Gini, 1916; Montanari, 2004).
Transvariation
Consider two continuous univariate populations x and y with distributions F
and G, respectively, de ned on R. Suppose we have a random sample X1;:::;Xnx
from x and, independent of the rst sample, a random sample Y1;:::;Yny from y.
The two samples are said to transvariate with respect to their measures of centers
mx and my if there is at least one pair (i;j) such that (Xi Yj)(mx my) < 0,
i = 1;:::nx; j = 1;:::;ny. Any di erence satisfying this condition is called a two-
group transvariation. Similarly, the sample X1;:::;Xnx and a given constant c2R
transvariate with respect to mx, if there is at least one i such that (Xi c)(mx c) < 0,
i = 1;:::;nx. This is known as a point-group transvariation.
The two-group transvariation probability between F and G is de ned as
xy := (F;G) = Pf(Y X)( y x) < 0g
=
Z
R
Z
R
If(y x)( (G) (F)) < 0gdF(x)dG(y) ; (2.5)
where X F, Y G and x = (F), y = (G) are the location parameters of
F and G, respectively. If Fnx and Gny be the two empirical distributions of the two
15
random samples. Then an estimate of xy is given as
Txy := (Fnx;Gny) =
Z
R
Z
R
If(y x)( (Gny) (Fnx)) < 0gdFnx(x)dGny(y)
= 1n
xny
nxX
i=1
nyX
j=1
If(Xi Yj)(mx my) < 0g;
where IfAg= 1 if A is true and is 0 otherwise. Here mx = (Fnx) and my = (Gny)
are estimators of x and y, respectively. Txy is a nonparametric estimator of the
overlap between the distributions Fx and Fy. In particular, nxnyTxy gives the number
of observations that need to be interchanged so that there will be no overlap between
the two samples. If we assume without loss of generality that y < x, then xy =
P(X DF(x2;F).
This is true for any depth function DF. Let F be the class of distributions on the
Borel sets of Rd. A statistical depth function is a bounded, nonnegative mapping
D : Rd F!R and there are certain properties that are desired of depth functions
(Zuo and Ser ing, 2000; Hoberg, 2003):
A ne Invariance: The depth of a point x 2 Rd should not depend on the
underlying coordinate system or, in particular, on the scales of the underlying
21
measurements.
DF(Ax + b;FAX+b) = DF(x;FX)
Maximality at center: For a distribution having a uniquely de ned \center"
(e.g., the point of symmetry with respect to some notion of symmetry), the
depth function should attain maximum value at this center. If is the center
of F, then
DF( ;F) = sup
x2Rd
DF(x;F)
Monotonicity relative to deepest point: As a point x 2 Rd moves away from
the \deepest point" (the point at which the depth function attains maximum
value; in particular, for a symmetric distribution, the center) along any xed
ray through the center, the depth at x should decrease monotonically.
DF(x;F) DF( + (x );F) for 2[0;1]
Vanishing at in nity: The depth of a point x should approach zero as kxk
approaches in nity.
DF(x;F)!0 as kxk!1
The interested reader may nd an extensive list of depth functions along with
their de nitions in Liu et al. (1999), Zuo and Ser ing (2000) or Ghosh and Chaud-
huri (2005). Among the numerous depth functions that are in existence, Mahalanobis
22
depth and L1 depth (spatial depth) are two of the most attractive ones due to their
ease of computation. They can be computed exactly for any dimension. The com-
putation of many other depth functions may require algorithms that provide only
approximations. This is especially true for higher dimensional data. For example,
one usually has to construct very complicated approximate algorithms to compute
the halfspace depth of points in three or higher dimensions.
Taking advantage of this notion of ordering multivariate data in a center-outward
manner, J ornsten (2004) proposed the maximum L1 depth classi er that uses the
discriminant function
S(z;F;G) =D1(z;F) D1(z;G) =
EG
z Y
kz Yk
EF
z X
kz Xk
=
Z
Rd
z y
kz ykdG(y)
Z
Rd
z x
kz xkdF(x)
: (2.9)
The new observation Z = z is then classi ed in x if S(z;F;G) > 0 and in y
otherwise. Despite its computational ease, a major drawback of this classi er is that
it lacks a ne invariance because L1 depth is not a ne invariant. It can, however, be
made a ne invariant by taking 1=2x (z X) and 1=2y (z Y) in place of z X and
z Y, respectively, in equation (2.9) (Vardi and Zhang, 2000; Ghosh and Chaudhuri,
2005). Note that one can use any a ne equivariant estimators of x and y when
computing the discriminant function. If the scatter estimator of Tyler (1987) is used,
then the resulting maximum L1 depth classi er is very similar to the classi er given by
Crimin et al. (2007). An alternative method of obtaining a ne invariance is to scale
23
the data along its principal component directions (PCA-scaling) as given in Hugg
et al. (2006). One could use robust principal component analysis (eg: Robust PCA
given by Croux et al. (2007)) or scale the data with the robust estimate of covariance
structure which will make the L1 depth function a ne invariant in addition to making
it robust against deviations.
Once again, for practical purposes, given two independent random training sam-
ples X1;:::;Xnx and Y1;:::;Yny from x and y, respectively, de ned on Rd (d 1),
the sample version ofD1(x;F) given in (2.8) can be found by replacing the empirical
cdf in place of F and G resulting in the sample version of S(z;F;G) given by
S(z;Fnx;Gny) =
Z
Rd
z y
kz ykdGny(y)
Z
Rd
z x
kz xkdFnx(x)
=
1
ny
nyX
j=1
z Yj
kz Yjk
1
nx
nxX
i=1
z Xi
kz Xik
:
It must be noted that the maximum L1 depth classi er is in the class of classi ers
known as maximum depth classi ers (Ghosh and Chaudhuri, 2005) in that any depth
function DF can be used in place of the L1 depth function. The optimality of the
classi er is dependent on the choice of the depth function. The choice of depth
functions could be based on various properties like robustness. You obtain QDF if
Mahalanobis depth is used in place of the L1 depth in (2.9). One would assign the
new observation Z = z in x if DF(z;F;G) > 0 and in y otherwise.
In Billor et al. (2008), it is shown that the robustness of the maximum depth
classi er can be improved if one considers the largest order of depth instead of just
24
the maximum depth. They de ne the transvariation probability between the depth
of Z and the depth of X relative to the distribution of X as
x(Z) = Pf(DF(X;F) DF(Z;F))(DF( x;F) DF(Z;F)g< 0)
= PfDF(X;F) 0. This
classi er is robust just like most rank-based procedures and can be directly extended
to more than two groups unlike the transvariation-based method.
The use of data depth for classi cation purposes has also been considered by
Mosler and Hoberg (2003). They de ne a combination of two existing depth functions
zonoid and Mahalanobis to come up with a new depth function and they call it
zonoid-Mahalanobis depth. Mahalanobis depth is parametric depth function that is
sensitive to deviations, while zonoid depth function is nonparametric and based on
convex hulls. In this regard, halfspace, simplicial and convex-hull peeling depths are
in the same class of depths as zonoid depth. A drawback of such depth functions is
that they vanish outside the convex hull, so points lying outside the convex hulls of
all classes cannot be classi ed using such depth functions. The zonoid-Mahalanobis
depth function uses a combination as in one would use the zonoid depth as long as
the point lies inside at least one of the convex hulls, else the Mahalanobis depth is
used for classi cation. Halfspace depth based classi cation for two populations has
been suggested by Christmann and Rousseeuw (2001) and Christmann et al. (2002).
Donoho and Gasko (1992) showed that the computation of HD in higher dimensions
can be performed using the projection pursuit method. In particular, the sample
half-space depth in one-dimension is de ned as
HD1(x;X) = min(#fi : Xi xg;#fi : Xi xg)
26
and for x2Rd the sample half-space depth is de ned as,
HDd(x; X) = min
kuk=1
HD1(u0x; u0X) (2.10)
where the dataset u0X is the one-dimensional projection of the d-dimensional dataset
X.
This means that the maximum depth classi er based on the half-sapce depth
(Ghosh and Chaudhuri, 2005) is equivalent to using projection pursuit with point-
group transvariation (Montanari, 2004) and allocating using a minmax rule. To
that end, let F and G be the data generating distributions and Fu and Gu be the
distributions of the projected data as de ned above. Then using the de nition given
in Equation (2.6) we can de ne the discriminant function
Dpgt(z;F;G) = min
kuk=1
Z
R
If(x u0z)( (Fu) u0z) < 0gdFu(x)
min
kuk=1
Z
R
If(y u0z)( (Gu) u0z) < 0gdGu(y)
and allocate z in x if Dpgt(z;Fnx;Gny) > 0.
2.3 Classi ers Based on Robust Estimators of Location and Scale
The reason LDF and QDF are sensitive to deviations from underlying assump-
tions is due to the fact that the estimators of mean and covariance that they use
are sensitive to such deviations. So a natural idea is to replace these quantities with
27
robust quantities to attain robustness in the classi er. Perhaps the earliest such pro-
cedure was proposed by Randles et al. (1978b) where Huber?s M estimates of the
mean vector and covariance matrix (Huber, 1977) are used in place of the mean and
covariance used in LDF and QDF.
2.3.1 M Estimators
We will give a brief discussion of Huber?s M estimates in the following. M
estimates are solutions, ^ , that maximize
nX
i=1
(xi; ) ;
where is a function that depends on the sample and a vector of parameters . The
properties of this function are discussed below. Often it is simpler to di erentiate
with respect to and solve for the root of the derivative. When this di erentiation
is possible, the M-estimator is said to be of -type. Otherwise, the M-estimator is
said to be of -type.
type
For positive integer d, let (X; ) and ( Rd;S) be measure spaces. 2 is a
vector of parameters. An M-estimator of -type T is de ned through a measurable
function : X ! R. It maps a probability distribution F on X to the value
28
T(F)2 (if it exists) that minimizes RX (x; )dF(x) :
^ = T(F) := arg min
2
Z
X
(x; )dF(x) :
type
If is di erentiable, then the computation of ^ is usually much easier. An M-
estimator of -type T is de ned through a measurable function :X !Rd. It
is assumed that the true parameter satis es RX (x; )dF(x) = 0. Then a -type
M estimator T(F) is de ned implicitly as the solution of the vector equation
Z
X
(x;T(F))dF(x) = 0 :
For many choices of or , no closed form solution exists and an iterative approach
to computation is required. It is possible to use standard function optimization algo-
rithms or an iteratively re-weighted least squares tting algorithm typically happens
to be the preferred method.
2.3.2 S Estimators
S-estimator based classi ers are given by He and Fung (2000) and Croux and De-
hon (2001). S estimators were rst de ned in the context of regression by Rousseeuw
and Yohai (1984). Let (x; a;C) = f(x a)0C 1(x a)g12 for any x 2Rd, a 2Rd
and C 2 S(d), where S(d) is the set of all symmetric positive de nite matrices in
29
Rd Rd. Let s(a;C) be the solution of
1
m
mX
i=1
(x
i; a;C)
s(a;C)
= E
n
(jjxjj12 )
o
;
where is such that (0) = 0, where is symmetric about 0 and is nondecreasing
on (0;c) and constant on (c;1) for some constant c > 0. Here the expectation on
the right hand side is taken at the standard d-variate normal distribution. Now let
(a ;C ) be the minimizers of s(a;C) subject to det(C) = 1, then
~ S = a and
~ S = s(a ;C )C
are the S-estimators of and .
2.3.3 MCD Estimators
The MCD (Maximum Covariance Determinant) based LDF and QDF given in
Hubert and Van Driessen (2004), the classi er based on a robust version of the
Lawley-Hotelling test uses the spatial median estimator of Hettmansperger and Ran-
dles (2002) and the related scatter estimator of Tyler (1987) given by Crimin et al.
(2007).
Hubert and Van Driessen (2004) used the re-weighted MCD estimate of mul-
tivariate location and scale (Rousseeuw, 1984, 1985), because of its good statistical
30
properties and the FAST-MCD algorithm (Rousseeuw and Van Driessen, 1999) which
provides an e cient algorithm of computing the estimates for large data sets.
For the X sample, the MCD estimator is de ned as the mean ^ x;0 and the
covariance matrix Sx;0 of the hx observations (out of nx) whose covariance matrix has
the lowest determinant. The quantity hx should be larger than b(nx p + 1) = 2c
and nx hx should be smaller than the number of outliers in the X population. With
this choice the MCD attains its maximal breakdown value of b(nx p + 1) = 2c
50%. The breakdown value of an estimator is de ned as the largest percentage of
contamination it can withstand (Rousseeuw and Leroy, 1987). If one suspects, less
than 25% contamination in the X sample, it is advised to take hx 0:75nx as this
yields a higher nite-sample e ciency (Croux and Haesbroeck, 2000). Based on the
initial estimates ^ x;0 and Sx;0 one computes, for each observation xi, its (preliminary)
robust distance (Rousseeuw and Vanzomeren, 1990)
RD0xi =
q
(xi ^ x;0)0S 1x;0(xi ^ x;0) ; i = 1;:::;nx
They assign weight 1 to xi if RD0xi
q
2p;0:975 and weight 0 otherwise. The
reweighted MCD estimator is then obtained as the mean ^ x;MCD and the covari-
ance matrix ^ x;MCD of those observations with weight 1. It is shown by Croux and
Haesbroeck (2000) that this reweighting step increases the nite-sample e ciency of
the MCD estimator considerably, whereas the breakdown value remains the same.
This can also be used to ag outliers and so can be used to detect outliers.
31
The Robust Quadratic Discriminant Rule (M) is: Z = z2 y if dM2 (z) >dM1 (z)
where
dMx (z) = 12lnj^ x;MCDj 12(z ^ x;MCD)0^ 1x;MCD(z ^ x;MCD) (2.11)
and z2 x otherwise. The quantity dMy (z) is de ned analogously as dMx (z).
Robusti ed Fisher?s linear discriminant rule (RLDR), which can be described as
z2 x if
(^ x ^ y)0^ 1(z (^ x ^ y)=2) > 0;
and z2 y otherwise.
To construct RLDR they rst look for initial estimates of the group means and
the common covariance matrix, denoted by ^ x;0, ^ y;0 and ^ 0. This will already yield
a discriminant rule based on ^dRLx (z; ^ x;0; ^ 0) and ^dRLy (z; ^ y;0; ^ 0). We will then also
consider the reweighting procedure based on the robust distances
RD0xi =
q
(xi ^ x;0)0^ 10 (xi ^ x;0) ; i = 1;:::;nx
and
RD0yj =
q
(yj ^ y;0)0^ 10 (yj ^ y;0) ; j = 1;:::;ny
For each observation in X sample, let wxi = 1 if RD0xi
q
2p;0:975 and wxi = 0
otherwise. wyj are de ned similarly for the Y sample. The nal estimates are then
obtained as the mean and the pooled covariance matrix of the observations with
32
weight 1, i.e.
^ x =
Pnx
i=1 wxixiPn
x
i=1wxi
; ^ y =
Pny
j=1wyjyjP
ny
j=1wyi
;
^ =
Pnx
i=1 wxi(xi ^ x)(xi ^ x)
0 +Pny
j=1 wyj(yj ^ y)(yj ^ y)
0
Pnx
i=1 wxi +
Pny
j=1wyj
(2.12)
and the resulting linear discriminant rule is then based on ^dRLx (z; ^ x; ^ ) and
^dRLy (z; ^ y; ^ ).
To obtain the initial covariance estimate ^ 0, they consider three di erent meth-
ods. The rst approach is straightforward, and has been applied by Chork and
Rousseeuw (1992) using the Minimum Volume Ellipsoid estimator (Rousseeuw, 1984),
and Croux and Dehon (2001) using S-estimators (Rousseeuw and Yohai, 1984). The
MCD estimates ^ x;MCD, ^ y;MCD, ^ x;MCD and ^ y;MCD are obtained, and then the
individual covariance matrices are pooled, yielding
^ PCOV = nx^ x;MCD +ny^ y;MCD
nx +ny
For the second approach, they adapt one of the proposals of He and Fung (2000)
who use S-estimators to robustify Fisher?s linear discriminant function. The idea is
based on pooling the observations instead of the group covariance matrices.
The third estimator combines the two previous approaches and is aimed to nd a
fast approximation to the Minimum Within-group Covariance Determinant criterion
of Hawkins and McLachlan (1997). Instead of applying the same trimming proportion
to each group, they proposed to nd the h observations out of the whole data set of
33
size n, such that the pooled within-group sample covariance matrix ^ H has minimal
determinant. The algorithm described in Hawkins and McLachlan (1997) is very
time-consuming because it is based on pairwise swaps. Hubert and Van Driessen
(2004) proposed the fast approximation for two groups which is much faster than
the algorithm given in Hawkins and McLachlan (1997). It has to be noted that this
algorithm can fail if some of the groups are very small where there is a possibility
that the nal subset H does not contain p+ 1 observations from each group, making
those group covariance matrices singular.
Similar to Hubert and Van Driessen (2004), Joossens and Croux (2004) performed
simulation studies to compare LDF and QDF with the MCD based LDF and QDF
as well as the S-estimator based classi ers.
2.3.4 Other
There are other estimates like the R-estimates and the L-estimates (Ser ing,
1980) which could also be used in classi cation in a similar fashion.
An issue that was discussed in Lachenbruch et al. (1973) is the problem of im-
balance between the two misclassi cation error rates PDyjx and PDxjy resulting from the
use of D = QDF and D = LDF when the parent populations are non-normal. Ideally
one would want PDyjx = PDxjy or a situation where a classi er can de nitely con rm
to provide lower misclassi cation rates for the samples of choice. Since this is not a
possibility, the only logical option is to maintain the balance in tact. A remedy for
this undesirable property of classi ers was given in Randles et al. (1978a) and Ng and
34
Randles (1983). In these papers, it is shown that balance is attained (asymptotically)
through ranking of discriminant functions such as LDF and QDF while keeping the
TPM relatively low. Randles et al. (1978b) use this ranking technique on robust
linear and quadratic discriminant functions that use M estimates and they showed
that the resulting classi er is robust compared to LDF and QDF and balance.
35
Chapter 3
Proposed Robust Projection Pursuit Based Allocation Schemes
3.1 Introduction
There are two main aspects of a projection pursuit technique - a projection
index and the most e cient way to search for the best projection direction. Works
in two-dimensional projection pursuit are discussed in Friedman and Tukey (1974),
Jones (1983), Jones and Sibson (1987) and Friedman (1987), and suggestions are
made by Yenyukov (1988), Yenyukov (1989). An e ective algorithm for the same
was provide by Posse (1990). He uses the chi-square measure as his projection index.
He uses random search for projection directions to nd local optima and restart the
search process repeatedly to make sure they nd the optimum. The two-dimensional
problem might not be as complicated for the modern computational power, but as the
dimension increases there is a need for more e cient projection pursuit algorithms.
Posse (1992) used the algorithm given in Huber (1989) which is similar to that given
by Posse (1990). A quickly generated random walk on the d-dimensional hypersphere
Sd sweeps the entire space where areas of interest are identi ed to nd the local
optima. This happens to be very e cient as it saves computational time while still
maintaining the e ciency of the algorithm.
For our projection pursuit based classi er, we use two-group transvariation (Gini,
1916) as projection index. We use similar principles of Posse (1992) where we sweep
36
the high dimensional space and then nd the projection direction that maximizes
separation by minimizing two-group transvariation. In order for the sweep to be suc-
cessful, the projection directions need to be uniformly distributed on the hypersphere.
One such method that allows us to do this is NT-net given by Fang and Wang (1994).
3.2 NT-Net
To perform projection pursuit in Rd we need points that are \uniformly scat-
tered" on the surface of the unit hypersphere Sd 1 as given in Fang and Wang (1994).
If W NIDd(0;Id), then
(W0W) 1=2W
is uniformly distributed on Sd 1.
If V1;:::;Vk is a number theoretic net (NT-net) on the d 1 dimensional unit
cube Cd 1, then
fWi =S(V1;:::;Vk) ; i = 1;:::;kg;
where S is the spherical coordinate transformation operator, is an NT-net on
Sd 1.
Let = ( 1;:::; d 1) be a point on Cd 1. We use the spherical coordinate
transformation
Xj =
j 1Y
i=1
SiCj; j = 1;:::;d 1;
37
Xd =
d 1Y
i=1
Si;
where
Si = sin( i); Ci = cos( i) i = 1;:::;d 2;
Sd 1 = sin(2 d 1); Cd 1 = cos(2 d 1:)
Then X = (X1;:::;Xd) is a point on Sd 1. If 1;:::; k forms an NT-net on
Cd 1, then X1;:::;Xk forms an NT-net on Sd 1. For example, to obtain an NT-net
for d = 3, iff(vi1;vi2) ; i = 1;:::;kgis a NT-net on [0;1]2, thenf(wi1;wi2;wi3) ; i =
1;:::kg is a NT-net on S2 as given in Figure 3.1, where
wi1 = 1 2vi1
wi2 = 2
p
vi1(1 vi1) cos(2 vi2)
wi3 = 2
p
vi1(1 vi1) sin(2 vi2)
To measure whether the points on the hypersphere are uniformly distributed,
measures such as F-Discrepancy are used (Fang and Wang, 1994). Let F(x) be a cdf
in Rd and } = xk;k = 1;:::;n be a set of points on Rd, then
F(n;}) = sup
x2Rd
jFn(x) F(x)j
is called the F-Discrepancy of } with respect to F(x), where Fn(x) is the empirical
distribution of x1;:::;xn. F-Discrepancy is a measure of the representation of } with
38
Figure 3.1: An NT-net on S2
39
respect to F(x). We chose to use NT-net to provide us with our projection directions
for our projection pursuit method because of this property.
The interested reader may nd alternative ways of generating projection direc-
tions in Marsaglia (1972), where three di erent Monte Carlo point picking methods
are given.
It is well known that projection pursuit based methods are computationally ex-
pensive. Although, advanced computational techniques can be e ectively used to
minimize the number of projections considered without a ecting the e ciency of the
technique (Filzmoser et al., 2006). Inspired by Posse (1992), we provide one such al-
gorithm that would brie y sweep the entire space using a small number of projection
directions and then localize to nd the best projection direction that optimizes the
projection index. We use two-group transvariation as a projection index as given in
Montanari (2004).
Algorithm:
Step 1: Uniformly spread points that cover the entire unit hyper cube Cd 1 are
generated.
Step 2: These points on the unit hyper cube are spherically transformed onto a unit
hyper sphere Sd 1 as described above. These points are uniformly distributed
as given in Fang and Wang (1994) using the F-discrepancy criteria.
Step 3: The projection index for each of the projections on the hyper sphere is calcu-
lated and the projection that provides the optimum projection index is chosen.
40
The optimum projection is the direction that produces maximum separation or
minimum overlap between the training samples.
Step 4: In the case of more than one projection direction providing the optimum
projection index, the projection direction that provides the most spread for the
data is considered. One can choose a robust version of spread to make the
algorithm more robust.
Step 5: The point on the unit hyper sphere that provides the projection direction in
?Step 4? is identi ed and easily traced back to the point on the unit hyper cube.
In the unit hyper cube, the points immediately surrounding this point are used
to create more points. The number of points generated here could be the same
as in ?Step 1?.
Step 6: A grid of uniformly spread points is now generated on this localized region
of the unit hyper cube.
Step 7: These points on the localized region of the unit hyper sphere are spherically
transformed, which now form a unit hyper arc instead of a unit hyper sphere.
This gives a localized view of the region that gave the optimum projection
direction in the previous iteration.
Step 8: ?Step 3? through ?Step 7? are then repeated until a convergence criterion is
reached. If transvariation probability is used as a projection index, the conver-
gence criterion could be to continue until the projection index stops decreasing.
41
It has to be noted that the convergence criterion will take you to the projection
direction that gives a local optimum projection index. For data that follows a certain
unimodal shape, there will only be a global optimum for the projection index and
the algorithm takes you to the projection direction that achieves this global opti-
mum. Since transvariation is discrete, the algorithm ensures that there exists only
one projection direction that provides the least two-group transvariation.
The number of points needed per dimension for this algorithm depends on the di-
mension under consideration and the computational power. While a thorough search
would con rm an optimum projection index, Monte Carlo simulations done at the
end of this chapter have shown us that for certain unimodal distributions, 5 projec-
tion directions per dimension seem to be more than su cient. We tried the process
with 7 and 9 projection directions per dimension but did not see any noticeable im-
provement. If 5 projection directions per dimension are used, the total number of
projection directions would be 5d 1, which means that as the dimension increases,
the number of projection directions required also increases. The beauty of using NT-
net for projection directions is that you do not have to look at the entire region of
space at one time. You could break the entire space that needs to be explored into
regions, which can be controlled easily by controlling the points on the unit hyper
cube.
Once the vector ^uopt that gives us the most interesting view of the data in a
single dimension is found using (2.7), we project the test samples in this direction
42
and allocate the new observation into one of the two groups based on a particular
allocation scheme.
3.3 Allocation Schemes
We will consider four di erent schemes of allocation in this study. The rst
method was proposed by Montanari (2004) and is based on projected distances of the
new observation from the centers of the training samples. The second is a suggestion
given in Montanari (2004), wherein a nonparametric allocation scheme was proposed
and immediately abandoned. We will discuss the reason why she had to abandon her
suggested scheme and we will provide two alternative allocation schemes which are
completely nonparametric. These are based on the positions of the projected points
relative to the projected centers of the training samples.
It must be noted that all the procedures given in this chapter are a ne invariant.
Invariance to translations is immediate. Moreover, any rotation of the data will result
in the same rotation of the projection direction that gives the \most interesting"
view and the resulting projected data do not depend on the coordinate system used.
A formal argument showing the a ne invariance of these procedures is found in
Lemma 2.1 of Donoho and Gasko (1992) and Theorem 2.1 of Zuo and Ser ing (2000).
3.3.1 Allocation Based on Distance
Given two independent random training samples X1;:::;Xnx and Y1;:::;Yny
from x and y, respectively, de ned on Rd (d 1), a new observation Z = z is
43
classi ed in x ifj^u0optZ mx(^uopt)jTy(Z); otherwise, it is assigned to y where
Tx(Z) = 1n
x
nxX
i=1
If(^u0optXi ^u0optZ)(mx(^uopt) ^u0optZ) < 0g and
Ty(Z) = 1n
y
nyX
i=1
If(^u0optYi ^u0optZ)(my(^uopt) ^u0optZ) < 0g (3.1)
As argued earlier, this allocation scheme is based on ranks. This gets rid of
the non optimality problem that TD has when skewed distributions are considered.
Although this allocation scheme makes this method completely nonparametric and
works better than TD for skewed distributions, it does not perform as well for data
with unequal sample sizes. This is due to the fact that an equal prior restriction is
imposed by counting and we neglect group two(one) when we nd the ranking of the
new point in group one(two). So the priors are not necessarily taken into account and
the e ect shows in the misclassi cation error rate especially when the sample sizes
are unequal. Montanari (2004) abandoned this scheme for this very reason. Adding
to the problem of priors that this allocation scheme has is another problem of rather
high likelihood of ties between Tx and Ty given in (3.1). The likelihood of ties is the
most noticeable in equal sample sizes cases and these cases happen to be the only
cases that this allocation scheme works e ciently. This problem of ties requires some
kind of a tie breaking strategy. The one employed by us is to randomly assign the
observation into one of the groups using a coin ip. The classi er obtained using this
allocation scheme will be referred to as Point-Group Transvariation (PGT) classi er.
45
3.4 Proposed Symmetrized Allocation Scheme
The schemes, TD and PGT, one being parametric and the other being nonpara-
metric, still have problems with skewed cases and problems with unequal sample
size cases, respectively. In spite of being nonparametric, the PGT scheme did not
include any information about the second group when nding point-group transvari-
ation between the new observation and the rst group. We needed to come up with a
new allocation scheme that considers both the groups while nding a nonparametric
alternative to the existing schemes.
We had a nonparametric way of looking at two groups at the same time: the
two-group or group-group transvariation de ned by Gini (1916). The next issue we
faced was to somehow include the new observation in the calculation of the group-
group transvariation and be able to use it for allocation. So we propose a new
nonparametric alternative that is based on ranking of the new observations in the
groups while considering both the groups.
To allocate a new observation Z, we de ne X =fX1;:::;Xnx;Zgand similarly
de ne Y =fY1;:::;Yny;Zg. The idea is to nd Tx y, the transvariation probability
between X and Y given by
Tx y = 1(n
x + 1)ny
X
x 2X
X
y2Y
If(^u0optx ^u0opty)(mx(^uopt) my(^uopt)) < 0g (3.2)
46
and Txy , the transvariation probability between X and Y given by
Txy = 1n
x(ny + 1)
X
x2X
X
y 2Y
If(^u0optx ^u0opty )(mx(^uopt) my(^uopt)) < 0g (3.3)
and see the e ect of the new observation on the quantities Tx y and Txy . These two
quantities have Txy in common, where
Txy = 1n
xny
X
x2X
X
y2Y
If(^u0optx ^u0opty)(mx(^uopt) my(^uopt)) < 0g; (3.4)
is the transvariation probability between X and Y as given in (2.5). The quantity
(nx + 1)nyTx y nxnyTxy is the number of observations in group X with which the
new observation transvariates. Similarly, the quantity nx(ny + 1)Txy nxnyTxy is
the number of observations in group Y with which the new observation transvariates.
So, the di erence Tx y Txy is due to the addition of the new observation Z in X
and the di erence Txy Txy is due to the addition of the new observation Z to
Y. These di erences can be thought of as the contributions that Z makes towards
the transvariations if it were to belong in the groups in which it is placed. These
di erences could be positive or negative based on the location of the new observation
but cannot both be positive or both negative for a particular new observation. If Z
belongs to the group it was placed in, it will not contribute to the transvariation with
the other group and thus creating a negative di erence, else it would create a positive
di erence.
47
A negative di erence indicates that the new observation Z is placed in the correct
group. So a new observation Z = z is allocated to x if Tx y Txy < Txy Txy,
else it is allocated to y. Since both the di erences contain Txy, regardless of the
di erences, we allocate the new observation to x if Tx y Ty(z) where
Tx(c) = 1n
x
nxX
i=1
1 Kxf(^u0optXi ^u0optz)(mx(^uopt) ^u0optz)g
and
Ty(c) = 1n
y
nyX
j=1
1 Kyf(^u0optYj ^u0optz)(my(^uopt) ^u0optz)g
; (3.5)
else classify it in y.
-3 -2 -1 0 1 2 3
0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
X
W
e
ig
h
t
S M O O T H - P G T
P G T
Figure 3.2: Graphical Depiction of Transvariation and Smoothed Transvariation
Kx and Ky need not necessarily be CDFs. They can be taken to be functions that
satisfy certain requirements or any CDF de ned on R with pdf kx and ky, respectively,
50
that is continuous on ( ; ) for some > 0 and kx(0) > 0 and ky(0) > 0. It is not
a requirement that both the samples need to follow a certain shape. Note that PGT
uses the CDF of the bernoulli random variable I(X 0).
The conversion of a 0 or 1 transvariation into a number in the interval (0, 1) is
based on a smoothing parameter. Smoothing works well in avoiding the problem that
PGT had with allocation in the case of unequal sample sizes. Kernel smoothing was
used for classi cation by Mojirsheibani (2000). In our study, we use the CDF of the
t-distribution for K with degrees of freedom (df) as the smoothing constant. This
gives us a wide range of CDFs with varying scale that could be used to t various
distributions.
For a data set with training and testing samples, the process of nding the
optimum smoothing constant (df) is as follows:
Algorithm
Step 1: After nding the optimum projection direction using NT-net, the data are
projected onto this direction. You would then allocate the testing sample using
equation (3.5).
Step 2: To allocate, you need the df that de nes the t-CDF, which provides the best
smoother for that particular shape of data. It makes a lot of sense to use two
di erent smoothing df, one for each group of data. We use the training sample
to train and nd the best smoother for each group.
51
Step 3: Finding the smoother theoretically is a very di cult problem. In fact, there
exists no closed form solution when the t-CDF is used in place of the indi-
cator function in transvariation. Instead we use a bivariate grid, containing
df(df1;df2), one for each group.
Step 4: We then use the training data and apply leave-one-out cross validation to
nd the probability of misclassi cation for each possible pair of df. The combi-
nation that provided the least probability of misclassi cation is then picked as
the pair of smoothing constants.
Step 5: We nally use equation (3.5) with the smoothing constants found in Step 4,
to nd the total probability of misclassi cation for this data.
When the data are presented as a single sample, as is the case for real data
sets, we remove one observation from the data and try to use the rest of the data
as a training sample to allocate this observation. This is like leave-one-out cross
validation. So, we end up using a double one-at-a-time cross validation, once for
nding the misclassi cation for the data and again to nd the best set of smoothing
df for the training sample.
We tried the normal cdf which worked really well for distributions that are normal
or near normal. As the distributions deviated from normality, the e ciency of the
normal cdf with the spread as the smoother went down. Based on our simulations,
this was partly because normal distribution with sigma as a smoother, lacks the range
to cover a variety of distributions. This was evident when most of the smoothing
52
constants ended up being very close to zero, which makes the normal distribution
almost like a point mass and hence a transvariation. In such cases, smooth becomes
PGT. We chose to use t-cdf with df as a smoother for its wide range. We are convinced
that a better smoother exists out there and we leave the solution of nding it as an
open problem.
We refer to the classi er that uses this allocation scheme as smooth-PGT (SPGT).
This classi er has all the positives of PGT but gets rid of the problem with PGT, that
is, with unequal sample sizes. By counting transvariations, PGT puts a restriction of
equal priors. As discussed above, transvariation with one group is scaled by 1nx, while
the transvariation with the other group is scaled by 1ny. This creates a problem when
the sample sizes are not equal, especially when the sample sizes are highly unequal.
In the case of the SPGT allocation scheme, each transvariation is smoothed, in our
case using a t-CDF. This takes care of the shape of the distributions by weighing each
point based on their relative position and the distance from the new observation. The
two smoother df0s de ning the t-CDf?s, which provide the least misclassi cation are
found using the training samples. This process of nding the best smoother that
smoothes the data, closes the gap between the di erence in sample sizes.
3.6 Application on Real Data and Monte Carlo Simulation
We would like to show the optimality of the proposed classi ers by applying the
methods on some real data sets and a variety of simulation settings.
53
3.6.1 Application on Real Data Sets
The data sets that we chose to consider are
Leukemia : This dataset is given by Golub et al. (1999) and comes from a study of
gene expression in two types of acute leukemias: acute lymphoblastic leukemia
(ALL) and acute myeloid leukemia (AML). The dataset is available at
http://www.genome.wi.mit.edu/MPR
Gene expression levels were measured using A ymetrix high-density oligonu-
cleotide arrays containing 7129 human genes. The data are comprised of 47
cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML. We used
the method of gene selection given by Nguyen and Rocke (2002) that uses a
t-statistic to select expressed genes. After preprocessing, the data are sum-
marized by a 72x1751 matrix. To reduce the dimension of the data set, we
considered the rst 10 principal components.
Colon : This data set contains gene expression in 40 tumor and 22 normal colon
tissue samples analyzed with an A ymetrix oligonucleotide array (Alon et al.,
1999). The expression level of 6500 genes were measured and the 2000 genes
with the highest minimal intensity were retained. We again applied the gene
selection method given by Nguyen and Rocke (2002) to select the genes that
exhibit maximum variation among the 62 tissues. The resulting data matrix
contained 250 genes. We then kept the rst 9 principal components for the
purpose of classi cation.
54
Fisher?s Iris : This is the classic Iris dataset given by Fisher (1936). It consists of
three species (Setosa, Versicolor, and Virginica) of Iris owers with 50 observa-
tions of each species. The dataset contains four variables: sepal length, sepal
width, petal length, and petal width. We wish to identify a new Iris ower as
Versicolor or not based on measurements on these four variables. We considered
the original data as well as a contaminated version where the contamination is
done by placing a single outlier in the Versicolor group in a similar manner as
Crimin et al. (2007).
We will consider the proposed methods, GGT and SPGT and as comparison
we will consider the allocation scheme, TD, proposed by Montanari (2004) and her
suggested allocation, PGT. We will also consider maximum depth classi er, MaxD,
based on L1 depth given by Ghosh and Chaudhuri (2005) and J ornsten (2004) and the
classical LDF and QDF. The results are presented in Table 1 given in the appendix.
For the original Iris data, QDF and MaxD gave the lowest TPM at 3.33%. The
least optimal performer was TD with a misclassi cation error rate of 31.33%.
When a cluster of ve outliers was added to the Iris data, GGT gave the lowest
TPM at 3.87% with PGT really close at 3.97%. The misclassi cation error rate
for SPGT went down slightly, while the error rates for MaxD, LDF and QDF
exploded. For LDF and MaxD, the error rate was doubled and for QDF, the
error rates increased almost 4 times.
55
For the Leukemia data, GGT happened to be the method that provided the
least TPM of 1.39%; The SPGT method had the next best misclassi cation
error rate at 2.78%. PGT is close with 2.9% while TD happened to be the least
optimal among the methods considered with an error rate of 11.11%. MaxD,
LDF and QDF shared the same error rate at 4.17%.
Smooth-PFT, GGT, TD and LDF shared the lowest TPM at 12.9%. The least
optimal method being MaxD with a misclassi cation error rate of 17.74%.
3.6.2 Monte Carlo Simulation
The common simulation settings used by authors in literature looks at normal
distributions and as a deviation they include outliers, look at distributions with heavy
tail and log-normal distribution as a case of skewed distribution. In all those cases,
most often than not, authors choose to use the same distributions for both the groups
and/or keep most of the other aspects simple. We perform a very extensive Monte
Carlo simulation to study the optimality (in terms of misclassi cation error rates) of
the proposed classi cation procedures under a variety of distributional settings:
Homoscedastic (vs) Heteroscedastic
Equal sample sizes (vs) Unequal sample sizes
Same distributions (vs) Di erent distributions for the two groups
Symmetric (vs) Skewed distributions
56
Normal tail (vs) Long tailed symetric distributions
Combinations therein
The simulation study follows the same setup given in Montanari (2004). We start
o by generating training samples of the given sizes which are used to formulate the
classi cation rules. Testing samples of size 1000 from each group are then generated
and the misclassi cation error rates are calculated by computing the proportion of
misclassi ed testing sample observations in each group. This process is then repeated
50 times and the mean and standard error of the misclassi cation error rates are
computed.
We consider Cauchy (C, which is t with 1 degree of freedom), t with 2 degrees of
freedom (t2), Normal (N) distributions, and log-normal (LN) distributions. Train-
ing samples of equal (150;150) and unequal (50;250) sizes were generated from the
distributions.
We consider four-dimensional distribution to generate the data. We consider
centers (0;0;0;0)0 and (2;0;0;0)0 and covariance matrices I = I4 and
W =
0
BB
BB
BB
BB
@
1 1 1 1
1 2 2 :25
1 2 3 :5
1 :25 :5 4
1
CC
CC
CC
CC
A
:
57
In the reporting of the results we use the notation K(A);H(B) to represent distribu-
tions K and H (K and H are C, t2, N or LN) with covariance matrices A and B (A
and B are I = I4, W), respectively. We consider all the methods as mentioned in the
real data sets. A summary of the results is given in Table 2 given in the appendix.
The following observations are noted:
A look at the results for the PGT classi er indicates that it is not optimal
for almost all of the unequal sample size cases con rming Montanari?s suspicion
that led her to abandon the point-group transvariation allocation scheme. GGT
and SPGT classi ers have error rates comparable for equal sample size cases
compared to the PGT classi er, lower error rates in almost all cases for the
unequal sample size cases and much lower error rates in most cases for the
unequal sample size cases. For these reasons, we will not be considering PGT
as a contender for misclassi cation error rate.
Considering the standard errors of the unequal sample size cases, MaxD and
PGT classi ers are less precise than the other classi ers, MaxD being the least
precise classi er. Except in a few cases, PGT is less precise than GGT and
SPGT.
Error rates for GGT and SPGT classi ers are comparable to TD for symmetric
and same distributions. GGT, SPGT and TD have better error rates compared
to LDF, QDF and MaxD for heavy tailed distributions (Cauchy and t(2)) while
58
for the normal case, GGT and SPGT have error rates comparable with LDF,
QDF and TD classi ers but superior to MaxD classi er.
For skewed but same distribution cases, SPGT and GGT are the best meth-
ods among the methods considered in terms of misclassi cation error rate and
standard error. The error rates and standard errors of SPGT and GGT are
comparable.
For di erent but homoscedastic distributions, there are cases where SPGT is the
best method, where GGT is the best method, where TD is the best method.
One of the three methods SPGT, GGT or TD is the best method in these
cases. Best in terms of misclassi cation error rate and standard error. For
these settings, it can also be noted that GGT acts up for the unequal sample
size cases. The error rates for GGT decreases for some cases with unequal
sample sizes and increases substantially for some other cases.
For di erent and heteroscedastic distributions, it can be noted that QDF does
fairly better than it usually does. As mentioned in Chapter 2, QDF works
better for heteroscedastic cases, but the di erent distributions in ates the error
rates in some cases. Except for QDF, SPGT happens to be the best method
among the methods considered. The only exception is GGT for some unequal
sample size cases and where log-normal distribution is considered with equal
sample sizes.
59
Chapter 4
Robust and Balanced Classification
4.1 Introduction
Extensive research has been done in the area of classi cation. To name a few of
the main categories that authors looked at are parametric methods, non-parametric
alternatives to the parametric methods, using robust estimates in place of the sensitive
ones. These methods try to create a method that provides the lowest possible TPM.
Not much has been done in regard to the issue of balance of the misclassi cation
error rates within each group. This issue was brie y talked about in Lachenbruch
et al. (1973) and then a solution was provided by Ng and Randles (1983) where they
rank LDF and QDF to create balance in the misclassi cation error rates, that is,
PDyjx = PDxjy.
The issue of balance needs more importance than it has been given in literature
in terms of research done on this topic. The issue of balance becomes very pertinent
in situations where one of misclassi cations (PDyjx, PDxjy) is costlier than the other. For
example, in a two group classi cation problem with a group of patients having cancer
and the another group not having cancer, a misclassi cation of a cancer patient
into a noncancerous group could prove costly in terms of life. Ideally there is a
need for a method that can control the ratio of the misclassi cation error rates at
a required level. When the investigator does not have any information on the costs
60
of misclassi cation and prior probabilities, then it is best to maintain the balance
between the two misclassi cation error rates. Some existing methods can provide a
low TPM but there are not many methods, especially robust methods that can create
this balance while minimizing the overall misclassi cation error rate. We propose one
such method that can maintain the balance in the two misclassi cation error rates
while controlling the overall TPM. We show that balance can be achieved as long
as ranking is incorporated and both groups are considered while allocating the new
observation (Randles et al., 1978a).
4.2 Ranking Discriminant Function
Given two independent random training samples X of sizenx given by X1;:::;Xnx
from population x with distribution F and Y of size ny given by Y1;:::;Yny from
y with distribution G, de ned on Rd (d 1). Let Fnx and Gny be the empirical
distribution functions of X and Y, respectively. Suppose, as before, that we have
an absolutely continuous and real valued discriminant function D such that Z = z
is classi ed in x if D(z;F;G) > 0 and in y otherwise. Now let us de ne a rank
discriminant function RD as
RD(z;F;G) = PfD(z;F;G) D(X;F;G) jX Fg
PfD(z;F;G) D(Y;F;G) jY Gg : (4.1)
61
Naturally, one classi es Z = z in x if RD(z;F;G) > 0 and in y otherwise. This
classi er is looking at whether the new observation belongs more to X or to Y and
the calculation depends on how D(:) is chosen.
To de ne the sample version, we will form two augmented samples by placing
the new observation Z = z in X forming X = fX1;:::;Xn1;Zg and in Y forming
Y =fY1;:::;Yn2;Zg. Let F nx+1 and G ny+1 be the empirical distribution functions
of X and Y , respectively. We then have
RD(z;Fnx;Gny) = 1n
x + 1
X
x2X
I D(z;F nx+1;Gny) D(x;F nx+1;Gny)
1n
y + 1
X
y2Y
I
n
D(z;Fnx;G ny+1) D(y;Fnx;G ny+1)
o
;
(4.2)
where IfAg is the indicator function of the event A. We would then allocate Z = z
to x if RD(z;Fnx;Gny) > 0 and to y otherwise.
Under certain mild regularity conditions given in Section 4 of Ng and Randles
(1983), the two misclassi cation error rates of RD(z;Fnx;Gny) are asymptotically
equal or controlled to be a speci c constant r. In particular, assume the regularity
conditions given below are true:
1. The sample versions of the discriminant functions must be symmetric in their
argument. That is,
D( jX1;:::;Xnx; Y1;:::;Yny) = D( jXs1;:::;Xsnx; Yt1;:::;Ytny)
62
for any permutation (s1;:::;snx) of (1;:::;nx) and (t1;:::;tny) of (1;:::;ny)
2. For each z 2Rd,
D(z;Fnx;G) P !D(z;F;G) and D(z;F;Gny) P !D(z;F;G) ;
where D( ) is a real valued function de ned on Rd.
3. For all z1;z2 2Rd, we have
D(z1;F;G) D(z2;G;F) :
4. Z, Xi?s and Yj?s are all mutually independent
5. The random variables D(X;F;G) and D(Y;F;G) are absolutely continuous
6. The equalities
P Fnx(D(X;F;G)) = rGny(D(X;G;F))) = 0 ; and
P Fnx(D(Y;F;G)) = rGny(D(Y;G;F))) = 0 ;
hold for some r2R+.
Then we have
PfRD(Z;F;G) < 0 jZ Fg= rPfRD(Z;F;G) > 0 jZ Gg:
63
The rank discriminant functions that we will study are:
RL: This uses the LDF L(z;F;G) given in (2.3) in place of D(z;F;G) in (4.1). This
would be the method proposed by Randles et al. (1978a) where he ranks LDF.
RQ: This uses the QDF Q(z;F;G) given in (2.4) in place of D(z;F;G) in (4.1).This
would be the method proposed by Randles et al. (1978a) where he ranks QDF.
RS: This uses the ranking of the L1 depth function de ned in (2.8). However, to com-
ply with the regularity conditions of Ng and Randles (1983) regarding e ective
sample separating ability of the discriminant function, we will use
~S(z;F;G) = 1
D1(z;G)
1
D1(z;F)
in place of D(z;F;G) in (4.1). This is compatible with all the other discriminant
functions we use since 1=D1 is some kind of a measure of distance from the
center of the distribution. If Mahalanobis depth is used the distance measure
would become Mahalanobis distance and the classi er using this distance would
exactly be RQ. In fact any robust a ne invariant measure of distance can be
used in place of this distance and the method still works. The more robust the
distance, the more robust the classi er.
RM: Here we use the QDF given in (2.4) in place of D(z;F;G) in (4.1). However,
in the sample version, we will use the minimum covariance determinant (MCD)
estimators of location and covariance matrix (Rousseeuw, 1984, 1985) as done in
64
Hubert and Van Driessen (2004). The quadratic discriminant function of Hubert
and Van Driessen (2004) will be denoted by M(z;F;G) as given in Equation
(2.11). We use the FAST-MCD algorithm of Rousseeuw and Van Driessen
(1999) with the default number of observations 0:75nx and 0:75ny for MCD
computations. This assumption is safe to assume as long as you do not suspect
more than 25% of the data to be outliers. The MCD estimates are essentially
trimmed mean and covariance structure as they are looking at the central 75%
of the data only. A very good property of MCD estimates is that a method
using these estimates will remain a ne invariant as the MCD estimates are
a ne invariant.
RL and RQ are ranking LDF and QDF respectively as given in Randles et al.
(1978a). Although ranking is nonparametric, the base, as in the quantities being
ranked are completely parametric. That makes this method semi-parametric. We
prove via simulation in Section 4.3 that although ranking makes the method ro-
bust, the quantities being ranked matters and controls the overall TPM. Inspired by
Randles et al. (1978a), we prove via simulation that the ranked methods are more
balanced than the unranked counterparts. We also show that RS and RM are more
robust than RL and RQ due to the fact that RL and RQ have a parametric base that
is sensitive to deviations and can only be so much better under deviations.
65
4.3 Application on Real Data and Monte Carlo Simulation
We would like to show the optimality of the proposed classi ers by applying the
methods on some real data sets and a variety of simulation settings.
4.3.1 Application on Real Data Sets
We will use the same three datasets as described in Subsection 3.6.1. We will
consider the proposed classi ers, RM and RS, which are robust ranked versions of the
unranked classi ers, M and S. M is the classi er created by replacing the parametric
quantities in QDF with MCD estimates. S is the maximum depth classi er based on
L1 depth given by Ghosh and Chaudhuri (2005) and J ornsten (2004). We consider
the classical L (LDF) and Q (QDF) and their ranked versions RL and RQ given by
Randles et al. (1978a). The results are shown in Table 3 given in the appendix.
For the original Iris data, Q gave the lowest TPM at 4% but the two mis-
classi cation error rates remain unbalanced at 1% and 10%. The method that
provided the most balance was RM where each misclassi cation error rate is
equal to 6%. The worst performer was L with misclassi cation error rates of
28% and 26%. When an outlier was added to the Iris data, M and RM gave
the lowest TPM at 8% with RM giving greater balance (8%, 7.8%) than M
(6%, 11.8%).
For the Leukemia data, RL and L gave the lowest TPM of 4.2%; however, the
error rates of L were unbalanced (2.1%, 8%) compared to those of RL (4.3%,
66
4%). All the remaining ranked discriminant functions gave a TPM of 8.3%.
The method that performed poorly for this data was M with TPM of 12.5%.
Considering the Colon data, both RM and M yielded TPM equal to 9.7%
outperforming all other classi ers. Once again, RM had balanced error rates
(9.1%, 10%) compared to M (13.6%, 7.5%). The least optimal method for this
data was S with unbalanced error rates (40.9%, 10%) giving a TPM of 21%.
We also performed leave-one-out cross validation for the Leukemia and Colon
data after using the gene selection scheme given by Dudoit et al. (2002) as well as
one that uses a Wilcoxon statistic in place of the t-statistic. The results were very
similar and hence not reported here.
4.3.2 Monte Carlo Simulation
For the same reasons mentioned in Subsection 3.6.2, we perform a very exten-
sive Monte Carlo simulation to study the optimality (in terms of misclassi cation
error rates) of the proposed classi cation procedures under a variety of distributional
settings:
Homoscedastic (vs) Heteroscedastic
Equal sample sizes (vs) Unequal sample sizes
Same distributions (vs) Di erent distributions for the two groups
Normal tail (vs) heavy tailed symmetric distributions
67
Sparse data (vs) Dense data
Combinations therein
The simulation study follows the same setup given in Montanari (2004). We start
o by generating training samples of the given sizes which are used to formulate the
classi cation rules. Testing samples of size 1000 from each group are then generated
and the misclassi cation error rates are calculated by computing the proportion of
misclassi ed testing sample observations in each group. This process is then repeated
50 times and the mean and standard error of the misclassi cation error rates are
computed.
We considered a two-dimensional case in addition to the four-dimensional case
considered for the projection pursuit methodology to study the e ect of data abun-
dance with the misclassi cation error rate.
We consider Cauchy (C, which is t with 1 degree of freedom), t with 2 degrees of
freedom (t2), and Normal (N) distributions. Training samples of equal (50;50) and
unequal (25;75) sizes were generated from the distributions.
In the two-dimensional setting, we used centers (0;0)0 and (2;0)0 for the two
distributions whereas the covariance matrices considered are the independence case
I2 and a correlated data case
V =
0
B@ 1 1
1 4
1
CA :
68
For the four-dimensional study we considered centers (0;0;0;0)0 and (2;0;0;0)0 and
covariance matrices I4 and
W =
0
BB
BB
BB
BB
@
1 1 1 1
1 2 2 :25
1 2 3 :5
1 :25 :5 4
1
CC
CC
CC
CC
A
:
In the reporting of the results we use the notation K(A);H(B) to represent distri-
butions K and H (K and H are C, t2, or N) with covariance matrices A and B (A
and B are I2, I4, W, or V), respectively. Thus, in our study, the two groups could
come from the same distribution with di erent means and the same covariance matrix
(homoscedastic), the same distribution with di erent means and di erent covariance
matrices (heteroscedastic), and/or two di erent distributions.
A summary of the results for the two-dimensional case is given in Table 4 whereas
Table 5 (given in the appendix) contains the results from the four-dimensional study.
We infer the following from the results:
S versus RS: S is heavily unbalanced, especially when the sample sizes are
not the same, the tail thickness of one or more of the distributions increases,
the distributions are heteroscedastic, or the two groups do not share the same
distribution. The balance of RS is not a ected by any of these situations. RS
generally has lower TPM than S, and the di erence is pronounced when at least
one of the distributions is t2 or Cauchy. Moreover, RS has better consistency
69
of correct classi cation in that the standard errors of its misclassi cation error
rates are generally much lower in comparison to S.
L versus RL: As expected, Lgives poor performance in the case of heteroscedas-
tic populations. Outside of balancing the misclassi cation error rates, RL does
not appear to improve the TPM of L with the exception of the Cauchy distribu-
tion cases. In fact, RL gives worse TPM than L when two di erent distributions
are considered and one of the distributions is normal.
Q versus RQ: The misclassi cation error rates of Q are severely unbalanced
especially when one of the distributions is heavy tailed. In addition to providing
balance, RQ gives lower error rates than Q in heavy tailed (C or t2) situations.
The standard errors of the misclassi cation error rates of Q are three to four
times higher than those of RQ when both distributions are heavy tailed.
M versus RM: The TPM of RM is generally lower than that of M in the case
of heavy tailed distributions. This is especially visible in the more sparse four-
dimensional case. Otherwise, the TPMs of RM and R are comparable, with
RM doing a better job of maintaining the balance between the misclassi cation
error rates. The imbalance between the error rates of M increases when the two
distributions are di erent and at least one of them is heavy tailed. Moreover,
the standard errors of the misclassi cation error rates of M are 1.5 to 2 times
larger than those of RM in the cases where a Cauchy distribution is involved.
70
In general, RM and M provide superior performance in terms of TPM when a
heavy tailed distribution is involved. RM gives the best performance of all the
methods studied in terms of TPM, balance, and standard errors when both the
distributions are Cauchy. This is true in both homoscedastic and heteroscedastic
cases.
Not surprisingly, as shown in Randles et al. (1978a), ranking provides balance
between the group misclassi cation error rates. Moreover, in the cases where
heavy tailed or two di erent distributions are considered, ranking itself appears
to provide added optimality (smaller TPM and standard errors) even when the
original method is robust.
71
Chapter 5
Influence Function
5.1 Introduction
The focus of this chapter is the analysis of the in uence functions of some of
the discriminant functions that were introduced in earlier chapters. As discussed in
Hubert and Van Driessen (2004), the robustness of a classi er directly depends on the
robustness of the discriminant function used. The in uence on the LDF of a single
point of perturbation added to one of the two populations was studied by Campbell
(1978). Croux and Dehon (2001) studied the in uence function of the LDF where
the underlying distributions are assumed to be multivariate normal and where both
populations are contaminated. Croux et al. (2008) followed a similar approach but
with a penalty term included to make the classi er optimal. Recently Huang et al.
(2007) studied the pair-perturbation in uence function of the LDF where a pair of
points of perturbation are included in one of the two populations.
On the other hand, the in uence function of quadratic discrimination appears
to be slightly more complicated. Croux and Joossens (2005) studied the in uence
of perturbing a point in one of the populations on the misclassi cation error rate of
quadratic discriminant analysis. Although more complex, this is QDF version of the
work of Croux and Dehon (2001) where underlying normality is assumed.
72
The purpose of this chapter is two-fold: to provide the in uence function for
the QDF thus lling this gap in the literature and to provide the in uence function
for the rank based discriminant function introduced in Chapter 4. Both of these are
given without making any underlying distributional assumptions.
We will suppose that both distributions are -contaminated (Van Ness and Yang,
1998) and de ne
Fy; = (1 )F + y and Gy; = (1 )G+ y ;
where y is the distribution function of the point mass at y. Now if T(F;G) is a
functional, then the in uence function is de ned as
IF(y;T(F;G)) = lim
#0
T(Fy; ;Gy; ) T(F;G)
: (5.1)
5.2 IF for QDF
The quadratic discriminant function is given in Equation (2.4) that we will re-
produce here using a functional notation for convenience
Q(z;F;G) = ln
j (G)j
j (F)j
(z (F))0[ (F)] 1(z (F))+
(z (G))0[ (G)] 1(z (G)) :
73
De ne the function
F(x;v) = (x (F))0[ (F)] 1(v (F)) :
The following theorem gives the in uence function of the QDF.
Theorem 5.1. For a xed value of z, the in uence function of Q(z;F;G) is given by
IF(y;Q(z;F;G)) = G(z;z) 2 G(z;y) 2G(z;y) F(z;z) + 2 F(z;y)+
2F(z;y) + G(y;y) F(y;y) :
Proof. Write
Q(z;Fy; ;Gy; ) = ln
j (G
y; )j
j (Fy; )j
Fy; (z;z) + Gy; (z;z) :
Straight forward algebra shows that (Fy; ) = (F) + (y (F)) and (Fy; ) =
(1 ) (F) + (y (F))(y (F))0. As shown in Campbell (1978), writing =
(F); = (F), and w = y (F), we have
[ (Fy; )] 1 = (1 ) 1
1
1ww0 1
(1 ) + w0 1w
= (1 + ) 1 1ww0 1
74
up to order . Thus
Fy; (z;z) =f(z ) wg0 (1 + ) 1 1ww0 1 f(z ) wg
= (1 + ) F(z;z) 2 (1 + ) F(z;y) + 2(1 + ) F(y;y) 2F(z;y)+
2 2 F(z;y) F(y;y) 3 2F(y;y) :
This gives
@
@ Fy; (z;z)
=0
= F(z;z) 2 F(z;y) 2F(z;y) : (5.2)
Using similar steps
@
@ Gy; (z;z)
=0
= G(z;z) 2 G(z;y) 2G(z;y) : (5.3)
Now using a result of Golberg (1972) [p. 1125, Equation (9)] we nd
@
@ j (Fy; )j= tr
1(Fy; ) @@ (Fy; )
j (Fy; )j:
Moreover
@
@ (Fy; ) = ww
0 ;
75
where w and are as de ned above. Thus
@
@ lnj (Fy; )j=
1
j (Fy; )j tr
1(Fy; ) @@ (Fy; )
j (Fy; )j
= tr (1 + ) 1 1ww0 1 (ww0 )
that gives
@
@ lnj (Fy; )j
=0
= tr 1(ww0 ) = F(y;y) d: (5.4)
Similarly
@
@ lnj (Gy; )j
=0
= G(y;y) d: (5.5)
Putting equations (5.2), (5.3), (5.4), and (5.5) together gives the desired result.
It is clear that IF(y;Q(z;F;G)) is unbounded in y. In fact it is a quadratic in
y. This shows that quadratic discriminant analysis is a non-robust procedure and
may provide misleading results if the data are contaminated.
5.3 IF for RD
Recall the rank based discriminant function given in Equation 4.1
RD(z;F;G) = PFfD(z;F;G) D(X;F;G)g PGfD(z;F;G) D(Y;F;G)g ;
76
where D(X;F;G) is a discriminant function. We will restrict our attention to the
cases where D(z;F;G) is de ned as a di erence between the Type D depth of z in F
and z in G
D(z;F;G) = DF(z;F) DF(z;G) :
The Type D depth of the point z in F is de ned as (Zuo and Ser ing, 2000)
DF(z;F) = inffPF(C) : z2C2 g
where is a speci ed class of subsets of Rd. For instance, if is the class of halfspaces
in Rd, then DF is the halfspace depth. As in Wang and Ser ing (2006), we assume
that
1. If C2 , then Cc2 , and
2. maxzDF(z;F) < 1,
where Ac and A denote the complement and the closure of the set A.
Let
HF(t;F;G) = PFfD(X;F;G) tg and HG(t;G;F) = PGfD(X;G;F) tg :
Note that
RD(z;F;G) = HF(D(z;F;G);F;G) HG(D(z;G;F);G;F) :
77
Theorem 5.2. If D(v;F;G) is continuous in v, then the in uence function of
HF(t;F;G) for a xed t is given by
IF(y;HF(t;F;G)) =
8
>>>
<
>>>
:
t hF(t) HF(t) if D(y;F;G) >t
1 +t hF(t) HF(t) if D(y;F;G) t;
where hF(t) = dHF(t)=dt.
Proof. It is easy to see that for Type D depth functions DF,
D(z;Fy; ;Gy; ) = (1 )DF(z;F) + (1 )DF(z;G) = (1 )D(z;F;G) :
Then following an approach similar to Wang and Ser ing (2006) we obtain
HFy; (t;Fy; ;Gy; ) = PFy; f(1 )D(X;F;G) tg
= (1 )PFf(1 )D(X;F;G) tg+ If(1 )D(y;F;G) tg
= (1 )PFf(1 )D(X;F;G) t \ X2Syg+
(1 )PFf(1 )D(X;F;G) t \ X =2Syg+
If(1 )D(y;F;G) tg ;
where Sy =fx : D(x;F;G) D(y;F;G)g. Consider the following two cases:
78
(1) D(y;F;G) >t : For > 0 su ciently small
fv : D(v;F;G) t=(1 )g\Sy =;
fv : D(v;F;G) t=(1 )g\Scy =fv : D(v;F;G) t=(1 )g
y =2fv : D(v;F;G) t=(1 )g
Thus
HFy; (t;Fy; ;Gy; ) = (1 )HF(t=(1 ))
which gives
IF(y;HF(t;F;G)) = @@ HFy; (t;Fy; ;Gy; )
=0
= t hF(t) HF(t) :
(2) D(y;F;G) t : Once again for > 0 su ciently small
fv : D(v;F;G) t=(1 )g\Sy =fv : D(y;F;G) D(v;F;G) t=(1 )g
fv : D(v;F;G) t=(1 )g\Scy =fv : D(v;F;G) D(y;F;G)g
y2fv : D(v;F;G) t=(1 )g
Thus
HFy; (t;Fy; ;Gy; ) = (1 )HF(t=(1 )) +
79
which gives
IF(y;HF(t;F;G)) = @@ HFy; (t;Fy; ;Gy; )
=0
= 1 +t hF(t) HF(t) :
The in uence function of HG(t;G;F) may be obtained following similar steps.
We now give the in uence function of the rank discriminant function.
Theorem 5.3. If D(z;F;G) is continuous, then the in uence function of RD(z;F;G)
is
IF(y;RD(z;F;G)) =
8
>>>>
>>>
>>>>
>>>
>>>
>>><
>>>
>>>>
>>>
>>>>
>>>
>>>:
HG(D(z;G;F);G;F) HF(D(z;F;G);F;G) + 1 ;
if D(y;F;G) D(z;F;G) :
80
Proof. Following steps that are similar to the proof of Theorem 5.2 and some algebraic
manipulation, we have
RD(z;Fy; ;Gy; ) = HFy; (D(z;Fy; ;Gy; );Fy; ;Gy; )
HGy; (D(z;Gy; ;Fy; );Gy; ;Fy; )
= (1 ) [HF(D(z;F;G);F;G) HG(D(z;G;F);G;F)] +
[IfD(y;F;G) D(z;F;G)g
IfD(y;G;F) D(z;G;F)g] :
Note that IfD(y;G;F) D(z;G;F)g = IfD(y;F;G) D(z;F;G)g and therefore
IfD(y;F;G) D(z;F;G)g IfD(y;G;F) D(z;G;F)gequals 0, 1, or -1 depend-
ing on whether D(y;F;G) = D(z;F;G), D(y;F;G) < D(z;F;G), or D(y;F;G)j>
D(z;F;G), respectively. Di erentiating with respect to and letting = 0 gives the
desired result.
Since HF and HG are CDFs, this in uence function remains bounded in y as long
as the Type D depth function used is continuous. As IF(y;RD(z;F;G)) is a step
function, RD has nite gross error sensitivity but in nite local shift sensitivity. This
behavior is similar to the behavior of the in uence function of univariate quantiles.
81
5.4 Sensitivity Curves for the Projection Based Classi er
To view the e ect of a single point contamination introduced into the training
sample on the probability of misclassi cation error graphically, one needs a sample
version of the in uence function provided in Equation (5.1). The quantity:
SCn(y) = n[Tn(y1;:::;yn 1;y) Tn 1(y1;:::;yn 1)] ; (5.6)
where y1;:::;yn 1 represents the training samples and y is a new point that is intro-
duced into each of the two training samples (Tukey, 1971). Tn is the probability of
misclassi cation error. SCn is known as the sensitivity curve and it shows the e ect
of the added new point on the classi er as the value of this new point changes. The
sensitivity curve of a robust procedure is bounded in much the same way the in uence
function is bounded.
We use a simple case where two groups with training sample of sizes 100 each
and testing samples of sizes 1000 each are generated from four dimensional normal
distributions with means (0;0;0;0) and (2;0;0;0). To simplify things further, we
used the identity covariance structure for both the distributions. We introduced a
single point (i;0;0;0);i = 200;:::;200 into both the training samples and for each
point we calculate the probability of misclassi cation error with and without the
added point using a leave-one-out cross validation for the uncontaminated testing
samples. The sensitivity of the probability of misclassi cation for each classi er is
then calculated using the Equation (5.6) and all these values are plotted in a graph.
82
?200 ?150 ?100 ?50 0 50 100 150 200?5
0
5
10
15
20
25
30
35
New Point
Sensitivity
SPGT GGT PGT TD MaxD LDF QDF
Figure 5.1: Sensitivity Curve for the Projection Based Methods
83
?20 ?15 ?10 ?5 0 5 10 15 20 25
?2.5
?2
?1.5
?1
?0.5
0
0.5
1
1.5
2
New Point
Sensitivity
SPGT GGT PGT TD MaxD LDF QDF
Figure 5.2: A Zoomed View of the Sensitivity Curve for the Projection Based Methods
84
The sensitivity curve in Figure 5.1 clearly shows that the classi ers LDF and QDF
appear to have an unbounded misclassi cation error sensitivity. As the magnitude
of the new point increases, so does the sensitivity of the misclassi cation error for
these classi ers. On the other hand MaxD and all the projection based classi ers:
SPGT, GGT, PGT and TD seem to be bounded. Although they are a ected by the
presence of the outlier, their sensitivity is eventually bounded. A zoomed-in view
of the plot when the newly added point is around the origin is shown in Figure 5.2.
It shows that the sensitivity curves for the projection based methods are bounded,
with PGT being the least a ected. A theoretical evaluation of the in uence function
of these projection based classi ers appears to be extremely complicated although
the sensitivity curves give us a strong indication that the in uence function will be
bounded.
85
Chapter 6
Conclusion
Two types of nonparametric procedures for discriminant analysis were consid-
ered; one based on transvariation probabilities and the other based on ranking dis-
criminant functions.
The rst method is a nonparametric discriminant analysis procedure that uses
the method of projection pursuit in tandem with the idea of transvariation probabil-
ities. In particular, group separation is measured using the two-group transvariation
probability (Gini, 1916) as a projection index. Allocation of the new observation is
performed either using a symmetrized group-group transvariation probability or a
smoothed version of point-group transvariation. These procedures are shown to give
smaller misclassi cation error rates compared to linear and quadratic discriminant
functions and maximum L1 depth classi er when the training samples are drawn
from symmetric and skewed distributions and especially when the di erence between
the training sample sizes gets larger. Moreover these methods give smaller misclassi-
cation error rates compared to Montanari?s (Montanari, 2004) transvariation based
classi er that uses distances to allocate the new observation when the training samples
are drawn from skewed distributions. An extensive simulation study and applications
on three real data sets (Fisher?s IRIS, Leukemia and Colon datasets) are used to
illustrate this behavior.
86
Although computation time can be a hinderance for projection pursuit tech-
niques, especially for large dimensions, we were able to optimize our computation
by generating projection directions using a multiscale procedure. To begin with, we
proceed by rst generating a few points on the hypercube, that are projected onto
the hypersphere and then repeatedly \zooming-in" to interesting hyperarcs on the
surface of the d-dimensional hypersphere where more directions are considered. We
took ve lattice points per dimension on the hypercube. No signi cant improvement
in misclassi cation error rate was noticed when seven or nine lattice points were used
instead of ve. We have found that this makes for a faster computation time without
compromising the results.
To illustrate the robustness of the proposed procedures, we provide sensitivity
curves (Tukey, 1971) where a new point is introduced into each training sample.
Then this new point is changed and its e ect on the misclassi cation error rate is
measured. The curves provided in Chapter 5 show that the newly proposed classi-
ers have bounded sensitivity to the inclusion of the new point with respect to the
misclassi cation error rate.
The second method is another nonparametric method that is based on rank-
ing discriminant functions. In discriminant analysis, when an investigator has no
prior reason to prefer one population to another, balance among the groups in terms
of the probabilities of correct classi cation is a desirable property of a discriminant
function. We asked ourselves whether the method of ranking discriminant functions
87
given in Randles et al. (1978a) would work in balancing the probabilities of cor-
rect classi cation for the recently proposed maximum L1 depth classi er (J ornsten,
2004) and MCD (Minimum Covariance Determinant) based quadratic discriminant
function (Hubert and Van Driessen, 2004). Not surprisingly, our extensive simula-
tion study showed that balance between the probabilities of correct classi cation is
achieved, while maintaining the overall robustness of the procedure under a variety
of distributional settings.
As it turns out, ranking discriminant functions does more than just provide bal-
ance between the the probabilities of correct classi cation while maintaining the total
probability of misclassi cation error at the level of the original discriminant functions.
Ranking also gives smaller and more consistent total probabilities of misclassi cation
error rates in cases where the underlying distributions are heavy tailed. This is
reminiscent of the use of ranking in univariate and certain multivariate problems to
arrive at procedures that are robust against a variety of violation of assumptions
(Hettmansperger and McKean, 1998). A somewhat surprising revelation was that
this bene t persisted even when the underlying discriminant function itself was ro-
bust such as the one given by Hubert and Van Driessen (2004). In particular, in
the simulation settings that we investigated, the ranked version of the MCD based
classi er of Hubert and Van Driessen (2004) gives the best performance of all the
methods studied in terms of total probability of correct classi cation, balance, and
standard errors when the two underlying distributions are Cauchy.
88
In Chapter 5 we derived the in uence function for the quadratic discriminant
function as well as the rank based discriminant function that is given in Chapter 4.
The in uence function for the quadratic discriminant function is unbounded, as ex-
pected, but the in uence function for the rank based discriminant function is a step
function and hence robust against gross error contamination.
Future Work
The work presented in this dissertation is by no stretch of imagination complete.
A plethora of things can be done taking the work presented here as a starting point.
For the projection based classi ers, more e cient projection pursuit techniques
that would let one classify observations from dimensions much higher are desired.
Projection indices that are robust and at the same time amenable to a more e cient
method of searching for the most interesting view of the data will make projection
pursuit based classi cation more appealing. Moreover, each transvariation in the
GGT classi er can be smoothed as in the SPGT classi er. A better smoothing func-
tion that works better than the t-CDF proposed in SPGT classi er can be found.
Finding the optimal smoother for the t-CDF or for any other function theoretically
is also an interesting problem.
For the rank based classi ers, one can always nd robust distance measures that
are more robust to deviations that we can then rank. That way the ranked version
would be that much better and balanced. As mentioned in Ng and Randles (1983),
the ratio of the misclassi cation error rates for the two samples can be controlled.
89
This idea could be applied to classify data from the medical eld where the ratio of
false-positive to false-negative can be controlled at any desired level.
Both types of classi ers only consider the two group classi cation. The extension
to more than two groups is not immediate and is interesting. Theoretically, the
in uence function for the discriminant function can be derived for the projection
based classi ers. This will con rm the results that are seen in the sensitivity curves.
Finally, the training data breakdown point could be another theoretical result that
could be found for both types of classi ers.
90
Bibliography
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine,
A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of
tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad
Sci USA, 96(12):6745{6750.
Anderson, T. W. (1984). An introduction to multivariate statistical analysis. Wiley
Series in Probability and Mathematical Statistics: Probability and Mathematical
Statistics. John Wiley & Sons Inc., New York, second edition.
Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H., and
Tukey, J. W. (1972). Robust estimates of location: Survey and advances. Princeton
University Press, Princeton, N.J.
Billor, N., Abebe, A., Turkmen, A., and Nudurupati, S. V. (2008). Classi cation
based on depth transvariations. J. Classi cation, (in press).
Brzezinski, J. R. (1999). Logistic Regression Modeling for Context-Based Classi ca-
tion. IEEE Computer Society, Washington, DC, USA.
Campbell, N. A. (1978). The in uence function as an aid in outlier detection in
discriminant analysis. Applied Statistics, 27(3):251{258.
Chen, Z. Y. (1989). Robust linear discriminant procedures using projection pursuit
methods. Unpublished PhD Dissertation, University of Michigan.
91
Chen, Z.-Y. and Muirhead, R. J. (1994). A comparison of robust linear discriminant
procedures using projection pursuit methods. In Multivariate analysis and its ap-
plications (Hong Kong, 1992), volume 24 of IMS Lecture Notes Monogr. Ser., pages
163{176. Inst. Math. Statist., Hayward, CA.
Chork, C. Y. and Rousseeuw, P. J. (1992). Integrating a high-breakdown option
into discriminant analysis in exploration geochemistry. Journal of Geochemical
Exploration, 43(3):191{203.
Christmann, A., Fischer, P., and Joachims, T. (2002). Comparison between vari-
ous regression depth methods and the support vector machine to approximate the
minimum number of misclassi cations. Comput. Statist., 17(2):273{287.
Christmann, A. and Rousseeuw, P. J. (2001). Measuring overlap in binary regression.
Comput. Statist. Data Anal., 37(1):65{75.
Crimin, K., McKean, J. W., and Sheather, S. J. (2007). Discriminant procedures
based on e cient robust discriminant coordinates. J. Nonparametr. Stat., 19(4-
5):199{213.
Croux, C. and Dehon, C. (2001). Robust linear discriminant analysis using S-
estimators. Canad. J. Statist., 29(3):473{493.
Croux, C., Filzmoser, P., and Joossens, K. (2008). Classi cation e ciencies for robust
linear discriminant analysis. Statist. Sinica, 18(2):581{599.
92
Croux, C., Filzmoser, P., and Oliveira, M. R. (2007). Algorithms for projection
pursuit: robust principal component analysis. Chemometrics and Intelligent Labo-
ratory Systems, 87(2):218{225.
Croux, C. and Haesbroeck, G. (2000). Principal component analysis based on ro-
bust estimators of the covariance or correlation matrix: In uence functions and
e ciencies. Biometrika, 87(3):603{618.
Croux, C. and Joossens, K. (2005). In uence of observations on the misclassi cation
probability in quadratic discriminant analysis. J. Multivariate Anal., 96(2):384{
403.
Dillon, W. R. (1979). The performance of the linear discriminant function in nonop-
timal situations and the estimation of classi cation error rates: A review of recent
ndings. Journal of Marketing Research, 16:370{381.
Donoho, D. (1982). Breakdown properties of multivariate location estimators. In
PhD Qualifying paper. Department of Statistics, Harvard University.
Donoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates
based on halfspace depth and projected outlyingness. Ann. Statist., 20(4):1803{
1827.
Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination
methods for the classi cation of tumors using gene expression data. J. Amer.
Statist. Assoc., 97(457):77{87.
93
Eddy, W. F. (1985). Ordering of multivariate data. In In Computer Science and
Statistics: The Interface(L. Billard, ed.), pages 25{30. North-Holland, Amsterdam.
Fang, K. T. and Wang, Y. (1994). Number-theoritic methods in statistics. Chapman
and Hall, London.
Filzmoser, P., Serneels, S., Croux, C., and Van Espen, P. J. (2006). Robust multi-
variate methods: The projection pursuit approach. From Data and Information
Analysis to Knowledge Engineering, pages 270{277.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.
Annals of Eugenics, VII(II):179{188.
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American
Statistical Association, 82(397):249{266.
Friedman, J. H. and Tukey, J. W. (1974). Projection pursuit algorithm for exploratory
data analysis. IEEE Transactions on Computers, C 23(9):881{890.
Ghosh, A. K. and Chaudhuri, P. (2005). On maximum depth and related classi ers.
Scand. J. Statist., 32(2):327{350.
Gini, C. (1916). Il concetto di transvariazione e le sue prime applicazioni. Giornale
degli economisti Rivista di statistica.
Golberg, M. A. (1972). The derivative of a determinant. Amer. Math. Monthly,
79:1124{1126.
94
Goldstein, M. (1987). [a review of multivariate analysis]: Comment. Statistical Sci-
ence, 2(4):418{420.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloom eld, C. D., and
Lander, E. S. (1999). Molecular classi cation of cancer: class discovery and class
prediction by gene expression monitoring. Science, 286:531{ 537.
Gunn, S. R. (1998). Support vector machines for classi cation and regression. Te-
chinical Report, University of Southampton.
Hardy, A. (1991). On tests concerning the existence of a classi cation. Belgian Journal
of Operations Research, Statistics and Computer Sciences, (31):111{126.
Hawkins, D. M. and McLachlan, G. J. (1997). High-breakdown linear discriminant
analysis. Journal of the American Statistical Association, 92(437):136{143.
He, X. and Fung, W. K. (2000). High breakdown estimation for multiple populations
with applications to discriminant analysis. J. Multivariate Anal., 72(2):151{162.
Hellman, M. E. (1970). Nearest neighbor classi cation rule with a reject option. IEEE
Transactions on Systems Science and Cybernetics, SSC6(3):179{&.
Hettmansperger, T. P. and McKean, J. W. (1998). Robust nonparametric statistical
methods, volume 5 of Kendall?s Library of Statistics. Edward Arnold, London.
95
Hettmansperger, T. P. and Randles, R. H. (2002). A practical a ne equivariant
multivariate median. Biometrika, 89(4):851{860.
Hills, M. (1967). Discrimination and allocation with discrete data. Applied Statistics,
16(3):237{250.
Hoberg, R. (2003). Clusteranalyse. In Klassi kation und Datentiefe. Eul, Lohmar.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric statistical methods. Wiley
Series in Probability and Statistics: Texts and References Section. John Wiley &
Sons Inc., New York, second edition. A Wiley-Interscience Publication.
Huang, Y., Kao, T.-L., and Wang, T.-H. (2007). In uence functions and local in u-
ence in linear discriminant analysis. Comput. Statist. Data Anal., 51(8):3844{3861.
Huber, P. J. (1977). Robust covariances. In Statistical decision theory and related
topics, II (Proc. Sympos., Purdue Univ., Lafayette, Ind., 1976), pages 165{191.
Academic Press, New York.
Huber, P. J. (1981). Robust statistics. John Wiley & Sons Inc., New York. Wiley
Series in Probability and Mathematical Statistics.
Huber, P. J. (1985). Projection pursuit. Ann. Statist., 13(2):435{525. With discussion.
Huber, P. J. (1989). Programs for Exploratory Projection Pursuit. Artemis Systems,
Carlisle, MA.
96
Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis.
Comput. Statist. Data Anal., 45(2):301{320.
Hugg, J., Rafalin, R., Seyboth, K., and Souvaine, D. (2006). An experimental study
of old and new depth measures. In Workshop on Algorithm Engineering and Ex-
periments (ALENEX06), pages 51{64. Springer-Verlag Lecture Notes in Computer
Science, New York.
Johnson, M. E., Wang, C., and Ramberg, J. S. (1979). Robustness of sher?s linear
discriminant function to departures from normality. In Los Alamos Techinical
Report LA-8068-MS. Los Alamos, NM 87545.
Johnson, N. L. (1949). Systems of frequency curves generated by methods of trans-
lation. Biometrika, 36:149{176.
Jones, M. C. (1983). The projection pursuit algorithm for exploratory data anlaysis.
PhD Thesis, University of Bath.
Jones, M. C. and Sibson, R. (1987). What is projection pursuit? J. Roy. Statist. Soc.
Ser. A, 150(1):1{36. With discussion.
Joossens, K. and Croux, C. (2004). Empirical comparison of the classi cation per-
formance of robust linear and quadratic discriminant analysis. In Theory and ap-
plications of recent robust methods, Stat. Ind. Technol., pages 131{140. Birkh auser,
Basel.
97
J ornsten, R. (2004). Clustering and classi cation based on the L1 data depth. J.
Multivariate Anal., 90(1):67{89.
Lachenbruch, P. A. (1975). Discriminant analysis. Hafner Press, New York.
Lachenbruch, P. A., Sneeringer, C., and Revo, L. T. (1973). Robustness of the linear
and quadratic discriminant function to certain types of non-normality. Comm.
Statist., 1(1):39{56.
LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and clas-
si cation. J. Amer. Statist. Assoc., 91(436):1641{1650.
Liu, R. Y. (1990). On a notion of data depth based on random simplices. Ann.
Statist., 18(1):405{414.
Liu, R. Y. (1992). Data depth and multivariate rank tests. In L1-statistical analysis
and related methods (Neuch^atel, 1992), pages 279{294. North-Holland, Amsterdam.
Liu, R. Y., Parelius, J. M., and Singh, K. (1999). Multivariate analysis by data depth:
descriptive statistics, graphics and inference. Ann. Statist., 27(3):783{858. With
discussion and a rejoinder by Liu and Singh.
Liu, R. Y. and Singh, K. (1993). A quality index based on data depth and multivariate
rank tests. J. Amer. Statist. Assoc., 88(421):252{260.
Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of
the National Institute of Science of India, pages 49{55.
98
Marsaglia, G. (1972). Choosing a point from the surface of a sphere. The Annals of
Mathematical Statistics, 43(2):645{646.
McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition.
Wiley Series in Probability and Mathematical Statistics: Applied Probability and
Statistics. John Wiley & Sons Inc., New York. A Wiley-Interscience Publication.
Mojirsheibani, M. (1997). A consistent combined classi cation rule. Statist. Probab.
Lett., 36(1):43{47.
Mojirsheibani, M. (1999). Combining classi ers via discretization. J. Amer. Statist.
Assoc., 94(446):600{609.
Mojirsheibani, M. (2000). A kernel-based combined classi cation rule. Statist. Probab.
Lett., 48(4):411{419.
Montanari, A. (2004). Linear discriminant analysis and transvariation. J. Classi ca-
tion, 21(1):71{88.
Mosler, K. (2002). Multivariate dispersion, central regions and depth, volume 165 of
Lecture Notes in Statistics. Springer-Verlag, Berlin. The lift zonoid approach.
Mosler, K. and Hoberg, R. (2003). Classi cation based on data depth, Bulliten of the
ISI 54th session.
Nason, G. (1995). Three-dimensional projection pursuit. Applied Statistics,
44(4):411{430.
99
Ng, T. H. and Randles, R. H. (1983). Rank procedures in many population forced
discrimination problems. Comm. Statist. A|Theory Methods, 12(17):1943{1959.
Nguyen, D. V. and Rocke, D. M. (2002). Tumor classi cation by partial least squares
using microarray gene expression data. Bioinformatics, 18:39{50.
Pao, Y. H. (1989). Adaptive pattern Recognition and Neural Network. Addition-
Wesley Publishing Company, Inc.
Posse, C. (1990). An e ective two-deimensional projection pursuit algorithm. Com-
munications in Statistics: Simulation and Computation, 19:1143{1164.
Posse, C. (1992). Projection pursuit discriminant analysis for two groups. Comm.
Statist. Theory Methods, 21(1):1{19.
Posse, C. (1995). Projection pursuit exploratory data analysis. Computational Statis-
tics & Data Analysis, 20(6):669{687.
Powell, M. J. D. (1964). E cient method for nding minimum of function of several-
variables without calculating derivatives. Computer Journal, 7(2):155{&.
Randles, R. H., Bro tt, J. D., Ramberg, J. S., and Hogg, R. V. (1978a). Discriminant
analysis based on ranks. J. Amer. Statist. Assoc., 73(362):379{384.
Randles, R. H., Bro tt, J. D., Ramberg, J. S., and Hogg, R. V. (1978b). General-
ized linear and quadratic discriminant functions using robust estimates. J. Amer.
Statist. Assoc., 73(363):564{568.
100
R atsch, G. (1998). Ensemble learning methods for classi cation. Diploma thesis (in
german).
R atsch, G., Onoda, T., and M uller, K.-R. (1998). Soft margins for AdaBoost. Tech-
nical Report NC-TR-1998-021, Department of Computer Science, Royal Holloway,
University of London, Egham, UK. Submitted to Machine Learning.
Rosenbrock, H. H. (1960). An automatic method for nding the gretest or least value
of a function. Computer Journal, 3(3):175{184.
Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In Mathe-
matical statistics and applications, Vol. B (Bad Tatzmannsdorf, 1983), pages 283{
297. Reidel, Dordrecht.
Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the minimum
covariance determinant estimator. Technometrics, 41(3):212{223.
Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc.,
79(388):871{880.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection.
Wiley, New York.
Rousseeuw, P. J. and Vanzomeren, B. C. (1990). Unmasking multivariate outliers and
levereage points. Journal of the American Statistical Association, 85(411):633{639.
101
Rousseeuw, P. J. and Yohai, V. (1984). Robust regression by means of S-estimators.
Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics, 26:256{
272.
Seber, G. A. F. (1984). Multivariate Observations. John Wiley, New York.
Ser ing, R. J. (1980). Approximation theorems of mathemtical statistics. John Wiley
& Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics.
Singh, K. (1991). A notion of majority depth. In Technical Report. Department of
Statistics, Rutgers University.
Ting, K. (2002). Cost-sensitive classi cation using decision trees. Idea Group Pub-
lishing.
Tukey, J. (1974). Address to international congress of mathematicians. Vancouver.
Tukey, J. W. (1971). Explaratory Data Analysis. Addison-Wesley, Reading.
Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Ann.
Statist., 15(1):234{251.
Van Ness, J. W. and Yang, J. J. (1998). Robust discriminant analysis: training data
breakdown point. J. Statist. Plann. Inference, 67(1):67{83.
Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1-median and associated data
depth. Proc. Natl. Acad. Sci. USA, 97(4):1423{1426 (electronic).
102
Wang, J. and Ser ing, R. (2006). In uence functions for a general class of depth-based
generalized quantile functions. J. Multivariate Anal., 97(4):810{826.
Yenyukov, I. S. (1988). Detecting Structures by Means of Projection Pursuit. Pro-
ceeding of COMPSTAT, 88:48{58.
Yenyukov, I. S. (1989). Indices for projection pursuit. E. Diday (Ed.), Data Analysis,
Learning Symbolic and Numeric Language, pages 181{189.
Zuo, Y. and Ser ing, R. (2000). General notions of statistical depth function. Ann.
Statist., 28(2):461{482.
103
Appendices
104
Table 1: Application on Real Data: Performance of Projection Based Methods Using
Leave-one-out Misclassi cation Error Rates
Dist SPGT PGT GGT TD MaxD LDF QDF
Iris
Original 0.0933 0.0467 0.0400 0.3133 0.0333 0.1067 0.0333
Outliers 0.0774 0.0397 0.0387 0.2968 0.0774 0.2129 0.1226
Leukemia 0.0278 0.0290 0.0139 0.1111 0.0417 0.0417 0.0417
Colon 0.1290 0.1525 0.1290 0.1290 0.1774 0.1290 0.1613
105
Table 2: Performance of Projection Based Methods Using Monte Carlo Simulation:
Average Misclassi cation Error Rates and Standard Errors on 50 Replications of 1000
Test Cases per Group
Dist SPGT PGT GGT TD MaxD LDF QDF
N(I),N(I) 0.1641 0.1625 0.1624 0.1624 0.1665 0.1635 0.1659
(150,150) 0.0091 0.0086 0.0087 0.0083 0.0079 0.0075 0.007
N(I),N(I) 0.1638 0.1953 0.1696 0.1607 0.1711 0.1623 0.1685
(50,250) 0.0097 0.015 0.0122 0.0094 0.0114 0.0085 0.0098
t(I),t(I) 0.2165 0.216 0.2137 0.213 0.2932 0.2465 0.3698
(150,150) 0.0097 0.0109 0.011 0.0107 0.0372 0.0405 0.0738
t(I),t(I) 0.2258 0.4075 0.2227 0.2177 0.3093 0.26 0.3833
(50,250) 0.018 0.0711 0.0177 0.0151 0.0387 0.0501 0.0673
C(I),C(I) 0.2617 0.2643 0.2565 0.2557 0.4208 0.4492 0.5
(150,150) 0.0116 0.0152 0.0114 0.0116 0.0468 0.0777 0.0056
C(I),C(I) 0.2747 0.4938 0.2659 0.2615 0.4201 0.4501 0.4984
(50,250) 0.024 0.0239 0.0218 0.0208 0.0482 0.072 0.0095
LN(I),LN(I) 0.1595 0.1588 0.1581 0.1783 0.2361 0.2576 0.2273
(150,150) 0.0104 0.0099 0.0088 0.0119 0.0338 0.0243 0.0249
LN(I),LN(I) 0.1644 0.1971 0.1713 0.1856 0.2713 0.2761 0.2426
(50,250) 0.0106 0.0173 0.0139 0.0112 0.0405 0.0362 0.0338
106
Table 2 Continued
Dist SPGT PGT GGT TD MaxD LDF QDF
N(I),N(W) 0.0954 0.1033 0.105 0.1003 0.0663 0.0969 0.0656
(150,150) 0.0095 0.0124 0.012 0.0088 0.0062 0.007 0.0059
N(I),N(W) 0.0993 0.1001 0.0974 0.1039 0.0698 0.0999 0.0686
(50,250) 0.012 0.0171 0.0124 0.0111 0.0052 0.0086 0.0051
t(I),t(W) 0.2095 0.2093 0.2043 0.2045 0.247 0.2222 0.3152
(150,150) 0.012 0.015 0.0131 0.0119 0.054 0.0425 0.0693
t(I),t(W) 0.2151 0.337 0.2093 0.2077 0.267 0.2249 0.3313
(50,250) 0.0145 0.0818 0.0119 0.0129 0.0553 0.0354 0.0554
C(I),C(W) 0.2601 0.2567 0.2531 0.2505 0.4007 0.4264 0.4888
(150,150) 0.0145 0.0173 0.0141 0.0131 0.0542 0.0706 0.0176
C(I),C(W) 0.2645 0.4967 0.2546 0.2525 0.3728 0.4326 0.4675
(50,250) 0.0191 0.0186 0.0132 0.0126 0.0478 0.1004 0.0207
LN(I),LN(W) 0.1271 0.1253 0.1258 0.1493 0.1729 0.2273 0.2382
(150,150) 0.01 0.01 0.0093 0.0141 0.0228 0.0216 0.0293
LN(I),LN(W) 0.132 0.144 0.1313 0.1537 0.1889 0.2376 0.2339
(50,250) 0.0136 0.0144 0.0137 0.0139 0.036 0.0284 0.039
107
Table 2 Continued
Dist SPGT PGT GGT TD MaxD LDF QDF
N(I),t(I) 0.1817 0.1932 0.19 0.1851 0.3331 0.2054 0.207
(150,150) 0.0115 0.0163 0.0108 0.0117 0.0623 0.0443 0.0251
N(I),t(I) 0.1868 0.4262 0.2093 0.1893 0.3418 0.2022 0.2123
(50,250) 0.0104 0.0688 0.0132 0.0097 0.0458 0.0184 0.0267
N(I),C(I) 0.1844 0.2312 0.2213 0.2074 0.4516 0.3288 0.2356
(150,150) 0.0099 0.0253 0.0133 0.0082 0.0505 0.0841 0.0263
N(I),C(I) 0.1884 0.5016 0.2366 0.2057 0.4691 0.3314 0.2357
(50,250) 0.0171 0.0044 0.0172 0.0091 0.0341 0.0819 0.0238
LN(I),N(I) 0.131 0.1293 0.1286 0.131 0.1506 0.1506 0.2098
(150,150) 0.0091 0.0085 0.0087 0.0083 0.0169 0.0143 0.0256
LN(I),N(I) 0.1487 0.1529 0.1715 0.1375 0.1523 0.1522 0.2004
(50,250) 0.0163 0.0141 0.0153 0.0156 0.0248 0.0202 0.0271
t(I),C(I) 0.2333 0.2434 0.2392 0.2354 0.4077 0.3543 0.3918
(150,150) 0.0108 0.0212 0.0088 0.0084 0.0534 0.0796 0.0154
t(I),C(I) 0.2341 0.5024 0.2429 0.233 0.399 0.3405 0.3841
(50,250) 0.013 0.002 0.0159 0.0106 0.0532 0.0882 0.0177
LN(I),t(I) 0.1979 0.2032 0.204 0.1894 0.2268 0.2077 0.2452
(150,150) 0.0156 0.0154 0.0161 0.0125 0.0279 0.017 0.0709
LN(I),t(I) 0.225 0.2483 0.3979 0.1993 0.2546 0.2158 0.263
(50,250) 0.0234 0.023 0.0717 0.0204 0.04 0.0246 0.0653
LN(I),C(I) 0.2532 0.2789 0.2742 0.2433 0.3916 0.2805 0.3628
(150,150) 0.0233 0.0188 0.0253 0.0225 0.0426 0.0446 0.016
LN(I),C(I) 0.3024 0.3286 0.5018 0.2698 0.4091 0.3232 0.3608
(50,250) 0.0327 0.0328 0.0013 0.0374 0.0411 0.0765 0.0203
108
Table 2 Continued
Dist SPGT PGT GGT TD MaxD LDF QDF
N(W),t(I) 0.1552 0.1793 0.1782 0.1655 0.1908 0.1746 0.0848
(150,150) 0.0248 0.0156 0.023 0.0182 0.0417 0.0305 0.0104
N(W),t(I) 0.1566 0.1672 0.139 0.157 0.1741 0.1502 0.0807
(50,250) 0.026 0.0164 0.0303 0.0195 0.0498 0.0242 0.0072
N(W),C(I) 0.1804 0.2203 0.2194 0.205 0.4146 0.3474 0.1149
(150,150) 0.0185 0.0163 0.0302 0.0139 0.0566 0.0985 0.0228
N(W),C(I) 0.1861 0.2189 0.1689 0.2046 0.3866 0.3238 0.0991
(50,250) 0.0291 0.0181 0.0288 0.0168 0.0741 0.0991 0.0232
LN(W),N(I) 0.0145 0.0141 0.0116 0.0681 0.0733 0.1327 0.047
(150,150) 0.004 0.0042 0.0043 0.0142 0.0247 0.027 0.0106
LN(W),N(I) 0.0155 0.0188 0.0177 0.0725 0.0886 0.1417 0.0483
(50,250) 0.0047 0.0076 0.0084 0.0128 0.0353 0.026 0.0149
t(W),C(I) 0.2296 0.2365 0.2406 0.231 0.3288 0.3734 0.4131
(150,150) 0.0145 0.012 0.022 0.0105 0.0512 0.0875 0.0183
t(W),C(I) 0.2425 0.238 0.3266 0.2331 0.3274 0.3587 0.4031
(50,250) 0.0154 0.0135 0.0734 0.0104 0.0535 0.0996 0.0349
LN(W),t(I) 0.0584 0.0555 0.0551 0.084 0.1107 0.1532 0.2045
(150,150) 0.0075 0.0063 0.0065 0.0134 0.0239 0.03 0.0487
LN(W),t(I) 0.0637 0.0594 0.0599 0.0915 0.1125 0.159 0.1933
(50,250) 0.0106 0.0119 0.0135 0.0159 0.0246 0.0292 0.0428
LN(W),C(I) 0.1193 0.1156 0.1145 0.1373 0.1648 0.2324 0.4103
(150,150) 0.0135 0.0125 0.0127 0.0204 0.0427 0.0573 0.0735
LN(W),C(I) 0.1277 0.1493 0.2292 0.1445 0.1814 0.2564 0.4515
(50,250) 0.0187 0.0201 0.0495 0.0288 0.0561 0.0741 0.0369
109
Table 3: Application on Real Data: Performance of Rank Based Methods Using
Leave-one-out Misclassi cation Error Rates
Data RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L
Iris
Original .060 .040 .053 .000 .500 .167 .270 .260 .267 .280 .260 .273
Outlier .130 .098 .119 .010 .549 .192 .510 .451 .490 .490 .294 .424
Leukemia .064 .120 .083 .064 .160 .097 .043 .040 .042 .021 .080 .042
Colon .136 .200 .177 .409 .100 .210 .091 .125 .113 .091 .125 .113
Table 3 Continued
Data RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M
Iris
Original .060 .040 .053 .010 .100 .040 .060 .060 .060 .060 .100 .073
Outlier .110 .137 .119 .010 .843 .291 .080 .078 .080 .060 .118 .080
Leukemia .064 .120 .083 .064 .160 .097 .064 .120 .083 .106 .160 .125
Colon .136 .175 .161 .273 .100 .161 .091 .100 .097 .136 .075 .097
110
Table 4: Performance of Rank Based Methods Using Monte Carlo Simulation: Av-
erage Misclassi cation Error Rates and Standard Errors for Bivariate Data on 50
Replications of 1000 Test Cases per Group
Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L
N(I2),N(I2) .168 .166 .167 .160 .175 .167 .163 .165 .164 .159 .166 .162
(50,50) .036 .037 .010 .037 .045 .012 .035 .038 .009 .024 .027 .009
N(I2),N(I2) .157 .177 .167 .184 .154 .169 .153 .173 .163 .159 .164 .162
(25,75) .038 .039 .011 .047 .038 .015 .036 .035 .009 .028 .028 .009
t2(I2),t2(I2) .268 .267 .268 .276 .261 .268 .252 .242 .247 .241 .254 .248
(50,50) .060 .063 .035 .100 .099 .037 .083 .073 .061 .082 .094 .059
t2(I2),t2(I2) .250 .279 .265 .292 .246 .269 .233 .259 .246 .269 .226 .248
(25,75) .065 .053 .030 .114 .109 .039 .083 .076 .054 .114 .082 .063
C(I2),C(I2) .350 .337 .343 .363 .330 .347 .385 .397 .391 .417 .412 .415
(50,50) .067 .072 .046 .127 .114 .050 .149 .143 .136 .242 .224 .102
C(I2),C(I2) .345 .352 .349 .427 .282 .354 .371 .397 .384 .433 .363 .398
(25,75) .070 .082 .056 .154 .151 .056 .140 .138 .126 .240 .228 .102
N(I2),N(V) .126 .134 .130 .190 .077 .133 .146 .150 .148 .162 .129 .145
(50,50) .035 .035 .008 .042 .026 .011 .036 .041 .009 .028 .027 .008
N(I2),N(V) .133 .141 .137 .216 .072 .144 .149 .156 .152 .168 .128 .148
(25,75) .044 .047 .011 .052 .024 .018 .044 .053 .012 .029 .028 .009
t2(I2),t2(V) .237 .237 .237 .342 .145 .244 .236 .239 .238 .236 .240 .238
(50,50) .041 .055 .029 .100 .063 .032 .052 .062 .044 .062 .081 .044
t2(I2),t2(V) .237 .261 .249 .383 .127 .255 .234 .228 .231 .246 .213 .230
(25,75) .064 .054 .026 .116 .055 .042 .075 .054 .046 .078 .069 .041
C(I2),C(V) .318 .320 .319 .409 .237 .323 .342 .375 .358 .357 .409 .383
(50,50) .058 .053 .040 .163 .135 .050 .114 .121 .110 .237 .227 .089
C(I2),C(V) .320 .378 .349 .490 .227 .358 .380 .418 .399 .449 .392 .421
(25,75) .077 .078 .049 .204 .142 .068 .155 .148 .144 .295 .263 .089
111
Table 4 Continued
Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M
N(I2),N(I2) .166 .165 .165 .162 .166 .164 .172 .165 .169 .165 .169 .167
(50,50) .038 .037 .009 .026 .029 .010 .043 .036 .012 .034 .037 .012
N(I2),N(I2) .171 .159 .165 .166 .164 .165 .176 .168 .172 .180 .158 .169
(25,75) .046 .042 .011 .029 .031 .012 .051 .047 .013 .038 .031 .012
t2(I2),t2(I2) .276 .287 .282 .289 .347 .318 .257 .244 .250 .252 .253 .253
(50,50) .065 .070 .049 .165 .210 .080 .048 .045 .026 .050 .053 .023
t2(I2),t2(I2) .268 .280 .274 .324 .315 .320 .259 .247 .253 .265 .238 .251
(25,75) .070 .074 .049 .207 .188 .085 .064 .058 .028 .044 .048 .027
C(I2),C(I2) .405 .395 .400 .526 .436 .481 .344 .335 .340 .348 .340 .344
(50,50) .099 .084 .068 .376 .379 .044 .058 .057 .037 .108 .077 .056
C(I2),C(I2) .388 .435 .411 .312 .648 .480 .356 .332 .344 .391 .312 .351
(25,75) .094 .090 .078 .303 .316 .033 .076 .071 .039 .125 .102 .055
N(I2),N(V) .128 .135 .131 .133 .120 .126 .130 .137 .133 .138 .124 .131
(50,50) .035 .039 .010 .027 .027 .006 .040 .048 .011 .028 .027 .009
N(I2),N(V) .122 .141 .131 .148 .118 .133 .152 .120 .136 .154 .116 .135
(25,75) .030 .036 .010 .032 .028 .011 .040 .038 .015 .038 .036 .012
t2(I2),t2(V) .255 .252 .253 .242 .368 .305 .226 .217 .221 .214 .246 .230
(50,50) .059 .061 .041 .204 .197 .085 .047 .050 .021 .032 .074 .032
t2(I2),t2(V) .250 .253 .251 .222 .371 .296 .231 .218 .225 .238 .213 .225
(25,75) .054 .072 .039 .150 .165 .062 .047 .053 .020 .040 .056 .025
C(I2),C(V) .358 .384 .371 .391 .541 .466 .303 .314 .308 .277 .379 .328
(50,50) .082 .087 .066 .374 .347 .051 .051 .060 .029 .038 .106 .047
C(I2),C(V) .362 .405 .384 .300 .649 .475 .301 .316 .309 .331 .333 .332
(25,75) .100 .097 .079 .354 .326 .026 .053 .065 .034 .088 .114 .040
112
Table 4 Continued
Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L
N(I2),C(I2) .197 .221 .209 .651 .045 .348 .335 .352 .344 .144 .499 .321
(50,50) .056 .044 .025 .170 .029 .072 .136 .141 .133 .134 .221 .100
N(I2),C(I2) .179 .243 .211 .633 .046 .339 .336 .392 .364 .131 .498 .314
(25,75) .056 .050 .025 .149 .022 .065 .149 .180 .157 .102 .231 .109
N(I2),t2(I2) .194 .215 .205 .356 .112 .234 .204 .215 .210 .170 .239 .204
(50,50) .052 .042 .019 .116 .038 .043 .062 .044 .032 .054 .049 .031
N(I2),t2(I2) .206 .239 .223 .410 .105 .257 .206 .235 .221 .177 .247 .212
(25,75) .051 .052 .029 .134 .035 .054 .061 .069 .045 .049 .059 .044
C(I2),t2(I2) .328 .276 .302 .131 .508 .320 .380 .354 .367 .472 .242 .357
(50,50) .065 .055 .036 .062 .145 .054 .143 .151 .142 .196 .150 .110
C(I2),t2(I2) .344 .276 .310 .199 .468 .334 .357 .318 .338 .498 .184 .341
(25,75) .080 .063 .042 .132 .193 .068 .144 .120 .115 .217 .096 .091
N(I2),C(V) .140 .183 .161 .620 .028 .324 .301 .327 .314 .111 .477 .294
(50,50) .043 .039 .017 .193 .018 .088 .148 .128 .129 .106 .214 .092
N(I2),C(V) .143 .223 .183 .648 .027 .338 .308 .382 .345 .113 .486 .300
(25,75) .065 .049 .029 .190 .021 .086 .151 .140 .135 .097 .215 .086
N(I2),t2(V) .144 .176 .160 .359 .065 .212 .178 .209 .193 .151 .228 .189
(50,50) .042 .033 .014 .125 .025 .052 .050 .032 .024 .046 .048 .021
N(I2),t2(V) .153 .191 .172 .420 .052 .236 .185 .215 .200 .167 .221 .194
(25,75) .052 .047 .024 .121 .024 .051 .047 .059 .035 .045 .049 .029
C(I2),t2(V) .288 .259 .273 .213 .344 .278 .358 .349 .354 .445 .277 .361
(50,50) .047 .053 .033 .086 .128 .040 .122 .135 .121 .198 .154 .120
C(I2),t2(V) .323 .269 .296 .258 .346 .302 .340 .325 .332 .480 .216 .348
(25,75) .067 .061 .049 .127 .158 .051 .131 .123 .117 .235 .127 .107
113
Table 4 Continued
Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M
N(I2),C(I2) .197 .218 .208 .027 .457 .242 .177 .179 .178 .109 .236 .172
(50,50) .044 .034 .020 .031 .113 .043 .054 .033 .020 .031 .055 .020
N(I2),C(I2) .195 .229 .212 .025 .468 .246 .201 .165 .183 .139 .232 .185
(25,75) .050 .043 .023 .024 .100 .040 .071 .038 .027 .056 .058 .030
N(I2),t2(I2) .200 .201 .200 .113 .285 .199 .190 .182 .186 .146 .221 .183
(50,50) .053 .029 .020 .037 .072 .025 .049 .030 .017 .030 .037 .014
N(I2),t2(I2) .204 .225 .215 .125 .310 .217 .207 .189 .198 .190 .212 .201
(25,75) .066 .041 .024 .053 .090 .030 .068 .042 .025 .065 .043 .029
C(I2),t2(I2) .333 .294 .313 .723 .093 .408 .272 .268 .270 .323 .230 .277
(50,50) .069 .057 .043 .135 .134 .040 .059 .051 .029 .094 .041 .039
C(I2),t2(I2) .318 .316 .317 .681 .110 .395 .279 .263 .271 .370 .205 .287
(25,75) .060 .080 .046 .169 .100 .058 .062 .044 .029 .112 .047 .043
N(I2),C(V) .161 .163 .162 .010 .414 .212 .154 .129 .142 .067 .215 .141
(50,50) .057 .030 .022 .011 .077 .034 .056 .030 .023 .026 .044 .018
N(I2),C(V) .144 .194 .169 .013 .418 .216 .151 .138 .144 .097 .202 .149
(25,75) .040 .042 .021 .017 .084 .036 .053 .040 .022 .053 .055 .019
N(I2),t2(V) .151 .161 .156 .061 .275 .168 .155 .137 .146 .111 .177 .144
(50,50) .034 .034 .015 .025 .072 .027 .045 .031 .016 .030 .033 .012
N(I2),t2(V) .140 .185 .163 .080 .256 .168 .166 .149 .157 .113 .177 .145
(25,75) .045 .041 .016 .040 .091 .030 .052 .046 .021 .043 .043 .013
C(I2),t2(V) .354 .308 .331 .654 .143 .398 .269 .240 .254 .301 .216 .259
(50,50) .080 .086 .069 .197 .185 .056 .040 .046 .019 .053 .053 .029
C(I2),t2(V) .333 .291 .312 .634 .165 .399 .290 .253 .272 .326 .206 .266
(25,75) .066 .061 .049 .236 .189 .070 .058 .062 .029 .080 .056 .038
114
Table 5: Performance of Rank Based Methods Using Monte Carlo Simulation: Aver-
age Misclassi cation Error Rates and Standard Errors for 4D Data on 50 Replications
of 1000 Test Cases per Group
Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L
N(I4),N(I4) .186 .171 .179 .190 .169 .180 .177 .159 .168 .169 .163 .166
(50,50) .039 .039 .014 .048 .054 .017 .038 .034 .012 .031 .027 .011
N(I4),N(I4) .188 .185 .187 .243 .143 .193 .168 .169 .169 .177 .162 .170
(25,75) .039 .040 .014 .073 .042 .022 .032 .033 .010 .029 .027 .011
t2(I4),t2(I4) .317 .326 .321 .349 .302 .325 .270 .273 .271 .262 .280 .271
(50,50) .065 .054 .039 .151 .132 .046 .073 .081 .060 .065 .095 .060
t2(I4),t2(I4) .322 .329 .326 .408 .261 .335 .277 .286 .282 .314 .251 .283
(25,75) .067 .057 .039 .148 .126 .039 .075 .078 .060 .090 .080 .061
C(I4),C(I4) .420 .423 .422 .440 .396 .418 .418 .410 .414 .415 .425 .420
(50,50) .069 .077 .050 .150 .150 .045 .081 .112 .083 .158 .169 .080
C(I4),C(I4) .414 .441 .427 .509 .349 .429 .425 .457 .441 .560 .343 .452
(25,75) .073 .070 .048 .164 .163 .043 .109 .115 .092 .233 .213 .069
N(I4),N(W) .079 .077 .078 .100 .047 .073 .103 .118 .111 .177 .030 .104
(50,50) .020 .033 .009 .020 .023 .008 .021 .046 .016 .028 .025 .012
N(I4),N(W) .081 .080 .081 .115 .033 .074 .096 .119 .108 .184 .016 .100
(25,75) .025 .037 .010 .026 .016 .008 .019 .048 .018 .030 .011 .011
t2(I4),t2(W) .245 .246 .246 .402 .111 .257 .245 .252 .248 .263 .235 .249
(50,50) .057 .062 .037 .134 .073 .052 .072 .076 .060 .061 .100 .055
t2(I4),t2(W) .246 .271 .258 .463 .086 .274 .244 .268 .256 .296 .203 .250
(25,75) .043 .060 .032 .139 .055 .054 .085 .091 .075 .111 .067 .070
C(I4),C(W) .363 .408 .386 .547 .246 .396 .409 .435 .422 .382 .477 .430
(50,50) .065 .067 .042 .186 .137 .055 .118 .094 .088 .204 .172 .078
C(I4),C(W) .352 .413 .383 .573 .196 .385 .400 .443 .421 .492 .372 .432
(25,75) .074 .077 .049 .191 .138 .058 .129 .129 .117 .247 .183 .104
115
Table 5 Continued
Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M
N(I4),N(I4) .177 .176 .176 .180 .171 .175 .188 .192 .190 .177 .197 .187
(50,50) .038 .036 .015 .034 .033 .013 .043 .047 .017 .038 .040 .017
N(I4),N(I4) .183 .186 .185 .215 .161 .188 .215 .212 .213 .223 .172 .198
(25,75) .045 .048 .018 .045 .029 .017 .064 .054 .029 .070 .038 .026
t2(I4),t2(I4) .336 .324 .330 .359 .416 .387 .293 .304 .298 .308 .303 .306
(50,50) .071 .074 .051 .236 .218 .063 .050 .052 .028 .058 .069 .036
t2(I4),t2(I4) .322 .335 .329 .366 .393 .380 .317 .315 .316 .400 .268 .334
(25,75) .081 .067 .051 .203 .206 .063 .065 .063 .036 .093 .063 .042
C(I4),C(I4) .443 .444 .443 .499 .481 .490 .408 .403 .405 .420 .440 .430
(50,50) .082 .082 .056 .345 .347 .024 .067 .065 .043 .118 .126 .044
C(I4),C(I4) .452 .459 .456 .323 .654 .489 .415 .416 .415 .535 .340 .438
(25,75) .069 .081 .051 .261 .275 .023 .088 .079 .034 .123 .104 .046
N(I4),N(W) .078 .077 .077 .094 .051 .072 .082 .084 .083 .097 .056 .076
(50,50) .023 .037 .012 .016 .018 .007 .023 .042 .014 .020 .019 .009
N(I4),N(W) .074 .092 .083 .102 .044 .073 .096 .087 .091 .107 .052 .080
(25,75) .021 .038 .012 .020 .016 .007 .031 .040 .022 .024 .020 .011
t2(I4),t2(W) .234 .285 .259 .216 .379 .297 .214 .213 .214 .203 .236 .220
(50,50) .050 .058 .037 .144 .152 .059 .038 .064 .024 .038 .057 .026
t2(I4),t2(W) .258 .285 .272 .213 .386 .299 .226 .226 .226 .249 .190 .220
(25,75) .056 .068 .040 .075 .133 .041 .054 .069 .031 .043 .055 .028
C(I4),C(W) .389 .421 .405 .360 .581 .471 .326 .364 .345 .301 .407 .354
(50,50) .071 .085 .055 .347 .322 .033 .045 .057 .033 .072 .101 .036
C(I4),C(W) .388 .440 .414 .213 .702 .458 .354 .346 .350 .393 .351 .372
(25,75) .064 .063 .031 .190 .183 .024 .071 .068 .029 .095 .095 .042
116
Table 5 Continued
Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L
N(I4),C(I4) .159 .225 .192 .899 .007 .453 .344 .402 .373 .151 .495 .323
(50,50) .045 .033 .023 .091 .007 .043 .120 .114 .110 .106 .164 .087
N(I4),C(I4) .159 .264 .212 .884 .009 .446 .314 .442 .378 .097 .554 .326
(25,75) .062 .049 .029 .138 .012 .063 .124 .126 .115 .082 .171 .079
N(I4),t2(I4) .177 .241 .209 .627 .046 .337 .215 .256 .236 .172 .282 .227
(50,50) .049 .039 .021 .177 .033 .075 .078 .068 .056 .060 .074 .053
N(I4),t2(I4) .195 .291 .243 .658 .040 .349 .201 .270 .235 .186 .259 .222
(25,75) .072 .052 .038 .159 .023 .069 .076 .068 .053 .057 .067 .047
C(I4),t2(I4) .384 .303 .343 .087 .705 .396 .393 .340 .367 .487 .214 .351
(50,50) .047 .057 .036 .076 .175 .064 .111 .103 .094 .170 .100 .083
C(I4),t2(I4) .402 .315 .359 .154 .665 .410 .431 .353 .392 .609 .161 .385
(25,75) .073 .065 .037 .137 .214 .060 .115 .115 .106 .178 .092 .078
N(I4),C(W) .108 .174 .141 .881 .002 .442 .314 .401 .357 .085 .509 .297
(50,50) .037 .034 .017 .137 .004 .067 .128 .137 .124 .066 .191 .079
N(I4),C(W) .107 .227 .167 .911 .002 .457 .341 .453 .397 .104 .526 .315
(25,75) .054 .051 .026 .108 .003 .053 .122 .152 .118 .087 .173 .078
N(I4),t2(W) .121 .150 .135 .529 .013 .271 .174 .218 .196 .164 .226 .195
(50,50) .041 .036 .018 .191 .013 .091 .054 .080 .058 .047 .103 .057
N(I4),t2(W) .123 .190 .156 .618 .009 .313 .179 .224 .201 .172 .209 .190
(25,75) .044 .043 .017 .163 .010 .078 .069 .073 .051 .055 .078 .048
C(I4),t2(W) .346 .291 .319 .121 .585 .353 .397 .381 .389 .485 .278 .381
(50,50) .069 .063 .043 .106 .171 .059 .113 .117 .107 .172 .120 .108
C(I4),t2(W) .335 .284 .310 .190 .476 .333 .378 .347 .362 .539 .188 .363
(25,75) .063 .070 .044 .105 .175 .057 .133 .115 .112 .194 .095 .104
117
Table 5 Continued
Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M
N(I4),C(I4) .165 .203 .184 .007 .396 .201 .142 .171 .156 .077 .233 .155
(50,50) .055 .039 .022 .009 .063 .028 .050 .038 .021 .036 .042 .020
N(I4),C(I4) .162 .233 .197 .012 .411 .211 .179 .170 .174 .141 .206 .174
(25,75) .066 .042 .028 .019 .073 .030 .064 .032 .027 .067 .052 .027
N(I4),t2(I4) .207 .217 .212 .090 .314 .202 .190 .192 .191 .158 .228 .193
(50,50) .047 .036 .024 .042 .081 .025 .058 .037 .025 .050 .054 .028
N(I4),t2(I4) .221 .244 .233 .130 .307 .218 .227 .209 .218 .230 .219 .224
(25,75) .070 .047 .029 .065 .074 .031 .073 .047 .039 .082 .054 .040
C(I4),t2(I4) .385 .325 .355 .665 .088 .376 .327 .303 .315 .406 .246 .326
(50,50) .046 .056 .038 .067 .032 .022 .065 .053 .021 .081 .050 .029
C(I4),t2(I4) .389 .303 .346 .664 .115 .389 .343 .308 .325 .432 .220 .326
(25,75) .073 .067 .041 .119 .125 .036 .081 .057 .034 .085 .047 .028
N(I4),C(W) .111 .154 .132 .003 .293 .148 .103 .101 .102 .035 .171 .103
(50,50) .040 .033 .017 .009 .058 .027 .035 .029 .014 .018 .033 .013
N(I4),C(W) .119 .170 .145 .004 .297 .151 .137 .114 .125 .094 .140 .117
(25,75) .053 .040 .022 .005 .041 .019 .057 .032 .022 .063 .039 .231
N(I4),t2(W) .131 .144 .137 .041 .241 .141 .111 .114 .112 .086 .139 .113
(50,50) .040 .036 .019 .021 .061 .023 .039 .033 .015 .028 .039 .014
N(I4),t2(W) .129 .186 .157 .070 .222 .146 .131 .126 .128 .140 .131 .136
(25,75) .053 .036 .019 .044 .081 .026 .052 .047 .028 .063 .043 .029
C(I4),t2(W) .370 .314 .342 .646 .114 .380 .243 .256 .250 .271 .233 .252
(50,50) .074 .066 .056 .103 .044 .037 .053 .057 .031 .070 .053 .032
C(I4),t2(W) .360 .308 .334 .643 .115 .379 .287 .232 .259 .331 .182 .257
(25,75) .076 .084 .060 .162 .105 .054 .068 .042 .028 .069 .031 .030
118