Robust Nonparametric Discriminant Analysis Procedures Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classi ed information. Sai Vamshidhar Nudurupati Certi cate of Approval: Nedret Billor Associate Professor Mathematics and Statistics Asheber Abebe, Chair Associate Professor Mathematics and Statistics Peng Zeng Assistant Professor Mathematics and Statistics George T. Flowers Dean Graduate School Robust Nonparametric Discriminant Analysis Procedures Sai Vamshidhar Nudurupati A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 9, 2009 Robust Nonparametric Discriminant Analysis Procedures Sai Vamshidhar Nudurupati Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Sai Vamshidhar Nudurupati, son of Sri. Sai Baba Nudurupati and Smt. Sarada Nuduruapti, was born on April 8, 1980, in Visakhapatnam, India. He graduated from Little Flower Junior College, Hyderabad, India in May 1997. He joined MJCET, Os- mania University, Hyderabad, India in November 1997 and graduated with a Bachelor of Engineering in Mechanical Engineering in May 2001. He graduated with a Masters in Industrial Engineering from Arizona State University in May 2003. He joined the doctoral program in Department of Mathematics and Statistics at Auburn University in January 2004 under the guidance of Dr. Asheber Abebe. He is married to Sheetal Paliwal, daughter of Sri. Hansraj Paliwal and Smt. Savita Paliwal. iv Dissertation Abstract Robust Nonparametric Discriminant Analysis Procedures Sai Vamshidhar Nudurupati Doctor of Philosophy, May 9, 2009 (M.S., Arizona State University, 2003) (B.S., MJCET - Osmania University, 2001) 130 Typed Pages Directed by Asheber Abebe In this study, a nonparametric discriminant analysis procedure that is less sensi- tive than traditional procedures to deviations from the usual assumptions is proposed. The procedure uses the projection pursuit methodology where the projection index is the two-group transvariation probability. Montanari (2004) proposed and used this projection index to measure group separation but allocated the new observation using simple Euclidean distances from projected centers. Our procedure employs a method of allocation based on the centrality of the new point measured using two versions of the transvariation probability: a symmetrized two-group transvariation and a smooth version of point-group transvariation. It is shown by simulation that the procedures proposed in this study provide lower misclassi cation error rates than classical pro- cedures such as linear discriminant analysis and quadratic discriminant analysis and recent procedures like maximum depth and Montanari?s transvariation-based classi- ers under a variety of distributional settings. A di erent rank-based procedure for v classi cation is considered where ranking is applied on classical classi ers as well as recently introduced classi ers such as the maximum L1 depth and quadratic discrimi- nant function based on the minimum covariance determinant (MCD) estimates of the mean and covariance. An extensive simulation study shows that not only does the ranking method provide balance between misclassi cation error rates for each group but also yields lower total probabilities of misclassi cation and higher consistency of correct classi cation for heavy-tailed distributions. A theoretical evaluation of the in uence function shows that this new procedure is robust against local in nitesimal contaminations. vi Acknowledgments I would like to express my earnest gratitude to Dr. Asheber Abebe for his continuous support and guidance throughout the course of my degree. I would like to acknowledge my committee members, Dr. Nedret Billor and Dr. Peng Zeng for their valuable time and support. My sincere gratitude to Dr. Curtis Jolly for acting as an external reader. I am deeply indebted to my family, who went far beyond their normal boundaries to provide me with all comforts. I thank them for all the sacri ces they have made and their devotion. The unconditional love and support of my siblings, Kiran and Chaitanya helped me accomplish this degree. My parents-in law, deserve a special mention for their immense support and a ection. Words fail to express my appreciation to my wife, Sheetal, whose commitment and persistence have helped me realize my aspirations. Her endless love and patience kept me focused and always steered me in the right direction. I consider myself fortunate to have her as my life partner. Thanks to one and all. Finally, thanks to Lord Sri Sairam for his grace and blessings. vii Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speci cally LATEX) together with the departmental style- le aums.sty. viii Table of Contents List of Figures xi List of Tables xii 1 Introduction 1 2 Background 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Rank Based Procedures for Classi cation . . . . . . . . . . . . . . . . 10 2.2.1 Projection Pursuit Based Classi ers . . . . . . . . . . . . . . . 10 2.2.2 Procedures Based on Multivariate Ranking . . . . . . . . . . . 20 2.3 Classi ers Based on Robust Estimators of Location and Scale . . . . 27 2.3.1 M Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 S Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 MCD Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.4 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Proposed Robust Projection Pursuit Based Allocation Schemes 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 NT-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Allocation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Allocation Based on Distance . . . . . . . . . . . . . . . . . . 43 3.3.2 Allocation Based on Point-Group Transvariation . . . . . . . . 44 3.4 Proposed Symmetrized Allocation Scheme . . . . . . . . . . . . . . . 46 3.5 Proposed Smoothed Allocation Scheme . . . . . . . . . . . . . . . . . 48 3.6 Application on Real Data and Monte Carlo Simulation . . . . . . . . 53 3.6.1 Application on Real Data Sets . . . . . . . . . . . . . . . . . . 54 3.6.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 56 4 Robust and Balanced Classification 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Ranking Discriminant Function . . . . . . . . . . . . . . . . . . . . . 61 4.3 Application on Real Data and Monte Carlo Simulation . . . . . . . . 66 4.3.1 Application on Real Data Sets . . . . . . . . . . . . . . . . . . 66 4.3.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 67 ix 5 Influence Function 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 IF for QDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 IF for RD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Sensitivity Curves for the Projection Based Classi er . . . . . . . . . 82 6 Conclusion 86 Appendices 104 x List of Figures 3.1 An NT-net on S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Graphical Depiction of Transvariation and Smoothed Transvariation . 50 5.1 Sensitivity Curve for the Projection Based Methods . . . . . . . . . . 83 5.2 A Zoomed View of the Sensitivity Curve for the Projection Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 xi List of Tables 1 Application on Real Data: Performance of Projection Based Methods Using Leave-one-out Misclassi cation Error Rates . . . . . . . . . . . 105 2 Performance of Projection Based Methods Using Monte Carlo Simu- lation: Average Misclassi cation Error Rates and Standard Errors on 50 Replications of 1000 Test Cases per Group . . . . . . . . . . . . . 106 3 Application on Real Data: Performance of Rank Based Methods Using Leave-one-out Misclassi cation Error Rates . . . . . . . . . . . . . . . 110 4 Performance of Rank Based Methods Using Monte Carlo Simulation: Average Misclassi cation Error Rates and Standard Errors for Bivari- ate Data on 50 Replications of 1000 Test Cases per Group . . . . . . 111 5 Performance of Rank Based Methods Using Monte Carlo Simulation: Average Misclassi cation Error Rates and Standard Errors for 4D Data on 50 Replications of 1000 Test Cases per Group . . . . . . . . . . . . 115 xii Chapter 1 Introduction Discriminant analysis is the process of devising rules to assign a new individual data point into one of K (K > 1) known groups. The method is usually based on previously known information related to the K groups, known as training data whose correct classi cation information is known. In this dissertation, the focus would be on the two-group discrimination problem (K = 2). Classi cation or discrimination has a wide range of applications. A very small list of applications includes: spam lters for an email engine that sends good emails to the inbox and bad emails to a spam folder; voice recognition software used to distinguish the source of the voice from among several speakers; methods to distinguish who is a bad risk for credit and who is credit worthy; methods to classify a patient?s tumor as cancerous or benign. A discriminant analysis procedure uses all the variables that the training data contains and use their correct classi cation information to create a discriminant model, a discriminant rule or a classi er. Classi cation is done by feeding new ob- servations into this classi er or model and getting the group membership to which the new observation belongs. A classi er is said to be a good classi er if it provides low misclassi cation error rates not just for the best of the situations but also under various conditions such as shape of the groups, number of groups, size of the groups etc. Fisher (1936) came up with what is considered to be the rst scienti c discrimi- nant model. His classi er performs well for data that follow normal distributions, the 1 groups share the same covariance structure and if there are no outliers. Violations of any of these assumptions may make Fisher?s method unstable. There has been a lot of research done in the eld of classi cation ever since then in an e ort to come up with classi ers that are robust, ones that are not sensitive to violations of certain assump- tions. A wide range of techniques have been used to develop various such classi ers that are robust to deviations in the parametric, semi-parametric and nonparametric realms. Most of the parametric methods become sensitive with deviations from their underlying assumptions. Researchers devised methods that are more robust to such deviations using semi-parametric and nonparametric techniques which address issues that existing methods did not consider. In this dissertation, we propose modi cations to existing semi-parametric classi- ers in an e ort to create classi ers that are robust to deviations. The rst proposed method provides two nonparametric alternatives for the allocation process provided in the classi er by Montanari (2004). The second method is a rank based method where discriminant functions are ranked and we wish to show that ranking can pro- vide equal misclassi cation error rates in each group which might be very pertinent in some situations. The optimality of our classi ers is shown by comparing the proposed classi er with existing classi ers. This is done by the use of real data sets and also via extensive Monte Carlo simulation studies. Sensitivity curves will be used to show the e ect of local and gross perturbations on the probability of misclassi cation error rate. 2 In uence functions are used to ascertain the robustness of the rank based discriminant function theoretically. 3 Chapter 2 Background 2.1 Introduction Let us start with the univariate two sample location problem. SupposeX1;:::;Xnx are independent and identically distributed (iid) random variables that are normally distributed with mean x and variance 2 (N( x; 2)) and Y1;:::;Yny iid N( y; 2). Suppose also that Xi is independent of Yj for i = 1;:::;nx and j = 1;:::;ny. Our desire is to test the hypothesis H0 : x = y versus the alternative hypothe- sis H1 : x 6= y. One then may employ the two sample t test that uses the test statistic txy = X Y Sp qn x+ny nxny ; where Sp is the pooled standard deviation, to obtain an estimate of the standardized separation between the two locations. The null is rejected in favor of the alternative if jtxyjis larger than t (df), the critical value, where is the allowable type-I error rate and df are the degrees of freedom. Alternatively, one may use the non-parametric counterpart of the two-sample t test known as the Mann-Whitney test (Hollander and Wolfe, 1999) which uses the test statistic uxy = nxX i=1 nyX j=1 (Xi;Yj) ; (2.1) 4 where (r;s) = Ifr 0 : The function D is known as a discriminant function. The probability of an observation from x being misclassi ed in y is PDyjx = PfD(Z;F;G) < 0 jZ Fg and the probability of misclassifying a random variable from population y into pop- ulation x is PDxjy = PfD(Z;F;G) > 0 jZ Gg : The total cost of misclassi cation is then CxPDyjx+(1 )CyPDxjy where is the prior probability that an observation comes from x and Cx and Cy are the costs of mis- classi cation of an observation from x in y and from y in x, respectively. Under the assumption that the priors are equal and that the costs of misclassi cation being 6 equal for the two populations (Cx = Cy = 1), the total probability of misclassi cation (TPM) is PD = 12PDyjx+ 12PDxjy. Hereafter, we will assume the costs Cx and Cy to equal 1. The following two de nitions describe the notions of optimality and robustness in the framework of discriminant analysis. De nition 2.1. We say a discriminant function D is optimal if PD PD for any other discriminant function D and we say that D is more optimal than D if PD C and to y otherwise. The cuto C is usually 0 when the prior probabilities are unknown. This method is built to be optimal in classifying the new observation Z under the assumption that F and G are both multivariate normal distributions that are di erent in location 7 but have the same scatter. In particular, if F and G are Nd( x; x) and Nd( y; y), respectively and under the assumption that = x = y. For the situation where x6= y is true, the optimal procedure uses a quadratic combination of the d-covariates that maximizes the separation between the popula- tions x and y. This is known as Quadratic Discriminant Function (QDF) which is given by: Q(z;F;G) ln j yj j xj (z x)0 1x (z x) + (z y)0 1y (z y): (2.4) This becomes the optimal rule for classi cation under the conditions where a new observation Z = z is assigned to x if Q(z;F;G) > 0 and to y otherwise. Given the random samples X and Y, the sample versions of LDF and QDF are given by L(z;Fnx;Gny) = ( x y)0^ 1 z 12( x + y) and Q(z;Fnx;Gny) = ln j^ yj j^ xj ! (z x)0^ 1x (z x) + (z y)0^ 1y (z y) ; respectively, where ^ x, ^ y, and ^ are the estimators of x, y, and , respectively. Linear and Quadratic discriminant functions are optimal under certain conditions and assumptions and may be very sensitive to any deviation from the assumptions. Deviations from normality, equal covariance in the case of LDF, existence of outliers 8 in the case of QDF are examples where LDF and QDF are not optimal. In fact the more serious the deviations and the greater the number of deviations, the more sen- sitive these methods get. Some work that considered the issue of robustness of LDF and QDF is Lachenbruch et al. (1973), Lachenbruch (1975), Hills (1967), McLachlan (1992), Anderson (1984), Dillon (1979), Johnson et al. (1979) and Seber (1984). A number of these authors investigated the robustness of LDF and QDF with respect to their TPM to some non-linear transformations of the normal distribution suggested in Johnson (1949). Hills (1967) looked at discrimination in data that are non-normal, speci cally when the data are discrete. Lachenbruch et al. (1973) investigated the performance of LDF under certain non-normality conditions, speci cally, log normal, logit normal and the inverse hyperbolic sine normal distributions. Optimal misclas- si cation probabilities in these cases are calculated by taking an appropriate inverse transformation. In such cases, nding the cuto C theoretically using a minimax rule is a very di cult problem. So Lachenbruch et al. (1973) determined the value of C, approximately, by using 25 di erent discrete points. They also found that the trans- formation makes the two populations heteroscedastic, so they proposed to assign the new observation Z = z to x if Q(z;F;G) >C and to y otherwise. They provided some theoretical results in addition to presenting a Monte Carlo study. Their work and the work of others who extended this research, found that both LDF and QDF are greatly a ected by these types of non-normality. As a solution to the sensitivity of the linear and quadratic discriminant functions, several authors have proposed robust procedures. 9 There are various classi ers, both parametric and nonparametric, which are proposed in literature, like kernel-based classi cation rule (Mojirsheibani, 2000), k- Nearest Neighbor classi cation rule (Hellman, 1970), decision trees (Ting, 2002), neural networks (Pao, 1989), logistic regression (Brzezinski, 1999), support vector machines (Gunn, 1998), combined classi ers (LeBlanc and Tibshirani, 1996; Mojir- sheibani, 1999, 1997) just to name a few. In this dissertation, we will consider various rank-based procedures for performing classi cation in an optimal and robust manner. 2.2 Rank Based Procedures for Classi cation In this section we will introduce two types of rank-based procedures for classi ca- tion. The rst will be classi cation based on projection pursuit that uses a rank-based projection index and the second will be based on multivariate ranking. 2.2.1 Projection Pursuit Based Classi ers Fisher?s idea of picking a linear or quadratic combination that maximizes the separation between the two samples is in a way nding the projection that maximizes the separation between the groups based on a particular criterion. In fact, Fisher?s LDF could be reframed as nding u, say u0, the projection direction that maximizes t2(u) = u0( X Y) 2 u0Spu nx+ny nxny 10 where X Nd( x; ), Y Nd( y; ) and S2p = (nx 1)s2x+(ny 1)s2ynx+ny 2 . Here sx and sy are estimates of standard deviations of X and Y respectively. One observes that this is the same two sample t test statistic that was given in Equation (2.2). The data is then reduced to one dimension by projecting it in the direction given by u0 and one would classify a new observation Z = z into x ifjZ0 X0j 0, then the most \interesting" direction is given by u0 = S 1 p ( X Y) kS 1p ( X Y)k In other situations, the projection direction is not generally obvious. This search for \interesting" low dimensional projection of high dimensional data, is more popularly known as projection pursuit (Friedman and Tukey, 1974; Posse, 1992). \Interesting- ness" is de ned based on a criterion which is measured through a suitable function known as a projection index. It turns out that the projection index for LDF is the two-sample t statistic. Friedman and Tukey (1974) in their study used an algorithm that associated with each direction in high dimensional space, a continuous index that measures its \usefulness" as a projection axis. The projection direction is then 11 varied to maximize the index. The hill-climbing algorithms for maximization given by Rosenbrock (1960) and Powell (1964) were then used as the projection index was su ciently continuous. Posse (1992) used projection pursuit for 2-group discrimina- tion and he did it using kernel estimation. He used fast fourier transform (FFT) to solve the kernel estimation problem of nding the projection direction that is most interesting. Projection pursuit is a computationally intensive procedure as it is a thorough process of searching all the projection directions to nd the most \interesting" pro- jection. This method gets complicated further as the dimension increases, but this technique is gaining popularity with the increased attention given to savvy computer programming techniques and improvements in computer technology. Any method that e ectively helps reduce the dimension from high to low can be treated as a form of projection pursuit. Principal component analysis (PCA) can be treated as a special case of projection pursuit (Huber, 1985). The aim of classical PCA is to nd a linear combinations of the original variables that have maximal sample variance (Nason, 1995). The rst principal component a is the vector that maximizes the variance of the data X projected along that direction. i.e. a = arg max Varxfa0Xg. Here the projection index happens to be variance of the projected data. The only other constraint in this method is the need for the projection directions to be orthogonal to each other. It happens so that the rst principal component has the highest eigenvalue and hence explains the most variance than any other direction. In that sense, eigenvalues are used as projection indices. 12 Jones (1983) and Huber (1985) gave strong heuristic arguments indicating that a projection is less interesting the more it is normal. In addition to lowering the dimension, projection pursuit also allows us to over- come problems associated with sparsity of data in high dimensions (Huber, 1985) which is also termed \curse of dimensionality" (Goldstein, 1987). With increasing di- mension, the need for more and more data to meet the requirement of su cient data increases like a curse. Many techniques fail to perform well under the conditions of sparse data. There are also situations where the number of variables or the dimension (d) is much higher than the amount of data or number of observation (n). These are more popularly known as datasets with d n data. As an example (Huber, 1985) assume that a large number of points is distributed uniformly in the 10-dimensional unit ball. Then the radius of a ball containing 5% of the points is (0:05)0:1 = 0:74. By using projection pursuit techniques, one can reduce the dimension to one and elimi- nate the problem of sparsity of data. As noted in Huber (1985), \Projection pursuit is the most powerful technique that can lift a one-dimensional technique to higher dimensions" (Chen, 1989). By this he means that a projection pursuit technique can be used to reduce the dimension to one and then any one-dimensional statistical technique can be applied. Every traditional projection pursuit methodology mainly di ers in the choice of projection index. There are certain demands of a good projection index (Posse, 1995): Robust to deviations 13 Approximately a ne invariant Consistent Simple enough to permit quick computation even for large data sets. The problem with Fisher?s LDF is that it uses the t-statistic as a projection index which is known to be very sensitive to the underlying distributional assumptions. In an attempt to make Fisher?s LDF more robust to deviations, many researchers have proposed to use robust version of projection index in place of the t-statistic. Posse (1990) proposed a projection pursuit technique based on a global optimization algo- rithm and used a chi-squared projection index to nd the most interesting plane (two dimensional view). This optimization algorithm was later modi ed by Posse (1995) by combining it with a structure removal procedure given in Friedman (1987) and used a modi ed chi-squared index which satis es all the demands of a good projection index. Chen and Muirhead (1994) and Chen (1989) in his PhD dissertation used various ro- bust estimates in the likes of the median location estimator (Andrews et al., 1972; Huber, 1981), trimmed location estimator, Huber type location M-estimator, the me- dian absolute scale estimator and Huber type scale M-estimator. Montanari (2004) and Chen and Muirhead (1994) used a two-sample Mann-Whitney type statistic as a projection index to measure group separation. They show that their projection pursuit methods are not sensitive to deviations from the homoscedasticity and nor- mality assumptions that are required for the optimality of Fisher?s LDF. The method 14 proposed by Montanari was motivated by the idea of what is known as transvariation (Gini, 1916; Montanari, 2004). Transvariation Consider two continuous univariate populations x and y with distributions F and G, respectively, de ned on R. Suppose we have a random sample X1;:::;Xnx from x and, independent of the rst sample, a random sample Y1;:::;Yny from y. The two samples are said to transvariate with respect to their measures of centers mx and my if there is at least one pair (i;j) such that (Xi Yj)(mx my) < 0, i = 1;:::nx; j = 1;:::;ny. Any di erence satisfying this condition is called a two- group transvariation. Similarly, the sample X1;:::;Xnx and a given constant c2R transvariate with respect to mx, if there is at least one i such that (Xi c)(mx c) < 0, i = 1;:::;nx. This is known as a point-group transvariation. The two-group transvariation probability between F and G is de ned as xy := (F;G) = Pf(Y X)( y x) < 0g = Z R Z R If(y x)( (G) (F)) < 0gdF(x)dG(y) ; (2.5) where X F, Y G and x = (F), y = (G) are the location parameters of F and G, respectively. If Fnx and Gny be the two empirical distributions of the two 15 random samples. Then an estimate of xy is given as Txy := (Fnx;Gny) = Z R Z R If(y x)( (Gny) (Fnx)) < 0gdFnx(x)dGny(y) = 1n xny nxX i=1 nyX j=1 If(Xi Yj)(mx my) < 0g; where IfAg= 1 if A is true and is 0 otherwise. Here mx = (Fnx) and my = (Gny) are estimators of x and y, respectively. Txy is a nonparametric estimator of the overlap between the distributions Fx and Fy. In particular, nxnyTxy gives the number of observations that need to be interchanged so that there will be no overlap between the two samples. If we assume without loss of generality that y < x, then xy = P(X DF(x2;F). This is true for any depth function DF. Let F be the class of distributions on the Borel sets of Rd. A statistical depth function is a bounded, nonnegative mapping D : Rd F!R and there are certain properties that are desired of depth functions (Zuo and Ser ing, 2000; Hoberg, 2003): A ne Invariance: The depth of a point x 2 Rd should not depend on the underlying coordinate system or, in particular, on the scales of the underlying 21 measurements. DF(Ax + b;FAX+b) = DF(x;FX) Maximality at center: For a distribution having a uniquely de ned \center" (e.g., the point of symmetry with respect to some notion of symmetry), the depth function should attain maximum value at this center. If is the center of F, then DF( ;F) = sup x2Rd DF(x;F) Monotonicity relative to deepest point: As a point x 2 Rd moves away from the \deepest point" (the point at which the depth function attains maximum value; in particular, for a symmetric distribution, the center) along any xed ray through the center, the depth at x should decrease monotonically. DF(x;F) DF( + (x );F) for 2[0;1] Vanishing at in nity: The depth of a point x should approach zero as kxk approaches in nity. DF(x;F)!0 as kxk!1 The interested reader may nd an extensive list of depth functions along with their de nitions in Liu et al. (1999), Zuo and Ser ing (2000) or Ghosh and Chaud- huri (2005). Among the numerous depth functions that are in existence, Mahalanobis 22 depth and L1 depth (spatial depth) are two of the most attractive ones due to their ease of computation. They can be computed exactly for any dimension. The com- putation of many other depth functions may require algorithms that provide only approximations. This is especially true for higher dimensional data. For example, one usually has to construct very complicated approximate algorithms to compute the halfspace depth of points in three or higher dimensions. Taking advantage of this notion of ordering multivariate data in a center-outward manner, J ornsten (2004) proposed the maximum L1 depth classi er that uses the discriminant function S(z;F;G) =D1(z;F) D1(z;G) = EG z Y kz Yk EF z X kz Xk = Z Rd z y kz ykdG(y) Z Rd z x kz xkdF(x) : (2.9) The new observation Z = z is then classi ed in x if S(z;F;G) > 0 and in y otherwise. Despite its computational ease, a major drawback of this classi er is that it lacks a ne invariance because L1 depth is not a ne invariant. It can, however, be made a ne invariant by taking 1=2x (z X) and 1=2y (z Y) in place of z X and z Y, respectively, in equation (2.9) (Vardi and Zhang, 2000; Ghosh and Chaudhuri, 2005). Note that one can use any a ne equivariant estimators of x and y when computing the discriminant function. If the scatter estimator of Tyler (1987) is used, then the resulting maximum L1 depth classi er is very similar to the classi er given by Crimin et al. (2007). An alternative method of obtaining a ne invariance is to scale 23 the data along its principal component directions (PCA-scaling) as given in Hugg et al. (2006). One could use robust principal component analysis (eg: Robust PCA given by Croux et al. (2007)) or scale the data with the robust estimate of covariance structure which will make the L1 depth function a ne invariant in addition to making it robust against deviations. Once again, for practical purposes, given two independent random training sam- ples X1;:::;Xnx and Y1;:::;Yny from x and y, respectively, de ned on Rd (d 1), the sample version ofD1(x;F) given in (2.8) can be found by replacing the empirical cdf in place of F and G resulting in the sample version of S(z;F;G) given by S(z;Fnx;Gny) = Z Rd z y kz ykdGny(y) Z Rd z x kz xkdFnx(x) = 1 ny nyX j=1 z Yj kz Yjk 1 nx nxX i=1 z Xi kz Xik : It must be noted that the maximum L1 depth classi er is in the class of classi ers known as maximum depth classi ers (Ghosh and Chaudhuri, 2005) in that any depth function DF can be used in place of the L1 depth function. The optimality of the classi er is dependent on the choice of the depth function. The choice of depth functions could be based on various properties like robustness. You obtain QDF if Mahalanobis depth is used in place of the L1 depth in (2.9). One would assign the new observation Z = z in x if DF(z;F;G) > 0 and in y otherwise. In Billor et al. (2008), it is shown that the robustness of the maximum depth classi er can be improved if one considers the largest order of depth instead of just 24 the maximum depth. They de ne the transvariation probability between the depth of Z and the depth of X relative to the distribution of X as x(Z) = Pf(DF(X;F) DF(Z;F))(DF( x;F) DF(Z;F)g< 0) = PfDF(X;F) 0. This classi er is robust just like most rank-based procedures and can be directly extended to more than two groups unlike the transvariation-based method. The use of data depth for classi cation purposes has also been considered by Mosler and Hoberg (2003). They de ne a combination of two existing depth functions zonoid and Mahalanobis to come up with a new depth function and they call it zonoid-Mahalanobis depth. Mahalanobis depth is parametric depth function that is sensitive to deviations, while zonoid depth function is nonparametric and based on convex hulls. In this regard, halfspace, simplicial and convex-hull peeling depths are in the same class of depths as zonoid depth. A drawback of such depth functions is that they vanish outside the convex hull, so points lying outside the convex hulls of all classes cannot be classi ed using such depth functions. The zonoid-Mahalanobis depth function uses a combination as in one would use the zonoid depth as long as the point lies inside at least one of the convex hulls, else the Mahalanobis depth is used for classi cation. Halfspace depth based classi cation for two populations has been suggested by Christmann and Rousseeuw (2001) and Christmann et al. (2002). Donoho and Gasko (1992) showed that the computation of HD in higher dimensions can be performed using the projection pursuit method. In particular, the sample half-space depth in one-dimension is de ned as HD1(x;X) = min(#fi : Xi xg;#fi : Xi xg) 26 and for x2Rd the sample half-space depth is de ned as, HDd(x; X) = min kuk=1 HD1(u0x; u0X) (2.10) where the dataset u0X is the one-dimensional projection of the d-dimensional dataset X. This means that the maximum depth classi er based on the half-sapce depth (Ghosh and Chaudhuri, 2005) is equivalent to using projection pursuit with point- group transvariation (Montanari, 2004) and allocating using a minmax rule. To that end, let F and G be the data generating distributions and Fu and Gu be the distributions of the projected data as de ned above. Then using the de nition given in Equation (2.6) we can de ne the discriminant function Dpgt(z;F;G) = min kuk=1 Z R If(x u0z)( (Fu) u0z) < 0gdFu(x) min kuk=1 Z R If(y u0z)( (Gu) u0z) < 0gdGu(y) and allocate z in x if Dpgt(z;Fnx;Gny) > 0. 2.3 Classi ers Based on Robust Estimators of Location and Scale The reason LDF and QDF are sensitive to deviations from underlying assump- tions is due to the fact that the estimators of mean and covariance that they use are sensitive to such deviations. So a natural idea is to replace these quantities with 27 robust quantities to attain robustness in the classi er. Perhaps the earliest such pro- cedure was proposed by Randles et al. (1978b) where Huber?s M estimates of the mean vector and covariance matrix (Huber, 1977) are used in place of the mean and covariance used in LDF and QDF. 2.3.1 M Estimators We will give a brief discussion of Huber?s M estimates in the following. M estimates are solutions, ^ , that maximize nX i=1 (xi; ) ; where is a function that depends on the sample and a vector of parameters . The properties of this function are discussed below. Often it is simpler to di erentiate with respect to and solve for the root of the derivative. When this di erentiation is possible, the M-estimator is said to be of -type. Otherwise, the M-estimator is said to be of -type. type For positive integer d, let (X; ) and ( Rd;S) be measure spaces. 2 is a vector of parameters. An M-estimator of -type T is de ned through a measurable function : X ! R. It maps a probability distribution F on X to the value 28 T(F)2 (if it exists) that minimizes RX (x; )dF(x) : ^ = T(F) := arg min 2 Z X (x; )dF(x) : type If is di erentiable, then the computation of ^ is usually much easier. An M- estimator of -type T is de ned through a measurable function :X !Rd. It is assumed that the true parameter satis es RX (x; )dF(x) = 0. Then a -type M estimator T(F) is de ned implicitly as the solution of the vector equation Z X (x;T(F))dF(x) = 0 : For many choices of or , no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algo- rithms or an iteratively re-weighted least squares tting algorithm typically happens to be the preferred method. 2.3.2 S Estimators S-estimator based classi ers are given by He and Fung (2000) and Croux and De- hon (2001). S estimators were rst de ned in the context of regression by Rousseeuw and Yohai (1984). Let (x; a;C) = f(x a)0C 1(x a)g12 for any x 2Rd, a 2Rd and C 2 S(d), where S(d) is the set of all symmetric positive de nite matrices in 29 Rd Rd. Let s(a;C) be the solution of 1 m mX i=1 (x i; a;C) s(a;C) = E n (jjxjj12 ) o ; where is such that (0) = 0, where is symmetric about 0 and is nondecreasing on (0;c) and constant on (c;1) for some constant c > 0. Here the expectation on the right hand side is taken at the standard d-variate normal distribution. Now let (a ;C ) be the minimizers of s(a;C) subject to det(C) = 1, then ~ S = a and ~ S = s(a ;C )C are the S-estimators of and . 2.3.3 MCD Estimators The MCD (Maximum Covariance Determinant) based LDF and QDF given in Hubert and Van Driessen (2004), the classi er based on a robust version of the Lawley-Hotelling test uses the spatial median estimator of Hettmansperger and Ran- dles (2002) and the related scatter estimator of Tyler (1987) given by Crimin et al. (2007). Hubert and Van Driessen (2004) used the re-weighted MCD estimate of mul- tivariate location and scale (Rousseeuw, 1984, 1985), because of its good statistical 30 properties and the FAST-MCD algorithm (Rousseeuw and Van Driessen, 1999) which provides an e cient algorithm of computing the estimates for large data sets. For the X sample, the MCD estimator is de ned as the mean ^ x;0 and the covariance matrix Sx;0 of the hx observations (out of nx) whose covariance matrix has the lowest determinant. The quantity hx should be larger than b(nx p + 1) = 2c and nx hx should be smaller than the number of outliers in the X population. With this choice the MCD attains its maximal breakdown value of b(nx p + 1) = 2c 50%. The breakdown value of an estimator is de ned as the largest percentage of contamination it can withstand (Rousseeuw and Leroy, 1987). If one suspects, less than 25% contamination in the X sample, it is advised to take hx 0:75nx as this yields a higher nite-sample e ciency (Croux and Haesbroeck, 2000). Based on the initial estimates ^ x;0 and Sx;0 one computes, for each observation xi, its (preliminary) robust distance (Rousseeuw and Vanzomeren, 1990) RD0xi = q (xi ^ x;0)0S 1x;0(xi ^ x;0) ; i = 1;:::;nx They assign weight 1 to xi if RD0xi q 2p;0:975 and weight 0 otherwise. The reweighted MCD estimator is then obtained as the mean ^ x;MCD and the covari- ance matrix ^ x;MCD of those observations with weight 1. It is shown by Croux and Haesbroeck (2000) that this reweighting step increases the nite-sample e ciency of the MCD estimator considerably, whereas the breakdown value remains the same. This can also be used to ag outliers and so can be used to detect outliers. 31 The Robust Quadratic Discriminant Rule (M) is: Z = z2 y if dM2 (z) >dM1 (z) where dMx (z) = 12lnj^ x;MCDj 12(z ^ x;MCD)0^ 1x;MCD(z ^ x;MCD) (2.11) and z2 x otherwise. The quantity dMy (z) is de ned analogously as dMx (z). Robusti ed Fisher?s linear discriminant rule (RLDR), which can be described as z2 x if (^ x ^ y)0^ 1(z (^ x ^ y)=2) > 0; and z2 y otherwise. To construct RLDR they rst look for initial estimates of the group means and the common covariance matrix, denoted by ^ x;0, ^ y;0 and ^ 0. This will already yield a discriminant rule based on ^dRLx (z; ^ x;0; ^ 0) and ^dRLy (z; ^ y;0; ^ 0). We will then also consider the reweighting procedure based on the robust distances RD0xi = q (xi ^ x;0)0^ 10 (xi ^ x;0) ; i = 1;:::;nx and RD0yj = q (yj ^ y;0)0^ 10 (yj ^ y;0) ; j = 1;:::;ny For each observation in X sample, let wxi = 1 if RD0xi q 2p;0:975 and wxi = 0 otherwise. wyj are de ned similarly for the Y sample. The nal estimates are then obtained as the mean and the pooled covariance matrix of the observations with 32 weight 1, i.e. ^ x = Pnx i=1 wxixiPn x i=1wxi ; ^ y = Pny j=1wyjyjP ny j=1wyi ; ^ = Pnx i=1 wxi(xi ^ x)(xi ^ x) 0 +Pny j=1 wyj(yj ^ y)(yj ^ y) 0 Pnx i=1 wxi + Pny j=1wyj (2.12) and the resulting linear discriminant rule is then based on ^dRLx (z; ^ x; ^ ) and ^dRLy (z; ^ y; ^ ). To obtain the initial covariance estimate ^ 0, they consider three di erent meth- ods. The rst approach is straightforward, and has been applied by Chork and Rousseeuw (1992) using the Minimum Volume Ellipsoid estimator (Rousseeuw, 1984), and Croux and Dehon (2001) using S-estimators (Rousseeuw and Yohai, 1984). The MCD estimates ^ x;MCD, ^ y;MCD, ^ x;MCD and ^ y;MCD are obtained, and then the individual covariance matrices are pooled, yielding ^ PCOV = nx^ x;MCD +ny^ y;MCD nx +ny For the second approach, they adapt one of the proposals of He and Fung (2000) who use S-estimators to robustify Fisher?s linear discriminant function. The idea is based on pooling the observations instead of the group covariance matrices. The third estimator combines the two previous approaches and is aimed to nd a fast approximation to the Minimum Within-group Covariance Determinant criterion of Hawkins and McLachlan (1997). Instead of applying the same trimming proportion to each group, they proposed to nd the h observations out of the whole data set of 33 size n, such that the pooled within-group sample covariance matrix ^ H has minimal determinant. The algorithm described in Hawkins and McLachlan (1997) is very time-consuming because it is based on pairwise swaps. Hubert and Van Driessen (2004) proposed the fast approximation for two groups which is much faster than the algorithm given in Hawkins and McLachlan (1997). It has to be noted that this algorithm can fail if some of the groups are very small where there is a possibility that the nal subset H does not contain p+ 1 observations from each group, making those group covariance matrices singular. Similar to Hubert and Van Driessen (2004), Joossens and Croux (2004) performed simulation studies to compare LDF and QDF with the MCD based LDF and QDF as well as the S-estimator based classi ers. 2.3.4 Other There are other estimates like the R-estimates and the L-estimates (Ser ing, 1980) which could also be used in classi cation in a similar fashion. An issue that was discussed in Lachenbruch et al. (1973) is the problem of im- balance between the two misclassi cation error rates PDyjx and PDxjy resulting from the use of D = QDF and D = LDF when the parent populations are non-normal. Ideally one would want PDyjx = PDxjy or a situation where a classi er can de nitely con rm to provide lower misclassi cation rates for the samples of choice. Since this is not a possibility, the only logical option is to maintain the balance in tact. A remedy for this undesirable property of classi ers was given in Randles et al. (1978a) and Ng and 34 Randles (1983). In these papers, it is shown that balance is attained (asymptotically) through ranking of discriminant functions such as LDF and QDF while keeping the TPM relatively low. Randles et al. (1978b) use this ranking technique on robust linear and quadratic discriminant functions that use M estimates and they showed that the resulting classi er is robust compared to LDF and QDF and balance. 35 Chapter 3 Proposed Robust Projection Pursuit Based Allocation Schemes 3.1 Introduction There are two main aspects of a projection pursuit technique - a projection index and the most e cient way to search for the best projection direction. Works in two-dimensional projection pursuit are discussed in Friedman and Tukey (1974), Jones (1983), Jones and Sibson (1987) and Friedman (1987), and suggestions are made by Yenyukov (1988), Yenyukov (1989). An e ective algorithm for the same was provide by Posse (1990). He uses the chi-square measure as his projection index. He uses random search for projection directions to nd local optima and restart the search process repeatedly to make sure they nd the optimum. The two-dimensional problem might not be as complicated for the modern computational power, but as the dimension increases there is a need for more e cient projection pursuit algorithms. Posse (1992) used the algorithm given in Huber (1989) which is similar to that given by Posse (1990). A quickly generated random walk on the d-dimensional hypersphere Sd sweeps the entire space where areas of interest are identi ed to nd the local optima. This happens to be very e cient as it saves computational time while still maintaining the e ciency of the algorithm. For our projection pursuit based classi er, we use two-group transvariation (Gini, 1916) as projection index. We use similar principles of Posse (1992) where we sweep 36 the high dimensional space and then nd the projection direction that maximizes separation by minimizing two-group transvariation. In order for the sweep to be suc- cessful, the projection directions need to be uniformly distributed on the hypersphere. One such method that allows us to do this is NT-net given by Fang and Wang (1994). 3.2 NT-Net To perform projection pursuit in Rd we need points that are \uniformly scat- tered" on the surface of the unit hypersphere Sd 1 as given in Fang and Wang (1994). If W NIDd(0;Id), then (W0W) 1=2W is uniformly distributed on Sd 1. If V1;:::;Vk is a number theoretic net (NT-net) on the d 1 dimensional unit cube Cd 1, then fWi =S(V1;:::;Vk) ; i = 1;:::;kg; where S is the spherical coordinate transformation operator, is an NT-net on Sd 1. Let = ( 1;:::; d 1) be a point on Cd 1. We use the spherical coordinate transformation Xj = j 1Y i=1 SiCj; j = 1;:::;d 1; 37 Xd = d 1Y i=1 Si; where Si = sin( i); Ci = cos( i) i = 1;:::;d 2; Sd 1 = sin(2 d 1); Cd 1 = cos(2 d 1:) Then X = (X1;:::;Xd) is a point on Sd 1. If 1;:::; k forms an NT-net on Cd 1, then X1;:::;Xk forms an NT-net on Sd 1. For example, to obtain an NT-net for d = 3, iff(vi1;vi2) ; i = 1;:::;kgis a NT-net on [0;1]2, thenf(wi1;wi2;wi3) ; i = 1;:::kg is a NT-net on S2 as given in Figure 3.1, where wi1 = 1 2vi1 wi2 = 2 p vi1(1 vi1) cos(2 vi2) wi3 = 2 p vi1(1 vi1) sin(2 vi2) To measure whether the points on the hypersphere are uniformly distributed, measures such as F-Discrepancy are used (Fang and Wang, 1994). Let F(x) be a cdf in Rd and } = xk;k = 1;:::;n be a set of points on Rd, then F(n;}) = sup x2Rd jFn(x) F(x)j is called the F-Discrepancy of } with respect to F(x), where Fn(x) is the empirical distribution of x1;:::;xn. F-Discrepancy is a measure of the representation of } with 38 Figure 3.1: An NT-net on S2 39 respect to F(x). We chose to use NT-net to provide us with our projection directions for our projection pursuit method because of this property. The interested reader may nd alternative ways of generating projection direc- tions in Marsaglia (1972), where three di erent Monte Carlo point picking methods are given. It is well known that projection pursuit based methods are computationally ex- pensive. Although, advanced computational techniques can be e ectively used to minimize the number of projections considered without a ecting the e ciency of the technique (Filzmoser et al., 2006). Inspired by Posse (1992), we provide one such al- gorithm that would brie y sweep the entire space using a small number of projection directions and then localize to nd the best projection direction that optimizes the projection index. We use two-group transvariation as a projection index as given in Montanari (2004). Algorithm: Step 1: Uniformly spread points that cover the entire unit hyper cube Cd 1 are generated. Step 2: These points on the unit hyper cube are spherically transformed onto a unit hyper sphere Sd 1 as described above. These points are uniformly distributed as given in Fang and Wang (1994) using the F-discrepancy criteria. Step 3: The projection index for each of the projections on the hyper sphere is calcu- lated and the projection that provides the optimum projection index is chosen. 40 The optimum projection is the direction that produces maximum separation or minimum overlap between the training samples. Step 4: In the case of more than one projection direction providing the optimum projection index, the projection direction that provides the most spread for the data is considered. One can choose a robust version of spread to make the algorithm more robust. Step 5: The point on the unit hyper sphere that provides the projection direction in ?Step 4? is identi ed and easily traced back to the point on the unit hyper cube. In the unit hyper cube, the points immediately surrounding this point are used to create more points. The number of points generated here could be the same as in ?Step 1?. Step 6: A grid of uniformly spread points is now generated on this localized region of the unit hyper cube. Step 7: These points on the localized region of the unit hyper sphere are spherically transformed, which now form a unit hyper arc instead of a unit hyper sphere. This gives a localized view of the region that gave the optimum projection direction in the previous iteration. Step 8: ?Step 3? through ?Step 7? are then repeated until a convergence criterion is reached. If transvariation probability is used as a projection index, the conver- gence criterion could be to continue until the projection index stops decreasing. 41 It has to be noted that the convergence criterion will take you to the projection direction that gives a local optimum projection index. For data that follows a certain unimodal shape, there will only be a global optimum for the projection index and the algorithm takes you to the projection direction that achieves this global opti- mum. Since transvariation is discrete, the algorithm ensures that there exists only one projection direction that provides the least two-group transvariation. The number of points needed per dimension for this algorithm depends on the di- mension under consideration and the computational power. While a thorough search would con rm an optimum projection index, Monte Carlo simulations done at the end of this chapter have shown us that for certain unimodal distributions, 5 projec- tion directions per dimension seem to be more than su cient. We tried the process with 7 and 9 projection directions per dimension but did not see any noticeable im- provement. If 5 projection directions per dimension are used, the total number of projection directions would be 5d 1, which means that as the dimension increases, the number of projection directions required also increases. The beauty of using NT- net for projection directions is that you do not have to look at the entire region of space at one time. You could break the entire space that needs to be explored into regions, which can be controlled easily by controlling the points on the unit hyper cube. Once the vector ^uopt that gives us the most interesting view of the data in a single dimension is found using (2.7), we project the test samples in this direction 42 and allocate the new observation into one of the two groups based on a particular allocation scheme. 3.3 Allocation Schemes We will consider four di erent schemes of allocation in this study. The rst method was proposed by Montanari (2004) and is based on projected distances of the new observation from the centers of the training samples. The second is a suggestion given in Montanari (2004), wherein a nonparametric allocation scheme was proposed and immediately abandoned. We will discuss the reason why she had to abandon her suggested scheme and we will provide two alternative allocation schemes which are completely nonparametric. These are based on the positions of the projected points relative to the projected centers of the training samples. It must be noted that all the procedures given in this chapter are a ne invariant. Invariance to translations is immediate. Moreover, any rotation of the data will result in the same rotation of the projection direction that gives the \most interesting" view and the resulting projected data do not depend on the coordinate system used. A formal argument showing the a ne invariance of these procedures is found in Lemma 2.1 of Donoho and Gasko (1992) and Theorem 2.1 of Zuo and Ser ing (2000). 3.3.1 Allocation Based on Distance Given two independent random training samples X1;:::;Xnx and Y1;:::;Yny from x and y, respectively, de ned on Rd (d 1), a new observation Z = z is 43 classi ed in x ifj^u0optZ mx(^uopt)jTy(Z); otherwise, it is assigned to y where Tx(Z) = 1n x nxX i=1 If(^u0optXi ^u0optZ)(mx(^uopt) ^u0optZ) < 0g and Ty(Z) = 1n y nyX i=1 If(^u0optYi ^u0optZ)(my(^uopt) ^u0optZ) < 0g (3.1) As argued earlier, this allocation scheme is based on ranks. This gets rid of the non optimality problem that TD has when skewed distributions are considered. Although this allocation scheme makes this method completely nonparametric and works better than TD for skewed distributions, it does not perform as well for data with unequal sample sizes. This is due to the fact that an equal prior restriction is imposed by counting and we neglect group two(one) when we nd the ranking of the new point in group one(two). So the priors are not necessarily taken into account and the e ect shows in the misclassi cation error rate especially when the sample sizes are unequal. Montanari (2004) abandoned this scheme for this very reason. Adding to the problem of priors that this allocation scheme has is another problem of rather high likelihood of ties between Tx and Ty given in (3.1). The likelihood of ties is the most noticeable in equal sample sizes cases and these cases happen to be the only cases that this allocation scheme works e ciently. This problem of ties requires some kind of a tie breaking strategy. The one employed by us is to randomly assign the observation into one of the groups using a coin ip. The classi er obtained using this allocation scheme will be referred to as Point-Group Transvariation (PGT) classi er. 45 3.4 Proposed Symmetrized Allocation Scheme The schemes, TD and PGT, one being parametric and the other being nonpara- metric, still have problems with skewed cases and problems with unequal sample size cases, respectively. In spite of being nonparametric, the PGT scheme did not include any information about the second group when nding point-group transvari- ation between the new observation and the rst group. We needed to come up with a new allocation scheme that considers both the groups while nding a nonparametric alternative to the existing schemes. We had a nonparametric way of looking at two groups at the same time: the two-group or group-group transvariation de ned by Gini (1916). The next issue we faced was to somehow include the new observation in the calculation of the group- group transvariation and be able to use it for allocation. So we propose a new nonparametric alternative that is based on ranking of the new observations in the groups while considering both the groups. To allocate a new observation Z, we de ne X =fX1;:::;Xnx;Zgand similarly de ne Y =fY1;:::;Yny;Zg. The idea is to nd Tx y, the transvariation probability between X and Y given by Tx y = 1(n x + 1)ny X x 2X X y2Y If(^u0optx ^u0opty)(mx(^uopt) my(^uopt)) < 0g (3.2) 46 and Txy , the transvariation probability between X and Y given by Txy = 1n x(ny + 1) X x2X X y 2Y If(^u0optx ^u0opty )(mx(^uopt) my(^uopt)) < 0g (3.3) and see the e ect of the new observation on the quantities Tx y and Txy . These two quantities have Txy in common, where Txy = 1n xny X x2X X y2Y If(^u0optx ^u0opty)(mx(^uopt) my(^uopt)) < 0g; (3.4) is the transvariation probability between X and Y as given in (2.5). The quantity (nx + 1)nyTx y nxnyTxy is the number of observations in group X with which the new observation transvariates. Similarly, the quantity nx(ny + 1)Txy nxnyTxy is the number of observations in group Y with which the new observation transvariates. So, the di erence Tx y Txy is due to the addition of the new observation Z in X and the di erence Txy Txy is due to the addition of the new observation Z to Y. These di erences can be thought of as the contributions that Z makes towards the transvariations if it were to belong in the groups in which it is placed. These di erences could be positive or negative based on the location of the new observation but cannot both be positive or both negative for a particular new observation. If Z belongs to the group it was placed in, it will not contribute to the transvariation with the other group and thus creating a negative di erence, else it would create a positive di erence. 47 A negative di erence indicates that the new observation Z is placed in the correct group. So a new observation Z = z is allocated to x if Tx y Txy < Txy Txy, else it is allocated to y. Since both the di erences contain Txy, regardless of the di erences, we allocate the new observation to x if Tx y Ty(z) where Tx(c) = 1n x nxX i=1 1 Kxf(^u0optXi ^u0optz)(mx(^uopt) ^u0optz)g and Ty(c) = 1n y nyX j=1 1 Kyf(^u0optYj ^u0optz)(my(^uopt) ^u0optz)g ; (3.5) else classify it in y. -3 -2 -1 0 1 2 3 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 X W e ig h t S M O O T H - P G T P G T Figure 3.2: Graphical Depiction of Transvariation and Smoothed Transvariation Kx and Ky need not necessarily be CDFs. They can be taken to be functions that satisfy certain requirements or any CDF de ned on R with pdf kx and ky, respectively, 50 that is continuous on ( ; ) for some > 0 and kx(0) > 0 and ky(0) > 0. It is not a requirement that both the samples need to follow a certain shape. Note that PGT uses the CDF of the bernoulli random variable I(X 0). The conversion of a 0 or 1 transvariation into a number in the interval (0, 1) is based on a smoothing parameter. Smoothing works well in avoiding the problem that PGT had with allocation in the case of unequal sample sizes. Kernel smoothing was used for classi cation by Mojirsheibani (2000). In our study, we use the CDF of the t-distribution for K with degrees of freedom (df) as the smoothing constant. This gives us a wide range of CDFs with varying scale that could be used to t various distributions. For a data set with training and testing samples, the process of nding the optimum smoothing constant (df) is as follows: Algorithm Step 1: After nding the optimum projection direction using NT-net, the data are projected onto this direction. You would then allocate the testing sample using equation (3.5). Step 2: To allocate, you need the df that de nes the t-CDF, which provides the best smoother for that particular shape of data. It makes a lot of sense to use two di erent smoothing df, one for each group of data. We use the training sample to train and nd the best smoother for each group. 51 Step 3: Finding the smoother theoretically is a very di cult problem. In fact, there exists no closed form solution when the t-CDF is used in place of the indi- cator function in transvariation. Instead we use a bivariate grid, containing df(df1;df2), one for each group. Step 4: We then use the training data and apply leave-one-out cross validation to nd the probability of misclassi cation for each possible pair of df. The combi- nation that provided the least probability of misclassi cation is then picked as the pair of smoothing constants. Step 5: We nally use equation (3.5) with the smoothing constants found in Step 4, to nd the total probability of misclassi cation for this data. When the data are presented as a single sample, as is the case for real data sets, we remove one observation from the data and try to use the rest of the data as a training sample to allocate this observation. This is like leave-one-out cross validation. So, we end up using a double one-at-a-time cross validation, once for nding the misclassi cation for the data and again to nd the best set of smoothing df for the training sample. We tried the normal cdf which worked really well for distributions that are normal or near normal. As the distributions deviated from normality, the e ciency of the normal cdf with the spread as the smoother went down. Based on our simulations, this was partly because normal distribution with sigma as a smoother, lacks the range to cover a variety of distributions. This was evident when most of the smoothing 52 constants ended up being very close to zero, which makes the normal distribution almost like a point mass and hence a transvariation. In such cases, smooth becomes PGT. We chose to use t-cdf with df as a smoother for its wide range. We are convinced that a better smoother exists out there and we leave the solution of nding it as an open problem. We refer to the classi er that uses this allocation scheme as smooth-PGT (SPGT). This classi er has all the positives of PGT but gets rid of the problem with PGT, that is, with unequal sample sizes. By counting transvariations, PGT puts a restriction of equal priors. As discussed above, transvariation with one group is scaled by 1nx, while the transvariation with the other group is scaled by 1ny. This creates a problem when the sample sizes are not equal, especially when the sample sizes are highly unequal. In the case of the SPGT allocation scheme, each transvariation is smoothed, in our case using a t-CDF. This takes care of the shape of the distributions by weighing each point based on their relative position and the distance from the new observation. The two smoother df0s de ning the t-CDf?s, which provide the least misclassi cation are found using the training samples. This process of nding the best smoother that smoothes the data, closes the gap between the di erence in sample sizes. 3.6 Application on Real Data and Monte Carlo Simulation We would like to show the optimality of the proposed classi ers by applying the methods on some real data sets and a variety of simulation settings. 53 3.6.1 Application on Real Data Sets The data sets that we chose to consider are Leukemia : This dataset is given by Golub et al. (1999) and comes from a study of gene expression in two types of acute leukemias: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The dataset is available at http://www.genome.wi.mit.edu/MPR Gene expression levels were measured using A ymetrix high-density oligonu- cleotide arrays containing 7129 human genes. The data are comprised of 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML. We used the method of gene selection given by Nguyen and Rocke (2002) that uses a t-statistic to select expressed genes. After preprocessing, the data are sum- marized by a 72x1751 matrix. To reduce the dimension of the data set, we considered the rst 10 principal components. Colon : This data set contains gene expression in 40 tumor and 22 normal colon tissue samples analyzed with an A ymetrix oligonucleotide array (Alon et al., 1999). The expression level of 6500 genes were measured and the 2000 genes with the highest minimal intensity were retained. We again applied the gene selection method given by Nguyen and Rocke (2002) to select the genes that exhibit maximum variation among the 62 tissues. The resulting data matrix contained 250 genes. We then kept the rst 9 principal components for the purpose of classi cation. 54 Fisher?s Iris : This is the classic Iris dataset given by Fisher (1936). It consists of three species (Setosa, Versicolor, and Virginica) of Iris owers with 50 observa- tions of each species. The dataset contains four variables: sepal length, sepal width, petal length, and petal width. We wish to identify a new Iris ower as Versicolor or not based on measurements on these four variables. We considered the original data as well as a contaminated version where the contamination is done by placing a single outlier in the Versicolor group in a similar manner as Crimin et al. (2007). We will consider the proposed methods, GGT and SPGT and as comparison we will consider the allocation scheme, TD, proposed by Montanari (2004) and her suggested allocation, PGT. We will also consider maximum depth classi er, MaxD, based on L1 depth given by Ghosh and Chaudhuri (2005) and J ornsten (2004) and the classical LDF and QDF. The results are presented in Table 1 given in the appendix. For the original Iris data, QDF and MaxD gave the lowest TPM at 3.33%. The least optimal performer was TD with a misclassi cation error rate of 31.33%. When a cluster of ve outliers was added to the Iris data, GGT gave the lowest TPM at 3.87% with PGT really close at 3.97%. The misclassi cation error rate for SPGT went down slightly, while the error rates for MaxD, LDF and QDF exploded. For LDF and MaxD, the error rate was doubled and for QDF, the error rates increased almost 4 times. 55 For the Leukemia data, GGT happened to be the method that provided the least TPM of 1.39%; The SPGT method had the next best misclassi cation error rate at 2.78%. PGT is close with 2.9% while TD happened to be the least optimal among the methods considered with an error rate of 11.11%. MaxD, LDF and QDF shared the same error rate at 4.17%. Smooth-PFT, GGT, TD and LDF shared the lowest TPM at 12.9%. The least optimal method being MaxD with a misclassi cation error rate of 17.74%. 3.6.2 Monte Carlo Simulation The common simulation settings used by authors in literature looks at normal distributions and as a deviation they include outliers, look at distributions with heavy tail and log-normal distribution as a case of skewed distribution. In all those cases, most often than not, authors choose to use the same distributions for both the groups and/or keep most of the other aspects simple. We perform a very extensive Monte Carlo simulation to study the optimality (in terms of misclassi cation error rates) of the proposed classi cation procedures under a variety of distributional settings: Homoscedastic (vs) Heteroscedastic Equal sample sizes (vs) Unequal sample sizes Same distributions (vs) Di erent distributions for the two groups Symmetric (vs) Skewed distributions 56 Normal tail (vs) Long tailed symetric distributions Combinations therein The simulation study follows the same setup given in Montanari (2004). We start o by generating training samples of the given sizes which are used to formulate the classi cation rules. Testing samples of size 1000 from each group are then generated and the misclassi cation error rates are calculated by computing the proportion of misclassi ed testing sample observations in each group. This process is then repeated 50 times and the mean and standard error of the misclassi cation error rates are computed. We consider Cauchy (C, which is t with 1 degree of freedom), t with 2 degrees of freedom (t2), Normal (N) distributions, and log-normal (LN) distributions. Train- ing samples of equal (150;150) and unequal (50;250) sizes were generated from the distributions. We consider four-dimensional distribution to generate the data. We consider centers (0;0;0;0)0 and (2;0;0;0)0 and covariance matrices I = I4 and W = 0 BB BB BB BB @ 1 1 1 1 1 2 2 :25 1 2 3 :5 1 :25 :5 4 1 CC CC CC CC A : 57 In the reporting of the results we use the notation K(A);H(B) to represent distribu- tions K and H (K and H are C, t2, N or LN) with covariance matrices A and B (A and B are I = I4, W), respectively. We consider all the methods as mentioned in the real data sets. A summary of the results is given in Table 2 given in the appendix. The following observations are noted: A look at the results for the PGT classi er indicates that it is not optimal for almost all of the unequal sample size cases con rming Montanari?s suspicion that led her to abandon the point-group transvariation allocation scheme. GGT and SPGT classi ers have error rates comparable for equal sample size cases compared to the PGT classi er, lower error rates in almost all cases for the unequal sample size cases and much lower error rates in most cases for the unequal sample size cases. For these reasons, we will not be considering PGT as a contender for misclassi cation error rate. Considering the standard errors of the unequal sample size cases, MaxD and PGT classi ers are less precise than the other classi ers, MaxD being the least precise classi er. Except in a few cases, PGT is less precise than GGT and SPGT. Error rates for GGT and SPGT classi ers are comparable to TD for symmetric and same distributions. GGT, SPGT and TD have better error rates compared to LDF, QDF and MaxD for heavy tailed distributions (Cauchy and t(2)) while 58 for the normal case, GGT and SPGT have error rates comparable with LDF, QDF and TD classi ers but superior to MaxD classi er. For skewed but same distribution cases, SPGT and GGT are the best meth- ods among the methods considered in terms of misclassi cation error rate and standard error. The error rates and standard errors of SPGT and GGT are comparable. For di erent but homoscedastic distributions, there are cases where SPGT is the best method, where GGT is the best method, where TD is the best method. One of the three methods SPGT, GGT or TD is the best method in these cases. Best in terms of misclassi cation error rate and standard error. For these settings, it can also be noted that GGT acts up for the unequal sample size cases. The error rates for GGT decreases for some cases with unequal sample sizes and increases substantially for some other cases. For di erent and heteroscedastic distributions, it can be noted that QDF does fairly better than it usually does. As mentioned in Chapter 2, QDF works better for heteroscedastic cases, but the di erent distributions in ates the error rates in some cases. Except for QDF, SPGT happens to be the best method among the methods considered. The only exception is GGT for some unequal sample size cases and where log-normal distribution is considered with equal sample sizes. 59 Chapter 4 Robust and Balanced Classification 4.1 Introduction Extensive research has been done in the area of classi cation. To name a few of the main categories that authors looked at are parametric methods, non-parametric alternatives to the parametric methods, using robust estimates in place of the sensitive ones. These methods try to create a method that provides the lowest possible TPM. Not much has been done in regard to the issue of balance of the misclassi cation error rates within each group. This issue was brie y talked about in Lachenbruch et al. (1973) and then a solution was provided by Ng and Randles (1983) where they rank LDF and QDF to create balance in the misclassi cation error rates, that is, PDyjx = PDxjy. The issue of balance needs more importance than it has been given in literature in terms of research done on this topic. The issue of balance becomes very pertinent in situations where one of misclassi cations (PDyjx, PDxjy) is costlier than the other. For example, in a two group classi cation problem with a group of patients having cancer and the another group not having cancer, a misclassi cation of a cancer patient into a noncancerous group could prove costly in terms of life. Ideally there is a need for a method that can control the ratio of the misclassi cation error rates at a required level. When the investigator does not have any information on the costs 60 of misclassi cation and prior probabilities, then it is best to maintain the balance between the two misclassi cation error rates. Some existing methods can provide a low TPM but there are not many methods, especially robust methods that can create this balance while minimizing the overall misclassi cation error rate. We propose one such method that can maintain the balance in the two misclassi cation error rates while controlling the overall TPM. We show that balance can be achieved as long as ranking is incorporated and both groups are considered while allocating the new observation (Randles et al., 1978a). 4.2 Ranking Discriminant Function Given two independent random training samples X of sizenx given by X1;:::;Xnx from population x with distribution F and Y of size ny given by Y1;:::;Yny from y with distribution G, de ned on Rd (d 1). Let Fnx and Gny be the empirical distribution functions of X and Y, respectively. Suppose, as before, that we have an absolutely continuous and real valued discriminant function D such that Z = z is classi ed in x if D(z;F;G) > 0 and in y otherwise. Now let us de ne a rank discriminant function RD as RD(z;F;G) = PfD(z;F;G) D(X;F;G) jX Fg PfD(z;F;G) D(Y;F;G) jY Gg : (4.1) 61 Naturally, one classi es Z = z in x if RD(z;F;G) > 0 and in y otherwise. This classi er is looking at whether the new observation belongs more to X or to Y and the calculation depends on how D(:) is chosen. To de ne the sample version, we will form two augmented samples by placing the new observation Z = z in X forming X = fX1;:::;Xn1;Zg and in Y forming Y =fY1;:::;Yn2;Zg. Let F nx+1 and G ny+1 be the empirical distribution functions of X and Y , respectively. We then have RD(z;Fnx;Gny) = 1n x + 1 X x2X I D(z;F nx+1;Gny) D(x;F nx+1;Gny) 1n y + 1 X y2Y I n D(z;Fnx;G ny+1) D(y;Fnx;G ny+1) o ; (4.2) where IfAg is the indicator function of the event A. We would then allocate Z = z to x if RD(z;Fnx;Gny) > 0 and to y otherwise. Under certain mild regularity conditions given in Section 4 of Ng and Randles (1983), the two misclassi cation error rates of RD(z;Fnx;Gny) are asymptotically equal or controlled to be a speci c constant r. In particular, assume the regularity conditions given below are true: 1. The sample versions of the discriminant functions must be symmetric in their argument. That is, D( jX1;:::;Xnx; Y1;:::;Yny) = D( jXs1;:::;Xsnx; Yt1;:::;Ytny) 62 for any permutation (s1;:::;snx) of (1;:::;nx) and (t1;:::;tny) of (1;:::;ny) 2. For each z 2Rd, D(z;Fnx;G) P !D(z;F;G) and D(z;F;Gny) P !D(z;F;G) ; where D( ) is a real valued function de ned on Rd. 3. For all z1;z2 2Rd, we have D(z1;F;G) D(z2;G;F) : 4. Z, Xi?s and Yj?s are all mutually independent 5. The random variables D(X;F;G) and D(Y;F;G) are absolutely continuous 6. The equalities P Fnx(D(X;F;G)) = rGny(D(X;G;F))) = 0 ; and P Fnx(D(Y;F;G)) = rGny(D(Y;G;F))) = 0 ; hold for some r2R+. Then we have PfRD(Z;F;G) < 0 jZ Fg= rPfRD(Z;F;G) > 0 jZ Gg: 63 The rank discriminant functions that we will study are: RL: This uses the LDF L(z;F;G) given in (2.3) in place of D(z;F;G) in (4.1). This would be the method proposed by Randles et al. (1978a) where he ranks LDF. RQ: This uses the QDF Q(z;F;G) given in (2.4) in place of D(z;F;G) in (4.1).This would be the method proposed by Randles et al. (1978a) where he ranks QDF. RS: This uses the ranking of the L1 depth function de ned in (2.8). However, to com- ply with the regularity conditions of Ng and Randles (1983) regarding e ective sample separating ability of the discriminant function, we will use ~S(z;F;G) = 1 D1(z;G) 1 D1(z;F) in place of D(z;F;G) in (4.1). This is compatible with all the other discriminant functions we use since 1=D1 is some kind of a measure of distance from the center of the distribution. If Mahalanobis depth is used the distance measure would become Mahalanobis distance and the classi er using this distance would exactly be RQ. In fact any robust a ne invariant measure of distance can be used in place of this distance and the method still works. The more robust the distance, the more robust the classi er. RM: Here we use the QDF given in (2.4) in place of D(z;F;G) in (4.1). However, in the sample version, we will use the minimum covariance determinant (MCD) estimators of location and covariance matrix (Rousseeuw, 1984, 1985) as done in 64 Hubert and Van Driessen (2004). The quadratic discriminant function of Hubert and Van Driessen (2004) will be denoted by M(z;F;G) as given in Equation (2.11). We use the FAST-MCD algorithm of Rousseeuw and Van Driessen (1999) with the default number of observations 0:75nx and 0:75ny for MCD computations. This assumption is safe to assume as long as you do not suspect more than 25% of the data to be outliers. The MCD estimates are essentially trimmed mean and covariance structure as they are looking at the central 75% of the data only. A very good property of MCD estimates is that a method using these estimates will remain a ne invariant as the MCD estimates are a ne invariant. RL and RQ are ranking LDF and QDF respectively as given in Randles et al. (1978a). Although ranking is nonparametric, the base, as in the quantities being ranked are completely parametric. That makes this method semi-parametric. We prove via simulation in Section 4.3 that although ranking makes the method ro- bust, the quantities being ranked matters and controls the overall TPM. Inspired by Randles et al. (1978a), we prove via simulation that the ranked methods are more balanced than the unranked counterparts. We also show that RS and RM are more robust than RL and RQ due to the fact that RL and RQ have a parametric base that is sensitive to deviations and can only be so much better under deviations. 65 4.3 Application on Real Data and Monte Carlo Simulation We would like to show the optimality of the proposed classi ers by applying the methods on some real data sets and a variety of simulation settings. 4.3.1 Application on Real Data Sets We will use the same three datasets as described in Subsection 3.6.1. We will consider the proposed classi ers, RM and RS, which are robust ranked versions of the unranked classi ers, M and S. M is the classi er created by replacing the parametric quantities in QDF with MCD estimates. S is the maximum depth classi er based on L1 depth given by Ghosh and Chaudhuri (2005) and J ornsten (2004). We consider the classical L (LDF) and Q (QDF) and their ranked versions RL and RQ given by Randles et al. (1978a). The results are shown in Table 3 given in the appendix. For the original Iris data, Q gave the lowest TPM at 4% but the two mis- classi cation error rates remain unbalanced at 1% and 10%. The method that provided the most balance was RM where each misclassi cation error rate is equal to 6%. The worst performer was L with misclassi cation error rates of 28% and 26%. When an outlier was added to the Iris data, M and RM gave the lowest TPM at 8% with RM giving greater balance (8%, 7.8%) than M (6%, 11.8%). For the Leukemia data, RL and L gave the lowest TPM of 4.2%; however, the error rates of L were unbalanced (2.1%, 8%) compared to those of RL (4.3%, 66 4%). All the remaining ranked discriminant functions gave a TPM of 8.3%. The method that performed poorly for this data was M with TPM of 12.5%. Considering the Colon data, both RM and M yielded TPM equal to 9.7% outperforming all other classi ers. Once again, RM had balanced error rates (9.1%, 10%) compared to M (13.6%, 7.5%). The least optimal method for this data was S with unbalanced error rates (40.9%, 10%) giving a TPM of 21%. We also performed leave-one-out cross validation for the Leukemia and Colon data after using the gene selection scheme given by Dudoit et al. (2002) as well as one that uses a Wilcoxon statistic in place of the t-statistic. The results were very similar and hence not reported here. 4.3.2 Monte Carlo Simulation For the same reasons mentioned in Subsection 3.6.2, we perform a very exten- sive Monte Carlo simulation to study the optimality (in terms of misclassi cation error rates) of the proposed classi cation procedures under a variety of distributional settings: Homoscedastic (vs) Heteroscedastic Equal sample sizes (vs) Unequal sample sizes Same distributions (vs) Di erent distributions for the two groups Normal tail (vs) heavy tailed symmetric distributions 67 Sparse data (vs) Dense data Combinations therein The simulation study follows the same setup given in Montanari (2004). We start o by generating training samples of the given sizes which are used to formulate the classi cation rules. Testing samples of size 1000 from each group are then generated and the misclassi cation error rates are calculated by computing the proportion of misclassi ed testing sample observations in each group. This process is then repeated 50 times and the mean and standard error of the misclassi cation error rates are computed. We considered a two-dimensional case in addition to the four-dimensional case considered for the projection pursuit methodology to study the e ect of data abun- dance with the misclassi cation error rate. We consider Cauchy (C, which is t with 1 degree of freedom), t with 2 degrees of freedom (t2), and Normal (N) distributions. Training samples of equal (50;50) and unequal (25;75) sizes were generated from the distributions. In the two-dimensional setting, we used centers (0;0)0 and (2;0)0 for the two distributions whereas the covariance matrices considered are the independence case I2 and a correlated data case V = 0 B@ 1 1 1 4 1 CA : 68 For the four-dimensional study we considered centers (0;0;0;0)0 and (2;0;0;0)0 and covariance matrices I4 and W = 0 BB BB BB BB @ 1 1 1 1 1 2 2 :25 1 2 3 :5 1 :25 :5 4 1 CC CC CC CC A : In the reporting of the results we use the notation K(A);H(B) to represent distri- butions K and H (K and H are C, t2, or N) with covariance matrices A and B (A and B are I2, I4, W, or V), respectively. Thus, in our study, the two groups could come from the same distribution with di erent means and the same covariance matrix (homoscedastic), the same distribution with di erent means and di erent covariance matrices (heteroscedastic), and/or two di erent distributions. A summary of the results for the two-dimensional case is given in Table 4 whereas Table 5 (given in the appendix) contains the results from the four-dimensional study. We infer the following from the results: S versus RS: S is heavily unbalanced, especially when the sample sizes are not the same, the tail thickness of one or more of the distributions increases, the distributions are heteroscedastic, or the two groups do not share the same distribution. The balance of RS is not a ected by any of these situations. RS generally has lower TPM than S, and the di erence is pronounced when at least one of the distributions is t2 or Cauchy. Moreover, RS has better consistency 69 of correct classi cation in that the standard errors of its misclassi cation error rates are generally much lower in comparison to S. L versus RL: As expected, Lgives poor performance in the case of heteroscedas- tic populations. Outside of balancing the misclassi cation error rates, RL does not appear to improve the TPM of L with the exception of the Cauchy distribu- tion cases. In fact, RL gives worse TPM than L when two di erent distributions are considered and one of the distributions is normal. Q versus RQ: The misclassi cation error rates of Q are severely unbalanced especially when one of the distributions is heavy tailed. In addition to providing balance, RQ gives lower error rates than Q in heavy tailed (C or t2) situations. The standard errors of the misclassi cation error rates of Q are three to four times higher than those of RQ when both distributions are heavy tailed. M versus RM: The TPM of RM is generally lower than that of M in the case of heavy tailed distributions. This is especially visible in the more sparse four- dimensional case. Otherwise, the TPMs of RM and R are comparable, with RM doing a better job of maintaining the balance between the misclassi cation error rates. The imbalance between the error rates of M increases when the two distributions are di erent and at least one of them is heavy tailed. Moreover, the standard errors of the misclassi cation error rates of M are 1.5 to 2 times larger than those of RM in the cases where a Cauchy distribution is involved. 70 In general, RM and M provide superior performance in terms of TPM when a heavy tailed distribution is involved. RM gives the best performance of all the methods studied in terms of TPM, balance, and standard errors when both the distributions are Cauchy. This is true in both homoscedastic and heteroscedastic cases. Not surprisingly, as shown in Randles et al. (1978a), ranking provides balance between the group misclassi cation error rates. Moreover, in the cases where heavy tailed or two di erent distributions are considered, ranking itself appears to provide added optimality (smaller TPM and standard errors) even when the original method is robust. 71 Chapter 5 Influence Function 5.1 Introduction The focus of this chapter is the analysis of the in uence functions of some of the discriminant functions that were introduced in earlier chapters. As discussed in Hubert and Van Driessen (2004), the robustness of a classi er directly depends on the robustness of the discriminant function used. The in uence on the LDF of a single point of perturbation added to one of the two populations was studied by Campbell (1978). Croux and Dehon (2001) studied the in uence function of the LDF where the underlying distributions are assumed to be multivariate normal and where both populations are contaminated. Croux et al. (2008) followed a similar approach but with a penalty term included to make the classi er optimal. Recently Huang et al. (2007) studied the pair-perturbation in uence function of the LDF where a pair of points of perturbation are included in one of the two populations. On the other hand, the in uence function of quadratic discrimination appears to be slightly more complicated. Croux and Joossens (2005) studied the in uence of perturbing a point in one of the populations on the misclassi cation error rate of quadratic discriminant analysis. Although more complex, this is QDF version of the work of Croux and Dehon (2001) where underlying normality is assumed. 72 The purpose of this chapter is two-fold: to provide the in uence function for the QDF thus lling this gap in the literature and to provide the in uence function for the rank based discriminant function introduced in Chapter 4. Both of these are given without making any underlying distributional assumptions. We will suppose that both distributions are -contaminated (Van Ness and Yang, 1998) and de ne Fy; = (1 )F + y and Gy; = (1 )G+ y ; where y is the distribution function of the point mass at y. Now if T(F;G) is a functional, then the in uence function is de ned as IF(y;T(F;G)) = lim #0 T(Fy; ;Gy; ) T(F;G) : (5.1) 5.2 IF for QDF The quadratic discriminant function is given in Equation (2.4) that we will re- produce here using a functional notation for convenience Q(z;F;G) = ln j (G)j j (F)j (z (F))0[ (F)] 1(z (F))+ (z (G))0[ (G)] 1(z (G)) : 73 De ne the function F(x;v) = (x (F))0[ (F)] 1(v (F)) : The following theorem gives the in uence function of the QDF. Theorem 5.1. For a xed value of z, the in uence function of Q(z;F;G) is given by IF(y;Q(z;F;G)) = G(z;z) 2 G(z;y) 2G(z;y) F(z;z) + 2 F(z;y)+ 2F(z;y) + G(y;y) F(y;y) : Proof. Write Q(z;Fy; ;Gy; ) = ln j (G y; )j j (Fy; )j Fy; (z;z) + Gy; (z;z) : Straight forward algebra shows that (Fy; ) = (F) + (y (F)) and (Fy; ) = (1 ) (F) + (y (F))(y (F))0. As shown in Campbell (1978), writing = (F); = (F), and w = y (F), we have [ (Fy; )] 1 = (1 ) 1 1 1ww0 1 (1 ) + w0 1w = (1 + ) 1 1ww0 1 74 up to order . Thus Fy; (z;z) =f(z ) wg0 (1 + ) 1 1ww0 1 f(z ) wg = (1 + ) F(z;z) 2 (1 + ) F(z;y) + 2(1 + ) F(y;y) 2F(z;y)+ 2 2 F(z;y) F(y;y) 3 2F(y;y) : This gives @ @ Fy; (z;z) =0 = F(z;z) 2 F(z;y) 2F(z;y) : (5.2) Using similar steps @ @ Gy; (z;z) =0 = G(z;z) 2 G(z;y) 2G(z;y) : (5.3) Now using a result of Golberg (1972) [p. 1125, Equation (9)] we nd @ @ j (Fy; )j= tr 1(Fy; ) @@ (Fy; ) j (Fy; )j: Moreover @ @ (Fy; ) = ww 0 ; 75 where w and are as de ned above. Thus @ @ lnj (Fy; )j= 1 j (Fy; )j tr 1(Fy; ) @@ (Fy; ) j (Fy; )j = tr (1 + ) 1 1ww0 1 (ww0 ) that gives @ @ lnj (Fy; )j =0 = tr 1(ww0 ) = F(y;y) d: (5.4) Similarly @ @ lnj (Gy; )j =0 = G(y;y) d: (5.5) Putting equations (5.2), (5.3), (5.4), and (5.5) together gives the desired result. It is clear that IF(y;Q(z;F;G)) is unbounded in y. In fact it is a quadratic in y. This shows that quadratic discriminant analysis is a non-robust procedure and may provide misleading results if the data are contaminated. 5.3 IF for RD Recall the rank based discriminant function given in Equation 4.1 RD(z;F;G) = PFfD(z;F;G) D(X;F;G)g PGfD(z;F;G) D(Y;F;G)g ; 76 where D(X;F;G) is a discriminant function. We will restrict our attention to the cases where D(z;F;G) is de ned as a di erence between the Type D depth of z in F and z in G D(z;F;G) = DF(z;F) DF(z;G) : The Type D depth of the point z in F is de ned as (Zuo and Ser ing, 2000) DF(z;F) = inffPF(C) : z2C2 g where is a speci ed class of subsets of Rd. For instance, if is the class of halfspaces in Rd, then DF is the halfspace depth. As in Wang and Ser ing (2006), we assume that 1. If C2 , then Cc2 , and 2. maxzDF(z;F) < 1, where Ac and A denote the complement and the closure of the set A. Let HF(t;F;G) = PFfD(X;F;G) tg and HG(t;G;F) = PGfD(X;G;F) tg : Note that RD(z;F;G) = HF(D(z;F;G);F;G) HG(D(z;G;F);G;F) : 77 Theorem 5.2. If D(v;F;G) is continuous in v, then the in uence function of HF(t;F;G) for a xed t is given by IF(y;HF(t;F;G)) = 8 >>> < >>> : t hF(t) HF(t) if D(y;F;G) >t 1 +t hF(t) HF(t) if D(y;F;G) t; where hF(t) = dHF(t)=dt. Proof. It is easy to see that for Type D depth functions DF, D(z;Fy; ;Gy; ) = (1 )DF(z;F) + (1 )DF(z;G) = (1 )D(z;F;G) : Then following an approach similar to Wang and Ser ing (2006) we obtain HFy; (t;Fy; ;Gy; ) = PFy; f(1 )D(X;F;G) tg = (1 )PFf(1 )D(X;F;G) tg+ If(1 )D(y;F;G) tg = (1 )PFf(1 )D(X;F;G) t \ X2Syg+ (1 )PFf(1 )D(X;F;G) t \ X =2Syg+ If(1 )D(y;F;G) tg ; where Sy =fx : D(x;F;G) D(y;F;G)g. Consider the following two cases: 78 (1) D(y;F;G) >t : For > 0 su ciently small fv : D(v;F;G) t=(1 )g\Sy =; fv : D(v;F;G) t=(1 )g\Scy =fv : D(v;F;G) t=(1 )g y =2fv : D(v;F;G) t=(1 )g Thus HFy; (t;Fy; ;Gy; ) = (1 )HF(t=(1 )) which gives IF(y;HF(t;F;G)) = @@ HFy; (t;Fy; ;Gy; ) =0 = t hF(t) HF(t) : (2) D(y;F;G) t : Once again for > 0 su ciently small fv : D(v;F;G) t=(1 )g\Sy =fv : D(y;F;G) D(v;F;G) t=(1 )g fv : D(v;F;G) t=(1 )g\Scy =fv : D(v;F;G) D(y;F;G)g y2fv : D(v;F;G) t=(1 )g Thus HFy; (t;Fy; ;Gy; ) = (1 )HF(t=(1 )) + 79 which gives IF(y;HF(t;F;G)) = @@ HFy; (t;Fy; ;Gy; ) =0 = 1 +t hF(t) HF(t) : The in uence function of HG(t;G;F) may be obtained following similar steps. We now give the in uence function of the rank discriminant function. Theorem 5.3. If D(z;F;G) is continuous, then the in uence function of RD(z;F;G) is IF(y;RD(z;F;G)) = 8 >>>> >>> >>>> >>> >>> >>>< >>> >>>> >>> >>>> >>> >>>: HG(D(z;G;F);G;F) HF(D(z;F;G);F;G) + 1 ; if D(y;F;G) D(z;F;G) : 80 Proof. Following steps that are similar to the proof of Theorem 5.2 and some algebraic manipulation, we have RD(z;Fy; ;Gy; ) = HFy; (D(z;Fy; ;Gy; );Fy; ;Gy; ) HGy; (D(z;Gy; ;Fy; );Gy; ;Fy; ) = (1 ) [HF(D(z;F;G);F;G) HG(D(z;G;F);G;F)] + [IfD(y;F;G) D(z;F;G)g IfD(y;G;F) D(z;G;F)g] : Note that IfD(y;G;F) D(z;G;F)g = IfD(y;F;G) D(z;F;G)g and therefore IfD(y;F;G) D(z;F;G)g IfD(y;G;F) D(z;G;F)gequals 0, 1, or -1 depend- ing on whether D(y;F;G) = D(z;F;G), D(y;F;G) < D(z;F;G), or D(y;F;G)j> D(z;F;G), respectively. Di erentiating with respect to and letting = 0 gives the desired result. Since HF and HG are CDFs, this in uence function remains bounded in y as long as the Type D depth function used is continuous. As IF(y;RD(z;F;G)) is a step function, RD has nite gross error sensitivity but in nite local shift sensitivity. This behavior is similar to the behavior of the in uence function of univariate quantiles. 81 5.4 Sensitivity Curves for the Projection Based Classi er To view the e ect of a single point contamination introduced into the training sample on the probability of misclassi cation error graphically, one needs a sample version of the in uence function provided in Equation (5.1). The quantity: SCn(y) = n[Tn(y1;:::;yn 1;y) Tn 1(y1;:::;yn 1)] ; (5.6) where y1;:::;yn 1 represents the training samples and y is a new point that is intro- duced into each of the two training samples (Tukey, 1971). Tn is the probability of misclassi cation error. SCn is known as the sensitivity curve and it shows the e ect of the added new point on the classi er as the value of this new point changes. The sensitivity curve of a robust procedure is bounded in much the same way the in uence function is bounded. We use a simple case where two groups with training sample of sizes 100 each and testing samples of sizes 1000 each are generated from four dimensional normal distributions with means (0;0;0;0) and (2;0;0;0). To simplify things further, we used the identity covariance structure for both the distributions. We introduced a single point (i;0;0;0);i = 200;:::;200 into both the training samples and for each point we calculate the probability of misclassi cation error with and without the added point using a leave-one-out cross validation for the uncontaminated testing samples. The sensitivity of the probability of misclassi cation for each classi er is then calculated using the Equation (5.6) and all these values are plotted in a graph. 82 ?200 ?150 ?100 ?50 0 50 100 150 200?5 0 5 10 15 20 25 30 35 New Point Sensitivity SPGT GGT PGT TD MaxD LDF QDF Figure 5.1: Sensitivity Curve for the Projection Based Methods 83 ?20 ?15 ?10 ?5 0 5 10 15 20 25 ?2.5 ?2 ?1.5 ?1 ?0.5 0 0.5 1 1.5 2 New Point Sensitivity SPGT GGT PGT TD MaxD LDF QDF Figure 5.2: A Zoomed View of the Sensitivity Curve for the Projection Based Methods 84 The sensitivity curve in Figure 5.1 clearly shows that the classi ers LDF and QDF appear to have an unbounded misclassi cation error sensitivity. As the magnitude of the new point increases, so does the sensitivity of the misclassi cation error for these classi ers. On the other hand MaxD and all the projection based classi ers: SPGT, GGT, PGT and TD seem to be bounded. Although they are a ected by the presence of the outlier, their sensitivity is eventually bounded. A zoomed-in view of the plot when the newly added point is around the origin is shown in Figure 5.2. It shows that the sensitivity curves for the projection based methods are bounded, with PGT being the least a ected. A theoretical evaluation of the in uence function of these projection based classi ers appears to be extremely complicated although the sensitivity curves give us a strong indication that the in uence function will be bounded. 85 Chapter 6 Conclusion Two types of nonparametric procedures for discriminant analysis were consid- ered; one based on transvariation probabilities and the other based on ranking dis- criminant functions. The rst method is a nonparametric discriminant analysis procedure that uses the method of projection pursuit in tandem with the idea of transvariation probabil- ities. In particular, group separation is measured using the two-group transvariation probability (Gini, 1916) as a projection index. Allocation of the new observation is performed either using a symmetrized group-group transvariation probability or a smoothed version of point-group transvariation. These procedures are shown to give smaller misclassi cation error rates compared to linear and quadratic discriminant functions and maximum L1 depth classi er when the training samples are drawn from symmetric and skewed distributions and especially when the di erence between the training sample sizes gets larger. Moreover these methods give smaller misclassi- cation error rates compared to Montanari?s (Montanari, 2004) transvariation based classi er that uses distances to allocate the new observation when the training samples are drawn from skewed distributions. An extensive simulation study and applications on three real data sets (Fisher?s IRIS, Leukemia and Colon datasets) are used to illustrate this behavior. 86 Although computation time can be a hinderance for projection pursuit tech- niques, especially for large dimensions, we were able to optimize our computation by generating projection directions using a multiscale procedure. To begin with, we proceed by rst generating a few points on the hypercube, that are projected onto the hypersphere and then repeatedly \zooming-in" to interesting hyperarcs on the surface of the d-dimensional hypersphere where more directions are considered. We took ve lattice points per dimension on the hypercube. No signi cant improvement in misclassi cation error rate was noticed when seven or nine lattice points were used instead of ve. We have found that this makes for a faster computation time without compromising the results. To illustrate the robustness of the proposed procedures, we provide sensitivity curves (Tukey, 1971) where a new point is introduced into each training sample. Then this new point is changed and its e ect on the misclassi cation error rate is measured. The curves provided in Chapter 5 show that the newly proposed classi- ers have bounded sensitivity to the inclusion of the new point with respect to the misclassi cation error rate. The second method is another nonparametric method that is based on rank- ing discriminant functions. In discriminant analysis, when an investigator has no prior reason to prefer one population to another, balance among the groups in terms of the probabilities of correct classi cation is a desirable property of a discriminant function. We asked ourselves whether the method of ranking discriminant functions 87 given in Randles et al. (1978a) would work in balancing the probabilities of cor- rect classi cation for the recently proposed maximum L1 depth classi er (J ornsten, 2004) and MCD (Minimum Covariance Determinant) based quadratic discriminant function (Hubert and Van Driessen, 2004). Not surprisingly, our extensive simula- tion study showed that balance between the probabilities of correct classi cation is achieved, while maintaining the overall robustness of the procedure under a variety of distributional settings. As it turns out, ranking discriminant functions does more than just provide bal- ance between the the probabilities of correct classi cation while maintaining the total probability of misclassi cation error at the level of the original discriminant functions. Ranking also gives smaller and more consistent total probabilities of misclassi cation error rates in cases where the underlying distributions are heavy tailed. This is reminiscent of the use of ranking in univariate and certain multivariate problems to arrive at procedures that are robust against a variety of violation of assumptions (Hettmansperger and McKean, 1998). A somewhat surprising revelation was that this bene t persisted even when the underlying discriminant function itself was ro- bust such as the one given by Hubert and Van Driessen (2004). In particular, in the simulation settings that we investigated, the ranked version of the MCD based classi er of Hubert and Van Driessen (2004) gives the best performance of all the methods studied in terms of total probability of correct classi cation, balance, and standard errors when the two underlying distributions are Cauchy. 88 In Chapter 5 we derived the in uence function for the quadratic discriminant function as well as the rank based discriminant function that is given in Chapter 4. The in uence function for the quadratic discriminant function is unbounded, as ex- pected, but the in uence function for the rank based discriminant function is a step function and hence robust against gross error contamination. Future Work The work presented in this dissertation is by no stretch of imagination complete. A plethora of things can be done taking the work presented here as a starting point. For the projection based classi ers, more e cient projection pursuit techniques that would let one classify observations from dimensions much higher are desired. Projection indices that are robust and at the same time amenable to a more e cient method of searching for the most interesting view of the data will make projection pursuit based classi cation more appealing. Moreover, each transvariation in the GGT classi er can be smoothed as in the SPGT classi er. A better smoothing func- tion that works better than the t-CDF proposed in SPGT classi er can be found. Finding the optimal smoother for the t-CDF or for any other function theoretically is also an interesting problem. For the rank based classi ers, one can always nd robust distance measures that are more robust to deviations that we can then rank. That way the ranked version would be that much better and balanced. As mentioned in Ng and Randles (1983), the ratio of the misclassi cation error rates for the two samples can be controlled. 89 This idea could be applied to classify data from the medical eld where the ratio of false-positive to false-negative can be controlled at any desired level. Both types of classi ers only consider the two group classi cation. The extension to more than two groups is not immediate and is interesting. Theoretically, the in uence function for the discriminant function can be derived for the projection based classi ers. This will con rm the results that are seen in the sensitivity curves. Finally, the training data breakdown point could be another theoretical result that could be found for both types of classi ers. 90 Bibliography Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA, 96(12):6745{6750. Anderson, T. W. (1984). An introduction to multivariate statistical analysis. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Inc., New York, second edition. Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H., and Tukey, J. W. (1972). Robust estimates of location: Survey and advances. Princeton University Press, Princeton, N.J. Billor, N., Abebe, A., Turkmen, A., and Nudurupati, S. V. (2008). Classi cation based on depth transvariations. J. Classi cation, (in press). Brzezinski, J. R. (1999). Logistic Regression Modeling for Context-Based Classi ca- tion. IEEE Computer Society, Washington, DC, USA. Campbell, N. A. (1978). The in uence function as an aid in outlier detection in discriminant analysis. Applied Statistics, 27(3):251{258. Chen, Z. Y. (1989). Robust linear discriminant procedures using projection pursuit methods. Unpublished PhD Dissertation, University of Michigan. 91 Chen, Z.-Y. and Muirhead, R. J. (1994). A comparison of robust linear discriminant procedures using projection pursuit methods. In Multivariate analysis and its ap- plications (Hong Kong, 1992), volume 24 of IMS Lecture Notes Monogr. Ser., pages 163{176. Inst. Math. Statist., Hayward, CA. Chork, C. Y. and Rousseeuw, P. J. (1992). Integrating a high-breakdown option into discriminant analysis in exploration geochemistry. Journal of Geochemical Exploration, 43(3):191{203. Christmann, A., Fischer, P., and Joachims, T. (2002). Comparison between vari- ous regression depth methods and the support vector machine to approximate the minimum number of misclassi cations. Comput. Statist., 17(2):273{287. Christmann, A. and Rousseeuw, P. J. (2001). Measuring overlap in binary regression. Comput. Statist. Data Anal., 37(1):65{75. Crimin, K., McKean, J. W., and Sheather, S. J. (2007). Discriminant procedures based on e cient robust discriminant coordinates. J. Nonparametr. Stat., 19(4- 5):199{213. Croux, C. and Dehon, C. (2001). Robust linear discriminant analysis using S- estimators. Canad. J. Statist., 29(3):473{493. Croux, C., Filzmoser, P., and Joossens, K. (2008). Classi cation e ciencies for robust linear discriminant analysis. Statist. Sinica, 18(2):581{599. 92 Croux, C., Filzmoser, P., and Oliveira, M. R. (2007). Algorithms for projection pursuit: robust principal component analysis. Chemometrics and Intelligent Labo- ratory Systems, 87(2):218{225. Croux, C. and Haesbroeck, G. (2000). Principal component analysis based on ro- bust estimators of the covariance or correlation matrix: In uence functions and e ciencies. Biometrika, 87(3):603{618. Croux, C. and Joossens, K. (2005). In uence of observations on the misclassi cation probability in quadratic discriminant analysis. J. Multivariate Anal., 96(2):384{ 403. Dillon, W. R. (1979). The performance of the linear discriminant function in nonop- timal situations and the estimation of classi cation error rates: A review of recent ndings. Journal of Marketing Research, 16:370{381. Donoho, D. (1982). Breakdown properties of multivariate location estimators. In PhD Qualifying paper. Department of Statistics, Harvard University. Donoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist., 20(4):1803{ 1827. Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classi cation of tumors using gene expression data. J. Amer. Statist. Assoc., 97(457):77{87. 93 Eddy, W. F. (1985). Ordering of multivariate data. In In Computer Science and Statistics: The Interface(L. Billard, ed.), pages 25{30. North-Holland, Amsterdam. Fang, K. T. and Wang, Y. (1994). Number-theoritic methods in statistics. Chapman and Hall, London. Filzmoser, P., Serneels, S., Croux, C., and Van Espen, P. J. (2006). Robust multi- variate methods: The projection pursuit approach. From Data and Information Analysis to Knowledge Engineering, pages 270{277. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, VII(II):179{188. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249{266. Friedman, J. H. and Tukey, J. W. (1974). Projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C 23(9):881{890. Ghosh, A. K. and Chaudhuri, P. (2005). On maximum depth and related classi ers. Scand. J. Statist., 32(2):327{350. Gini, C. (1916). Il concetto di transvariazione e le sue prime applicazioni. Giornale degli economisti Rivista di statistica. Golberg, M. A. (1972). The derivative of a determinant. Amer. Math. Monthly, 79:1124{1126. 94 Goldstein, M. (1987). [a review of multivariate analysis]: Comment. Statistical Sci- ence, 2(4):418{420. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloom eld, C. D., and Lander, E. S. (1999). Molecular classi cation of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531{ 537. Gunn, S. R. (1998). Support vector machines for classi cation and regression. Te- chinical Report, University of Southampton. Hardy, A. (1991). On tests concerning the existence of a classi cation. Belgian Journal of Operations Research, Statistics and Computer Sciences, (31):111{126. Hawkins, D. M. and McLachlan, G. J. (1997). High-breakdown linear discriminant analysis. Journal of the American Statistical Association, 92(437):136{143. He, X. and Fung, W. K. (2000). High breakdown estimation for multiple populations with applications to discriminant analysis. J. Multivariate Anal., 72(2):151{162. Hellman, M. E. (1970). Nearest neighbor classi cation rule with a reject option. IEEE Transactions on Systems Science and Cybernetics, SSC6(3):179{&. Hettmansperger, T. P. and McKean, J. W. (1998). Robust nonparametric statistical methods, volume 5 of Kendall?s Library of Statistics. Edward Arnold, London. 95 Hettmansperger, T. P. and Randles, R. H. (2002). A practical a ne equivariant multivariate median. Biometrika, 89(4):851{860. Hills, M. (1967). Discrimination and allocation with discrete data. Applied Statistics, 16(3):237{250. Hoberg, R. (2003). Clusteranalyse. In Klassi kation und Datentiefe. Eul, Lohmar. Hollander, M. and Wolfe, D. A. (1999). Nonparametric statistical methods. Wiley Series in Probability and Statistics: Texts and References Section. John Wiley & Sons Inc., New York, second edition. A Wiley-Interscience Publication. Huang, Y., Kao, T.-L., and Wang, T.-H. (2007). In uence functions and local in u- ence in linear discriminant analysis. Comput. Statist. Data Anal., 51(8):3844{3861. Huber, P. J. (1977). Robust covariances. In Statistical decision theory and related topics, II (Proc. Sympos., Purdue Univ., Lafayette, Ind., 1976), pages 165{191. Academic Press, New York. Huber, P. J. (1981). Robust statistics. John Wiley & Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics. Huber, P. J. (1985). Projection pursuit. Ann. Statist., 13(2):435{525. With discussion. Huber, P. J. (1989). Programs for Exploratory Projection Pursuit. Artemis Systems, Carlisle, MA. 96 Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis. Comput. Statist. Data Anal., 45(2):301{320. Hugg, J., Rafalin, R., Seyboth, K., and Souvaine, D. (2006). An experimental study of old and new depth measures. In Workshop on Algorithm Engineering and Ex- periments (ALENEX06), pages 51{64. Springer-Verlag Lecture Notes in Computer Science, New York. Johnson, M. E., Wang, C., and Ramberg, J. S. (1979). Robustness of sher?s linear discriminant function to departures from normality. In Los Alamos Techinical Report LA-8068-MS. Los Alamos, NM 87545. Johnson, N. L. (1949). Systems of frequency curves generated by methods of trans- lation. Biometrika, 36:149{176. Jones, M. C. (1983). The projection pursuit algorithm for exploratory data anlaysis. PhD Thesis, University of Bath. Jones, M. C. and Sibson, R. (1987). What is projection pursuit? J. Roy. Statist. Soc. Ser. A, 150(1):1{36. With discussion. Joossens, K. and Croux, C. (2004). Empirical comparison of the classi cation per- formance of robust linear and quadratic discriminant analysis. In Theory and ap- plications of recent robust methods, Stat. Ind. Technol., pages 131{140. Birkh auser, Basel. 97 J ornsten, R. (2004). Clustering and classi cation based on the L1 data depth. J. Multivariate Anal., 90(1):67{89. Lachenbruch, P. A. (1975). Discriminant analysis. Hafner Press, New York. Lachenbruch, P. A., Sneeringer, C., and Revo, L. T. (1973). Robustness of the linear and quadratic discriminant function to certain types of non-normality. Comm. Statist., 1(1):39{56. LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and clas- si cation. J. Amer. Statist. Assoc., 91(436):1641{1650. Liu, R. Y. (1990). On a notion of data depth based on random simplices. Ann. Statist., 18(1):405{414. Liu, R. Y. (1992). Data depth and multivariate rank tests. In L1-statistical analysis and related methods (Neuch^atel, 1992), pages 279{294. North-Holland, Amsterdam. Liu, R. Y., Parelius, J. M., and Singh, K. (1999). Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann. Statist., 27(3):783{858. With discussion and a rejoinder by Liu and Singh. Liu, R. Y. and Singh, K. (1993). A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc., 88(421):252{260. Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Science of India, pages 49{55. 98 Marsaglia, G. (1972). Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics, 43(2):645{646. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New York. A Wiley-Interscience Publication. Mojirsheibani, M. (1997). A consistent combined classi cation rule. Statist. Probab. Lett., 36(1):43{47. Mojirsheibani, M. (1999). Combining classi ers via discretization. J. Amer. Statist. Assoc., 94(446):600{609. Mojirsheibani, M. (2000). A kernel-based combined classi cation rule. Statist. Probab. Lett., 48(4):411{419. Montanari, A. (2004). Linear discriminant analysis and transvariation. J. Classi ca- tion, 21(1):71{88. Mosler, K. (2002). Multivariate dispersion, central regions and depth, volume 165 of Lecture Notes in Statistics. Springer-Verlag, Berlin. The lift zonoid approach. Mosler, K. and Hoberg, R. (2003). Classi cation based on data depth, Bulliten of the ISI 54th session. Nason, G. (1995). Three-dimensional projection pursuit. Applied Statistics, 44(4):411{430. 99 Ng, T. H. and Randles, R. H. (1983). Rank procedures in many population forced discrimination problems. Comm. Statist. A|Theory Methods, 12(17):1943{1959. Nguyen, D. V. and Rocke, D. M. (2002). Tumor classi cation by partial least squares using microarray gene expression data. Bioinformatics, 18:39{50. Pao, Y. H. (1989). Adaptive pattern Recognition and Neural Network. Addition- Wesley Publishing Company, Inc. Posse, C. (1990). An e ective two-deimensional projection pursuit algorithm. Com- munications in Statistics: Simulation and Computation, 19:1143{1164. Posse, C. (1992). Projection pursuit discriminant analysis for two groups. Comm. Statist. Theory Methods, 21(1):1{19. Posse, C. (1995). Projection pursuit exploratory data analysis. Computational Statis- tics & Data Analysis, 20(6):669{687. Powell, M. J. D. (1964). E cient method for nding minimum of function of several- variables without calculating derivatives. Computer Journal, 7(2):155{&. Randles, R. H., Bro tt, J. D., Ramberg, J. S., and Hogg, R. V. (1978a). Discriminant analysis based on ranks. J. Amer. Statist. Assoc., 73(362):379{384. Randles, R. H., Bro tt, J. D., Ramberg, J. S., and Hogg, R. V. (1978b). General- ized linear and quadratic discriminant functions using robust estimates. J. Amer. Statist. Assoc., 73(363):564{568. 100 R atsch, G. (1998). Ensemble learning methods for classi cation. Diploma thesis (in german). R atsch, G., Onoda, T., and M uller, K.-R. (1998). Soft margins for AdaBoost. Tech- nical Report NC-TR-1998-021, Department of Computer Science, Royal Holloway, University of London, Egham, UK. Submitted to Machine Learning. Rosenbrock, H. H. (1960). An automatic method for nding the gretest or least value of a function. Computer Journal, 3(3):175{184. Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In Mathe- matical statistics and applications, Vol. B (Bad Tatzmannsdorf, 1983), pages 283{ 297. Reidel, Dordrecht. Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212{223. Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc., 79(388):871{880. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley, New York. Rousseeuw, P. J. and Vanzomeren, B. C. (1990). Unmasking multivariate outliers and levereage points. Journal of the American Statistical Association, 85(411):633{639. 101 Rousseeuw, P. J. and Yohai, V. (1984). Robust regression by means of S-estimators. Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics, 26:256{ 272. Seber, G. A. F. (1984). Multivariate Observations. John Wiley, New York. Ser ing, R. J. (1980). Approximation theorems of mathemtical statistics. John Wiley & Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics. Singh, K. (1991). A notion of majority depth. In Technical Report. Department of Statistics, Rutgers University. Ting, K. (2002). Cost-sensitive classi cation using decision trees. Idea Group Pub- lishing. Tukey, J. (1974). Address to international congress of mathematicians. Vancouver. Tukey, J. W. (1971). Explaratory Data Analysis. Addison-Wesley, Reading. Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Ann. Statist., 15(1):234{251. Van Ness, J. W. and Yang, J. J. (1998). Robust discriminant analysis: training data breakdown point. J. Statist. Plann. Inference, 67(1):67{83. Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. USA, 97(4):1423{1426 (electronic). 102 Wang, J. and Ser ing, R. (2006). In uence functions for a general class of depth-based generalized quantile functions. J. Multivariate Anal., 97(4):810{826. Yenyukov, I. S. (1988). Detecting Structures by Means of Projection Pursuit. Pro- ceeding of COMPSTAT, 88:48{58. Yenyukov, I. S. (1989). Indices for projection pursuit. E. Diday (Ed.), Data Analysis, Learning Symbolic and Numeric Language, pages 181{189. Zuo, Y. and Ser ing, R. (2000). General notions of statistical depth function. Ann. Statist., 28(2):461{482. 103 Appendices 104 Table 1: Application on Real Data: Performance of Projection Based Methods Using Leave-one-out Misclassi cation Error Rates Dist SPGT PGT GGT TD MaxD LDF QDF Iris Original 0.0933 0.0467 0.0400 0.3133 0.0333 0.1067 0.0333 Outliers 0.0774 0.0397 0.0387 0.2968 0.0774 0.2129 0.1226 Leukemia 0.0278 0.0290 0.0139 0.1111 0.0417 0.0417 0.0417 Colon 0.1290 0.1525 0.1290 0.1290 0.1774 0.1290 0.1613 105 Table 2: Performance of Projection Based Methods Using Monte Carlo Simulation: Average Misclassi cation Error Rates and Standard Errors on 50 Replications of 1000 Test Cases per Group Dist SPGT PGT GGT TD MaxD LDF QDF N(I),N(I) 0.1641 0.1625 0.1624 0.1624 0.1665 0.1635 0.1659 (150,150) 0.0091 0.0086 0.0087 0.0083 0.0079 0.0075 0.007 N(I),N(I) 0.1638 0.1953 0.1696 0.1607 0.1711 0.1623 0.1685 (50,250) 0.0097 0.015 0.0122 0.0094 0.0114 0.0085 0.0098 t(I),t(I) 0.2165 0.216 0.2137 0.213 0.2932 0.2465 0.3698 (150,150) 0.0097 0.0109 0.011 0.0107 0.0372 0.0405 0.0738 t(I),t(I) 0.2258 0.4075 0.2227 0.2177 0.3093 0.26 0.3833 (50,250) 0.018 0.0711 0.0177 0.0151 0.0387 0.0501 0.0673 C(I),C(I) 0.2617 0.2643 0.2565 0.2557 0.4208 0.4492 0.5 (150,150) 0.0116 0.0152 0.0114 0.0116 0.0468 0.0777 0.0056 C(I),C(I) 0.2747 0.4938 0.2659 0.2615 0.4201 0.4501 0.4984 (50,250) 0.024 0.0239 0.0218 0.0208 0.0482 0.072 0.0095 LN(I),LN(I) 0.1595 0.1588 0.1581 0.1783 0.2361 0.2576 0.2273 (150,150) 0.0104 0.0099 0.0088 0.0119 0.0338 0.0243 0.0249 LN(I),LN(I) 0.1644 0.1971 0.1713 0.1856 0.2713 0.2761 0.2426 (50,250) 0.0106 0.0173 0.0139 0.0112 0.0405 0.0362 0.0338 106 Table 2 Continued Dist SPGT PGT GGT TD MaxD LDF QDF N(I),N(W) 0.0954 0.1033 0.105 0.1003 0.0663 0.0969 0.0656 (150,150) 0.0095 0.0124 0.012 0.0088 0.0062 0.007 0.0059 N(I),N(W) 0.0993 0.1001 0.0974 0.1039 0.0698 0.0999 0.0686 (50,250) 0.012 0.0171 0.0124 0.0111 0.0052 0.0086 0.0051 t(I),t(W) 0.2095 0.2093 0.2043 0.2045 0.247 0.2222 0.3152 (150,150) 0.012 0.015 0.0131 0.0119 0.054 0.0425 0.0693 t(I),t(W) 0.2151 0.337 0.2093 0.2077 0.267 0.2249 0.3313 (50,250) 0.0145 0.0818 0.0119 0.0129 0.0553 0.0354 0.0554 C(I),C(W) 0.2601 0.2567 0.2531 0.2505 0.4007 0.4264 0.4888 (150,150) 0.0145 0.0173 0.0141 0.0131 0.0542 0.0706 0.0176 C(I),C(W) 0.2645 0.4967 0.2546 0.2525 0.3728 0.4326 0.4675 (50,250) 0.0191 0.0186 0.0132 0.0126 0.0478 0.1004 0.0207 LN(I),LN(W) 0.1271 0.1253 0.1258 0.1493 0.1729 0.2273 0.2382 (150,150) 0.01 0.01 0.0093 0.0141 0.0228 0.0216 0.0293 LN(I),LN(W) 0.132 0.144 0.1313 0.1537 0.1889 0.2376 0.2339 (50,250) 0.0136 0.0144 0.0137 0.0139 0.036 0.0284 0.039 107 Table 2 Continued Dist SPGT PGT GGT TD MaxD LDF QDF N(I),t(I) 0.1817 0.1932 0.19 0.1851 0.3331 0.2054 0.207 (150,150) 0.0115 0.0163 0.0108 0.0117 0.0623 0.0443 0.0251 N(I),t(I) 0.1868 0.4262 0.2093 0.1893 0.3418 0.2022 0.2123 (50,250) 0.0104 0.0688 0.0132 0.0097 0.0458 0.0184 0.0267 N(I),C(I) 0.1844 0.2312 0.2213 0.2074 0.4516 0.3288 0.2356 (150,150) 0.0099 0.0253 0.0133 0.0082 0.0505 0.0841 0.0263 N(I),C(I) 0.1884 0.5016 0.2366 0.2057 0.4691 0.3314 0.2357 (50,250) 0.0171 0.0044 0.0172 0.0091 0.0341 0.0819 0.0238 LN(I),N(I) 0.131 0.1293 0.1286 0.131 0.1506 0.1506 0.2098 (150,150) 0.0091 0.0085 0.0087 0.0083 0.0169 0.0143 0.0256 LN(I),N(I) 0.1487 0.1529 0.1715 0.1375 0.1523 0.1522 0.2004 (50,250) 0.0163 0.0141 0.0153 0.0156 0.0248 0.0202 0.0271 t(I),C(I) 0.2333 0.2434 0.2392 0.2354 0.4077 0.3543 0.3918 (150,150) 0.0108 0.0212 0.0088 0.0084 0.0534 0.0796 0.0154 t(I),C(I) 0.2341 0.5024 0.2429 0.233 0.399 0.3405 0.3841 (50,250) 0.013 0.002 0.0159 0.0106 0.0532 0.0882 0.0177 LN(I),t(I) 0.1979 0.2032 0.204 0.1894 0.2268 0.2077 0.2452 (150,150) 0.0156 0.0154 0.0161 0.0125 0.0279 0.017 0.0709 LN(I),t(I) 0.225 0.2483 0.3979 0.1993 0.2546 0.2158 0.263 (50,250) 0.0234 0.023 0.0717 0.0204 0.04 0.0246 0.0653 LN(I),C(I) 0.2532 0.2789 0.2742 0.2433 0.3916 0.2805 0.3628 (150,150) 0.0233 0.0188 0.0253 0.0225 0.0426 0.0446 0.016 LN(I),C(I) 0.3024 0.3286 0.5018 0.2698 0.4091 0.3232 0.3608 (50,250) 0.0327 0.0328 0.0013 0.0374 0.0411 0.0765 0.0203 108 Table 2 Continued Dist SPGT PGT GGT TD MaxD LDF QDF N(W),t(I) 0.1552 0.1793 0.1782 0.1655 0.1908 0.1746 0.0848 (150,150) 0.0248 0.0156 0.023 0.0182 0.0417 0.0305 0.0104 N(W),t(I) 0.1566 0.1672 0.139 0.157 0.1741 0.1502 0.0807 (50,250) 0.026 0.0164 0.0303 0.0195 0.0498 0.0242 0.0072 N(W),C(I) 0.1804 0.2203 0.2194 0.205 0.4146 0.3474 0.1149 (150,150) 0.0185 0.0163 0.0302 0.0139 0.0566 0.0985 0.0228 N(W),C(I) 0.1861 0.2189 0.1689 0.2046 0.3866 0.3238 0.0991 (50,250) 0.0291 0.0181 0.0288 0.0168 0.0741 0.0991 0.0232 LN(W),N(I) 0.0145 0.0141 0.0116 0.0681 0.0733 0.1327 0.047 (150,150) 0.004 0.0042 0.0043 0.0142 0.0247 0.027 0.0106 LN(W),N(I) 0.0155 0.0188 0.0177 0.0725 0.0886 0.1417 0.0483 (50,250) 0.0047 0.0076 0.0084 0.0128 0.0353 0.026 0.0149 t(W),C(I) 0.2296 0.2365 0.2406 0.231 0.3288 0.3734 0.4131 (150,150) 0.0145 0.012 0.022 0.0105 0.0512 0.0875 0.0183 t(W),C(I) 0.2425 0.238 0.3266 0.2331 0.3274 0.3587 0.4031 (50,250) 0.0154 0.0135 0.0734 0.0104 0.0535 0.0996 0.0349 LN(W),t(I) 0.0584 0.0555 0.0551 0.084 0.1107 0.1532 0.2045 (150,150) 0.0075 0.0063 0.0065 0.0134 0.0239 0.03 0.0487 LN(W),t(I) 0.0637 0.0594 0.0599 0.0915 0.1125 0.159 0.1933 (50,250) 0.0106 0.0119 0.0135 0.0159 0.0246 0.0292 0.0428 LN(W),C(I) 0.1193 0.1156 0.1145 0.1373 0.1648 0.2324 0.4103 (150,150) 0.0135 0.0125 0.0127 0.0204 0.0427 0.0573 0.0735 LN(W),C(I) 0.1277 0.1493 0.2292 0.1445 0.1814 0.2564 0.4515 (50,250) 0.0187 0.0201 0.0495 0.0288 0.0561 0.0741 0.0369 109 Table 3: Application on Real Data: Performance of Rank Based Methods Using Leave-one-out Misclassi cation Error Rates Data RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L Iris Original .060 .040 .053 .000 .500 .167 .270 .260 .267 .280 .260 .273 Outlier .130 .098 .119 .010 .549 .192 .510 .451 .490 .490 .294 .424 Leukemia .064 .120 .083 .064 .160 .097 .043 .040 .042 .021 .080 .042 Colon .136 .200 .177 .409 .100 .210 .091 .125 .113 .091 .125 .113 Table 3 Continued Data RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M Iris Original .060 .040 .053 .010 .100 .040 .060 .060 .060 .060 .100 .073 Outlier .110 .137 .119 .010 .843 .291 .080 .078 .080 .060 .118 .080 Leukemia .064 .120 .083 .064 .160 .097 .064 .120 .083 .106 .160 .125 Colon .136 .175 .161 .273 .100 .161 .091 .100 .097 .136 .075 .097 110 Table 4: Performance of Rank Based Methods Using Monte Carlo Simulation: Av- erage Misclassi cation Error Rates and Standard Errors for Bivariate Data on 50 Replications of 1000 Test Cases per Group Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L N(I2),N(I2) .168 .166 .167 .160 .175 .167 .163 .165 .164 .159 .166 .162 (50,50) .036 .037 .010 .037 .045 .012 .035 .038 .009 .024 .027 .009 N(I2),N(I2) .157 .177 .167 .184 .154 .169 .153 .173 .163 .159 .164 .162 (25,75) .038 .039 .011 .047 .038 .015 .036 .035 .009 .028 .028 .009 t2(I2),t2(I2) .268 .267 .268 .276 .261 .268 .252 .242 .247 .241 .254 .248 (50,50) .060 .063 .035 .100 .099 .037 .083 .073 .061 .082 .094 .059 t2(I2),t2(I2) .250 .279 .265 .292 .246 .269 .233 .259 .246 .269 .226 .248 (25,75) .065 .053 .030 .114 .109 .039 .083 .076 .054 .114 .082 .063 C(I2),C(I2) .350 .337 .343 .363 .330 .347 .385 .397 .391 .417 .412 .415 (50,50) .067 .072 .046 .127 .114 .050 .149 .143 .136 .242 .224 .102 C(I2),C(I2) .345 .352 .349 .427 .282 .354 .371 .397 .384 .433 .363 .398 (25,75) .070 .082 .056 .154 .151 .056 .140 .138 .126 .240 .228 .102 N(I2),N(V) .126 .134 .130 .190 .077 .133 .146 .150 .148 .162 .129 .145 (50,50) .035 .035 .008 .042 .026 .011 .036 .041 .009 .028 .027 .008 N(I2),N(V) .133 .141 .137 .216 .072 .144 .149 .156 .152 .168 .128 .148 (25,75) .044 .047 .011 .052 .024 .018 .044 .053 .012 .029 .028 .009 t2(I2),t2(V) .237 .237 .237 .342 .145 .244 .236 .239 .238 .236 .240 .238 (50,50) .041 .055 .029 .100 .063 .032 .052 .062 .044 .062 .081 .044 t2(I2),t2(V) .237 .261 .249 .383 .127 .255 .234 .228 .231 .246 .213 .230 (25,75) .064 .054 .026 .116 .055 .042 .075 .054 .046 .078 .069 .041 C(I2),C(V) .318 .320 .319 .409 .237 .323 .342 .375 .358 .357 .409 .383 (50,50) .058 .053 .040 .163 .135 .050 .114 .121 .110 .237 .227 .089 C(I2),C(V) .320 .378 .349 .490 .227 .358 .380 .418 .399 .449 .392 .421 (25,75) .077 .078 .049 .204 .142 .068 .155 .148 .144 .295 .263 .089 111 Table 4 Continued Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M N(I2),N(I2) .166 .165 .165 .162 .166 .164 .172 .165 .169 .165 .169 .167 (50,50) .038 .037 .009 .026 .029 .010 .043 .036 .012 .034 .037 .012 N(I2),N(I2) .171 .159 .165 .166 .164 .165 .176 .168 .172 .180 .158 .169 (25,75) .046 .042 .011 .029 .031 .012 .051 .047 .013 .038 .031 .012 t2(I2),t2(I2) .276 .287 .282 .289 .347 .318 .257 .244 .250 .252 .253 .253 (50,50) .065 .070 .049 .165 .210 .080 .048 .045 .026 .050 .053 .023 t2(I2),t2(I2) .268 .280 .274 .324 .315 .320 .259 .247 .253 .265 .238 .251 (25,75) .070 .074 .049 .207 .188 .085 .064 .058 .028 .044 .048 .027 C(I2),C(I2) .405 .395 .400 .526 .436 .481 .344 .335 .340 .348 .340 .344 (50,50) .099 .084 .068 .376 .379 .044 .058 .057 .037 .108 .077 .056 C(I2),C(I2) .388 .435 .411 .312 .648 .480 .356 .332 .344 .391 .312 .351 (25,75) .094 .090 .078 .303 .316 .033 .076 .071 .039 .125 .102 .055 N(I2),N(V) .128 .135 .131 .133 .120 .126 .130 .137 .133 .138 .124 .131 (50,50) .035 .039 .010 .027 .027 .006 .040 .048 .011 .028 .027 .009 N(I2),N(V) .122 .141 .131 .148 .118 .133 .152 .120 .136 .154 .116 .135 (25,75) .030 .036 .010 .032 .028 .011 .040 .038 .015 .038 .036 .012 t2(I2),t2(V) .255 .252 .253 .242 .368 .305 .226 .217 .221 .214 .246 .230 (50,50) .059 .061 .041 .204 .197 .085 .047 .050 .021 .032 .074 .032 t2(I2),t2(V) .250 .253 .251 .222 .371 .296 .231 .218 .225 .238 .213 .225 (25,75) .054 .072 .039 .150 .165 .062 .047 .053 .020 .040 .056 .025 C(I2),C(V) .358 .384 .371 .391 .541 .466 .303 .314 .308 .277 .379 .328 (50,50) .082 .087 .066 .374 .347 .051 .051 .060 .029 .038 .106 .047 C(I2),C(V) .362 .405 .384 .300 .649 .475 .301 .316 .309 .331 .333 .332 (25,75) .100 .097 .079 .354 .326 .026 .053 .065 .034 .088 .114 .040 112 Table 4 Continued Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L N(I2),C(I2) .197 .221 .209 .651 .045 .348 .335 .352 .344 .144 .499 .321 (50,50) .056 .044 .025 .170 .029 .072 .136 .141 .133 .134 .221 .100 N(I2),C(I2) .179 .243 .211 .633 .046 .339 .336 .392 .364 .131 .498 .314 (25,75) .056 .050 .025 .149 .022 .065 .149 .180 .157 .102 .231 .109 N(I2),t2(I2) .194 .215 .205 .356 .112 .234 .204 .215 .210 .170 .239 .204 (50,50) .052 .042 .019 .116 .038 .043 .062 .044 .032 .054 .049 .031 N(I2),t2(I2) .206 .239 .223 .410 .105 .257 .206 .235 .221 .177 .247 .212 (25,75) .051 .052 .029 .134 .035 .054 .061 .069 .045 .049 .059 .044 C(I2),t2(I2) .328 .276 .302 .131 .508 .320 .380 .354 .367 .472 .242 .357 (50,50) .065 .055 .036 .062 .145 .054 .143 .151 .142 .196 .150 .110 C(I2),t2(I2) .344 .276 .310 .199 .468 .334 .357 .318 .338 .498 .184 .341 (25,75) .080 .063 .042 .132 .193 .068 .144 .120 .115 .217 .096 .091 N(I2),C(V) .140 .183 .161 .620 .028 .324 .301 .327 .314 .111 .477 .294 (50,50) .043 .039 .017 .193 .018 .088 .148 .128 .129 .106 .214 .092 N(I2),C(V) .143 .223 .183 .648 .027 .338 .308 .382 .345 .113 .486 .300 (25,75) .065 .049 .029 .190 .021 .086 .151 .140 .135 .097 .215 .086 N(I2),t2(V) .144 .176 .160 .359 .065 .212 .178 .209 .193 .151 .228 .189 (50,50) .042 .033 .014 .125 .025 .052 .050 .032 .024 .046 .048 .021 N(I2),t2(V) .153 .191 .172 .420 .052 .236 .185 .215 .200 .167 .221 .194 (25,75) .052 .047 .024 .121 .024 .051 .047 .059 .035 .045 .049 .029 C(I2),t2(V) .288 .259 .273 .213 .344 .278 .358 .349 .354 .445 .277 .361 (50,50) .047 .053 .033 .086 .128 .040 .122 .135 .121 .198 .154 .120 C(I2),t2(V) .323 .269 .296 .258 .346 .302 .340 .325 .332 .480 .216 .348 (25,75) .067 .061 .049 .127 .158 .051 .131 .123 .117 .235 .127 .107 113 Table 4 Continued Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M N(I2),C(I2) .197 .218 .208 .027 .457 .242 .177 .179 .178 .109 .236 .172 (50,50) .044 .034 .020 .031 .113 .043 .054 .033 .020 .031 .055 .020 N(I2),C(I2) .195 .229 .212 .025 .468 .246 .201 .165 .183 .139 .232 .185 (25,75) .050 .043 .023 .024 .100 .040 .071 .038 .027 .056 .058 .030 N(I2),t2(I2) .200 .201 .200 .113 .285 .199 .190 .182 .186 .146 .221 .183 (50,50) .053 .029 .020 .037 .072 .025 .049 .030 .017 .030 .037 .014 N(I2),t2(I2) .204 .225 .215 .125 .310 .217 .207 .189 .198 .190 .212 .201 (25,75) .066 .041 .024 .053 .090 .030 .068 .042 .025 .065 .043 .029 C(I2),t2(I2) .333 .294 .313 .723 .093 .408 .272 .268 .270 .323 .230 .277 (50,50) .069 .057 .043 .135 .134 .040 .059 .051 .029 .094 .041 .039 C(I2),t2(I2) .318 .316 .317 .681 .110 .395 .279 .263 .271 .370 .205 .287 (25,75) .060 .080 .046 .169 .100 .058 .062 .044 .029 .112 .047 .043 N(I2),C(V) .161 .163 .162 .010 .414 .212 .154 .129 .142 .067 .215 .141 (50,50) .057 .030 .022 .011 .077 .034 .056 .030 .023 .026 .044 .018 N(I2),C(V) .144 .194 .169 .013 .418 .216 .151 .138 .144 .097 .202 .149 (25,75) .040 .042 .021 .017 .084 .036 .053 .040 .022 .053 .055 .019 N(I2),t2(V) .151 .161 .156 .061 .275 .168 .155 .137 .146 .111 .177 .144 (50,50) .034 .034 .015 .025 .072 .027 .045 .031 .016 .030 .033 .012 N(I2),t2(V) .140 .185 .163 .080 .256 .168 .166 .149 .157 .113 .177 .145 (25,75) .045 .041 .016 .040 .091 .030 .052 .046 .021 .043 .043 .013 C(I2),t2(V) .354 .308 .331 .654 .143 .398 .269 .240 .254 .301 .216 .259 (50,50) .080 .086 .069 .197 .185 .056 .040 .046 .019 .053 .053 .029 C(I2),t2(V) .333 .291 .312 .634 .165 .399 .290 .253 .272 .326 .206 .266 (25,75) .066 .061 .049 .236 .189 .070 .058 .062 .029 .080 .056 .038 114 Table 5: Performance of Rank Based Methods Using Monte Carlo Simulation: Aver- age Misclassi cation Error Rates and Standard Errors for 4D Data on 50 Replications of 1000 Test Cases per Group Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L N(I4),N(I4) .186 .171 .179 .190 .169 .180 .177 .159 .168 .169 .163 .166 (50,50) .039 .039 .014 .048 .054 .017 .038 .034 .012 .031 .027 .011 N(I4),N(I4) .188 .185 .187 .243 .143 .193 .168 .169 .169 .177 .162 .170 (25,75) .039 .040 .014 .073 .042 .022 .032 .033 .010 .029 .027 .011 t2(I4),t2(I4) .317 .326 .321 .349 .302 .325 .270 .273 .271 .262 .280 .271 (50,50) .065 .054 .039 .151 .132 .046 .073 .081 .060 .065 .095 .060 t2(I4),t2(I4) .322 .329 .326 .408 .261 .335 .277 .286 .282 .314 .251 .283 (25,75) .067 .057 .039 .148 .126 .039 .075 .078 .060 .090 .080 .061 C(I4),C(I4) .420 .423 .422 .440 .396 .418 .418 .410 .414 .415 .425 .420 (50,50) .069 .077 .050 .150 .150 .045 .081 .112 .083 .158 .169 .080 C(I4),C(I4) .414 .441 .427 .509 .349 .429 .425 .457 .441 .560 .343 .452 (25,75) .073 .070 .048 .164 .163 .043 .109 .115 .092 .233 .213 .069 N(I4),N(W) .079 .077 .078 .100 .047 .073 .103 .118 .111 .177 .030 .104 (50,50) .020 .033 .009 .020 .023 .008 .021 .046 .016 .028 .025 .012 N(I4),N(W) .081 .080 .081 .115 .033 .074 .096 .119 .108 .184 .016 .100 (25,75) .025 .037 .010 .026 .016 .008 .019 .048 .018 .030 .011 .011 t2(I4),t2(W) .245 .246 .246 .402 .111 .257 .245 .252 .248 .263 .235 .249 (50,50) .057 .062 .037 .134 .073 .052 .072 .076 .060 .061 .100 .055 t2(I4),t2(W) .246 .271 .258 .463 .086 .274 .244 .268 .256 .296 .203 .250 (25,75) .043 .060 .032 .139 .055 .054 .085 .091 .075 .111 .067 .070 C(I4),C(W) .363 .408 .386 .547 .246 .396 .409 .435 .422 .382 .477 .430 (50,50) .065 .067 .042 .186 .137 .055 .118 .094 .088 .204 .172 .078 C(I4),C(W) .352 .413 .383 .573 .196 .385 .400 .443 .421 .492 .372 .432 (25,75) .074 .077 .049 .191 .138 .058 .129 .129 .117 .247 .183 .104 115 Table 5 Continued Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M N(I4),N(I4) .177 .176 .176 .180 .171 .175 .188 .192 .190 .177 .197 .187 (50,50) .038 .036 .015 .034 .033 .013 .043 .047 .017 .038 .040 .017 N(I4),N(I4) .183 .186 .185 .215 .161 .188 .215 .212 .213 .223 .172 .198 (25,75) .045 .048 .018 .045 .029 .017 .064 .054 .029 .070 .038 .026 t2(I4),t2(I4) .336 .324 .330 .359 .416 .387 .293 .304 .298 .308 .303 .306 (50,50) .071 .074 .051 .236 .218 .063 .050 .052 .028 .058 .069 .036 t2(I4),t2(I4) .322 .335 .329 .366 .393 .380 .317 .315 .316 .400 .268 .334 (25,75) .081 .067 .051 .203 .206 .063 .065 .063 .036 .093 .063 .042 C(I4),C(I4) .443 .444 .443 .499 .481 .490 .408 .403 .405 .420 .440 .430 (50,50) .082 .082 .056 .345 .347 .024 .067 .065 .043 .118 .126 .044 C(I4),C(I4) .452 .459 .456 .323 .654 .489 .415 .416 .415 .535 .340 .438 (25,75) .069 .081 .051 .261 .275 .023 .088 .079 .034 .123 .104 .046 N(I4),N(W) .078 .077 .077 .094 .051 .072 .082 .084 .083 .097 .056 .076 (50,50) .023 .037 .012 .016 .018 .007 .023 .042 .014 .020 .019 .009 N(I4),N(W) .074 .092 .083 .102 .044 .073 .096 .087 .091 .107 .052 .080 (25,75) .021 .038 .012 .020 .016 .007 .031 .040 .022 .024 .020 .011 t2(I4),t2(W) .234 .285 .259 .216 .379 .297 .214 .213 .214 .203 .236 .220 (50,50) .050 .058 .037 .144 .152 .059 .038 .064 .024 .038 .057 .026 t2(I4),t2(W) .258 .285 .272 .213 .386 .299 .226 .226 .226 .249 .190 .220 (25,75) .056 .068 .040 .075 .133 .041 .054 .069 .031 .043 .055 .028 C(I4),C(W) .389 .421 .405 .360 .581 .471 .326 .364 .345 .301 .407 .354 (50,50) .071 .085 .055 .347 .322 .033 .045 .057 .033 .072 .101 .036 C(I4),C(W) .388 .440 .414 .213 .702 .458 .354 .346 .350 .393 .351 .372 (25,75) .064 .063 .031 .190 .183 .024 .071 .068 .029 .095 .095 .042 116 Table 5 Continued Dist RS1 RS2 RS S1 S2 S RL1 RL2 RL L1 L2 L N(I4),C(I4) .159 .225 .192 .899 .007 .453 .344 .402 .373 .151 .495 .323 (50,50) .045 .033 .023 .091 .007 .043 .120 .114 .110 .106 .164 .087 N(I4),C(I4) .159 .264 .212 .884 .009 .446 .314 .442 .378 .097 .554 .326 (25,75) .062 .049 .029 .138 .012 .063 .124 .126 .115 .082 .171 .079 N(I4),t2(I4) .177 .241 .209 .627 .046 .337 .215 .256 .236 .172 .282 .227 (50,50) .049 .039 .021 .177 .033 .075 .078 .068 .056 .060 .074 .053 N(I4),t2(I4) .195 .291 .243 .658 .040 .349 .201 .270 .235 .186 .259 .222 (25,75) .072 .052 .038 .159 .023 .069 .076 .068 .053 .057 .067 .047 C(I4),t2(I4) .384 .303 .343 .087 .705 .396 .393 .340 .367 .487 .214 .351 (50,50) .047 .057 .036 .076 .175 .064 .111 .103 .094 .170 .100 .083 C(I4),t2(I4) .402 .315 .359 .154 .665 .410 .431 .353 .392 .609 .161 .385 (25,75) .073 .065 .037 .137 .214 .060 .115 .115 .106 .178 .092 .078 N(I4),C(W) .108 .174 .141 .881 .002 .442 .314 .401 .357 .085 .509 .297 (50,50) .037 .034 .017 .137 .004 .067 .128 .137 .124 .066 .191 .079 N(I4),C(W) .107 .227 .167 .911 .002 .457 .341 .453 .397 .104 .526 .315 (25,75) .054 .051 .026 .108 .003 .053 .122 .152 .118 .087 .173 .078 N(I4),t2(W) .121 .150 .135 .529 .013 .271 .174 .218 .196 .164 .226 .195 (50,50) .041 .036 .018 .191 .013 .091 .054 .080 .058 .047 .103 .057 N(I4),t2(W) .123 .190 .156 .618 .009 .313 .179 .224 .201 .172 .209 .190 (25,75) .044 .043 .017 .163 .010 .078 .069 .073 .051 .055 .078 .048 C(I4),t2(W) .346 .291 .319 .121 .585 .353 .397 .381 .389 .485 .278 .381 (50,50) .069 .063 .043 .106 .171 .059 .113 .117 .107 .172 .120 .108 C(I4),t2(W) .335 .284 .310 .190 .476 .333 .378 .347 .362 .539 .188 .363 (25,75) .063 .070 .044 .105 .175 .057 .133 .115 .112 .194 .095 .104 117 Table 5 Continued Dist RQ1 RQ2 RQ Q1 Q2 Q RM1 RM2 RM M1 M2 M N(I4),C(I4) .165 .203 .184 .007 .396 .201 .142 .171 .156 .077 .233 .155 (50,50) .055 .039 .022 .009 .063 .028 .050 .038 .021 .036 .042 .020 N(I4),C(I4) .162 .233 .197 .012 .411 .211 .179 .170 .174 .141 .206 .174 (25,75) .066 .042 .028 .019 .073 .030 .064 .032 .027 .067 .052 .027 N(I4),t2(I4) .207 .217 .212 .090 .314 .202 .190 .192 .191 .158 .228 .193 (50,50) .047 .036 .024 .042 .081 .025 .058 .037 .025 .050 .054 .028 N(I4),t2(I4) .221 .244 .233 .130 .307 .218 .227 .209 .218 .230 .219 .224 (25,75) .070 .047 .029 .065 .074 .031 .073 .047 .039 .082 .054 .040 C(I4),t2(I4) .385 .325 .355 .665 .088 .376 .327 .303 .315 .406 .246 .326 (50,50) .046 .056 .038 .067 .032 .022 .065 .053 .021 .081 .050 .029 C(I4),t2(I4) .389 .303 .346 .664 .115 .389 .343 .308 .325 .432 .220 .326 (25,75) .073 .067 .041 .119 .125 .036 .081 .057 .034 .085 .047 .028 N(I4),C(W) .111 .154 .132 .003 .293 .148 .103 .101 .102 .035 .171 .103 (50,50) .040 .033 .017 .009 .058 .027 .035 .029 .014 .018 .033 .013 N(I4),C(W) .119 .170 .145 .004 .297 .151 .137 .114 .125 .094 .140 .117 (25,75) .053 .040 .022 .005 .041 .019 .057 .032 .022 .063 .039 .231 N(I4),t2(W) .131 .144 .137 .041 .241 .141 .111 .114 .112 .086 .139 .113 (50,50) .040 .036 .019 .021 .061 .023 .039 .033 .015 .028 .039 .014 N(I4),t2(W) .129 .186 .157 .070 .222 .146 .131 .126 .128 .140 .131 .136 (25,75) .053 .036 .019 .044 .081 .026 .052 .047 .028 .063 .043 .029 C(I4),t2(W) .370 .314 .342 .646 .114 .380 .243 .256 .250 .271 .233 .252 (50,50) .074 .066 .056 .103 .044 .037 .053 .057 .031 .070 .053 .032 C(I4),t2(W) .360 .308 .334 .643 .115 .379 .287 .232 .259 .331 .182 .257 (25,75) .076 .084 .060 .162 .105 .054 .068 .042 .028 .069 .031 .030 118