Nonparametric Methods for Classiflcation and Related Feature Selection Procedures by Shuxin Yin A dissertation submitted to the Graduate Faculty of Auburn University in partial fulflllment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 6, 2010 Keywords: Gene expression, Principal component, Misclassiflcation, Rank-bassed Copyright 2011 by Shuxin Yin Approved by Asheber Abebe, Chair, Associate Professor of Mathematics and Statistics Peng Zeng, Associate Professor of Mathematics and Statistics Ming Liao, Professor of Mathematics and Statistics Abstract One important application of gene expression microarray data is classiflcation of samples into categories, such as types of tumor. Gene selection procedures become crucial since gene expression data from DNA microarrays are characterized by thou- sands measured genes on only a few subjects. Not all these genes are thought to determine a speciflc genetic trait. In this dissertation, I develop a novel nonparamet- ric procedure for selecting such genes. This rank-based forward selection procedure rewards genes for their contribution towards determining the trait but penalizes them for their similarity to genes that are already selected. I will show that my method gives lower misclassiflcation error rates than the dimension reduction methods such as principal component analysis and partial least square analysis. I also explore more properties of Wilcoxon-Mann-Whitney (WMW) statistic and propose a new classifler based on WMW to reduce the misclassiflcation error rate. Real data analysis and Monte Carlo simulation demonstrate the superiority of the proposed methods to the classical methods in several situations. ii Acknowledgments My greatest thanks extend to Dr. Asheber Abebe for his patience and guidance throughout this endeavor. Dr. Abebe took considerable time and energy to further the progress in my studies of Statistics. I also wish to express my appreciation for my parents and other family members Man Peng and Shuyu Yin who have given of their love and constant support throughout my life. I would also like to thank Dr. Peng Zeng and Dr. Ming Liao on the committee for their contribution to this dissertation and serving as the committee members. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Classiflcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Variable Screening Procedures . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Dimension Reduction Methods . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 9 2.2.2 Partial Least Square Analysis . . . . . . . . . . . . . . . . . . 10 2.3 Classiflcation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Non-Robust Classiflcation . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Rank Based Classiflcation using Projections . . . . . . . . . . 13 2.3.3 Maximum Depth Classiflers . . . . . . . . . . . . . . . . . . . 17 3 Applications of Wilcoxon-Mann-Whitney Statistic (WMW) . . . . . . . . 19 3.1 Feature Annealed Independence Rules (Fan and Fan, 2008) . . . . . 19 3.2 WMW-Based Feature Annealed Classifler . . . . . . . . . . . . . . . . 21 3.2.1 Variable Selection Based on WMW Statistic . . . . . . . . . . 22 iv 3.2.2 Projection-free WMW-based Classifler . . . . . . . . . . . . . 24 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Prostate Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Smoothed Projection-Free WMW-Based Classifler . . . . . . . . . . . 29 3.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . 31 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 WMW Forward Variable Selection Method . . . . . . . . . . . . . . . . 35 4.1 Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Caribbean Food Data . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2 Colon Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Leukemia Data . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.4 Monte Carlo Study . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 47 v List of Figures 3.1 Scree Plot of WFAC for Lung Cancer Data . . . . . . . . . . . . . . . . 26 3.2 Scree Plot of WFAC for Prostate Cancer Data . . . . . . . . . . . . . . . 28 3.3 Misclassiflcation Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Misclassiflcation Error Rate (Smoothed with equal Size) . . . . . . . . . 33 3.5 Misclassiflcation Error Rate (Smoothed with Unequal Size) . . . . . . . . 34 4.1 Misclassiflcation error rates ( Black=QDA, Gray=MaxD, White=GGT) . . 46 vi List of Tables 3.1 Lung Cancer data. The minimum number of incorrectly classifled samples out of 149 testing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Prostate Cancer data. The minimum number of incorrectly classifled sam- ples out of 102 samples using leave-one-out cross validation . . . . . . . . 27 4.1 Caribbean food data. Misclassiflcation error rates using leave-one-out cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Colon data. The number of incorrectly classifled samples out of 62 samples using leave-one-out cross validation. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Leukemia training data. The number of incorrectly classifled samples out of 38 training samples using leave-one-out cross validation. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. . . . . . . . . . . . . . 42 4.4 Leukemia training and testing data. The number of incorrectly classifled samples out of 24 testing samples based on training samples. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. . . . . . . 43 vii Chapter 1 Introduction 1.1 Nonparametric Methods Originally nonparametric methods were introduced in the mid-1930. The rank correlation without normality assumption was discussed in Hotelling and Past (1936) which was considered as the the true beginning of topic of nonparametric statistics. Friedman (1940) developed the Fried test which was a nonparametric statistical test to detect difierences in treatments across multiple test attempts in the complete block design. Durbin (1951) proposed the nonparametric test for the incomplete block de- sign that reduces to the Friedman test in the case of a complete block design. Benard and Elteren (1953) generalized Durbin test to the case in which several observations are taken on some experimental units. During the same period, Wilcoxon (1945) proposed the signed-rank statistic named Wicoxon statistic to test the signiflcance of the location difierences of two samples. Later Mann and Whitney (1947) introduced the Mann-Whitney test statistic which is equivalent to Wilcoxon rank test statistic. Wilcoxon statistic played a central role in many nonparametric approaches in 1950s and 1960s. Nonparametric methods have emerged as the preferred methodology in many scientiflc areas due to their outstanding advantages: ? Nonparametric procedures require fewer assumptions than the traditional meth- ods so that they can be used more widely than the corresponding parametric methods. In particular, nonparametric procedures are applicable and more ef- flcient in many situations where normality assumption is violated. 1 ? Nonparametric methods are distribution-free methods, which do not rely on assumptions that the data are drawn from a certain probability distribution. So they can be used in many complicated situations where the distribution theory is not achievable. ? Nonparametric methods are resistant to outliers. When the data contain some outliers or the longer tail than the normal distribution, some traditional sta- tistical procedures are ine?cient, even though they can perform well when the error in the model follow a normal distribution. ? Another advantage for the use of non-parametric methods is simplicity. In some cases, even when the use of parametric methods is justifled, non-parametric methods may be easier to use. Due both to this simplicity and to their greater robustness, non-parametric methods are preferred by some statisticians. In recent years, nonparametric analysis has gained its popularity in the analysis of linear model (Sievers and Abebe, 2004), non-linear model (Abebe and McKean, 2007), classiflcation (Nudurupati and Abebe, 2009; Montanari, 2004), generalized estimating equations (Abebe et al., 2011), etc because it leaves less room for the improper use and misunderstanding. 1.2 Classiflcation Some of the basic ideas and history in classiflcation is discussed in the following. Consider we have two populations, and the main goal of classiflcation is to determine the membership of a new observation based on the training data set. A discriminant function is needed to flnd the criterion in order to assign the new observations and it generally projects the multidimensional real space into one dimension real line such that a clear cutting value of discriminant function can be applied to determine which class the new observation probably belongs. Fisher (1936) gave the linear discriminant 2 classifler. He found the optimal projection direction by maximizing the two-sample t- statistic and allocated the new observations based on the Euclidian distance between the new observations and the centers of populations. If the covariance matrices of populations are not equal, quadratic discriminant classifler is preferred. Bickel and Levina (2004) proposed the independence classifler by setting the non-diagonal entries of the common covariance matrix to be zero. However those methods above are adversely ine?cient in many circumstances where the normality assumption is not proper because of their sensitivity to the skewness and outliers. Their limitations call for some robust rank-based classiflers which are highly related to the idea of transvariation probability given in Gini (1916). Transvariation probability is originally deflned for the univariate case and can be extended to the multivariate case by following a certain projection pursuit. Monta- nari (2004) proposed transvariation-distance classifler to allocate a new observation according to the Euclidian distance to population centers in the projected space. He found the optimal projection direction that maximizes the two-sample Mann- Whitney-Wilcoxon (WMW) statistic (Lehmann, 2006). He also proposed the point- group classifler to determine the likelihood of the new observation belonging to which population based on the same projection pursuit. The results by using point-group classifler, however can be biased when two sample sizes are too difierent. An im- proved allocation method was proposed by Nudurupati and Abebe (2009). They put the new observation in two samples separately to smooth the data depth. We can also use some depth functions to measure the group separation in order to classify a new observation as shown in Liu et al (1999). A few popular depth functions are Ma- halanobis depth (Mahalanobis, 1936; Liu and Singh, 1993), halfspace depth (Tukey, 1974), simplicial depth (Liu, 1990), majority depth (Singh, 1991), projection depth (Donoho, 1982), and spatial or L1 depth (Vardi and Zhang, 2000). 3 1.3 Dimension Reduction However the gene expression data are usually ultrahigh dimensional such that sample size N is far smaller than the data dimension p which can make some clas- siflers not applicable. As we know high dimension can easily cause over ow in the calculation of inverse matrices that is required by some classifler, such as the ones involving the projection pursuit. Typically, the calculation working load can be in- creased dramatically by even adding one more gene if a projection pursuit is needed. Besides, it is well known that only few genes carry the useful information which can determine a speciflc genetic trait, such as susceptibility to cancer while most of genes carry nothing useful but the noises. Taking all the genes instead of the most informa- tive ones in to account in the process of classiflcation can?t provide a better accuracy but result in the widely ine?ciency. Usually, a smaller set of genes are selected based the amount of the information in terms of the group separation to be considered as the most important genes in the process of classiflcation. Basically, there are two ways to reduce the dimension of data: ? Select a subset of the original variables (genes) based on the power of class determination; ? Create new variables by combining the information of all the variables (genes) without loss much information from the original variables. Many statisticians prefer that flrstly a smaller set of variables are selected by following a certain variable screening method and then some optimal linear combi- nations of the selected variables are flnally created to proceed the classiflcation while some directly perform the classiflcation after the variable screening. Dudoit et al (2002) performed gene screening based on the ratio of between-group and within-group sums of squares. Many statisticians (Fan and Fan, 2008; Nguyen and Rocke, 2002; Ding and Gentleman, 2005) applied two-sample t-statistic which 4 measures the distance between two populations and can be used as the criterion to preliminarily select the most important genes while other people (Liao et al, 2007) picked up the variables based on Wilcoxon-Mann-Whitney statistic which is also good measurement in terms of group separation. Usually the variable screening method using WMW statistic is only slightly less e?cient than the one using t-statistic when the underlying populations are normal, and it can be mildly or wildly more e?cient than its competitors when the underlying populations are not normal. Because of the sensitivity of some classiflers to the dimension, the dimension is needed to be reduced further even though the initial variable screening is applied. Based on the genes selected by the variables screening procedures, some dimension reduction methods can be introduced to reduce the dimension by performing a linear mapping of data to a lower dimensional space, such as principle component analysis (PCA) and partial least square analysis (PLS). In PCA (Massey, 1965; Jollifie, 1986), a small set of orthogonal linear combinations of the original predictor variables can be found by maximizing the variance of these linear combinations matrix. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues can now be used to reconstruct a large fraction of the variance of the original data. Then the flrst few eigenvectors are selected and can often be considered as the optimal linear combinations and used in classiflcation. Thus the original space is reduced to the lower dimensional space spanned by a few eigenvectors without loss of the information carried by the original variables. In PCA, however, the correlation between the predictor variables and the re- sponse variable specifying the class of the observations is not considered, which may be ine?cient. E?cient one must not treat the predictors separately from the re- sponse. Nguyen and Rocke (2002) proposed a new approach to obtain the optimal combinations by maximizing the covariance between those linear combinations and 5 response vector, which is referred to as partial least square analysis. PLS surpasses PCA by taking the relation to response variable into account. A numerical algorithm to obtain the components is also included in that paper. 1.4 Organization In Chapter 2, I discuss seven difierent classiflcation methods as well as several dimension reduction methods and gene screening procedures. In Chapter 3, I prove that the WMW can pick up all the important variables with the probability tending to 1 followed by a comparison of the performances of WMW statistic on variable screening and classiflcation with the procedure given in Fan and Fan (2008) using two real data sets and a large simulation study. Besides, a smoother is recommended when two sample sizes are too difierent. In Chapter 4, I propose a new forward variable selection method and demonstrate its superiority using some real data analysis and simulation. 1.5 Notations Here are some notations used throughout my dissertation: ? Consider two populations ?X and ?Y with underlying distributions F and G with common support Rp. ? ?x and ?y are the mean vectors of ?X and ?Y ? ?x and ?y are the covariance matrices of ?X and ?Y ? XandYare two samples from ?X and ?Y with sample sizes nx and ny respec- tively. ? X and Y are the predictor matrices of two samples, and R is the response vector (indictor of tumor versus normal tissue). 6 Chapter 2 Background During the past decade and a half, classiflcation and clustering methods have gained popularity for cancer classiflcation based on gene expression proflles obtained via DNA microarray technology. But the dimension of microarray data is usually ultrahigh which makes some classifler inapplicable. Moreover we believe that a subset of genes whose expression su?ces for accurately predicting the response. Dimension reduction allows us to replace a very large number of predictors with a small number of linear combinations, and perform a better prediction based on those optimal linear combinations. For the sake of simplicity, before I use some dimension reduction approaches, some variable screening methods are applied to get rid of those irrelevant variables which have smaller variation than noise measurement. 2.1 Variable Screening Procedures Two sample t-statistic can be considered as a measurement of group separation since it can evaluate the difierences in means between two groups. Larger t-statistic value implies the better group separation. Thus we flrstly rank all the variables based on their two sample t-statistic. The two-sample t-statistic for variable k is deflned as Tk = xk ?ykq s2xk=nx +s2yk=ny ; k = 1;:::;p (2.1) where xk and yk are the means of two samples respectively for variable k. s2xk and s2yk are variances of of two samples respectively for variable k. I select the ones with the larger values of t-statistic to be the most informative variables. 7 However, t-statistic is sensitive to the outliers and skewness, so the nonparametric alternatives are recommended to be applied to measure the difierences between two groups when the normality assumption is violated. The most commonly used one is WMW statistic which is robustly measuring the difierences of two groups. Rank the variables based on the two sample WMW statistic Wk = 1? 2n xny nxX i=1 nyX j=1 `'(xki ?ykj)mkxy? (2.2) where mkxy is the median of fxki ?ykjg for k = 1;:::;p. The function ` is deflned as `(x) = 8 >>< >>: 1 if x < 0 0 if x ? 0 : (2.3) The most informative variables are the ones with the larger WMW statistic which indicates the less overlapped area under the density curves of two populations. More details are discussed in Chapter 3. 2.2 Dimension Reduction Methods Afterinitialvariablescreeningprocedure, mostnoninformativevariablesareelim- inated and only few important ones left. But the dimension is still too high for some classiflers, especially for the ones requiring the projection pursuit. I consider to apply the dimension reduction approaches to reduce the dimension further. The purpose of dimension reduction is to create a smaller set of linear combinations of original variables without loss of too much information carried by the original ones. This is achieved by optimizing a deflned objective criterion. PCA and PLS are two well- known dimension reduction methods. 8 2.2.1 Principal Component Analysis In PCA (Massey, 1965; Jollifie, 1986), orthogonal linear combinations are con- structed to maximize the variance of the linear combinations of the predictor variables sequentially ck = Argmax c0c=1 Var(Pc); where P = X[Y subject to the orthogonality constraint c0Scj = 0 for all 1 ? j ? k, where S = P0P In fact, let n = nx +ny and V be the n?p matrix found by stacking X and Y; that is V = 2 66 66 66 66 66 66 66 4 x1 ... xnx y1 ... yny 3 77 77 77 77 77 77 77 5 : where xi and yj are the rows in X and Y respectively. Let ?1 ? ?2 ????? ?r be the eigenvalues of V0V and fi1;:::;fir be the corresponding eigenvectors, where r is the rank of V0V. The ith principal component is then Vfii, which is a linear combination of the original columns of V. The flrst principal component accounts the direction of maximum variability in the data, and each succeeding component accounts for increasingly smaller amounts of variability. Several optimal linear combinations can be obtained by using the eigenvectors corresponding to the larger eigenvalues. PCA, however, only measures the variability in the predictor data V without any consideration to the contribution of variables towards the classiflcation problem. Ignoring the relation to the response variable may make the components selected 9 by PCA short of the information in terms of the group separation and consequently results in inaccuracy in classiflcation. A dimension reduction approach considering the correlation between predictors and response variables is needed. 2.2.2 Partial Least Square Analysis Nguyen and Rocke (2002) proposed the method of PLS that sequentially max- imizes the covariance between the response variable and a linear combinations of predictor variables. In our case, the response variable R is made up of nx zeros and ny ones as R = (0;:::;0;1;:::;1)0. Thus in place of the eigenvectors used in PCA, PLS uses fii = Argmax kfik=1 cov(Vfi;R) subject to the constraint (Vfi)0(Vfii) = 0 for i = 1;:::;r. This object criterion for the dimension reduction may be more appropriate for the prediction since the relation between predictors and response variable is considered. A basic algorithm implement- ing PLS is given in Nguyen and Rocke (2002). For the details, see also Helland (1988), Garthwaite (1994), H?oskuldsson (1988), and Martens and Naes (1989). 2.3 Classiflcation Methods In general, classiflcation is to solve the problem of classifying a new observation z 2 ?X [ ?Y in either ?X or ?Y. We need to deflne a discriminant function to project the multidimensional data space into one dimensional real line such that we can make the decision that where the new observation belongs: Deflnition 2.1. A discriminant function D(z;F;G) : Rp ! R is such that z is classifled in ?X if D(z;F;G) > 0 and in ?Y otherwise. 10 Discriminant function gives a linear combination of the predictor variables, whose values are as close as possible within populations and as far apart as possible between populations. Several popular classiflers are discussed in the following. 2.3.1 Non-Robust Classiflcation Linear Discriminant Analysis (LDA) Fisher (1936) looked at a linear combination of the p-covariates that maximizes the separation between the two populations ?X and ?Y. This gives rise to the linear discriminant function L(z;F;G) ? (?x ??y)0??1 ? z? 12 (?x +?y) ? ; where ?y and ?y are as deflned in Section 1.5, and ? is the pooled covariance matrix of F and G. A new observation z is classifled in ?X if L(z;F;G) > 0 and in ?Y otherwise. Such classiflcation is referred to as Linear Discriminant Analysis (LDA). Given samples X and Y from ?X and ?Y, respectively, the discriminant func- tion of LDA is estimated by L(z;Fnx;Gny), where Fnx is the empirical distribution function of X obtained by putting mass 1=nx on each x sample point and Gny is the empirical distribution function of Y. Henceforth, it will be assumed that esti- mates of discriminant functions are obtained by replacing distribution functions by the corresponding empirical distribution function. Quadratic Discriminant Analysis (QDA) In LDA, we assume two populations share the same covariance matrix. If this as- sumption is not held, another commonly used classiflcation method named Quadratic 11 Discriminant Analysis (QDA) can be applied, which is based on the classical multi- variate normal model for each class. Assume PX = PY be the equiprobable priors of two populations. The quadratic discriminant function is deflned as Q(z;F;G) = Q(z;F)?Q(z;G) ; where Q(z;F) = ?12(z??x)0?x?1(z??x)? 12 log?flfl?x?1flfl? Q(z;G) = ?12(z??y)0?y?1(z??y)? 12 log?flfl?y?1flfl? Here ?x and ?y are as deflned in 1.5. A new observation z is classifled in ?X if Q(z;F;G) > 0 and in ?Y otherwise. Independence Classifler BothLDAandQDArequiretheprojectionpursuit, whichisnotanoptionforthelarge dimensional data. In particular, in order to evaluate some variable selection methods, we usually need to add more and more variables into the classiflcation model to flnd the optimal number of variables for the prediction. Independence classifler can be applied for this purpose: I(z;F;G) ? (?x ??y)0D?1 ? z? 12 (?x +?x) ? ; where ?x and ?y are as deflned in Section 1.5, and D = diag(?). A new observation z is classifled in ?X if I(z;F;G) > 0 and in ?Y otherwise. As matter of fact, linear discriminant function can be reduced to independence discriminant function 12 by setting the non-diagonal entries of common covariance matrix to be zero. This classifler is discussed in Bickel and Levina (2004). They also show that it is superior to LDA when the number of variables is large. 2.3.2 Rank Based Classiflcation using Projections LDA amounts to flnding u 2Rp, say ^u0, that maximizes the square of the two- sample t-statistic between the two projected samples u0X = fu0x1;:::;u0xnxg and u0Y= fu0y1;:::;u0ynyg; that is ^u0 = Argmax kuk=1 [u0(?x? ?y)]2 u0Spu ? nx+ny nxny ? ; where Sp is the pooled covariance matrix, and ?x and ?y are means of two samples. The data are then reduced to one dimension by projecting them in the direction given by ^u0 and one would classify a new observation z into ?X if jz0 ? ?x0j < jz0 ? ?y0j, where x0i = ^u00xi, y0j = ^u00yj, and z0 = ^u00z, i = 1;:::;nx and j = 1;:::;ny. Otherwise, one classifles z into ?Y. Here ?x0 = n?1x Pnxi=1 x0i and ?y0 = n?1y Pnyj=1 y0j. When the underlying distributions are spherically symmetric, the direction of maximum separation is along the line that connects the centers of the distributions. LDA is equivalent to classifying z based on its Euclidean distance from the means. In the case of the normal distribution, the projection direction can be obtained easily as u0 = S?1=2p (?x? ?y)=kS?1=2p (?x? ?y)k. In other situations, the projection direction is not obvious and has to be determined numerically. This search for "interesting" low dimensional projection of high dimensional data is known as projection pursuit (Friedman and Tukey, 1974). "Interestingness" is measured through a suitable func- tion known as the projection index. For LDA, this index is the two-sample t-statistic. Montanari (2004) and Chen et al. (1994) used a two-sample WMW type statistic as a projection index to measure group separation. They showed that their projection 13 pursuit method is not sensitive to deviations from the homoscedasticity and normality assumptions. Their method is related to the idea of transvariation probability given in Gini (1916): Deflnition 2.2. For univariate distributions F and G, the Transvariation Probability is deflned as ?(F;G) = Z R Z R `((x?y)(?(F)??(G)))d(F(x))d(G(y)) where ?(F) and ?(G) are the medians of F and G. This transvariation probability is the measure of common area under underlying distribution curves of two populations. Of course, the smaller transvariation proba- bility indicates the better group separation. In practice, instead of using the theoretical underlying distributions of two pop- ulations, we can use empirical distributions of X and Y to estimate ? ?? = 1n xny nxX i=1 nyX j=1 `f(xki ?ykj)(mx ?my)g where mx and my are medians of two samples. However, this transvariation probability is only deflned under univariate space, so forthemultidimensional data, weneed toredeflne ageneral transvariationprobability. This general transvariation probability can be redeflned on a vector u through a certain projection pursuit. The direction of minimum overlap measured by the general transvariation prob- ability is given by ^u1 = Argmin kuk=1 nX `f[u0x?u0y][mx(u)?my(u)]g o ; (2.4) 14 where the sum is over the set fx 2X; y 2Yg and mx(u) and my(u) are medians of two projected samples u0X and u0Y. The function ` is deflned as in (2.3) Once the direction of maximum separation is found, the next step is to project all the data (including the new sample point) onto that direction and allocate the new point to one of the two populations. Three allocation schemes are discussed in the following. Transvariation-Distance (TD) Classifler Montanari (2004) proposed classifying a new observation z in ?X if j^u01z?mx(^u1)j < j^u01z?my(^u1)j and in ?Y otherwise, where mx(^u1) and my(^u1) are medians of two projected groups. Hereafter the classifler obtained by using this allocation method will be referred to as Transvariation-Distance (TD) classifler. As shown in Nudurupati and Abebe (2009), this method can be adversely afiected by skewness and outliers. Point-Group Transvariation (PGT) Classifler Another allocation method suggested by Montanari (2004) is based on a comparison of the ranking of the new observation z among X and among Y. This utilizes the point-group transvariation. The observation z is classifled in ?X if T (z;F;G) ? T(z;F)?T(z;G) > 0, where T(z;F) = 1n x X x2X `f[^u01x? ^u01z][mx(^u1)? ^u01z]g 15 and T(z;G) = 1n y X y2Y `f[^u01y? ^u01z][mx(^u1)? ^u01z]g : This allocation scheme is robust against skewness and outliers. However, it does not perform well for data with unequal sample sizes. This is because the vote of each member of X is either 0 or 1=nx whereas the vote of each member of Y is 0 or 1=ny. This allocation scheme has also a problem of ties between T(?;F) and T(?;G). The likelihood of ties is the greatest in the case of equal sample sizes, which happens to be the only situation where this scheme works e?ciently. We will use random tie breaking where a coin is ipped to decide allocation in the case of a tie. The classifler obtained by using this allocation scheme will be referred to as Point-Group Transvariation (PGT) classifler. Group-Group Transvariation (GGT) Classifler An improved allocation method was proposed by Nudurupati and Abebe (2009) to eliminate the problem caused by unequal sample sizes. Deflne two augmented samples X? andY? by including the new point z in the two samples; that isX? =X[fzg and Y? = Y[fzg. The point z is then classifled in ?X if T ?(z;F;G) ? T ?1 (z;F;G)? T ?2 (z;F;G) > 0 where T ?1 (z;F;G) = 1(1+n x)ny X `f[^u01x? ? ^u01y][mx(^u1)? ^u01z]g and T ?2 (z;F;G) = 1n x(1+ny) X `f[^u01x? ^u01y?][my(^u1)? ^u01z]g The two sums are over the sets fx? 2X?; y 2Yg and fx 2X; y? 2Y?g, respectively. 16 The classifler obtained by using this allocation scheme will be referred to as Group-Group Transvariation (GGT) classifler. Note that we do not have the un- equal voting problem here. The vote of all observations is either 0 or approximately (nxny)?1. In essence, here we are smoothing one sample using the empirical distribu- tion of the other before applying the PGT rule. 2.3.3 Maximum Depth Classiflers In the univariate setting, statistical methods that use rank-based nonparametric techniques do not depend on restrictive distributional assumptions and hence are robust to deviations from these assumptions. For higher dimensions, statistical depth functions give a multivariate version of ranks (Liu, 1992). Depth functions give a measure of the \centrality" of a given multivariate sample point with respect to its underlying distribution (Liu et al, 1999). In particular, a depth function assigns higher values to points that are more central with respect to a data cloud. This naturally gives a center-outward ranking of the sample points. A number of depth functions are available in the literature. A few popular depth functions are Mahalanobis depth (Mahalanobis, 1936; Liu and Singh, 1993), halfspace depth (Tukey, 1974), simplicial depth (Liu, 1990), majority depth (Singh, 1991), projection depth (Donoho, 1982), and spatial or L1 depth (Vardi and Zhang, 2000). In this paper, I use the maximum depth (MaxD) classiflcation method (Ghosh and Chaudhuri, 2005) based on spatial (L1) depth function deflned as S(x;F) = 1? EF ? x?X kx?Xk ; (2.5) where X ? F and k?k is the Euclidean norm on Rp. The classifler MaxD uses the discriminant function S?(z;F;G) = S(z;F)?S(z;G) 17 and classifles z in ?X if S?(z;F;G) > 0. A major drawback of this classifler is that it lacks a?ne invariance. Thus it is necessary to transform the data so that all the variables are similarly scaled before using the spatial depth function. Vardi and Zhang (2000) suggest to make the spatial depth function a?ne invariant by taking ?x?1=2(z?X) and ?y?1=2(z?Y) in place of z?X and z?Y before computing S(z;F) and S(z;G), respectively, using equation (2.5). Note that one can use any a?ne equivariant estimators of?x and?y when computing the discriminant function. If the scatter estimator of Tyler (1987) is used, then the resulting maximum spatial depth classifler resembles the classifler given by Crimin et al. (2007). An alternative method of obtaining a?ne invariance is to scale the data along its principal component directions (PCA-scaling) as given in Hugg et al (2006). An estimate of the MaxD discriminant function S?(z;F;G) is given by S?(z;Fnx;Gny) = 1 ny nyX j=1 z?yj kz?yjk ? 1 nx nxX i=1 z?xi kz?xik : 18 Chapter 3 Applications of Wilcoxon-Mann-Whitney Statistic (WMW) 3.1 Feature Annealed Independence Rules (Fan and Fan, 2008) Two sample t-statistic can be applied as an initial variable screening index be- cause of its contribution to the group separation. Fan and Fan (2008) proved that theoretically t-statistic can pick up all the informative variables with probability ap- proaching to 1. They consider the p-dimensional classiflcation problem between two populations ?X and ?Y. XandYare two samples from ?X and ?Y with sample sizes nx and ny respectively. Write ith observation in ?X as xi = ?x +?xi; and ith observation in ?Y as yi = ?y +?yi; where ?xi = (?xij) and ?yi = (?yij) are iid with mean 0 and covariance matrix ?x and ?y respectively. They assume that all the observations are independent across sam- ples and in addition, within one population, observations are identically distributed. They also assume that the two classes have compatible sample size. They flrst proved that without variable selection, discrimination based on linear projections to almost all directions performs nearly the same as random guessing under some assumptions. They then claim that using t-statistic deflned in 2.1, all the 19 informative variables can be selected in the below Theorem (3.1). In order to prove their theorem, they need the following conditions. Condition1 ? Assume that the vector fi = ?x ??y is sparse and without loss of generality only flrst s entries are nonzero. ? Suppose that ?xij and ?2xij ?1 satisfy the Cram?er?s condition, that is there exist constants ?1;?2;M1 and M2, such that Ej?xijjm ? m!Mm?21 ?1=2 and Ej?2xij ? 2xjjm ? m!Mm?21 ?1=2 for all m = 1;2;::: where xj is the diagonal entries of ?x. Assumptions on ?yij and ?2yij?1 are the same as ?xij and ?2xij?1 respectively. ? Assume that the diagonal elements of both ?x and ?y are bounded away from 0. Under Condition 1, they have the following theorem: Theorem3.1. Let s be a sequence such that log(p?s) = o(n ) and logs = o(n1=2? fln) for some fln ?! 1 and 0 < < 13. Suppose that min1s jTjj < t) ?! 1 where n = nx +ny. Theorem 3.1 indicates that two sample t-statistic can potentially pick all the important variables as long as the rate of decay is not too fast and the sample size is not too small. In order to demonstrate the performance of two sample t-statistic on the variable screening, they then apply the independence classifler which is mentioned in the Section 2.3 to calculate the misclassiflcation error rate based on the most informative variables selected by two sample t-statistic. 20 They assume both populations are from Gaussian distributions and common variance matrix is identity, that is ?x = ?y = I. Rank all the variables according to two sample t-statistic and pick up the most informative m variables assuming that those m variables are all the important variables in terms of classiflcation. The theo- retical misclassiflcation error rate calculated by this truncated independence classifler is given in the below theorem: Theorem 3.2. Consider ?x = ?y = I and use a truncated independence classifler b?mn(z) = (zmn ?b?mn)(b?mn1 ?b?mn2 ) for a given sequence mn. Suppose that npmn?mnj=1fi2j ?! 1 as mn ?! 1. Then the classiflcation error of b?mn is = 1?' (1+o P(1))?mnj=1fi2j +mn(n1 ?n2)=(n1n2) 2f(1+oP)?mnj=1fi2j +nmn=(n1n2)g1=2 ? where '(?) is the standard Gaussian distribution function. They call this trun- cated classifler as feature annealed independence rule (FAIR). We can have the precise value of m by minimizing this theoretical misclassiflcation error rate . In practice, however, this equation is unsolvable and it can only be done numerically. Fan and Fan (2008) used a simulation study and three real data analyses to demonstrate their theoretical results and show the superiority of their method over the nearest shrunken centroid method (Tibshirani et al, 2002). 3.2 WMW-Based Feature Annealed Classifler As deflned in Section 2.1, two sample Wilcoxon-Mann-Whitney statistic provides more useful information than two sample t-statistic in terms of group separation un- der some certain circumstances where the normality assumption is not achievable. 21 Inspired by Fan and Fan (2008), I expect that using WMW statistic can also pick up all the important variables. If so, then WMW will be used more widely than t-statistic because most gene expression data present heavier tail than normal distri- bution (Salas-Gonzalez et al, 2009). 3.2.1 Variable Selection Based on WMW Statistic Using the similar strategy given in Fan and Fan (2008), I also prove that theoreti- cally in the inflnitely multidimensional data space, Wilcoxon-Mann-Whitney statistic can pick up all the informative variables with probability approaching to 1. The result is given in the following theorem: Theorem 3.3. Assume that the vector fi = ?x ??y is sparse and without loss of generality only flrst s entries are nonzero. Let s be a sequence such that log(p?s) = o(n ) and logs = o(n1=2? fln) for some fln ! 1 and 0 < < 13. For w ? cn =2 with some constant c > 0, we have P(min j?s jWjj? w and max j>s jWjj < w) ! 1: Proof. I divide the proof into two parts. (a) Let us flrst look at the probability P(maxj>sjWjj > w). Clearly, P(max j>s jWjj > w) ? pX j=s+1 P(jWjj? w) By the Corollary 3.2 proved in Froda and Eeden (2000), there exist a fi > 0, such that , for M0 < x < fin1=6, P(jWjj? w) = (1?'(w))(1+O(w3=n1=2)) 22 where M0 > 1. For the normal distribution, we have the following tail probability inequality 1?'(x) ? 1p2? 1we?w2=2 Combining with the symmetry of Wj, if we let w ? cn =2, then we have pX j=s+1 P(jWjj? w) ? (p?s) 2p2? 1we?w2=2(1+O(w3=n1=2) ! 0: since log(p?s) = o(n ) with 0 < < 13. Thus, we have P(max j>s jWjj > w) ! 0 (b) Next, we consider P(minj?sjWjj? w). Let ?j = fijpn 1n2(n1+n2?1)=12 and deflne fWj = Wj ??j. Then clone the lines in (a), we have X j?s P(jfWj? w) ? s 2p2? 1we?w2=2(1+O(w3=n1=2) ! 0 Let fi0 = minj?s ?j. Then it follows that P(min j?s jWjj? w) ? P(max j?s jfWjj? min j?s j?jj?w) ? P(max j?s jfWjj? fi0 ?w) If w ? cn =2 and fi0 ? n? fln for some fln ! 1, then similarly to part (a), we have P(min j?s jWjj? w) ! 0: Combination of Part (a) and (b) completes the proof. 23 Compare to Fan and Fan (2008), Theorem 3.3 requires fewer assumptions. For example, there are no assumption on random errors and no assumptions on covariance matrices of two population, which makes WMW statistic more e?ciently to identify the minimal subset of variables that succinctly predict the categories of the new observations. 3.2.2 Projection-free WMW-based Classifler Fan and Fan (2008) used minimal misclassiflcation error rate to explicitly the per- formance of two sample t-statistic. They named their procedures as Feature Anneal Independence Rule (FAIR), that is, allocate the new observations using independence classifler based on the variables chosen by t-statistic. However independence classifler is strongly related to the t-statistic which is not appropriate where normality assump- tion isn?t held. That is the reason I propose this new nonparametric classifler which is simply based on WMW statistic. Even though WMW statistic measures the group separation very well, itself can?t be used as a classifler directly. Here I borrow the idea of Group-Group Transviation classifler (Nudurupati and Abebe, 2009) to translate the measure of group separation to a discriminant function. To classify a new observation z, deflneX? =X[fzg andY? =Y[fzg. z 2? ?X if pX k=1 wk(X?;Y) > pX k=1 wk(X;Y?) where wk(X?;Y) is WMW statistic of variable k based on X? and Y while wk(X;Y?) is WMW statistic of variable k based on X and Y?. This classifler indicates that I classify this new observation z into ?X if a better group separation can be achieved by adding this new observation z into ?X, otherwise into ?Y. 24 The advantage of this classifler is that it is robust to deviations from the usual assumptions. The other hand it?s projection free classifler such that it can be used to numerically flnd the minimal misclassiflcation error rate by picking the proper number of informative variables. The combination of WMW statistic variable screening and WMW-basedclassiflerresultsintheWMW-basedfeatureannealedclassifler(WFAC). 3.3 Results Two real data analyses and a Monte Carlo simulation are provided to demon- strate the superiority of WFAC compared to FAIR. 3.3.1 Lung Cancer I flrst use Lung Cancer data to compare the performances of WFAC and FAIR by classifying between two types of lung cancer: malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA). In total, the data set contains 181 sample, with 31 from MPM and 150 from ADCA. The training set contains 32 samples, with 16 from MPM and 16 from ADCA while the testing set contains 149 samples, with 15 from MPM and 134 from ADCA. Each sample is described by 12533 genes. This data is available at http://www.chestsurg.org. Fan and Fan (2008) set the classiflcation rule by using independence classifler with t-statistic variable screening based on the training data and predict each sample in the testing data to be MPM or ADCA by following this rule. I, instead set the classiflcation rule by using WMW-based classifler with WMW statistic variable screening based on the training data and predict each sample in the testing data to be MPM or ADCA by following this new classifler proposed above. Numbers of incorrect classiflcation out of 149 testing samples are given in the Table 3.1. 25 Table 3.1: Lung Cancer data. The minimum number of incorrectly classifled samples out of 149 testing data Method Test Error No. of Selected Genes FAIR 11/149 26 WFAC 0/149 78 Figure 3.1: Scree Plot of WFAC for Lung Cancer Data 0 20 40 60 80 1000 5 10 15 20 25 30 Number of Genes Selected Number of Misclassified Samples Apparently, FAIR reaches the minimal misclassiflcation error rate (11) by select- ing 26 genes while WFAC achieves zero misclassiflcation error by picking 78 genes. In the sense of misclassiflcation error rate, our method is superior to FAIR. However it seems WFAC needs to select much more genes to have this desirable result, which makes our method ine?cient. For this reason, a scree plot is drawn to show how the misclassiflcation error rate changes when more and more genes are added. Plot 3.1 shows that 1 testing sample is misclassifled based on top 17 selected genes. As shown in Table 3.1, using FAIR the minimum number of misclassifled samples is 11 by selecting 26 variables, which indicates that FAIR use more genes but achieves large number of misclassifled samples than WFAC. 26 Table 3.2: Prostate Cancer data. The minimum number of incorrectly classifled samples out of 102 samples using leave-one-out cross validation Method Training Error No. of Selected Genes Fan and Fan 10/102 2 Our 5/102 4 3.3.2 Prostate Cancer IthenuseProstateCancerdata, whichisavailableathttp://www.broad.mit.edu/cgi- bi/cancer/datasets.cgi. The prostate cancer data contains 102 patient samples, 52 of which are prostate tumor samples and 50 of which are normal prostate samples. Each sample contains 12600 genes. The minimum number of misclassifled samples is calculated by FAIR and WFAC respectively using leave-one-out cross validation. Numbers of incorrect classiflcation out of 102 testing samples are given in the Table 3.2. It shows that by selecting top 2 genes, FAIR gives 10 misclassifled samples while by selecting top 4 genes, WFAC only gives 5 misclassifled samples. It still seems WFAC sacriflces the e?ciency to obtain the accuracy. For the same reason, a scree plot is drawn to illustrate the e?ciency of WFAC. As shown in Plot 3.2, WFAC gives 7 misclassifled samples if only selecting top 2 genes, which implies that WFAC uses the same number of genes to obtain lower misclassiflcation error rate. 3.3.3 Simulation I perform a large Monte Carlo simulation to study the optimality (in terms of misclassiflcation error) of FAIR and WAFC under a variety of distributional set- tings. To that end, I generated two classes of data from normal, Cauchy, and t with two degrees of freedom (t2) distributions with dimension p = 200. I set the center of one class at the origin (0;0;:::;0) and the center of second class at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I considered variance-covariance matrices 27 Figure 3.2: Scree Plot of WFAC for Prostate Cancer Data 0 20 40 60 80 1005 6 7 8 9 10 11 12 13 Number of Genes Selected Number of Misclassified Samples ?1 = I200 and ?2 = 0 BB BB BB BB BB @ 1 ?1=2 ?1=2 ::: ?1=2 ?1=2 1 ?1=2 ::: ?1=2 ?1=2 ?1=2 1 ::: ?1=2 ::: ::: ::: ::: ::: ?1=2 ?1=2 ?1=2 ::: 1 1 CC CC CC CC CC A : In the simulation, training samples of sizes 20 and 30 were generated as well as testing samples of size 1000 for each group are generated. Use FAIR and WFAC to set the classiflcation rule based on the training data and calculate the minimum mis- classiflcation error rate by computing the proportion of misclassifled testing samples in each group respectively. Use the Monte Carlo simulation to generate 50 difierent training and testing data having the same structure for each distributional setting and apply the same procedure to all those difierent samples separately. Comparison boxplots containing the misclassiflcation error rates are given in Figure 3.3. It is clear from the plots that WFAC provides lower misclassiflcation error rates for the heavier tailed distributions (Cauchy, t2). In particular, for Cauchy data, FAIR 28 leads to misclassiflcation error rates consistently around 50% which is nearly as same as guessing. This is somewhat improved for the t2 distribution even though WFAC is still better than FAIR. As expected, for normal data FAIR gives better performance than WFAC. 3.4 Smoothed Projection-Free WMW-Based Classifler 3.4.1 Description As discussed above, WFAC provides an improvement over FAIR in dealing with non-normal distributed data. As shown in (2.2), in the calculation of WFAC, sign function ` is used to count the number of transvariated observations. In order to simplify the following discussion, let us assume the median of the difierences of all the observations from ?X and ?Y respectively is positive. Thus two observations x and y from ?X and ?Y respectively are treated as transvariation as long as x < y regardless of whether y is barely greater than x or much greater than x. A weight associated with the magnitude of difierence is needed and I will assign such weight s by replacing ` with a [0;1]-valued and non decreasing function that is continuously difierentiableonaninterval(??;?)forsome? > 0. InspiredbyAbebeandNudurupati (2011), a continuous cumulative distribution functions deflned on R are applied. Using smoothed WFAC, I will classify z to ?x if pX k=1 wk(X?;Y) > pX k=1 wk(X;Y?) where wk(X?;Y) = 1? 2n xny X x?2X? X y2Y Kfi'(xki ?ykj)mkxy? and wk(X;Y?) = 1? 2n xny X x2X X y?2Y? Kfi'(xki ?ykj)mkxy? 29 Figure 3.3: Misclassiflcation Error Rate + + ++ FAIR FAW G 0.300.350.400.450.50 Cau chy : Eq ual Vari anc e + + FAIR FAW G 0.250.300.350.400.450.50 T2: Equ al V aria nce ++ + +++ FAIR FAW G 0.150.200.25 Nor mal : Eq ual Vari anc e + ++ + +++ FAIR FAW G 0.300.350.400.450.50 Cau chy : Un equ al V aria nce + ++ +++ FAIR FAW G 0.250.300.350.400.450.50 T2: Une qua l Va rian ce + ++ FAIR FAW G 0.140.160.180.200.220.24 Nor mal : Un equ al V aria nce 30 Here Kfi is [0;1]-valued functions that are non decreasing on R, where fi is smoothing parameter. For example, one could take cumulative density function of N(0;fi). Here I use cumulative density function of t-distribution with fi degrees of freedom. For the data with training and testing samples, the training samples are used to flnd the best smoother for each group. Similarly as mentioned in Abebe and Nudurupati (2011), I use a bivariate grid (fi1;fi2) and apply a leave-one-out cross validation to the training samples to flnd the misclassiflcation error for each possible pairs of degrees of freedom. The combination with the least misclassiflcation error is then selected as the pair of smoothing constants. 3.4.2 Monte Carlo Simulation To demonstrate the optimality of smoother, several common simulation set- ting are used: normal distribution, t distribution (heavy tail distribution), and log- normal distribution (skewed distribution). I set the center of one class at the origin (0;0;:::;0) and the center of second class at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I considered variance-covariance matrices ?1 = I200. I consider normal, t2 and log-normal distribution to generate 200-dimensional data. In the simulation, training samples of sizes 30 and 30 as well as testing sam- ples of size 100 for each group were generated by following the distributional settings mentioned above. We use WFAC to set the classiflcation rule based on the training data and calculate the minimum misclassiflcation error rate by computing the pro- portion of misclassifled testing samples in each group respectively. I then calculate the minimum misclassiflcation error rate for the testing sample by including the op- timal smoother determined by training data. We use the Monte Carlo simulation to generate 10 difierent training and testing data having the same structure for each 31 distributional setting and apply the same procedure to all those difierent samples sep- arately. Comparison boxplots containing the misclassiflcation error rates are given in Figure 3.4. It shows that with the equal sample sizes, the smoother doesn?t improve the performance of WFAC very well. I then change the training samples sizes to 20 and 30 respectively and proceed the same simulation as described above to show how the smoother behaves for the difierent sample size case. Comparison boxplots containing the misclassiflcation error rates are given in Figure 3.5. Figure 3.5 shows that the one with smoother works much better for the skewed data (log-normal) while it is slightly more e?cient for the heavy tailed data (t2). 3.5 Conclusion Both two real data analysis shows the better e?ciency and accuracy of WFAC than FAIR. A large Monte Carlo simulation study further demonstrates the obvi- ous advantage of WFAC for heavier tailed data and slight disadvantage of normally distributed data. Then the necessity of smoothed WFAC under some circumstance where two sample sizes are unequal, is discussed in a Monte Carlo simulation study. 32 Figure 3.4: Misclassiflcation Error Rate (Smoothed with equal Size) Ours with out Smo othe r Ours with Smo othe r 0.100.120.140.160.18 Nor mal ?No rma l Ours with out Smo othe r Ours with Smo othe r 0.120.140.160.180.20 Log Nor mal ?Lo gNo rma l Ours with out Smo othe r Ours with Smo othe r 0.200.250.300.35 T2?T 2 Ours with out Smo othe r Ours with Smo othe r 0.150.200.25 Nor mal ?T2 Ours with out Smo othe r Ours with Smo othe r 0.020.030.040.05 T2?L ogN orm al 33 Figure 3.5: Misclassiflcation Error Rate (Smoothed with Unequal Size) Ours with out Smo othe r Ours with Smo othe r 0.120.140.160.180.20 Nor mal ?No rma l Ours with out Smo othe r Ours with Smo othe r 0.150.200.25 Log Nor mal ?Lo gNo rma l Ours with out Smo othe r Ours with Smo othe r 0.200.250.300.350.40 T2?T 2 Ours with out Smo othe r Ours with Smo othe r 0.160.180.200.220.240.26 Nor mal ?T2 Ours with out Smo othe r Ours with Smo othe r 0.050.100.15 T2?L ogN orm al 34 Chapter 4 WMW Forward Variable Selection Method Two sample t-statistic and WMW statistic can be used as variable screening index and their properties are discussed in Chapter 3. But both variable selection procedures fail to take the correlations among predictor variables into account. They arenotrecommendedwhentherearemanyvariablesthataredependentoneachother. AsshowninSection2.2, twopopulardimensionreductionmethods, PCAandPLScan also be used to reduce the dimension by flnding a smaller set of uncorrelated linear combinations of the original variables. However, the problem with both methods is that they use the linear combinations to combine all the predictor variables to create new variables. Those created variables contain the information of all predictor variables and they are very hard to be interpreted, especially in biological study. That is the reason I propose this new WMW forward variable selection procedure to improve those existing methods. 4.1 Descriptions Consider two populations ?X and ?Y. XandYare two samples from ?X and ?Y with sample sizes nx and ny respectively. X = fxijg and Y = fyijg are the predictor matrices of two samples. Write V as a matrix of column vectors V = [v1 v2 ??? vp] where each vi is an n ? 1 vector, where n = nx + ny. I would like to order the variables v1;:::;vp in a decreasing order according to the amount of information they provide for class determination, v[1] ? ??? ? v[p] say. The most informative variable is the one that gives maximum separation between the two groups. As the second most informative variable, it seems reasonable to pick the variable that is 35 the most dissimilar to v[1] while at the same time giving the highest contribution to distinguishing between the two groups. The approach I propose uses the WMW statistics to measure overlap and dis- similarity in the sense of classiflcation. For k = 1;:::;p, deflne tkij = xki ?ykj ; i = 1;:::;nx;j = 1;:::;ny and, as a measure of overlap between ?X and ?Y, consider the two sample WMW statistic based on vk w(vk) = 1? 2n xny nxX i=1 nyX j=1 `(tkij)mk ; (4.1) where mk = median'tkij : 1 ? i ? nx;1 ? j ? ny?. It can be seen that 0 ? w(vk) ? 1. Higher values of w(vk) indicate smaller overlap between fxk1;:::;xknxg and fyk1;:::;yknyg. The most informative variable is the one with the the least over- lap and hence the highest w(?). To measure the dissimilarity between two variables vr and vs I use d(vr;vs) = 1n xny nxX i=1 nyX j=1 `'?trijmr??tsijms?? ; (4.2) where r;s = 1;:::;p. This quantity d(vr;vs) resembles the measure of dissimilarity studied by Sokal and Michener (1958) and Rand (1971) (see discussion in Albatineh et al, 2006) that is given by (nxny)?1PPtrijtsij. The measure d(vr;vs) counts how often observations i and j in vr and vs behave in an opposite direction with respect to their medians for i = 1;:::;nx; j = 1;:::;ny. It is clear that 0 ? d(vr;vs) ? 1 where large values of d(vr;vs) indicate large dissimilarity between vs and vr. It is also easily observed that d(v;v) = 0 = d(v;?v). Moreover, d(vr;vs) = 1 if trijmr < 0 whenever tsijms > 0, for all i and j, and vice versa. In such cases, variables vr and vs 36 are totally dissimilar in the sense that they provide opposing information about class membership. 4.2 Algorithm An algorithm to implement WFS is as follows: Algorithm 4.1. Step 1: Let v[1] = Argmax k=1;:::;p w(vk) ; where w is deflned in (4.1). Step 2: Use (4.2) to compute d(vs;v[1]) for s 6= [1]. Let v[2] = Argmax k=1;:::;p k6=[1] fw(vk)d(vk;v[1])g : Step 3: For c = 2;:::;p, flnd direction of maximum separation ^u1 2Rc given in (2.4) using v[1];:::;v[c] and set v[c+1] = Argmax k=1;:::;p k=2f[1];:::;[c]g w(vk)d(vk; ^u01[v[1]???v[c]]) : One may use stability of the misclassiflcation error rate as a stopping crite- rion. The downside of Algorithm 4.1 is that one needs to search higher and higher dimensional spaces as the value of c in Step 3 increases. This introduces a huge com- putational burden. The following modiflcation avoids high dimensional projections by combining selected variables using the direction of maximum separation: 37 Algorithm 4.2. Step 1: Set c = 1. Let v[1] = Argmax k=1;:::;p w(vk) ; where w is deflned in (4.1). Step 2: If c ? p, then STOP. Otherwise use (4.2) to compute d(vs;v[1]) for s 2f[c+1];:::;[p]g. Let v[2] = Argmax k2f[c+1];:::;[p]g fw(vk)d(vk;v[1])g : Step 3: Find direction of maximum separation ^u1 2R2 given in equation (2.4) using [v[1] v[2]] and set v[1] ? ^u01[v[1] v[2]] c ? c+1 Go back to Step 2. This algorithm is convenient for selecting variables in high dimensional data since it only requires two dimensional projections. This is especially useful when performing classiflcation based on gene expression data. Besides, forward selection allows one to start with fewer variables and proceed to higher dimensions if necessary. As a stopping rule, one may use predetermined dimensions (Nguyen and Rocke, 2002) or cross validation using the misclassiflcation error rate. 38 4.3 Results One real data analysis demonstrates the advantage of WFS over both simple t-statistic and WMW statistic variable screening. Then two real data analyses and a Monte Carlo simulation are used to compare the performances among PCA, PLS and WFS. 4.3.1 Caribbean Food Data Caribbean food data is applied to compare the performance among t-statistic, WMW statistic and WFS. This data set contains information from Food and Drug Administration (FDA) and U.S. Department of Agriculture (USDA) on the number of rejections by country for certain Latin American and Caribbean (LAC) countries for the years 1992 to 2003. This data set was investigated in Jolly et al. (2007) using zero-in ated count data mixed models. The variables considered in the current study are ? t = year (1992 - 2003) ? FDI = Foreign direct investment, net in ows (Balance of Payments (BoP), current US $ ) ? Fertcons = Fertilizer consumption (metric tons) ? USImp = U.S. Imports by Country, (1985-03; Millions of Dollars) ? AgImp = Total Agricultural Import to the US (million $) ? GNI = Gross national income per capita, Atlas method (current US $) ? Y = Detention Status (Y=0 no detention; Y=1 detention) Consider variable Y as response variable and the other six variables as predictor variables. I flrst select top 2 variables based on t-statistic, WMW statistic and WFS. 39 Table 4.1: Caribbean food data. Misclassiflcation error rates using leave-one-out cross validation. t Selection WMW Selection WFS Without Selection LDA 0.4155 0.3873 0.3873 0.3803 MaxD 0.3873 0.3170 0.3169 0.3170 PGT 0.3944 0.3099 0.3028 0.3592 GGT 0.3803 0.3099 0.3028 0.3451 TD 0.3732 0.3310 0.3310 0.3310 Variables USImp and AgImp are selected by t-statistic while WMW statistic chooses variables FDI and USImp. WFS, instead, picks the variables FDI and Fertcons. Then classiflers LDA, TD, MaxD, PGT, and GGT discussed in Section 2.3 are applied to calculate the proportion of the misclassifled samples by using leave-one-out cross val- idation based on the top 2 variables selected by t-statistic, WMW statistic and WFS respectively. I also calculate the corresponding misclassiflcation error rate based on all those six predictor variables to show the necessity of variable selection procedure. Misclassiflcation error rates are given in the Table 4.1 It is clear that using t-statistic to select the variables gets the worse classiflca- tion than without any variable selection, which indicates that it fails to identify the most informative variables. The results can be improved when I apply PGT and GGT to determine the class membership based on the important variables (FDI and Fertcons) selected by WMW. Finally WFS followed by PGT and GGT gives the min- imal misclassiflcation error rate (0.3028). In the sense of classiflcation, variables FDI and Fertcons chosed by WFS should be considered as the most informative ones in prediction of food detention. 4.3.2 Colon Data A two way cluster study is conducted by Alon et al. (1999) using a data set com- posed of 40 colon tumor samples and 22 normal colon tissue samples, analyzed with an Afiymetrix oligonucleotide array complementary to more than 6,500 human genes. 40 Table 4.2: Colon data. The number of incorrectly classifled samples out of 62 samples using leave-one-out cross validation. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. WMW Selection t Selection PCA PLS PCA PLS PCA PLS PCA PLS WFS (50) (50) (100) (100) (50) (50) (100) (100) LDA 10 8 8 8 9 8 8 7 5 QDA 10 8 10 7 10 8 9 8 11 MaxD 9 12 10 11 10 12 12 11 10 PGT 8 8 8 8 8 8 8 8 5 GGT 9 9 9 9 9 10 9 9 5 TD 8 8 8 8 8 8 8 9 7 Alon et al. (1999) used an algorithm based on deterministic-annealing algorithm to cluster the data set into two clusters. One cluster consisted of 35 tumor and 3 normal samples while the other cluster contained 19 normal and 5 tumor samples. A leave-one-out cross validation is used to determine the misclassiflcation error rates based on 4 genes selected by WFS and 4 gene components selected by PCA and PLS. Prior to using PCA and PLS, the top 50 and 100 genes were selected based on the values of WMW and t statistics. Numbers of incorrect classiflcation out of 62 samples are given in the Table 4.2. The results show that WFS followed by PGT, LDA, or GGT gives the fewest (5) misclassifled samples of any combination of dimension reduction/selection and classifler. MaxD results in between 9 and 12 misclassifled samples. A misclassiflcation of 12 samples is the highest in the study. The fewest number of misclassifled samples by any classifler following PCA is 8 samples. The minimum number of samples misclassifled following PLS is 7 samples. This is achieved by LDA when 100 genes were selected by the t-statistic and QDA when 100 genes were selected by the WMW statistic. 41 Table 4.3: Leukemia training data. The number of incorrectly classifled samples out of 38 training samples using leave-one-out cross validation. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. WMW Selection t Selection PCA PLS PCA PLS PCA PLS PCA PLS WFS (50) (50) (100) (100) (50) (50) (100) (100) LDA 1 1 1 2 1 2 1 2 4 QDA 2 2 2 0 3 2 2 3 5 MaxD 4 3 3 3 4 4 4 4 1 PGT 1 2 1 1 2 3 2 3 0 GGT 1 2 1 1 2 4 2 4 0 TD 1 2 2 1 1 1 2 2 3 4.3.3 Leukemia Data Golub et al (1999) used a classiflcation procedure to discover the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The original data set (training) used consisted of 38 bone marrow samples (27 ALL and 11 AML) obtained from acute leukemia patients at the time of diagnosis. The independent (testing) data set consisted of 24 bone marrow samples as well as 10 peripheral blood specimens from adults and children (20 ALL and 14 AML). I flrst use the training data to apply a leave-one-out cross validation as described earlier. Numbers of incorrect classiflcation out of 38 samples based on WFS as well as PCA and PLS based on four genes or gene components were calculated. The results are shown in the Table 4.3. The results show that WFS followed by PGT or GGT gave no misclassifled samples. The same result is attained by QDA using 100 genes selected by the t statistic followed by PLS. The minimum number of samples misclassifled following PCA is 1. I then calculated the number of misclassifled samples out of testing samples by using the variables selected by WFS and top 4 principal components obtained by PCA and PLS based on training samples. The results are shown in the Table 4.4. 42 Table 4.4: Leukemia training and testing data. The number of incorrectly classifled samples out of 24 testing samples based on training samples. Screening for PCA and PLS uses WMW and t statistics. The numbers in parentheses are the number of genes kept after the screening. WMW Selection t Selection PCA PLS PCA PLS PCA PLS PCA PLS WFS (50) (50) (100) (100) (50) (50) (100) (100) LDA 1 3 1 1 1 3 2 3 5 QDA 2 2 3 2 2 2 1 2 3 MaxD 5 3 2 2 3 2 3 3 3 PGT 2 3 1 1 2 2 1 2 3 GGT 2 3 1 1 2 2 1 3 3 TD 2 1 1 1 2 2 2 3 3 We note that WFS gives inferior performance to PCA and PLS for this experiment involving training and testing data. Theresultsoftheanalysesonrealdatademonstratetheneedtocharacterizecases where WFS performs better than PCA and PLS and vice versa. This is investigated in the following section using simulated data. 4.3.4 Monte Carlo Study I perform a Monte Carlo simulation to study the optimality (in terms of misclas- siflcation error) of PCA, PLS and WFS under a variety of distributional settings. Two classes of data are generated from normal, Cauchy, and t with two degrees of freedom (t2) distributions with dimension p = 200. Set the center of one class at the origin (0;0;:::;0) and the center of second class at (1=4;1=2;3=4;1;5=4;3=2;0;0;:::;0). I 43 then consider variance-covariance matrices ?1 = I200 and ?2 = 0 BB BB BB BB BB @ 1 ?1=2 ?1=2 ::: ?1=2 ?1=2 1 ?1=2 ::: ?1=2 ?1=2 ?1=2 1 ::: ?1=2 ::: ::: ::: ::: ::: ?1=2 ?1=2 ?1=2 ::: 1 1 CC CC CC CC CC A : In the simulation, training samples of sizes 20 and 30 were generated. After the initial screening of 50 variables using WMW and t-statistics, the samples were used to determine the PCA and PLS loadings and set the classiflcation rules based on the top four components. Testing samples of size 1000 from each group were then generated and the loadings found from the training samples are applied. The misclassiflcation error rate is calculated based on the top four components by computing the proportion ofmisclassifledtestingsampleobservationsineachgroup. ForWFS,Idirectlyselected the top 4 variables without any screening. These same variables were retained for the testing samples. The entire process is replicated 50 times. For the sake of brevity, I only report the results of QDA, MaxD, and GGT. The performance of LDA was similar to QDA and that of PGT and TD was similar to GGT. Comparison boxplots containing the misclassiflcation error rates are given in Figure 4.1. It is clear from the plots that WFS provides lower misclassiflcation error rates for the heavier tailed distributions (Cauchy, t2). For Cauchy data, PCA and PLS lead to misclassiflcation error rates consistently around 50%. This is akin to ipping a coin to decide group membership without regard to the information contained in the variables. This is somewhat improved for the t2 distribution even though WFS is still the best among the methods considered. As expected, for normal data PCA and PLS provide better performance than WFS. In the homoscedastic normal case, GGT is 44 the classifler with the lowest misclassiflcation error rate while for the heteroscedastic normal case the best method is QDA. The latter is expected under normality. 4.4 Conclusion Using two real data, it is shown that transvariation-based classiflers following the rank-based forward variable selection procedure provide better class prediction than LDA and QDA following dimension reduction using PCA and PLS. The forward selection procedure also provides superior performance when the data come from heavy-tailed distributions. Because it starts with low dimensions, the use of forward selection makes intuitive sense for variable selection in very high dimensional data. Even then the original formulation of the proposed forward selection procedure required projection pursuit in high dimensional spaces. This becomes computationally very expensive especially for gene expression data that are ultra-high dimensional. Complicated methods of mesh-generation and a large number of points are required to efiectively cover high dimensional spaces. In this paper, an alternative algorithm that sequentially combines information in two variables using the most informative direction is given as a way to optimize the computation. This modifled algorithm only requires projections in two dimensions which can be done by picking evenly spaced points on the unit circle. 45 Figure 4.1: Misclassiflcation error rates ( Black=QDA, Gray=MaxD, White=GGT) + + ++ ++ + + ++ + + + ++ ++ ++ + ++ + + ++ + +++ Cauchy: Equal Variance 0.20.30.40.5 WFS t-PC A W-PC A t-PLS W-PLS ++ + + + T2: Equal Varia nce 0.10.20.30.40.5 WFS t-PC A W-PC A t-PLS W-PLS + + ++ + ++ + +++ + + + ++ + + + +++ ++ +++ + ++ + Normal: Equa l Var iance 0.000.050.100.150.20 WFS t-PC A W-PC A t-PLS W-PLS ++ + + + + ++ + + ++ + + ++ +++ + ++++ + ++ ++ + ++ + + + ++ ++ +++ ++++ + Cauchy: Unequal Variance 0.20.30.40.5 WFS t-PC A W-PC A t-PLS W-PLS + + + + ++ + + ++ T2: Uneq ual Varia nce 0.20.30.40.50.6 WFS t-PC A W-PC A t-PLS W-PLS ++ + ++ + + + + ++ ++ + Normal: Un equ al V ariance 0.00.10.20.30.4 WFS t-PC A W-PC A t-PLS W-PLS 46 Chapter 5 Conclusion and Future Work Gene expression data usually contains a large number of genes, but a small number of samples. It is well known that not all these genes contribute to determining a speciflc genetic trait. Feature selection for gene expression data aims at flnding a set of genes that best discriminate biological samples of difierent types. In this dissertation, inspired by FAIR of Fan and Fan (2008), a new nonpara- metric classifler (WFAC) is proposed to classify new observations based on the most informative variables selected by Wilcoxon-Mann-Whitney statistic. Its similarity to and difierences with FAIR are discussed theoretically and using real data analysis and a Monte Carlo simulation study. I also introduced a smoothed version of WFAC to improve its performance when there is a large sample size discrepancy in the two samples. I then developed a nonparametric forward selection procedure for selecting features to be used for classiflcation. This rank-based forward selection procedure re- wards genes for their contribution towards determining the trait but penalizes them for their similarity to genes that are already selected. Lower misclassiflcation error rates are achieved by WFS compared to the dimension reduction methods such as PCA and PLS. It is of interest to flnd a speciflc rule to determine the number of variables I need to select by using WFS. This requires a theoretical description of the misclassiflcation error rate which can then be minimized with respect to the number of variables. So far there is no clear stopping rule and I may only use the predetermined dimensions or cross validation that uses the misclassiflcation error rate. This is currently being studied by the author. 47 Bibliography Abebe,A. and Mckean, J. W., (2007) Highly e?cient nonlinear regression based on the Wilcoxon norm, Abebe,A. and Nudurupati, S. V. (2011) Smooth Nonparametric Allocation of Classi- flcation, J. Stat. Comp. Simul., 40: 1-16 Abebe, A., McKean, J.W., and Kloke, J.D., (2011) Iterated reweighted rank-based estimates of GEE models, summitted Albatineh, A. N., Niewiadomska-Bugaj, M. and Mihalko, D. (2006). On similarity indices and correction for chance agreement. J. Class, 23:301{313. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999) Broad patterns of gen expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl Acad. Sci. USA, 96, 6745-6750. Benard, A. and Elteren, P. V. (1953) A gereralization of method of m rankings, Nederl. Akad. Wetensch. Proc. (Indag, Math. 15), Ser. A, 56, 358-369. Bickel, P.L. and Levina, E. (2004). Some theory for Fisher?s linear discriminant func- tion, "naive Bayes," and some alternatives when there ae many more variables than observations. Bernoulli 10 989-1010. MR2108040 Chen, Z. Y. and Muirhead, R. J. (1994) Comparison of robust linear discriminant procedures using projection pursuit methods, In Multivariate analysis and its ap- plications (Hong Kong, 1992) volume 24 of IMS Lecture Notes Monogr. Ser., pages 163-176. Inst. Math. Stat., Hayward, CA. 48 Crimin, K., McKean, J. W., and Sheather, S. J. (2007) Discriminant procedures based on e?cinet robust discriminant coordiantes, J. Nonpara. Stat. , 19(45): 199-213. Ding, B. and Gentleman, R. (2005). Classiflcation using generalized partial least squares. J. Comput. Graph. Stat., 14(2): 280{298. Donoho, D. (1982) Breakdown properties of multivariate location estimators, PhD Qualifying paper, Department of Statistics, Harvard University. Dudoit, S., Fridly, J. and Speed, T. P. (2002) Classiflcation e?ciencies for robust linear discriminant analysis, Stat. Sinica, 18(2):581-599. Durbin, J. (1951) Incomplete Blocks in Ranking Experiments, Br. J. Psychol. (Sta- tistical Section), 4, 85-90. Fan, J. and Fan, Y. (2008) High-dimensional classiflcation using features annealed independence rules. Ann. Stat. 36, 2605{2637. Fan, J. and Lv, J. (2008) Sure inpendence screening for ultrahigh dimensional feature space, J. R. Stat. 70, Part 5, pp. 849-911 Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems, Ann. Eugenics, VII(II):179-188. Friedman, M. (1940) A Comparison of Alternative Tests of Signiflcance for the Prob- lem of m Rankings, Ann. Math. Stat, 11 (1): 86-92. Friedman, J. H. and Tukey, J. W. (1974) Projection pursuit algorithm for exploratory data analysis, IEEE Transactions on Computers, C 23(9):881-890. Froda, S. and Eeden, C. V. (2000) A uniform saddlepoint expansion for the null- distribution of the Wilcoxon-Mann-Whiteney statistic J. Stat, Canadian, Vol. 28, No 1. 137-149 49 Garthwaite, P.H. (1994) A interprtation of partial least squares, Am. Stat. Assoc., 89, 122-127 Ghosh, A. K. and Chaudhuri, P. (2005) On maximum depth and related classiflers. Scand, J. Stat., 32(2):327-350. Gini, C. (1916) Il concetto di transvariazione e le sue prime applicazioni. Giornale degli economisti Rivista di statistica. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Caasenbeek, M., Mesirov, P., Coller, H., Loh, M. I., Downing, J. R., Caligiuri, M. A., Bloomfleld, C. D., and Lander, E. S. (1999) Molecular classiflcation of caver: calss discovery and class prediction by gene expression monitoring, Science, 286, 531-537. Helland, I.S. (1988) On the structure of partial least squares, Commun. Stat. Simul. Comput., 17, 581-607. H?oskuldsson, A. (1988) PLS regression methods, J. Chem., 2, 211-228. Hotelling, M. and Pabst, M. R. (1936) Rank correlation and tests of signiflcance involving no assumption of nomality. Ann. Math. Stat, 7:29-43. Hugg, J., Rafalin, R., Seyboth, K., and Souvaine, D. (2006) An experimental study of old and new depth measures, Workshop on Alforithm Engineering and Experiments (ALENEX06), Lecture notes in Computer Sceience. New York: Springer-Verlag, pp: 51-64. Jollifie, I. T. (1986) Principal component analysis. Springer, New York. Jolly, C. M., Namugabo, E., and Abebe, A. (2007). Food safety issues between latin american and caribbean countries. Paper presented at the 27th West Indies Agri- cultural Economics Conference. 50 Lehmann, E. L. (2006) Nonparametrics. Statistical methods based on ranks. Revised flrst edition. Springer, New York. Liao, C., Li, S. and Luo Z. (2007). Gene selection using wilcoxon rank sum test and support vector machine for cancer classiflcation. In Y. Wang, Y.-m. Cheung, and H. Liu, editors, Computational Intelligence and Security, volume 4456 of Lecture Notes in Computer Science, pages 57{66. Springer Berlin / Heidelberg. Liu, R. Y. (1990) On a notion of data depth based on random simplices, Ann. Stat., 18(1):405-414. Liu, R. Y. (1992) Data depth and multivariate rank tests, L1-statistical analysis and related methods (Neuch^atel, 1992), 79-294, North-Holland, Amsterdam. Liu, R. Y., Parelius, J. M., and Singh, K. (1999) Multivariate analysis by data depth: descriptive statistics, graphics and inference, Ann. Stat., 27(3):783-858, With dis- cussion and a rejoinder by Liu and Singh. Liu, R. Y. and Singh, K. (1993) A quality index based on data depth and multivariate rank tests, J. Amer. Stat. Assoc., 88(421):252-260. Mahalanobis, P. C. (1936) On the generalized distance in statistics, Proc. Natl Acad. Sci. India, 49-55. Mann, H. B.; Whitney, D. R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other". Annals of Mathematical Statis- tics 18 (1): 5060. Martens, H. and Taes, T. (1989) Multivariate Calibration. Wiley, NewYork. Massey, W.F. (1965) Principal component regression in exploratory statistical Re- search,J. AM. Stat. Assoc., 60, 234-246. 51 Montanari, A. (2004) Linear discriminant analysis and transvariation J. Classi., 21(1):71-88. Nguyen, D. V. and Rocken, D. M. (2002) Tumor classiflcation by partial least squares using microarray gene expression data. Bioinformatics, 18(1): 39{50. Nudurupati, S. V. and Abebe, A. (2009) A nonparametric allocation scheme for classi- flcation based on transvariation probabilities, J. Stat. Comp. Simul., 79(8):977-987. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Stat. Assoc., 66(336):846{850. Salas-Gonzalez, D., Kuruoglu, E. E., Ruiz, D. P. (2009) A heavy-tailed empirical Bayes method for replicated microarray data, J. Comput. Stat. Data. Anal, 53 (2009) 1535-1546 Sievers, G. L., and Abebe, A. (2004) Rank estimation of regression coeficients using iterated reweighted least squares, J. Stat. Comp. Simul., 74, 821-831 Singh, K. (1991) A notion of majority depth, Technical Report, Department of Statis- tics, Rutgers University. Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating system- atic relationships. University of Kansas Scientiflc Bulletin, 28:1409-1438. Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Natl. Acad. Sci., 99, 6567-6572. Tukey, J. (1974). Address to international congress of mathematicians, Vancouver. Tyler, D. E. (1987) A distribution-free M-estimator of multivariante scatter, Ann. Stat., 15(1):234{251. 52 Vardi, Y. and Zhang, C. H. (2000) The multivariate L1-median and associated data depth, Proc. Natl Acad. Sci. USA. 97, 1423C1426. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80- 83. 53