High-Dimensional Classification Methods for Sparse Signals and Their Applications in Text and Data Mining
Type of Degreedissertation
Mathematics and Statistics
MetadataShow full item record
Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. In this dissertation we conduct a rigorous performance analysis of the two linear methods for high-dimensional classification, Independence Rule (or Naive Bayes) and Fisher discriminant both in theory and simulation. We know that, for the normal population model, when all the parameters are known Fisher is optimal and Naive Bayes is suboptimal. But in this dissertation we give the conditions under which Naive Bayes is optimal. Through theory and simulation, we further, show that Naive Bayes performs better than Fisher under broader conditions. We also study the associated feature selection methods. The two-sample t-test is a widely popular feature selection method. But it heavily depends on the normality assumption so we proposed a generalized feature selection algorithm which works regardless of the distribution. Our generalized feature selection is a special case of two-sample ttest, Wilcoxon-Mann Whitney Statistic and two-sample proportion statistic. We know that Singular Value Decomposition(SVD) is a popular dimension reduction method in text mining problems. Researchers take the first few SVDs which explain the largest variation. However, in this dissertation we argue that the first few SVDs are not necessarily the most important ones for classification. We then give a new feature selection algorithm for the data matrix in text mining problem in high-dimensional spaces.