This Is AuburnElectronic Theses and Dissertations

Show simple item record

High-Dimensional Classification Methods for Sparse Signals and Their Applications in Text and Data Mining


Metadata FieldValueLanguage
dc.contributor.advisorCarpenter, Mark
dc.contributor.authorTadesse, Dawit
dc.date.accessioned2014-07-22T20:24:49Z
dc.date.available2014-07-22T20:24:49Z
dc.date.issued2014-07-22
dc.identifier.urihttp://hdl.handle.net/10415/4276
dc.description.abstractClassification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. In this dissertation we conduct a rigorous performance analysis of the two linear methods for high-dimensional classification, Independence Rule (or Naive Bayes) and Fisher discriminant both in theory and simulation. We know that, for the normal population model, when all the parameters are known Fisher is optimal and Naive Bayes is suboptimal. But in this dissertation we give the conditions under which Naive Bayes is optimal. Through theory and simulation, we further, show that Naive Bayes performs better than Fisher under broader conditions. We also study the associated feature selection methods. The two-sample t-test is a widely popular feature selection method. But it heavily depends on the normality assumption so we proposed a generalized feature selection algorithm which works regardless of the distribution. Our generalized feature selection is a special case of two-sample ttest, Wilcoxon-Mann Whitney Statistic and two-sample proportion statistic. We know that Singular Value Decomposition(SVD) is a popular dimension reduction method in text mining problems. Researchers take the first few SVDs which explain the largest variation. However, in this dissertation we argue that the first few SVDs are not necessarily the most important ones for classification. We then give a new feature selection algorithm for the data matrix in text mining problem in high-dimensional spaces.en_US
dc.rightsEMBARGO_NOT_AUBURNen_US
dc.subjectMathematics and Statisticsen_US
dc.titleHigh-Dimensional Classification Methods for Sparse Signals and Their Applications in Text and Data Miningen_US
dc.typedissertationen_US
dc.embargo.lengthMONTHS_WITHHELD:12en_US
dc.embargo.statusEMBARGOEDen_US
dc.embargo.enddate2015-07-22en_US

Files in this item

Show simple item record