This Is AuburnElectronic Theses and Dissertations

Semi-Supervised Classification Techniques in Big Data Text Analytics

Date

2013-04-22

Author

Li, Geng

Type of Degree

dissertation

Department

Mathematics and Statistics

Abstract

Imagine that you are trying to read everything you got in 2011 as soon as possible. That will take the first three months of 2012. How can we get access to the vast unstructured literature, automatically process it, effectively predigest and make sense of it with less effort? Focusing on the entire preprocessing and classification steps, a hybrid semi-supervised text classification approach proposed in this dissertation will help you survive in a rising sea of information. The porter stemming, new adaptive TFIDF-LDA weighting, Zipf's law based dimension reduction, multinomial Naive Bayes classifier, and Expectation-maximization algorithm are harmoniously integrated together in the hybrid semi-supervised text classification model. From a small set of “known” labeled papers, you can use this mixture model to make predictions about newly “unknown” unclassified papers into the predefined categories. Extensive experimental results show that the proposed system dramatically reduces the feature dimension and improves the classification accuracy.