Semi-Supervised Classification Techniques in Big Data Text Analytics
View/Open
Date
2013-04-22Type of Degree
dissertationDepartment
Mathematics and Statistics
Metadata
Show full item recordAbstract
Imagine that you are trying to read everything you got in 2011 as soon as possible. That will take the first three months of 2012. How can we get access to the vast unstructured literature, automatically process it, effectively predigest and make sense of it with less effort? Focusing on the entire preprocessing and classification steps, a hybrid semi-supervised text classification approach proposed in this dissertation will help you survive in a rising sea of information. The porter stemming, new adaptive TFIDF-LDA weighting, Zipf's law based dimension reduction, multinomial Naive Bayes classifier, and Expectation-maximization algorithm are harmoniously integrated together in the hybrid semi-supervised text classification model. From a small set of “known” labeled papers, you can use this mixture model to make predictions about newly “unknown” unclassified papers into the predefined categories. Extensive experimental results show that the proposed system dramatically reduces the feature dimension and improves the classification accuracy.