Improving Prediction Accuracy Using Class-specific Ensemble Feature Selection

Soares, Caio

View/Open

Dissertation_(Draft_13).pdf.txt (239.2Kb)

Dissertation_(Draft_13).pdf (1.664Mb)

Date

2010-08-03

Author

Soares, Caio

Type of Degree

dissertation

Department

Computer Science

Metadata

Show full item record

Abstract

As data accumulates at a speed significantly faster than can be processed, data preprocessing techniques such as feature selection become increasingly important and beneficial. Moreover, given the well-known gains of feature selection, any further improvements can positively affect a wide array of fields and applications. So, this research explores a novel feature selection architecture, Class-specific Ensemble Feature Selection (CEFS), which finds class-specific subsets of features optimal to each available classification in the dataset. Each subset is then combined with a classifier to create an ensemble feature selection model which is further used to predict unseen instances. CEFS attempts to provide the diversity and base classifier disagreement sought after in effective ensemble models by providing highly useful, yet highly exclusive feature subsets. CEFS is not a feature selection algorithm, but rather, a unique way of performing feature selection. Hence, it is also algorithm independent, suggesting that various machine learners and feature selection algorithms can benefit from the use of this architecture. To test this architecture, a comprehensive experiment is conducted, implementing the architecture under two different classifiers, three different feature selection algorithms, and under ten different datasets. The results of this experiment shows that the CEFS architecture outperforms the traditional feature selection architecture in every algorithmic combination and for every dataset. Moreover, the presence of high-dimensional datasets suggests that CEFS will scale up. Finally, the feature results obtained from the experiment suggest that vital class-specific information can be lost if feature selection is performed on the entire dataset as a whole, as opposed to a class-specific manner.

URI

http://hdl.handle.net/10415/2273