A CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION  
APPROACH FOR CLASSIFICATION PROBLEMS 
 
 
 
 
Except where reference is made to the work of others, the work described in this thesis is 
my own or was done in collaboration with my advisory committee.  This thesis does 
not include proprietary or classified information. 
 
 
 
 
 
____________________________________ 
Caio Soares 
 
 
 
Certificate of Approval: 
 
 
 
 
______________________________ ______________________________ 
Juan E. Gilbert, Chair Cheryl Seals  
Full Professor Associate Professor  
Computer Science and  Computer Science and  
Software Engineering Software Engineering  
 
 
 
 
______________________________ ______________________________ 
Gerry Dozier George T. Flowers 
Associate Professor Dean 
Computer Science and Graduate School 
Software Engineering 
 
 
A CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION  
APPROACH FOR CLASSIFICATION PROBLEMS 
 
 
 
Caio Soares 
 
 
 
 
 
A Thesis 
 
 
Submitted to 
 
 
the Graduate Faculty of 
 
 
Auburn University 
 
 
in Partial Fulfillment of the 
 
 
Requirements for the 
 
 
Degree of 
 
 
Master of Science 
 
 
 
 
 
Auburn, Alabama 
May 9, 2009
 
iii 
A CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION  
APPROACH FOR CLASSIFICATION PROBLEMS 
 
 
 
 
Caio Soares 
 
 
 
 
Permission is granted to Auburn University to make copies of this thesis at its discretion, 
upon the request of individuals or institutions and at their expense.  The author reserves 
all publication rights. 
 
 
 
____________________________________ 
Signature of Author 
 
 
 
____________________________________ 
Date of Graduation
 
iv 
VITA 
 
 
Caio Vinicius Soares, son of Darwin Soares and Yara Panzone, was born March 
18, 1983 in Sao Paulo, Brazil.  In 2001, he graduated from Dr. Phillips High School in 
Orlando, Florida.  He attended Berry College in Rome, Georgia, graduating Magna Cum 
Laude with Bachelor of Science degrees in Computer Science and Mathematics in 2004.  
The same year, he entered Graduate School in the Computer Science and Software 
Engineering Department at Auburn University in Auburn, Alabama. 
  
 
v 
THESIS ABSTRACT 
 
A CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION  
APPROACH FOR CLASSIFICATION PROBLEMS 
 
 
Caio Soares 
 
Master of Science, May 9, 2009 
(B.S. Computer Science, Berry College, 2004) 
(B.S. Mathematics, Berry College, 2004) 
 
 
58 Typed Pages 
 
Directed by Juan E. Gilbert 
 
 
 
This research proposes a new feature selection algorithm, Class-specific Ensemble 
Feature Selection (CEFS), which finds class-specific subsets of features optimal to each 
available classification in the dataset.  Each subset is then combined with a classifier to 
create an ensemble feature selection model which is further used to predict unseen 
instances.  CEFS attempts to provide the diversity and base classifier disagreement 
sought after in effective ensemble models by providing highly useful, yet highly 
exclusive feature subsets.  Also, the use of a wrapper method gives each subset the 
chance to perform optimally under the respective base classifier.  Preliminary 
experiments implementing this innovative approach suggest potential improvements of 
more than 10% over existing methods. 
 
vi 
ACKNOWLEDGEMENTS 
 
 
The author thanks the following: 
 
 
Lacey Weaver for her unyielding love, support and encouragement. 
 
 
Dr. Juan Gilbert and Dr. Gerry Dozier for letting me borrow their great minds and great 
ideas. 
 
Yara, George, Bruna, Junior, Lysandra, and Duda for being the greatest family a person 
could ask for. 
 
vii 
Style manual or journal used IEEE Computer Society Transactions. 
 
 
Computer software used Microsoft Word 2003 and Microsoft Word 2007.
 
 
viii 
TABLE OF CONTENTS 
 
 
LIST OF FIGURES ............................................................................................................. ix 
 
LIST OF TABLES .................................................................................................................x 
 
1. INTRODUCTION ............................................................................................................1 
1.1. Motivation .............................................................................................................3 
1.2. Problem Description ..............................................................................................5 
1.3. Goals and Contributions ........................................................................................6 
1.4. Organization ..........................................................................................................7 
 
2. FEATURE SELECTION ..................................................................................................8 
2.1. Misconceptions ....................................................................................................12 
2.2. Definition of relevance ........................................................................................13 
2.3. Feature Selection in Unsupervised Learning .......................................................17 
2.4. Feature Selection Methods ..................................................................................18 
2.4.1. Filter Methods ...........................................................................................18 
2.4.2. Wrapper Methods .....................................................................................20 
2.4.3. Embedded Methods ..................................................................................22 
2.4.4. Hybrid Methods ........................................................................................23 
2.5. Feature Selection in Ensemble Methods ...............................................................25 
2.5.1. Ensemble Methods ....................................................................................25 
2.5.2. Ensemble Feature Selection ......................................................................27 
 
3. CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION (CEFS) ................................30 
 
4. IMPLEMENTATION .....................................................................................................33 
4.1. Proposed Hypothesis ...........................................................................................33 
4.2. Preliminary Experiment .......................................................................................34 
4.2.1. Data ...........................................................................................................34 
4.2.2. Tools .........................................................................................................34 
4.2.3. Procedure ..................................................................................................35 
4.2.4. Experiment Results ...................................................................................36 
 
5. CONCLUSION ..............................................................................................................39 
 
BIBLIOGRAPHY ................................................................................................................41 
 
 
ix 
 LIST OF FIGURES 
 
 
Figure 1: Feature Selection Procedure .................................................................................9 
Figure 2: Feature Subset Space .........................................................................................15 
Figure 3: Generalized Filter Algorithm .............................................................................19 
Figure 4: Generalized Wrapper Algorithm ........................................................................22 
Figure 5: Generalized Hybrid Algorithm ..........................................................................24 
Figure 6: A logical view of the ensemble learning method ...............................................26 
Figure 7: CEFS Procedure .................................................................................................31 
Figure 8: Ensemble Classifier Agreement .........................................................................37 
Figure 9: Prediction Accuracy based on Algorithm ..........................................................38 
 
 
 
 
 
x 
 LIST OF TABLES
 
 
Table 1: Required Sample Size for given number of dimensions .......................................4 
Table 2: Optimal Feature Subsets under Breast Cancer Dataset .......................................36 
 
 
 
1 
CHAPTER 1 
 
INTRODUCTION 
 
 
Prediction and classification problems appear throughout our daily lives; 
physicians predict (diagnose) health problems, gamblers predict (pick) sporting events, 
stock traders predict (buy/sell) stock, and etc.  The storage and processing of past known 
data gives humans and machines alike the ability to predict and classify unseen or future 
data.  Now, add to it the fact that data storage has become increasingly easier and 
increasingly inexpensive.  The result is a dramatic shift in the mindset of data storage; a 
shift which considered a one-megabyte database as being very large in 1989 to numerous 
multi-terabyte databases being mined regularly by the year 2000 [1].   
Unfortunately, data processing has not kept up with data acquisition.  That is, as 
computer and database technologies continue to improve, data accumulates in a speed 
significantly faster than the capacity of data processing [2].  Such data accumulation 
produces significantly large databases.  These databases can be large due to a high 
number of instances1, a high number of features2, or a combination of both.  An example 
of a high-instance database is Google index, which reported having indexed one billion 
unique URLs by 2000 and one trillion by 2008 [3].  On the other hand, microarray 
                                                 
1 For the remainder of this work, instances may also be referred to as ?cases?, ?vectors?, ?samples?, or 
?observations?. 
2 For the remainder of this work, features may also be referred to as ?variables?, ?dimensions?, 
?attributes?, or ?genes? (in case of microarray data). 
 
 
2 
datasets present a fitting example of high-dimensional datasets, often containing a small 
number of instances but each with thousands of attributes [4].  Although various methods 
for dealing with both do exist, this research will focus solely on the issue of high-
dimensional data.  More specifically, it will do so through the use of feature selection, a 
technique which can reduce the cost of recognition and often provide better classification 
accuracy by reducing the number of features [5].  Up to now, however, this technique has 
been almost always carried out on the entire dataset at once, assuming that the most 
salient features for learning will be uniform among all instances, regardless of 
classification.   
So, assume dataset D, with instances classified as either x or y.  The hypothesis 
investigated in this research is two-fold.  First, it is believed that instances classified as x 
will have a different set of relevant3 features than will instances classified as y, 
contradicting the aforementioned methodology, which would suggest that features 
relevant to the entire dataset must be relevant to instances classified x and instances 
classified as y.  Moreover, under existing methods, features relevant to a single 
classification, but not to the entire dataset, may be deemed irrelevant, removed and thus 
unused at prediction time.  Second, given the well documented benefits of ensemble 
methods [6] [7] [8], an ensemble system which takes advantage of class-specific features 
should provide the data diversity and classifier disagreement necessary for ensemble 
systems to succeed, thereby increasing prediction performance (better accuracy).  
Allowing each classification to have its own set of class-specific optimal features, and 
                                                 
3 The formal definition of relevance as it pertains to features will be detailed in Section 2.2 of this 
work. 
 
 
3 
thus each its own ensemble classifier, will create an ensemble system which should allow 
for a more accurate classification of unseen instances.  As an added benefit, new 
information may be learned as to the relevance of certain features on certain 
classifications, information which may have been otherwise unknown due to performing 
feature selection on all instances, and thus, all classifications at once.  This would be of 
particularly good use in problems such as cancer prediction [9] [10], where certain 
features may be optimal in detecting certain types of cancer, but irrelevant in others. 
 
1.1 Motivation 
 
 
According to [11], a learning algorithm is good if it produces hypotheses that do a 
good job of predicting classifications of unseen examples.  With that in mind, the primary 
goal of this research is to increase prediction accuracy, thereby creating a better learning 
algorithm.  Note, however, that the focus here is not on the learning algorithm itself, but 
rather on the pre-processing methodology used to prepare the training data for optimal 
learning.  In fact, research suggests that much of the power in classification comes not 
from the specific learning method, but from proper formulation of the problems and 
crafting the representation to make learning tractable  [12].  In addition, the same 
research suggests that performance is roughly equivalent between numerous learning 
algorithms, across many domains.  In other words, the preparation and representation of 
the data is of the utmost importance given that no single learning algorithm outperforms 
all others on every domain.  In a nutshell, this represents the well-known theorems of 
?No Free Lunch? [13]. 
 
 
4 
Although a dataset may be well prepared and represented, it may also fall victim 
to Belman?s ?curse of dimensionality?, a problem which often plagues high-dimensional 
datasets.  The curse of dimensionality refers to the exponential growth of hypervolume as 
a function of dimensionality [14].  That is, the more dimensions in a dataset, the more 
representation needed by the predicting model.  For classification, this can mean that 
there are not enough data objects to allow the creation of a model that reliably assigns a 
class to all possible objects [15].  In the context of Neural Networks (NN), for example, 
the curse of dimensionality causes networks with irrelevant inputs to behave badly: the 
dimension of the input space is high, and the network uses almost all its resources to 
represent irrelevant portions of the space [16].  In [17], Silverman illustrates the difficulty 
of kernel estimation in high dimensions.  As shown in Table 1, even at a small 
dimensionality of 10, roughly 840,000 samples are required to estimate the density at 0 
with a given accuracy [17].   
Table 1: Required Sample Size for given Number of Dimensions [17] 
Dimensionality Required Sample Size 
1 4 
2 19 
5 786 
7 10,700 
10 842,000 
 
In 2003, research conducted by Guyon and Elisseeff reported on gene expression 
datasets up to 60,000 variables wide, and text classification datasets 15,000 variables 
wide and nearly 1 million instances deep [18].  Those numbers continue to grow daily, as 
well as in other fields such as satellite imagery, hyperspectral imagery, financial data, 
 
 
5 
consumer data, and etc [19].  Improving prediction accuracy within problems of high data 
dimensionality is the primary motivation for using feature selection.   
The key benefit of feature selection is that it directly targets the curse of 
dimensionality, since it can eliminate irrelevant and redundant features and reduce noise 
in the dataset.  Moreover, as the dimension of the feature space increases so does the 
probability of over-fitting, and as shown by [20], feature selection provides a powerful 
means to avoid over-fitting.  Feature selection can also lead to a more understandable 
model, allow the data to be more easily visualized, and with a reduction in 
dimensionality, can decrease the amount of time and memory required by learning 
algorithms [15].  The motivation for this research, however, goes deeper into the 
methodology of feature selection, a topic which will be further explored in the subsequent 
section.   
 
1.2 Problem Description 
 
 
Although there are a plethora of feature selection algorithms, all differing in some 
shape or form, they all share a commonality; they all attempt to reduce the number of 
features by selecting an optimal subset of the original feature set.  This subset, however, 
almost always contains features which are assumed to be optimal to the entire dataset. 
This then leads to the assumption that the most optimal features to the dataset must be 
optimal for instances across all classifications.  Therein lies the problem and thus the 
basis for this research.   
 
 
6 
The hypothesis is that by allowing the feature selection algorithm to focus its 
attention on the prediction accuracy of a single classification at a time, the algorithm will 
produce a separate subset of features optimal to each classification.  These separate 
subsets can then be utilized to create separate classification models, thereby taking 
advantage of another technique proven to improve classification accuracy; ensemble 
learning (or in this case, ensemble feature selection).  In summary, this research will 
attempt to improve on the problem of prediction accuracy by proposing a unique 
approach to performing ensemble feature selection 
 
1.3 Goals and Contributions 
 
 
The research conducted and detailed through this work will attempt to achieve the 
following comprehensive goals: 
 
1. To design and develop a unique and effective approach to feature selection. 
 
2. To examine the potential benefits of class-specific ensemble feature selection. 
 
3. To enhance dataset understandability with the creation of class-specific optimal 
subsets as opposed to dataset-specific optimal subsets. 
 
4. To test the effectiveness of class-specific ensemble feature selection on an 
existing classification problem. 
 
The immediate contribution of this research will be in the introduction of a unique 
algorithm to the areas of feature selection, ensemble learning, and machine learning.  
 
 
7 
However, as dataset sizes continue to grow in every discipline, the proposed algorithm 
can further contribute in a much more widespread manner as it is data independent.  
Given its class-specific design, the algorithm may have even farther reaching 
implications to fields in science, medicine, vision, finance, etc.  
 
1.4 Organization 
 
 
The outline of this thesis first began with a brief introduction on the methods 
explored in this research.  The remainder of this thesis is organized as described below. 
Chapter 2 explains feature selection in detail.  Following a brief introduction to 
feature selection, common misconceptions and the definition of relevance are provided.  
The use of feature selection in unsupervised learning is then discussed, concluding with 
thorough explanations of the different types of feature selection algorithms and existent 
work in ensemble feature selection. 
Chapter 3 provides a thorough explanation of the Class-specific Ensemble Feature 
Selection (CEFS) algorithm proposed in this research.  A detailed description of the 
suggested implementation is also provided.   
Chapter 4 presents an initial implementation of the suggested algorithm, along 
with results attained from a preliminary experiment.  Moreover, experiment parameters, 
analysis, and conclusions are discussed.  Lastly, Chapter 5 concludes with final 
observations and analysis of the conducted research, and possible future work. 
 
 
8 
CHAPTER 2 
 
FEATURE SELECTION 
 
 
Many machine learning algorithms, including top-down induction of decision tree 
algorithms such as ID3 [21], C4.5 [22], and CART [23], and instance-based algorithms, 
such as IBL [24] [25], are known to lose prediction accuracy when faced with many 
features unnecessary for predicting the output, i.e. irrelevant features [26].  These are 
features which contain almost no useful information for the learning task at hand.  For 
example, students? ID numbers are irrelevant for predicting students? GPAs [15].  On the 
other hand, algorithms such as Na?ve-Bayes [27] [28] [29] are robust with respect to 
irrelevant features, but their performance may degrade quickly if correlated features, i.e. 
redundant features, are added [26].  Such features duplicate much or all of information 
contained in one or more other attributes.  For instance, the purchase price of a product 
and the amount of sales tax paid contain much of the same information [15].  Feature 
selection targets those very problems as it is a technique which reduces the number of 
features and removes irrelevant, redundant or noisy data by selecting a subset of the 
original feature set [2].  Most feature selection methods follow a four step process: subset 
generation, subset evaluation, stopping criterion, and result validation (Figure 1) [30]. 
 
 
 
9 
 
Figure 1: Feature Selection Process [30] 
 
Beginning with subset generation, a selected search strategy produces candidate 
feature subsets.  Each subset is then evaluated and compared to others according to a 
given evaluation criterion.  The best subset is kept and this process is repeated until a 
stopping criterion is reached, at which point, the selected subset is validated using the 
pre-selected classifier.  Formally, feature selection can be defined as follows [2]: 
Definition 1 (feature selection) Let Y be the original set of features, with cardinality n.  
Let d represent the desired number of features in the selected subset X, X ? Y.  Let the 
feature selection criterion function for the set X be represented by J(X).  Without lost of 
generality, let us consider a higher value of J to indicate a better feature set.  Formally, 
the problem of feature selection is to find a subset X ? Y such that |X| = d and  
 
g1836g4666g1850g4667 =maxg3027?g3026,uni007Cg3027uni007Cg2880g3031g1836g4666g1852g4667. 
Although optimal feature selection is typically intractable [26] and many 
problems related to it have been shown to be NP-hard [31] [32], some feature selection 
algorithms have proved to be significantly more stable than others [33] [34].  
Furthermore, feature selection has been shown to yield benefits such as facilitating data 
visualization and data understanding, reducing measurement and storage requirements, 
reducing training and utilization times, and defying the curse of dimensionality to 
improve prediction performance [18].  As a result, feature selection can be found in many 
areas of research such as bioinformatics [35] [36] [37] [38], machine learning [35] [39] 
[40] [41], text classification/mining [42] [43] [44] [45] [46], case-based reasoning [47], 
 
 
10 
statistics [48] [49] [50], security [44] [51], pattern recognition [5] [36] [52] [53], data 
mining [54] [55] [56], and various others.  In addition, numerous reviews and 
comparative studies on existing feature selection algorithms have been conducted [2] [30] 
[57] [58] [59].   
Feature selection methods can be structured into four main factions: filter, 
wrapper, embedded, and hybrid methods [2] [38]. Filter and wrapper methods are most 
widely known and used.  Embedded methods are also frequently used, but typically with 
the user having no knowledge that feature selection is being employed.  Hybrid methods 
are fairly new on the scene, so they have not yet fully made their mark.   
Filter methods rank each feature according to some univariate metric, and only 
the highest ranking features are used while the remaining low ranking features are 
eliminated [38].  Some commonly used metrics for filter methods include Information 
Gain, Chi-Squared [42], and Pearson Correlation Coefficient [60].  Since such metrics are 
used as the evaluation criterion to select an optimal subset of features, it is important to 
point out that filter methods do not involve the use of a learning algorithm as evaluation 
criterion, defining its main distinction from wrapper methods.  As such, filter methods do 
not inherit bias from learning algorithms and are also more computationally efficient than 
wrapper methods [2]. 
Popularized by Kohavi and John in [26] wrapper methods utilize a learning 
algorithm to assess the goodness of a subset of features [18].  Whereas a filter method 
will use metrics to determine the usefulness of each feature, wrapper methods will use a 
learning method to determine the usefulness of a candidate set of features.  At the end of 
 
 
11 
the process, not only will the algorithm produce an optimal set of features, but one which 
will perform most optimally under the aforementioned learning algorithm.  Well known 
wrapper algorithms include Sequential Forward Selection, Sequential Backward 
Selection [5], and using a hill-climber to optimize feature selection [26].  Although 
wrapper methods have shown to be more computationally expensive than filter methods 
due to their use of a separate learning algorithm for evaluation, research has also shown 
that wrapper methods tend to give superior performance as feature subsets are better 
suited to the predetermined learning algorithm [2]. 
Embedded methods perform feature selection as part of the learning algorithm 
itself.  That is, during the process of the learning algorithm, the algorithm decides which 
attributes to use and which to ignore [15].  Decision Trees are perhaps the most well 
known example of embedded methods [21] [22] [23]. 
To take advantage of the benefits of filter and wrapper models, and to avoid the 
pre-specification of a stopping criterion, hybrid methods have been proposed [61].  A 
typical hybrid method uses both an independent measure and a learning algorithm to 
evaluate feature subsets [2]. 
As such, this work will focus most of its attention on filter and wrapper 
algorithms, the latter, being implemented on the proposed algorithm.  The remainder of 
this chapter is organized as follows.  First, common misconceptions of the feature 
selection are addressed, followed by an in depth discussion on the subject of relevance as 
it pertains to feature selection.  Next, a brief synopsis of the use of feature selection in 
unsupervised learning is given.  Detailed descriptions of each type of feature selection 
 
 
12 
algorithm are then provided.  Finally, previous research conducted in ensemble feature 
selection is summarized and explained, concluding with a few remarks regarding feature 
selection as described in this chapter. 
 
2.1 Misconceptions 
 
 
Techniques such as dimensionality reduction, feature extraction, feature 
construction, feature weighting, feature creation and others are often misused as 
synonymous to feature selection.  It should be pointed out that feature selection strictly 
selects a subset of the existing feature set, without creation of new features [26] [62] [63].  
However, to attain a better understanding of feature selection, a brief explanation of some 
of the aforementioned techniques is noteworthy. 
The first two techniques, dimensionality reduction and feature extraction, go hand 
in hand as various algorithms are often interchangeably characterized under both.  Tan et 
al comments that dimensionality reduction commonly uses techniques from linear 
algebra to project the data from a high-dimensional space into a lower-dimensional space 
[15].  Similarly, Jain et al describes feature extraction algorithms as algorithms which 
create new features based on transformations or combinations of the original feature set 
[5].  For example, Principal Component Analysis (PCA) finds new attributes which are 
linear combinations of the original attributes, are orthogonal to each other, and capture 
the maximum amount of variation in the data [15].  In other words, PCA is an orthogonal 
transformation of the coordinate system, which is obtained by projection onto the 
principal components, or features, of the data.  Since a small number of principal 
 
 
13 
components are often sufficient, a reduction in dimensionality occurs [64].  Further 
reviews of PCA literature can also be found at [65] and [66].   
Now consider another example; suppose you have a set of photographs, where 
each photograph is to be classified as to whether or not it contains a human face.  By 
using a feature extraction algorithm, pixels can be transformed to provide higher-level 
features, such as edges and areas highly correlated with the presence of human faces.  
Other examples include applying a Fourier transform to times series data to identify 
underlying frequencies in order to detect certain periodic patterns [15] and using 
Independent Component Analysis (ICA) in conjunction with Support Vector Machines 
(SVM) in order to improve prediction accuracy in series forecasting [67].   
Other techniques such as feature weighting [15] and feature construction [18] also 
provide similar distinctions, but for the brevity of this research, will not be discussed at 
this time.  In summary, feature selection is not to be confused with other techniques 
which go beyond the original feature set, creating new features; that is, feature selection 
algorithms always yield a direct subset of the original feature set. 
 
2.2 Definition of Relevance 
 
 
Relevance, as it pertains to feature selection, is a bit of a loaded term.  Defining it 
is not trivial nor is it widely agreed upon.  In fact, numerous authors have provided a 
number of different definitions for relevance.  A brief summary of some of these 
definitions will be provided in the subsequent paragraphs.  Detailed explanations and 
further definitions of relevance can be found in works such as [26] by Kohavi and John, 
 
 
14 
[68] by Blum and Langley, [36] by Nilsson et al, and most recently in [69] by Bell and 
Wang. 
 Although the word relevance is often used casually and without formal definition 
in most feature selection studies, its definition is as important as its use.  At the heart of 
this matter lies the question of relevance vs. usefulness.  Selecting only the most relevant 
features will often produce suboptimal results, especially if the features are redundant.  
Conversely, a subset of useful features may leave out many redundant, yet relevant, 
features [18].  In other words, the relevance of a feature does not imply that it should be 
in the optimal feature subset, while irrelevance does not imply it should not be in the 
optimal feature subset [26].  For clarification purposes, provided next are some formal 
definitions acquired from [26], [36], and [68] which may aid in understanding the role of 
relevance in feature selection.   
Definition 2 (conditional independence) A variable Xi is conditionally independent of a 
variable Y given (conditioned on) the set of variables g1845 ?g1850 iff it holds that  
 
g1842g3435g1868g4666g1851uni007Cg1850g3036,g1845g4667 = g1868g4666g1851uni007Cg1845g4667g3439= 1. 
 
This is denoted g1851 ? g1850g3036uni007Cg1845. 
 
Conditional independence is a measure of irrelevance, but it is difficult to use as an 
operational definition since this measure depends on the conditioning set S [36].  The 
definitions of strong and weak relevance, defined next, will help refine what it means for 
a feature to be considered irrelevant. 
Definition 3 (strong relevance) A feature Xi is strongly relevant iff there exists some xi, 
y, and si for which p( Xi = xi, Si = si ) > 0 such that  
 
p( Y = y | Xi = xi, Si = si ) ? p( Y = y | Si = si ). 
 
 
 
15 
Definition 4 (weak relevance) A feature Xi is weakly relevant iff it is not strongly 
relevant and there exists a subset of features g1845g3036g4593of g1845g3036 for which there exists some xi, y, and 
g1871g3036g4593 with p( Xi = xi, g1845g3036g4593 = g1871g3036g4593 ) > 0 such that  
 
p( Y = y | Xi = xi, g1845g3036g4593 = g1871g3036g4593) ? p( Y = y | g1845g3036g4593 = g1871g3036g4593 ). 
 
In other words, a feature X is strongly relevant if the removal of X alone will yield a 
decrease in the performance of an optimal classifier.  Conversely, a feature X is weakly 
relevant if it is not strongly relevant and there exists a subset of features, S, such that the 
performance of a classifier on S is worse than the performance on S ? {X}.  Thus, a 
feature is relevant if it is either weakly or strongly relevant, otherwise it is irrelevant 
[26].  Figure 2 provides a useful visual representation of the feature subset space divided 
into irrelevant, weakly relevant and strongly relevant feature subsets [26]. 
  
Figure 2: Feature Subset Space [26] 
 
 
Recall, however, that a distinction between the relevance of a feature and the 
usefulness of a feature is still necessary.  As such, the following definition derived from 
[70] will provide clarification. 
 
 
16 
Definition 5 (incremental usefulness) Given a sample of data S, a learning algorithm L, 
and a feature set A, feature xi is incrementally useful to L with respect to A if the accuracy 
of the hypothesis that L produces using the feature set {xi} ? A is better than the accuracy 
achieved using just the feature set A.  
 
Accordingly, the definition of incremental usefulness will be the focal point for 
determining whether or not a feature should be included in the optimal feature subset.  
This is further apparent given the fact that the proposed algorithm later described in this 
research will employ a wrapper method, which tailors directly to the learning algorithm 
used to evaluate each feature subset.  Therefore, features will be assessed according to 
their usefulness with respect to the learning algorithm rather than their overall relevance. 
As a final illustration on the issue of relevance vs. usefulness, refer to an example 
provided in [26].  One of the artificial datasets (m-of-n-3-7-10) represents a symmetric 
target function, implying that all features should be ranked equally by any filtering 
method.  However, Na?ve-Bayes improves if a single feature (any one of them) is 
removed.  Consequently, that is the precise motivation behind the selection of a wrapper 
method over a filter method in the implementation of the proposed algorithm in this 
research (Chapter 3).  Please note, however, that relevance according to these definitions 
does not imply membership in the optimal feature subset, and that irrelevance does not 
imply that a feature cannot be in the optimal feature subset [26].  The underlying purpose 
among an of this is to design the algorithm in such a way that it will be allowed to choose 
usefulness over relevance. 
 
  
 
 
17 
2.3 Feature Selection in Unsupervised Learning 
 
 
Although the use of feature selection in unsupervised learning will not be 
implemented in this research, a brief discussion of the subject is important given the 
amount of attention it has received [2] [71] [72] [73] [74].  Unsupervised learning, more 
specifically clustering [75] [76], groups instances based on the information describing the 
instance or based on their relationships.  The goal is to obtain groups with instances 
similar or related to one another and different from instances in other groups.  The greater 
the similarity (or homogeneity) within a group and the greater the difference between 
groups, the better the clustering [71]. 
Wrapper methods used in conjunction with clustering evaluate the goodness of a 
feature subset by the quality of the clusters resulted from applying the clustering 
algorithm on the candidate feature subset.  A number of heuristic measures exist for 
estimating the quality of clustering results, such as cluster compactness, scatter 
separability, and maximum likelihood [2].  Further work on developing dependent 
criteria in feature selection for clustering can also be found in [54], [72], and [73].  Since 
filter methods don?t depend on the use of a separate learning algorithm for evaluation, the 
methodology for using filter feature selection on unsupervised learning does not differ 
from that of supervised learning.  Given the infancy of this research area, advances 
continue to occur.  
 
  
 
 
18 
2.4 Feature Selection Methods 
 
 
Feature selection methods can be grouped into one of four groups: filter, wrapper, 
embedded, and hybrid methods [2] [38].  The subsequent sections will describe each 
method in detail. 
 
2.4.1 Filter Methods 
 
 
Coined by John, Kohavi and Pfleger in [77], filter methods have their name 
because they filter out irrelevant features before induction occurs.  The process uses 
general characteristics of the training set to select some features and exclude others.  
Since filtering methods do not involve the use of a learning algorithm to evaluate 
candidate sets, they can be combined with any learning algorithm after the filtering is 
complete.  Moreover, filter methods are a computationally effective form of data pre-
processing, especially when compared to wrapper methods [38] [68]. 
Figure 3 describes a generalized form of a filter algorithm, provided by [2].  
Given dataset D, begin with a given subset S0 (an empty set, a full set, or any randomly 
selected subset) and search through the feature space using a particular search strategy.  
Each generated subset S is evaluated by an independent measure M and compared with 
the previous best.  If better, it?s regarded as the current best subset.  The search iterates 
until a predefined stopping criterion is reached.  The algorithm outputs the last current 
best subset Sbest as the final feature subset [2].  
 
 
 
19 
 
Figure 3: Generalized Filter Algorithm [2] 
 
 
One of the simplest filtering processes is to evaluate each feature individually 
based on its correlation with the target function and then select the k features with the 
highest values.  This method has been shown to achieve good results in text 
categorization tasks [78] [79], often used in combination with either a Na?ve-Bayesian 
classifier or a nearest neighbor classifier [68].  Other widely used metrics include 
Information Gain, Odds Ratio, Log Probability Ratio [42], FOCUS [80], RELIEF [81] 
[82], Potential Difference [83], Pearson Correlation Coefficient [60], etc.  RELIEF, for 
instance, assigns a ?relevance? weight to each feature, which represents the relevance of 
the feature to the target concept.  It samples instances randomly and updates the 
relevance values based on the difference between the selected instance and the nearest 
instances of the same and opposite class (?near-hit? and ?near-miss?) [26].  The 
algorithm does require the problem to only contain two classes however.  Another 
example, FOCUS, looks for minimal combinations of attributes that perfectly 
 
 
20 
discriminate among the classes.  It begins by looking at each feature individually, goes on 
to pairs of features, triples, and so forth, stopping only when it finds a combination which 
generates pure partitions of the training set. 
There are some drawbacks to filter methods however.  First, most filtering 
methods require the pre-selection of a set number of features to be chosen.  If the number 
is too high, irrelevant features will be kept and accuracy may suffer.  If the number is too 
low, useful features may not be selected, once again affecting accuracy.  Some remedies 
to this problem include using a hold-out set to test for the best k, or to use a hill-climber 
or some other evolutionary computation algorithm to find an optimal k.  These solutions, 
however, will negatively impact what is perhaps the biggest benefit of a filtering 
algorithm, its relative speed.  A second drawback is that filtering algorithms may miss 
features that would otherwise be useful to the learning algorithm which will predict 
unseen instances.  Since the algorithm bases its selection purely on metrics, it may miss 
features deemed useful by some learning algorithms and perhaps irrelevant in others.  
This is why wrapper methods are widely known to outperform filter methods (in terms of 
prediction accuracy) [2] [38]. 
 
2.4.2 Wrapper Methods 
 
 
Wrapper methods occur outside the basic learning algorithm, but also use said 
learning algorithm as a subroutine, rather than just as a post-processor.  For this reason, 
John, Kohavi, and Pfleger [77] refer to them as wrapper methods.  Each candidate subset 
is evaluated by running the selected data through the learning algorithm and using the 
 
 
21 
estimated accuracy of the resulting classifier as its evaluation metric.  As aforementioned, 
the biggest benefit to using wrapper methods is their tendency to outperform (prediction 
accuracy) their filter and embedded counterparts.  The general argument is that the 
classifier which will use the feature subset should provide a better estimate of accuracy 
than a separate metric that may have an entirely different bias [68]. 
Figure 4 provides the pseudocode for a generalized wrapper algorithm [2], which 
is quite similar to the generalized filter algorithm (Figure 3 ? Section 2.4.1) except that it 
uses a predefined learning algorithm A instead of an independent measure M for the 
evaluation of each candidate subset.  For each generated subset S, the algorithm evaluates 
its goodness by applying the learning algorithm to the data with subset S and evaluating 
the accuracy.  Thus, different learning algorithms may produce different feature selection 
results.  Varying the search strategies via the function generate(D) and learning 
algorithms (A) can result in different wrapper algorithms. Since learning algorithms are 
used to control the selection of feature subsets, the wrapper model tends to give superior 
performance as feature subsets found are better suited to the predetermined learning 
algorithm.  As a result, it?s also more computationally expensive than a filter method [2]. 
 
 
 
22 
 
Figure 4: Generalized Wrapper Algorithm [2] 
 
 
Similar to many wrapper method variations such as the brute force method, 
branch and bound, sequential backward/forward search, the sequential floating search 
method, etc. Variations can also be found in the learning algorithm.  Neural networks, 
Bayesian networks, and the SVM are often applied on different wrapper problems [84].  
Research done in [56] also discusses overfitting and dynamic search in regards to 
wrapper methods. 
 
2.4.3 Embedded Methods 
 
 
Much like wrapper methods, embedded methods interact directly with a specific 
learning algorithm.  In other words, the feature selection algorithm is built (embedded) 
into the classifier model itself rather than using the classifier to evaluate candidate feature 
sets.  Moreover, these methods have the advantages that they interact directly with the 
classifier while also being less computationally expensive than wrapper methods [38].  A 
 
 
23 
generalized form of this technique is not provided here because embedded feature 
selection methods may be highly different in implementation depending on the learning 
algorithms encompassing them (e.g. decision tree vs. SVM).   
Recursive partitioning methods for induction such as decision trees [21] [22] [23], 
for instance, employ a greedy search through the space of decision trees, at each stage 
using an evaluation function to select the feature which has the best ability to 
discriminate among the classes.  They partition the data based on this feature and repeat 
the process on each subset, extending the tree downward until no further discrimination is 
possible.  Other embedded methods include Separate-and-Conquer for learning decision 
lists [85] and Support Vector Machines of Recursive Feature Elimination [35]. 
 
2.4.4 Hybrid Methods 
 
 
Hybrid feature selection algorithms combine the use of filter and wrapper 
methods in an attempt to exploit the benefits of both.  By combining both techniques at 
different stages, hybrids are able to take advantage of the speed of a filter method and the 
accuracy of a wrapper method [37] [61].  Other implementations may also include the use 
of an embedded method, such as a decision tree, instead of a wrapper method [86] [87].   
Similar to the generalized algorithms previously provided, Figure 5 describes the 
pseudocode for a generalized hybrid algorithm [2].  The algorithm first uses a filter 
method to find the best subset for a given predetermined cardinality and then uses the 
learning algorithm to select the final best subset among the best subsets across different 
cardinalities.  The algorithm begins with a given subset S0 (typically an empty set in 
 
 
24 
sequential forward selection) and iterates to find the best subsets at each increasing 
cardinality. In each round for a best subset with cardinality c, it searches through all 
possible subsets of cardinality c + 1 by adding one feature from the remaining features. 
Each newly generated subset S with cardinality c + 1 is evaluated by an independent 
measure M and compared with the previous best.  If S is better, it becomes the current 
best subset Sg2912g2915g2929g2930g4593  at level c + 1.  At the end of each iteration, a learning algorithm A is 
applied on Sg2912g2915g2929g2930g4593  at level c + 1 and the quality of the mined result is compared with that 
from the best subset at level c.  If Sg2912g2915g2929g2930g4593  is better, the algorithm continues to find the best 
subset at the next level; otherwise, it stops and outputs the current best subset as the final 
best subset. The quality of results from a learning algorithm provides a natural stopping 
criterion in this model [2]. 
 
 
Figure 5: Generalized Hybrid Algorithm [2] 
 
 
25 
For example, [84] begins by pre-processing the dataset by using two different 
filter methods, F-score and Information Gain.  The separate feature subsets are then 
combined and processed through a Sequential Floating Wrapper method, which yields the 
final feature subset.  A Support Vector Machine (SVM) then utilizes the resulting feature 
subset to compute the classification accuracy. 
 
2.5 Feature Selection in Ensemble Methods 
 
 
Feature selection methods discussed up to this point employ the use of a single 
classifier.  An ensemble system, on the other hand, is composed of a set of multiple 
classifiers and performs classification by selecting from the predictions made by each of 
the classifiers [15].  Since wide research has shown that ensemble systems are often more 
accurate than any of the individual classifiers of the system alone [6] [7] [8], it is only 
natural that ensemble systems and feature selection would be combined at some point. 
 
2.5.1 Ensemble Methods 
 
 
The main goal of an ensemble is to construct multiple classifiers from the original 
data and then aggregate their predictions when classifying unknown instances.  Figure 6 
shows a basic view of an ensemble method [15].  As depicted, three main steps exist: 
training set generation, learning, and integration.  Step 1 begins with the original training 
set D.  From this training set, t data subsets are created (D1, D2, ?, Dt).  Bagging and 
boosting are common ways to accomplish this step [15].  Then in Step 2, t base classifiers 
are generated (C1, C2, ?, Ct).  These classifiers may all be the same, all different, or 
 
 
26 
contain any combination of the same or different classifiers.  Each classifier Ci is trained 
using the subset Di.  Finally in Step 3, the prediction of each classifier is combined in a 
predetermined way to produce the resulting classification. 
 
Figure 6: A logical view of the ensemble learning method [15] 
 
 
Two primary approaches exist to the integration phase: combination and 
selection.  In the combination approach, the base classifiers produce their class 
predictions and the final outcome is composed using those predictions.  In the selection 
approach, one of the classifiers is selected and the final prediction is the one produced by 
it [88].  The simplest and most common combination method is voting, also known as 
majority voting.  In voting, the classification predicted by a base classifier is counted as a 
vote for that particular class value.  The class value with the most votes becomes the final 
classification [88].  A simple and popular selection method is cross-validation majority 
(CVM) [89], which estimates the accuracy of each base classifier using cross-validation 
 
 
27 
and selects the classifier with the highest accuracy.  Although CVM is a selection method 
which chooses one classifier for the whole data space, more sophisticated selection 
methods which estimate local accuracy [90] or meta-level classifiers do exist [91].  
Perhaps the most commonly used integration techniques are voting [6], simple and 
weighted averaging, and a posteriori [92] [93].   
According to [94], the main objective when building the base classifiers is to 
maximize the coverage of the data, which is the percentage of the instances which at least 
one base classifier can classify correctly.  Reaching coverage greater than the accuracy of 
the best classifier, however, requires diversity among the base classifiers [8] [92] [93].  
Although research is always ongoing on new approaches to increase diversity, some 
methods include training on different subsets of the training set, using different learning 
algorithms, injecting randomness, and training on different sets of input features [95].  
The latter is where ensemble feature selection can be successfully applied.  
 
2.5.2 Ensemble Feature Selection 
 
 
An effective way of generating a diverse, yet accurate, ensemble of base 
classifiers is to use ensemble feature selection [88].  To recall, theoretical and empirical 
research has shown that an efficient ensemble should consist not only of high accuracy 
classifiers, but classifiers which also err in different parts of the input space [6].  
Providing different feature subsets allow base classifiers to make their classification 
errors in different subareas of the instance space.  While feature selection algorithms 
attempt to find an optimal feature subset for the learning algorithm, ensemble feature 
 
 
28 
selection has the additional goal of finding a set of feature subsets which will promote 
disagreement among the base classifiers [88] [96].  This is perhaps the most important 
point in understanding the motivation and goals of the research presented in this work.  
The approach proposed in Chapter 3 will attempt to provide the aforementioned 
disagreement among the base classifiers, but with a set of highly optimal class-specific 
feature subsets. 
Various successful attempts have been made at implementing ensemble feature 
selection.  In [97], Ho proposes a technique called Random Subspace Method (RSM), 
which randomly selects a predetermined number of features from the original feature set.  
This is repeated n times to create n feature subsets which are used to generate the n base 
classifiers used in the ensemble.  Ho goes on to show that RSM ensembles can reach 
effective results presumably because the lack of accuracy in the ensemble members is 
compensated for by the diversity created by the RSM [97].  In another implementation, 
Optiz begins by using RSM to generate an initial population of feature subsets, which he 
then uses and optimizes in his Genetic Algorithm (GA).  Optiz uses GAs in conjunction 
with Neural Networks (NN) as a wrapper method to find optimal feature subsets, which 
he then uses as training data to build an ensemble model of NNs.  Optiz comments that 
the initial population, acquired through RSM, was surprisingly good and produced better 
ensembles on average than the popular and power ensemble approaches of bagging and 
boosting [96].  Other ensemble feature selection studies have included implementations 
such as a Hill-Climber and Bayesian Classifiers with cross-validation integration [88], 
GA and k Nearest Neighbor classifiers [98] [99], sequential-search-based strategies and 
 
 
29 
Bayesian Classifiers [100], and comparative ensemble feature selection studies such as 
[101].   
Perhaps the most related work to the algorithm proposed in this research is the 
work done by Vale et al in [102].  In that study, the authors implement a filter method 
with an emphasis on class-based feature selection to generate the feature subsets used by 
each base classifier.  The use of a filter method is perhaps the most evident shortcoming 
of the implementation as wrapper methods have often shown to provide more accurate 
results.  In addition, the authors implement a combination based ensemble integration 
technique, although the specific details of the combination technique are not provided.  
Depending on the specific combination technique, further improvement can be seen here 
as well. 
In summary, Chapter 2 has provided an in depth description of feature selection.  
First, common misconceptions and various definitions of relevance were discussed.  This 
was followed by the uses of feature selection in unsupervised learning and explanations 
of the different types of feature selection techniques.  Lastly, the use of feature selection 
in combination with ensemble methods was reviewed. 
 
 
  
 
 
30 
CHAPTER 3 
 
CLASS-SPECIFIC ENSEMBLE FEATURE SELECTION  
(CEFS) 
 
 
A thorough summary of the intricacies of feature selection was presented in 
Chapter 2.  This included an overview of some of the benefits and drawbacks of certain 
methods and the latest work being done in the field.  The algorithm proposed in this 
research, named Class-specific Ensemble Feature Selection (CEFS), will seek to exploit 
some of those short comings, as well as, build upon the most recent successful findings. 
The CEFS algorithm will focus mainly on improving prediction accuracy on 
classification problems.  As an added benefit, the class-specific design will select features 
optimal to each separate classification, thereby providing more information and 
understanding in regards to the most useful features to each classification and to the 
model. 
Presented next is the design of CEFS (Figure 7).  First, assume dataset D, which 
contains n different classifications.  CEFS will use a wrapper method such as Sequential 
Backward Selection (SBS) [47] or a Genetic Algorithm (GA) [96] to find n optimal 
feature subsets.  That is, the feature selection algorithm will be run n times, each time 
searching for the subset of features which will maximize the classification accuracy, not 
of the entire dataset D, but of one of the n classifications, thus producing n optimal class-
specific subsets.  This is done by evaluating the prediction accuracy of the classifier on a 
 
 
31 
single classification at a time, instead of the prediction accuracy of the classifier on the 
entire test set (which groups all classifications together).  To clarify, the training set and 
test set do not ever change, they still contain all of the instances as though the algorithm 
were going to predict as normal.  The facet which changes is the evaluation criterion.  By 
using only the prediction accuracy of a single classification at a time, the algorithm is 
able to find an optimal set of class-specific features. 
 
Figure 7: CEFS Procedure 
 
 
The classifier used for the wrapper method evaluation is a Na?ve Bayesian 
Classifier (NBC) [28] [29], which is perfectly suited for this research, as lazy learners 
tend to benefit greatly from feature selection [103].  Then, each class-specific subset will 
be used to build a separate base classifier, creating n classifiers in an ensemble feature 
Class-based 
Subset Generation 
(Class i) 
 
Subset 
Evaluation 
Stopping 
Criterion 
i < n Increment i 
i = 1 
 
n = # of 
classifications 
Original 
Subset Subset 
 
Build Ensemble 
with n subsets 
 
Result 
Validation 
YES 
YES 
NO 
NO 
Ensemble 
 
 
32 
selection model.  Finally, the ensemble will use a combination technique to predict the 
outputs of the unseen instances. 
The algorithm itself is similar to the one presented in [102], but with two main 
improvements.  First, a wrapper method is used for feature selection instead of a filter 
method.  Since the features will be selected using the classifier as its evaluation criteria, 
this should produce features more useful to predict each class, which most importantly, 
should lead to a higher accuracy as has been typically shown to occur with the use of 
wrapper methods over filter methods [2].  Refer to Section 2.2 for a more in depth 
discussion of the issue of relevance vs. usefulness in regards to feature selection.  Using a 
wrapper method over a filter method will increase the overall running time of the 
algorithm, however, this is an acceptable drawback if it will improve prediction accuracy.  
The second distinction is the integration strategy used for the ensemble machine.  The 
research presented in [102] uses a fusion-based integration strategy, although the details 
on the specific implementation of this strategy are not provided.  The research proposed 
in this work will not implement a specific ensemble integration technique, but rather 
explore the potential accuracy if an appropriate technique was selected.   
 
 
 
  
 
 
33 
CHAPTER 4 
 
IMPLEMENTATION 
 
 
In order to gage the usefulness and effectiveness of the algorithm and to test the 
hypothesis aforementioned in this text, an implementation of a Class-specific Ensemble 
Feature Selection (CEFS) is necessary.  This Chapter provides preliminary results of an 
initial implementation of CEFS.  Moreover, this initial implementation is tested on a 
limited amount of data to show a proof of concept.  Further experimentation will be 
necessary to fully test the impact the proposed algorithm can have on prediction 
accuracy.  The remainder of the Chapter will be arranged as follows.  First, a brief 
description of the proposed hypothesis is provided.  Next, experimental data, tools, and 
procedure are discussed.  Finally, preliminary results, findings and conclusions are 
detailed.   
 
4.1 Proposed Hypothesis 
 
To reiterate the proposed hypothesis from Chapter 1, the hypothesis examined in 
this research is two-fold.  First, is the theory that instances classified as x will have a 
different set of useful features than will instances classified as y, contradicting the 
aforementioned methodology, which suggests that features useful to the entire dataset 
must be useful to instances classified x and instances classified as y and vice-versa.  
 
 
34 
Second, utilizing class-specific feature subsets to assemble and train ensemble base 
classifiers will create a system which will further improve prediction accuracy over 
existing models.   
 
4.2 Preliminary Experiment 
 
An initial preliminary experiment was conducted in small scale to show, as a 
proof of concept, that the proposed approach is successful in confirming the 
aforementioned hypothesis and achieving the suggested goals.   
 
4.2.1 Data 
 
The dataset used in this experiment was acquired from the UCI Machine Learning 
Repository [104].  The breast cancer dataset was chosen, containing 286 instances with 
two possible classifications:  recurrence (85) and no recurrence (201).  Furthermore, each 
instance contained nine unique features: age, menopause, tumor-size, inv-nodes, node-
caps, deg-malig, breast side, breast-quad, and irradiat.  A Leave-One-Out cross- 
validation scheme, which is detailed in Section 4.2.3, was used to partition the data. 
 
4.2.2 Tools 
 
The dataset used in this experiment was stored and accessed using Microsoft 
Access 2007.  The CEFS algorithm was written using Java JDK 1.6 as the programming 
language of choice and jGRASP as the IDE. 
 
  
 
 
35 
4.2.3 Procedure 
 
Given the relatively small number of instances and imbalance between classes in 
the dataset (29.7% recurrence vs. 70.3% no recurrence), a leave-one-out cross-validation 
scheme was selected.  In a leave-one-out approach, each test set contains only one 
instance from the overall dataset.  This is repeated until all instances have been used as a 
test set.  The accuracy is then the average of the number of instances correctly predicted 
over the total number of instances in the dataset.  This approach has the advantage of 
utilizing as much data as possible for training (i.e., n ? 1 instances where n is the number 
of instances in the original dataset).  In addition, test sets are mutually exclusive and they 
effectively cover the entire dataset [15].   
The CEFS algorithm implemented in this experiment uses a Sequential Backward 
Selection algorithm (SBS) [5] in conjunction with a Na?ve-Bayesian Classifier (NBC) to 
create the feature selection wrapper method.  The only portion of CEFS not implemented 
was the ensemble integration technique, for two reasons.  First, by not implementing an 
integration technique and simply measuring whether either of the classifiers correctly 
classified the instance, we are able to attain an upper bound on performance.  Second, 
further research and experimentation are needed in order to choose the most appropriate 
integration methods to be used in this system. 
In terms of metrics, the overall accuracy of CEFS and of a baseline model, also 
using SBS and NBC, were measured.  In addition to overall accuracy, several other 
metrics were recorded to support the hypothesis proposed in this thesis.  These metrics 
will be further detailed in the subsequent section. 
 
 
36 
4.2.4 Experiment Results 
 
Several comparative metrics were taken for this experiment.  The first was to 
compare the optimal feature subsets acquired from the original baseline algorithm to 
CEFS.  Table 2 gives the optimal feature subsets acquired by each algorithm.  Although 
the dataset is considered small on a dimensional scale, the results can serve to illustrate 
several points.  To begin, the only feature useful to both class-specific subsets was feature 
8 ? breast quadrant.  Interestingly enough though, this feature was not deemed useful in 
terms of the optimal subset for the entire dataset.  In contrast, features 5 and 6 were the 
only ones selected by the baseline algorithm, yet neither of them is deemed useful for 
instances classified as recurrence.  Feature 8 was the only feature deemed useful for 
recurrence instances, yet this feature was not deemed at all useful under the baseline 
algorithm.  The results from this dataset alone support the idea of feature dataset 
exclusivity, which shows that features useful for one classification may be different from 
other classifications, and may be lost at prediction time if using the baseline type of 
feature selection. 
Table 2: Optimal Feature Subsets under Breast Cancer Dataset 
Feature Subset 1 2 3 4 5 6 7 8 9 
Entire Dataset (baseline)         X X       
No-recurrence specific (CEFS) X   X   X X X X X 
Recurrence specific (CEFS)               X   
(1-age, 2-menopause, 3-tumor size, 4-inv nodes, 5-node caps,  
6-deg malig, 7-breast side, 8-breast quad, 9-irradiat) 
 
The next metric, depicted in Figure 8, measured the percentage of times each 
classifier agreed or disagreed on their prediction.  This is important, because the 
 
ensemble system accurately predicted about 80% of the instances when the classifiers 
agreed, which was 73% of the time.  
 
Figure 8: Ensemble Classifier Agreement
 
 
According to Figure 8, t
prediction accuracy when the classifiers disagree is dependent on which type of ensemble 
integration method is used.  Since no ensemble integration method was implemented, this 
figure is not available.  However, if the integration technique selected were to allow the 
ensemble machine to correctly predict 100% of the instances when the classifiers 
disagreed, then the algorithm would reach an overall accuracy of 85.3% (giving us an 
upper bound on accuracy).  If at the worst, the integration technique predicted as good as 
when the classifiers agree, that would still yield an overall accuracy of roughly 80%.  
Figure 9 displays the prediction accuracy of an NBC without any feature selection, the 
baseline algorithm aforementioned, the case where disagreeing classifiers predicted as 
27%
Ensemble Classifier Agreement
 
37 
 
 
he other 27% of the time, the classifiers disagreed.  The 
73%
Agreed
Did not agree
 
 
 
38 
well as agreeing classifiers and the upper bound on prediction accuracy (when 
disagreeing classifiers predict correctly 100% of the time). 
 
 
Figure 9: Prediction Accuracy based on Algorithm 
 
 
As shown by Figure 9, choosing an effective integration technique for the 
ensemble system can provide significantly better accuracy than existent methods.  
Moreover, this figure shows that the opportunity for improvement is quite possible for the 
CEFS proposed approach. 
 
  
0
10
20
3040
50
60
70
80
90
100
NBC (no feature 
selection)
NBC (with feature 
selection - baseline)
CEFS 
(agree/disagree 
same)
CEFS (upper bound)
Pre
dic
tio
n A
ccu
rac
y
Algorithm
Prediction Accuracy based on Algorithm
 
 
39 
CHAPTER 5 
 
CONCLUSION 
 
 
Given the motivation and objectives presented in Chapter 1, a unique ensemble 
feature selection algorithm is proposed and presented.  This algorithm utilizes a wrapper 
method to build an ensemble of models, each with feature subsets optimized for 
performance with respect to a separate classification and under a specific base classifier.  
The proposed method seeks to outperform (in terms of prediction accuracy) the state of 
the art by providing an approach which will bring sought after diversity and disagreement 
to the ensemble model while supplying the same model with feature subsets containing 
highly useful features to the base classifiers themselves.  Preliminary Results show 
promise in terms of possible increases in prediction accuracy, with potential 
improvements of more than 10% over exiting methods. 
In terms of future work, this research can evolve in a variety of different ways.  
First, a thorough comparative study on how performance is affected by different wrapper 
methods, higher and lower number of classifications, and different ensemble integration 
techniques, should be conducted.  Second, it would be worthwhile to investigate an 
ensemble of models with feature subsets derived from both filter and wrapper methods.  
Third, a comparative study of different evolutionary computation algorithms utilized as 
wrapper methods within ensemble feature selection models might provide valuable 
 
 
40 
information as to better performing wrapper methods.  Finally, given that ensemble 
feature selection is still a relatively new area to feature selection, and class-specific 
selection even newer, fresh and innovative combinations may provide even better 
performing algorithms. 
  
 
 
41 
BIBLIOGRAPHY
 
 
[1] G. Piatetsky-Shapiro, "Knowledge Discovery in Databases: 10 years later," 
SIGKDD Explorations, vol. 1, no. 2, pp. 59-61, Jan. 2000. 
[2] H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for 
Classification and Clustering," IEEE Transactions on Knowledge and 
Data Engineering, vol. 17, no. 4, pp. 491-502, Apr. 2005. 
[3] J. Alpert and N. Hajaj. (2008, Jul.) The Official Google Blog: We knew the 
web was big.... [Online]. http://googleblog.blogspot.com/2008/07/we-
knew-web-was-big.html 
[4] C. Soares, L. Montgomery, K. Rouse, and J. E. Gilbert, "Automating 
Classification using General Regression Neural Networks," in 
Proceedings of the 2008 Seventh International Conference on Machine 
Learning and Applications, San Diego, CA, 2008, pp. 508-513. 
[5] A. Jain and D. Zongker, "Feature Selection: Evaluation, Application, and 
Small Sample Performance," IEEE Transactions on Pattern Analysis 
and Machine Intelligence, vol. 19, no. 2, pp. 153-158, 1997. 
[6] E. Bauer and R. Kohavi, "An Empirical Comparison of Voting Classification 
Algorithms: Bagging, Boosting, and Variants," Machine Learning, vol. 
36, no. 1, 2, pp. 105-139, 1999. 
[7] T. G. Dietterich, "Ensemble Learning Methods," in Handbook of Brain 
Theory and Neural Networks, M. A. Arbib, Ed. MIT Press, 2001. 
[8] L. Hansen and P. Salamon, "Neural Network Ensembles," IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 
993-1001, 1990. 
[9] T. R. Golub, et al., "Molecular Classification of Cancer: Class Discovery and 
Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 
5439, pp. 531-537, 1999. 
[10] U. Alon, et al., "Broad patterns of gene expression revealed by clustering 
analysis of tumor and normal colon tissues probed by oligonucleotide 
arrays," PNAS, vol. 96, no. 12, pp. 6745-6750, 1999. 
[11] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. 
Upper Saddle River, NJ: Pearson Education, 2003. 
[12] P. Langley and H. A. Simon, "Applications of machine learning and rule 
induction," Communications of ACM, vol. 38, no. 11, pp. 55-64, 1995. 
 
 
42 
[13] D. H. Wolpert and W. G. Macready, "No free lunch theorems for search," 
Santa Fe Institute, Santa Fe, CA, Technical Report SFI-TR-95-02-010, 
1995. 
[14] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton 
University Press, 1961. 
[15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. 
Addison-Wesley, 2006. 
[16] J. Sinkkonen and W. S. Sarle. (2002, Oct.) Neural Network FAQ, part 2 of 7: 
Learning. [Online]. ftp://ftp.sas.com/pub/neural/FAQ2.html 
[17] B. W. Silverman, Density Estimation for Statistics and Data Analysis, 1st ed. 
Chapman and Hall/CRC Press, 1986. 
[18] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature 
Selection," Journal of Machine Learning Research, vol. 3, pp. 1157-
1182, 2003. 
[19] D. L. Donoho, "High-dimensional data analysis: The curses and blessings of 
dimensionality," in Aide-Memoire of a Lecture at AMS Conference on 
Math Challenges of the 21st Century, 2000. 
[20] A. Y. Ng, "Feature selection, L1 vs. L2 regularization, and rotational 
invariance," in International Conference on Machine Learning, 2004. 
[21] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 
81-106, 1986. 
[22] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: 
Morgan Kaufmann, 1993. 
[23] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and 
Regression Trees. Wadsworth International Group, 1984. 
[24] B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification 
Techniques. Los Alamitos, CA: IEEE Computer Society Press, 1990. 
[25] D. W. Aha, D. Kibler, and M. K. Albert, "Instance Based learning 
algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, 1991. 
[26] R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial 
Intelligence: Special Issue on relevance, vol. 97, no. 1 - 2, pp. 273-324, 
1997. 
[27] P. Langley, W. Iba, and K. Thompson, "An analysis of Bayesian classifiers," 
in Proceedings of the Tenth National Conference on Artificial 
Intelligence, 1992, pp. 223-228. 
[28] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley, 1973. 
[29] I. J. Good, The Estimation of Probabilities: An Essay on Modern Bayesian 
Methods. MIT Press, 1965. 
[30] M. Dash and H. Liu, "Feature selection for classification," Intelligent Data 
Analysis, vol. 1, no. 1 - 4, pp. 131-156, 1997. 
 
 
43 
[31] A. L. Blum and R. L. Rivest, "Training a 3-Node Neural Networks is NP-
Complete," Neural Networks, vol. 5, pp. 117-127, 1992. 
[32] E. Amaldi and V. Kann, "On the approximation of minimizing non zero 
variables or unsatisfied relations in linear systems," Theoretical 
Computer Science, vol. 209, pp. 237-260, 1998. 
[33] A. Kalousis, J. Prados, and M. Hilario, "Stability of Feature Selection 
Algorithms," in Proceedings of the Fifth IEEE International Conference 
on Data Mining, Houston, TX, 2005. 
[34] A. Kalousis, J. Prados, and M. Hilario, "Stability of feature selection 
algorithms: a study on high-dimensional spaces," Knowledge and 
Information Systems, vol. 12, no. 1, pp. 95-116, 2006. 
[35] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer 
classification using support vector machines," Machine Learning, vol. 
46, no. 1 - 3, pp. 389-422, 2002. 
[36] R. Nilsson, J. M. Pena, J. Bjorkegren, and J. Tegner, "Consistent Feature 
Selection for Pattern Recognition," Journal of Machine Learning 
Research, vol. 8, pp. 589-612, 2007. 
[37] E. Xing, M. Jordan, and R. Karp, "Feature Selection for High-Dimensional 
Genomic Microarray Data," in Proceedings of the 15th International 
Conference on Machine Learning, 2001, pp. 601-608. 
[38] F. K. Ahmad, N. M. Norwawi, S. Deris, and N. H. Othman, "A review of 
feature selection techniques via gene expression profiles," International 
Symposium on Information Technology, vol. 2, pp. 1-7, Aug. 2008. 
[39] P. Langley, "Selection of relevant features in Machine Learning," in AAAI 
Fall Symposium on Relevance, 1994, pp. 140-144. 
[40] R. Caruana and D. Freitag, "Greedy attribute selection," in Machine 
Learning: Proceedings of the Eleventh International Conference, 1994, 
pp. 28-36. 
[41] D. Koller and M. Sahami, "Toward optimal feature selection," in 
Proceedings of the 13th International Conference on Machine Learning, 
San Francisco, CA, 1996, pp. 284-292. 
[42] G. Forman, "An Extensive Empirical Study of Feature Selection Metrics for 
Text Classification," Journal of Machine Learning Research, vol. 3, pp. 
1289-1305, 2003. 
[43] H. Lu, et al., "The Effects of Domain Knowledge Relations on Domain Text 
Classification," in Proceedings of the 27th Chinese Control Conference, 
Kumming, Yunnan, China, 2008, pp. 460-463. 
[44] Q. Chen, "Feature Selection for the Topic-Based Mixture Model in Factored 
Classification," in Proceedings of the 2006 International Conference on 
Computational Intelligence and Security, 2006, pp. 39-44. 
  
 
 
44 
[45] E. Gabrilovich and S. Markovitch, "Text categorization with many redundant 
features: using aggressive feature selection to make SVMs competitive 
with C4.5," in Proceedings of the 21st International Conference on 
Machine Learning, 2004, pp. 41-48. 
[46] A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney, 
"Feature Selection Methods for Text Classification," in Proceedings of 
the 13th SIGKDD International Conference on Knowledge Discovery 
and Data Mining, San Jose, CA, 2007, pp. 230-239. 
[47] D. W. Aha and R. L. Bankert, "Feature selection for the case-based 
classification of cloud types: An empirical comparison," in Working 
Notes of the AAAI-94 Workshop on Case-Based Reasoning, 1994, pp. 
106-112. 
[48] M. Ben-Bassat, Use of distance measures, information measures and error 
bounds in feature evaluation, 2nd ed., P. R. Krishnaiah and L. N. Kanal, 
Eds. North-Holland Publishing Company, 1982. 
[49] A. Miller, Subset Selection in Regression, 2nd ed. Chapman & Hall/CRC, 
2002. 
[50] N. R. Draper and H. Smith, Applied Regression Analysis, 2nd ed. John Wiley 
and Sons, 1981. 
[51] J. Doak, "An evaluation of feature selection methods and their application to 
computer security," University of California at Davis, Technical Report 
CSE-92-18, 1992. 
[52] P. Pudil, J. Novovicova, and J. Kittler, "Floating search methods in feature 
selection," Pattern Recognition Letters, vol. 15, pp. 1119-1125, 1994. 
[53] M. Kudo and J. Sklansky, "Comparison of algorithms that select features for 
pattern classifiers," Patter Recognition, vol. 33, pp. 25-41, 2000. 
[54] Y. Kim, W. N. Street, and F. Menczer, "Feature selection in unsupervised 
learning via evolutionary search," in Proceedings of the Sixth ACM 
SIGKDD International Conference on Knowledge Discovery and Data 
Mining, 2000, pp. 365-369. 
[55] Y. Kim, W. N. Street, and F. Menczer, "An evolutionary multi-objective 
local selection algorithm for customer targeting," in Proceedings of 
Congress on Evolutionary Computation, 2001, pp. 759-766. 
[56] R. Kohavi and D. Sommerfield, "Feature subset selection using the wrapper 
model: Overfitting and dynamic search space topology," in The First 
International Conference on Knowledge Discovery and Data Mining, 
1995, pp. 192-197. 
[57] D. W. Aha and R. L. Bankert, "A comparative evaluation of sequential 
feature selection algorithms," in Proceedings of the Fifth International 
Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, 
1995, pp. 1-7. 
 
 
45 
[58] H. Liu and H. Motola, Feature selection for Knowledge Discovery and Data 
Mining. Springer, 1998. 
[59] A. Arauzo-Azofra and J. M. Benitez, "Empirical Study of Feature Selection 
Methods in Classification," in Proceedings of the Eighth International 
Conference on Hybrid Intelligent Systems, 2008, pp. 584-589. 
[60] S. B. Cho and H. H. Won, "Machine Learning in DNA Microarray Analysis 
for Cancer Classification," in Proceedings of the First Asia-Pacific 
Bioinformatics Conference, 2003. 
[61] S. Das, "Filters, Wrappers and a Boosting-Based Hybrid for Feature 
Selection," in Proceedings of the 18th International Conference on 
Machine Learning, 2001, pp. 74-81. 
[62] J. Kittler, Feature Selection and Extraction. Academic Press, 1986. 
[63] L. Rendell and R. Seshu, "Learning hard concepts through constructive 
induction: framework and rationale," Computational Intelligence, vol. 6, 
no. 4, pp. 247-270, Nov. 1990. 
[64] B. Scholkopf and A. J. Smola, Learning with Kernels. Cambridge, MA: MIT 
Press, 2002. 
[65] I. T. Jolliffe, Principal Component Analysis. New York, NY: Springer-
Verlag, 1986. 
[66] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks. 
New York: Wiley, 1996. 
[67] L. J. Cao and W. K. Chong, "Feature Extraction in Support Vector Machine: 
A comparison of PCA, KPCA, and ICA," in Proceedings of the 9th 
International Conference on Neural Information Processing, 2002, pp. 
1001-1005. 
[68] A. L. Blum and P. Langley, "Selection of Relevant Features and Examples in 
Machine Learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997. 
[69] D. A. Bell and H. Wang, "A formalism for relevance and its application in 
feature subset selection," Machine Learning, vol. 41, pp. 175-195, 2000. 
[70] R. A. Caruana and D. Freitag, "How useful is relevance?," in Working Notes 
of the AAAI Fall Symposium on Relevance, New Orleans, LA, 1994, pp. 
25-29. 
[71] M. Steinbach, L. Ertoz, and V. Kumar, "The challenges of clustering high-
dimensional data," New Vistas in Statistical Physics: Applications in 
Econophysics, Bioinformatics, and Pattern Recognition, 2003. 
[72] M. Dash and H. Liu, "Feature Selection for Clustering," in Proceedings of 
the Fourth Pacific Asia Conference on Knowledge Discovery and Data 
Mining, 2000, pp. 110-121. 
  
 
 
46 
[73] J. G. Dy and C. E. Brodley, "Feature Subset Selection and Order 
Identification for Unsupervised Learning," in Proceedings of the 17th 
International Conference on Machine Learning, 2000, pp. 247-254. 
[74] L. Talavera, "Feature Selection as a Preprocessing Step for Hierarchical 
Clustering," in Proceedings of the International Conference on 
Machine Learning, 1999, pp. 389-397. 
[75] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, 
1988. 
[76] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction 
to Cluster Analysis. John Wiley and Sons, 1990. 
[77] G. H. John, R. Kohavi, and K. Pfleger, "Irrelevant features and the subset 
selection problem," in Proceedings of the Eleventh International 
Conference on Machine Learning, New Brunswick, NJ, 1994, pp. 121-
129. 
[78] D. D. Lewis, "Representation and learning in information retrival.," 
Department of Computer Science, University of Massachusetts, 
Amherst, Doctoral dissertation and Technical Report UM-CS-1991-
093, 1992. 
[79] D. D. Lewis, "Feature selection and feature extraction for text 
categorization," in Proceedings of Speech and Natural Language 
Workshop, San Francisco, CA, 1992, pp. 212-217. 
[80] H. Almuallim and T. G. Dietterich, "Learning with many irrelevant 
features," in Proceedings of the Ninth National Conference on 
Artificial Intelligence, San Jose, CA, 1991, pp. 547-552. 
[81] K. Kira and L. A. Rendell, "The feature selection problem: Traditional 
methods and a new algorithm," in Tenth National Conference on 
Artificial Intelligence, 1992, pp. 129-134. 
[82] I. Kononenko, "Estimating attributes: Analysis and extensions of RELIEF," 
Machine Learning: ECML-94, vol. 784/1994, pp. 171-182, 1994. 
[83] G. Liu, L. Dong, S. Yuan, Y. Liu, and Y. Li, "New Feature Selection 
Algorithm based on Potential Difference," in Proceedings of the 2007 
IEEE International Conference on Mechatronics and Automation, 
Harbin, China, 2007, pp. 566-570. 
[84] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, "A Hybrid Feature Selection 
Mechanism," in Eighth International Conference on Intelligent Systems 
Design and Applications, 2008, pp. 271-276. 
[85] G. Pagallo and D. Haussler, "Boolean feature discovery in empirical 
learning," Machine Learning, vol. 5, pp. 71-99, 1990. 
[86] D. Lei, Y. Xiaochun, and X. Jun, "Optimizing Traffic Classification Using 
Hybrid Feature Selection," in Proceedings of the Ninth International 
Conference on Web-Age Information Management, 2008, pp. 520-525. 
 
 
47 
[87] M. Baglioni, B. Furletti, and F. Turini, "DrC4.5: Improving C4.5 by means 
of prior knowledge," in Proceedings of the 2005 ACM Symposium on 
Applied Computing, Santa Fe, CA, 2005, pp. 474-481. 
[88] A. Tsymbal, S. Puuronen, and D. Patterson, "Ensemble feature selection 
with the simple Bayesian classification," Information Fusion, vol. 4, 
pp. 87-100, 2003. 
[89] R. Kohavi, "A study of cross-validation and bootstrap for accuracy 
estimation and model selection," in Proceedings of the International 
Joint Conference on Artificial Intelligence, 1995. 
[90] C. Merz, "Dynamical selection of learning algorithms," in Learnin from 
Data, Artificial Intelligence and Statistics, D. Fisher and H. .-J. Lenz, 
Eds. NY: Springer-Verlag, 1996. 
[91] M. Koppel and S. P. Engelson, "Integrating multiple classifiers by finding 
their areas of expertise," in AAAI-96 Workshop On Integrating Multiple 
Learning Models, 1996, pp. 53-58. 
[92] A. Krogh and J. Vedelsby, "Neural Network ensembles, cross validation, 
and active learning," in Advances in Neural Information Processing 
Systems, D. Touretzky and T. Leen, Eds. Cambridge, MA: MIT Press, 
1995, vol. 7, pp. 231-238. 
[93] K. Tumer and J. Ghosh, "Error correlation and error reduction in ensemble 
classifiers," Connection Science, Special Issue on Combining Artificial 
Neural Networks: Ensemble Approaches, vol. 9, no. 3, 4, pp. 385-404, 
1996. 
[94] C. Brodley and T. Lane, "Creating and exploiting coverage and diversity," 
in AAAI-96 Workshop on Integrating Multiple Learned Models, 1996, 
pp. 8-14. 
[95] L. Asker and R. Maclin, "Ensembles as a sequence of classifiers," in 
Proceedings of the 15th International Joint Conference on Artificial 
Intelligence, San Francisco, CA, 1997, pp. 860-865. 
[96] D. W. Opitz, "Feature Selection for Ensembles," in Proceedings of 16th 
National Conference on Artificial Intelligence, 1999, pp. 379-384. 
[97] T. K. Ho, "The random subspace method for constructing decision forests," 
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 
8, pp. 832-844, 1998. 
[98] C. Guerra-Salcedo and D. Whitley, "Genetic Approach for Feature 
Selection for Ensemble Creation," in Proceedings of the Genetic and 
Evolutionary Computation Conference, Orlando, FL, 1999, pp. 236-
243. 
[99] C. Guerra-Salcedo and D. Whitley, "Feature Selection Mechanisms for 
Ensemble Creation: A Genetic Search Perspective,," Data Mining with 
Evolutionary Algorithms: Research Directions. Papers from the AAAI 
Workshop., 1999. 
 
 
48 
[100] A. Tsymbal, P. Cunningham, M. Pechenizkiy, and S. Puuronen, "Search 
Strategies for Ensemble Feature Selection in Medical Diagnostics," in 
Proceedings of the 16th IEEE Symposium on Computer-Based Medical 
Systems, 2003, pp. 124-129. 
[101] L. E. A. Santana, D. F. d. Oliveira, A. M. P. Canuto, and M. C. P. d. Souto, 
"A Comparative Analysis of Feature Selection Methods for Ensembles 
with Different Combination Methods," in . 
[102] K. M. O. Vale, F. G. Dias, A. M. P. Canuto, and M. C. P. Souto, "A Class-
Based Feature Selection Method for Ensemble Systems," in 
Proceedings of the Eighth International Conference on Hybrid 
Intelligent Systems, 2008, pp. 596-601. 
[103] P. Cunningham and J. Carney, "Diversity versus quality in classification 
ensembles based on feature selection," in Proceedings of the 11th 
European Conference on Machine Learning, 2000, pp. 109-116. 
[104] T. G. Dietterich, "Machine Learning Research: four current directions," AI 
Magazine, vol. 18, no. 4, pp. 97-136, 1997.