Improving Prediction Accuracy Using Class-specific Ensemble Feature Selection by Caio Vinicius Soares A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 13, 2010 Keywords: Machine Learning, Data Mining, Feature Selection, Ensemble Machines, Classification Copyright 2010 by Caio Vinicius Soares Approved by Juan E. Gilbert, Chair, Professor of Computer Science and Software Engineering Cheryl Seals, Associate Professor of Computer Science and Software Engineering Gerry Dozier, Chair, Professor of Computer Science and Software Engineering ii Abstract As data accumulates at a speed significantly faster than can be processed, data preprocessing techniques such as feature selection become increasingly important and beneficial. Moreover, given the well-known gains of feature selection, any further improvements can positively affect a wide array of fields and applications. So, this research explores a novel feature selection architecture, Class-specific Ensemble Feature Selection (CEFS), which finds class-specific subsets of features optimal to each available classification in the dataset. Each subset is then combined with a classifier to create an ensemble feature selection model which is further used to predict unseen instances. CEFS attempts to provide the diversity and base classifier disagreement sought after in effective ensemble models by providing highly useful, yet highly exclusive feature subsets. CEFS is not a feature selection algorithm, but rather, a unique way of performing feature selection. Hence, it is also algorithm independent, suggesting that various machine learners and feature selection algorithms can benefit from the use of this architecture. To test this architecture, a comprehensive experiment is conducted, implementing the architecture under two different classifiers, three different feature selection algorithms, and under ten different datasets. The results of this experiment shows that the CEFS architecture outperforms the traditional feature selection architecture in every algorithmic combination and for every dataset. Moreover, the presence of high-dimensional datasets suggests that CEFS will scale up. Finally, the feature results obtained from the experiment suggest that vital class- specific information can be lost if feature selection is performed on the entire dataset as a whole, as opposed to a class-specific manner. iii Acknowledgements The author thanks the following: The Almighty Lord for providing patience, wisdom, knowledge, and for placing such amazing people in my path. My wife Lacey for her unyielding love, support, and endless encouragement. For being so sefless and for doing everything possible to help, every step of the way. Dr. Juan Gilbert for the constant push and drive to finish, and to excel in all I do. Also, for caring enough to always check in and offer help, by whatever means necessary. This never would have gotten finished without your help. My amazing family (Yara, George, Bruna, Junior, Lysandra, and Duda), for being the greatest family a person could ask for and for always making me believe I could do anything I set my mind to. Winard Britt for his insightful ideas and great help, despite the long discussions. The Southern Regional Education Board (SREB) and the State of Alabama for helping to fund the last three years of my life. The Doctoral Fellowship was the biggest blessing I could hope for and it helped in ways which can?t be described. And to so many others who helped along the way, thank you!!! iv Table of Contents Abstract ........................................................................................................................................... ii Acknowledgements ........................................................................................................................ iii List of Figures .............................................................................................................................. viii List of Tables ...................................................................................................................................x 1. Introduction ................................................................................................................................1 1.1. Motivation .........................................................................................................................3 1.2. Problem Description ..........................................................................................................6 1.3. Goals and Contributions ....................................................................................................7 1.4. Organization ......................................................................................................................7 2. Supervised Learning: Classification ..........................................................................................9 2.1. Background ......................................................................................................................10 2.2. General Classification Process ........................................................................................11 2.3. Performance Evaluation ..................................................................................................12 2.4. Algorithmic Framework ..................................................................................................14 2.4.1. Logic based Algorithms ........................................................................................14 2.4.1.1. Decision Trees ........................................................................................14 2.4.1.2. Other rule-based systems ........................................................................16 2.4.2. Perceptron-based Algorithms ...............................................................................17 2.4.3. Statistical-based Algorithms .................................................................................18 v 2.4.4. Instance-based Algorithms ...................................................................................20 2.4.5. Support Vector Machines .....................................................................................21 2.5. Overfitting .......................................................................................................................22 2.6. Ensemble Methods ..........................................................................................................23 3. Feature Selection ......................................................................................................................26 3.1. Background ......................................................................................................................27 3.2. General Feature Selection Process ..................................................................................30 3.2.1. Subset Generation .................................................................................................30 3.2.1.1. Complete Search .....................................................................................31 3.2.1.2. Sequential Search ....................................................................................31 3.2.1.3. Random Search .......................................................................................32 3.2.2. Subset Evaluation .................................................................................................34 3.2.2.1. Independent Criteria................................................................................34 3.2.2.2. Dependent Criteria ..................................................................................38 3.2.2.3. Hybrid Criteria ........................................................................................40 3.2.3. Stopping Criterion ................................................................................................42 3.2.4. Result Validation ..................................................................................................42 3.3. Embedded Methods .........................................................................................................43 3.4. General Algorithmic Framework ....................................................................................44 3.5. Common Misconceptions ................................................................................................46 3.6. Relevance and Optimality of Features ............................................................................48 3.7. Feature Selection in Unsupervised Learning ...................................................................52 3.8. Feature Selection in Supervised Learning .......................................................................53 vi 3.9. Ensemble Feature Selection ..............................................................................................54 4. Class-specific Ensemble Feature Selection (CEFS) ................................................................56 5. Implementation ........................................................................................................................62 5.1. Proposed Hypothesis .......................................................................................................63 5.2. Experiment ......................................................................................................................63 5.2.1. Data .......................................................................................................................64 5.2.1.1. Breast Cancer Dataset .............................................................................65 5.2.1.2. Wine Dataset ...........................................................................................65 5.2.1.3. Credit Approval Dataset .........................................................................66 5.2.1.4. Congressional Voting Dataset.................................................................66 5.2.1.5. Montgomery County Jails Dataset ..........................................................67 5.2.1.6. Musk (version 1) Dataset ........................................................................67 5.2.1.7. Arrhythmia Dataset .................................................................................67 5.2.1.8. Madelon Dataset .....................................................................................68 5.2.1.9. Colon Cancer Dataset .............................................................................68 5.2.1.10. Lymphoma Dataset ................................................................................69 5.2.2. Tools .....................................................................................................................69 5.2.3. Procedure ..............................................................................................................69 5.2.4. Procedure Validation ............................................................................................71 5.2.5. Algorithms ............................................................................................................72 5.2.5.1. Algorithm Procedural Specifications ....................................................72 5.2.5.2. Algorithm Constraints ...........................................................................74 5.2.6. Results...................................................................................................................75 vii 5.2.6.1. Performance ...........................................................................................75 5.2.6.2. Domain Understandability ....................................................................80 6. Conclusion ...............................................................................................................................82 References ......................................................................................................................................84 Appendix A. Performance during Individual Runs .....................................................................96 Appendix B. Best found overall and class-specific feature subsets per individual runs ...........103 viii List of Figures Figure 1: General classification process ........................................................................................11 Figure 2: ID3 Decision Tree ..........................................................................................................15 Figure 3: Multilayer Neural Network with one hidden layer and four inputs ...............................17 Figure 4: Maximum Margin Hyperplane .......................................................................................22 Figure 5: A logical view of the ensemble learning method ...........................................................24 Figure 6: Feature Selection within general classification process .................................................27 Figure 7: Feature Selection Process ..............................................................................................30 Figure 8: Pseudo-code for Sequential Forward Selection (SFS) Algorithm .................................32 Figure 9: Pseudo-code for Sequential Backward Selection (SBS) Algorithm ..............................32 Figure 10: Pseudo-code for GA driven Feature Selection Algorithm ...........................................34 Figure 11: Filter approach to feature selection ..............................................................................36 Figure 12: Generalized Filter Algorithm .......................................................................................36 Figure 13: Wrapper approach to feature selection .........................................................................39 Figure 14: Generalized Wrapper Algorithm ..................................................................................40 Figure 15: Generalized Hybrid Algorithm ....................................................................................41 Figure 16: Taxonomy of Feature Selection Algorithms ................................................................44 Figure 17: Feature Subset Space ...................................................................................................50 Figure 18: CEFS Procedure ...........................................................................................................59 Figure 19: Graphical comparison of feature selection architectures based on dataset accuracy averages (under NBC/SBS algorithms) .......................................................................76 ix Figure 20: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/SBS algorithms) .......................................................................78 Figure 21: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/SFS algorithms) ........................................................................78 Figure 22: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/GA algorithms) .........................................................................79 x List of Tables Table 1: Required Sample Size for given Number of Dimensions .................................................5 Table 2: Play Tennis dataset .........................................................................................................10 Table 3: Confusion Matrix for a two-class prediction problem (with class labels).......................12 Table 4: Confusion Matrix for a two-class prediction problem (statistical representation) ..........13 Table 5: Categorical Framework of Feature Selection Algorithms ..............................................46 Table 6: Dataset characteristics .....................................................................................................65 Table 7: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under NBC/SBS algorithms) ..........................................................................76 Table 8: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/SBS algorithms) ...........................................................................77 Table 9: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/SFS algorithms) ...........................................................................77 Table 10: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/GA algorithms) ..........................................................................79 Table 11: Best found subsets for run #1 of kNN/SFS (using Congressional Voting dataset) .......81 1 Chapter 1 Introduction As data acquisition has become increasingly easier and data storage has become increasingly inexpensive, the usage and storage of exceedingly larger databases has become widespread. This has caused a dramatic shift in the mindset of data storage and acquisition; a shift which considered a one-megabyte database as being very large in 1989 to numerous multi- terabyte databases being mined regularly by the year 2000 [1]. In fact, approximately 15 petabytes of new data are created every day worldwide, more than eight times the amount of data contained in all U.S. libraries [2]. As a result, many routine information technology problems such as inventory management and systems automation have given way to multi-dimensional problems of analysis such as genomic research, deciphering global trade patterns from foreign exchange flows, or predicting disease states [2] [3]. Unfortunately, data processing has not kept up with data acquisition. That is, as computer and database technologies continue to improve, data accumulates in a speed significantly faster than the capacity of data processing [4]. Such data accumulation produces significantly large databases. These databases can be large due to a high number of instances1, a high number of features2, or a combination of both. An example of a high-instance database is Google index, which reported having indexed one billion unique URLs by 2000 and one trillion 1 For the remainder of this work, instances may also be referred to as ?cases?, ?vectors?, ?samples?, or ?observations?. 2 For the remainder of this work, features may also be referred to as ?variables?, ?dimensions?, ?attributes?, or ?genes? (in case of microarray data). 2 by 2008 [5]. On the other hand, microarray datasets present a fitting example of high- dimensional datasets, often containing a small number of instances but each with thousands of attributes [6]. This research will focus solely on the issue of high-dimensional data as it pertains to classification tasks within supervised learning. More specifically, it will do so through the use of feature selection, a technique which can reduce the cost of recognition and often provide better classification accuracy by reducing the number of features [7]. Traditionally, feature selection has been carried out on the entire dataset at once, assuming that the most salient features for learning will be uniform among all instances, regardless of classification (i.e. class label). Research recently conducted in [8], however, has shown that classification accuracy can be improved when feature selection is conducted in a class-specific manner and organized in an ensemble machine for final classification. The research further states that the potential for improvement rests largely on how the ensemble?s modules are recombined when they face disagreement [8]. Based on the aforementioned research, the hypothesis explored in this research contains four parts: 1. There exists a recombination technique such that, when applied to a Class- specific Ensemble Feature Selection (CEFS) architecture, it will allow the system to outperform a traditional feature selection architecture in terms of prediction accuracy. 2. The same CEFS architecture will scale up by continuing to outperform the traditional feature selection architecture when the number of features are increased, as well as, when the number of class labels are increased. 3. The union of the set of relevant3 class-specific features may not be inclusive of the set of relevant features found traditionally for the entire dataset. 4. Conversely, the set of relevant features found traditionally for the entire dataset may not be inclusive to the union of the set of relevant features for each class label. 3 The formal definition of relevance as it pertains to features will be detailed in Section 3.6 of this work. 3 Although all parts of the hypothesis revolve around exploring the notion of a class- specific feature selection architecture, each part will provide useful insight into the benefits afforded by such a system. Parts 1 and 2 of the hypothesis will explore the realization and development of a more accurate and scalable learning system through CEFS. Parts 3 and 4 will investigate the potential information that may be gained by CEFS, and conversely, information which may be missed or lost through the use of a traditional feature selection architecture. Moreover, the combination of class-specific optimal features and an ensemble machine will allow the system to take advantage of the well documented benefits of ensemble methods [9] [10] [11] by providing the data diversity and classifier disagreement necessary for ensemble systems to succeed, thereby increasing prediction performance (better accuracy). Allowing each classification to have its own set of class-specific optimal features, and thus each its own ensemble classifier, will create an ensemble system that should allow for more accurate classification of unseen instances. As an added benefit, new information may be learned as to the relevance of certain features on certain classifications, information that may have been otherwise unknown due to performing feature selection on all instances, and thus, all classifications at once. This would be of particularly good use in problems such as cancer prediction [12] [13], where certain features may be optimal in detecting certain types of cancer, but irrelevant in other types. 1.1 Motivation According to [14], a learning algorithm is good if it produces hypotheses that do a good job of predicting classifications of unseen examples. With that in mind, the primary goal of this research is to increase prediction accuracy, thereby creating a better learning algorithm. Note, 4 however, that the focus here is not on the learning algorithm itself, but rather on the pre- processing methodology used to prepare the training data for optimal learning. The first reason is because the architecture presented here is algorithm independent and can be combined with just about any learning algorithm. The second reason is research suggests that much of the power in classification comes not from the specific learning method, but from proper formulation of the problems and crafting the representation to make learning tractable [15]. In addition, the same research suggests that performance is roughly equivalent between various learning algorithms, across many domains. In other words, the preparation and representation of the data is of the utmost importance given that no single learning algorithm outperforms all others on every domain. In a nutshell, this represents the well-known theorems of ?No Free Lunch? [16]. Although a dataset may be well prepared and represented, it may also fall victim to Belman?s ?curse of dimensionality?, a problem which often plagues high-dimensional datasets. The curse of dimensionality refers to the exponential growth of hypervolume as a function of dimensionality [17]. That is, the more dimensions in a dataset, the more representation needed by the predicting model. For classification, this can mean that there are not enough data objects to allow the creation of a model that reliably assigns a class to all possible objects [18]. Moreover, when the set of features in the data is sufficiently large, many induction algorithms become simply intractable [19]. In the context of Neural Networks (NN), for example, the curse of dimensionality causes networks with irrelevant inputs to behave badly: the dimension of the input space is high, and the network uses almost all its resources to represent irrelevant portions of the space [20]. In [21], Silverman illustrates the difficulty of kernel estimation in high dimensions. As shown in Table 1, even at a small dimensionality of 10, roughly 840,000 samples are required to estimate the density at 0 with a given accuracy [21]. 5 Table 1: Required Sample Size for given Number of Dimensions [21] Dimensionality Required Sample Size 1 4 2 19 5 786 7 10,700 10 842,000 In 2003, research conducted by Guyon and Elisseeff reported on gene expression datasets up to 60,000 variables wide, and text classification datasets 15,000 variables wide and nearly 1 million instances deep [22]. Those numbers continue to grow daily, as well as in other fields such as satellite imagery, hyperspectral imagery, financial data, consumer data, and etc [23]. In fact, the UCI Machine Learning Repository, where many of the datasets used in this research are attained from, contains two datasets 100,000 features wide and one dataset spanning over 3 million features [24]. Improving prediction accuracy within problems of high data dimensionality is the primary motivation for using feature selection. Feature selection can also significantly improve the comprehensibility of resulting classifier models, generated rules, or relationships among features. For instance, methods such as ID3 [25] or C4.5 [26] can create very complex tree structures on high-dimensional datasets, making it almost impossible to comprehend with the naked eye. By removing many of the unnecessary and noisy features, the resulting structure will be simpler and comprehensible, allowing for a better understanding of the problem domain and the relationships between features. The key benefit of feature selection is that it directly targets the curse of dimensionality, since it can eliminate irrelevant and redundant features and reduce noise in the dataset. Moreover, as the dimension of the feature space increases so does the probability of over-fitting, 6 and as shown by [27], feature selection provides a powerful means to avoid over-fitting. Feature selection can also lead to a more understandable model, allow the data to be more easily visualized, and with a reduction in dimensionality, can decrease the amount of time and memory required by learning algorithms [18]. The motivation for this research, however, goes deeper into the methodology of feature selection, a topic that will be further explored later in this work. 1.2 Problem Description Although there are a plethora of feature selection algorithms, all differing in some shape or form, they all share a commonality; they all attempt to reduce the number of features by selecting an optimal subset of the original feature set. This subset, however, almost always contains features that are assumed to be optimal to the entire dataset. This then leads to the assumption that the most optimal features to the dataset must be optimal for instances across all classifications. Therein lays the problem and thus the basis for this research. The hypothesis presented in this work is based on the notion that by allowing the feature selection algorithm to focus its attention on the prediction accuracy of a single classification at a time, the algorithm will produce a separate subset of features optimal to each classification, and thus create a more robust system, one better fit to classify unseen instances. These class-specific subsets can then be utilized to create separate classification models, thereby taking advantage of another technique proven to improve classification accuracy; ensemble learning (or in this case, ensemble feature selection). In summary, this research will attempt to improve on the problem of prediction accuracy by successfully developing and analyzing a class-specific ensemble feature selection architecture. 7 1.3 Goals and Contributions As is the case with most novel learning systems, the primary goal of this research is to improve prediction accuracy. This will be done through the development and analysis of the aforementioned ensemble feature selection architecture. Accordingly, this research will attempt to achieve the following detailed goals through the use of a Class-specific Ensemble Feature Selection (CEFS) architecture: ? Find an effective recombination technique for disagreeable occurrences between the ensemble classifiers ? Improve prediction accuracy over baseline traditional feature selection methods ? Examine scalability of CEFS in terms of number of features and number of classifications ? Explore the inclusiveness/exclusiveness between feature subsets attained through CEFS versus the ones attained though traditional feature selection methods The immediate contribution of this research will be in the introduction of a novel architecture to the areas of feature selection, ensemble learning, and machine learning. Additionally, as dataset sizes continue to grow in every discipline, the proposed architecture can further contribute in a much more widespread manner as it is data and algorithm independent. Given its class-specific design, the architecture may also have far reaching implications to fields in science, medicine, vision, finance, security, biometrics, etc. 1.4 Organization The outline of this thesis first began with a brief introduction on the disciplines and methods explored in this research. The remainder of this thesis is organized as described below. 8 Chapter 2 begins by discussing classification tasks as part of supervised learning. The chapter provides a brief background on classification, as well as, a general procedure for conducting classification tasks. It then details classifier performance metrics and presents a variety of classification algorithms. It further provides a brief discussion of overfitting within supervised learning tasks and concludes the chapter with key background information on ensemble learning machines. Chapter 3 explains feature selection in detail. Following a brief introduction and background on feature selection, a general procedure for feature selection is provided. Then, the chapter describes embedded feature selection methods and provides various other feature selection algorithms. Sections on common misconceptions, and relevance and optimality of features are also presented. The use of feature selection in unsupervised and supervised learning is then discussed, concluding with thorough explanations of the different types of feature selection algorithms and existent work in ensemble feature selection. Chapter 4 provides an in depth explanation of the methodology behind Class-specific Ensemble Feature Selection (CEFS). This chapter includes the theoretical, architectural, and algorithmic frameworks necessary for the design and implementation of CEFS. Chapter 5 presents an implementation of the suggested architecture, along with results attained from an experiment spanning several algorithms and ten datasets. Moreover, experiment parameters, analysis, and conclusions are discussed. Lastly, Chapter 6 concludes with final observations and analysis of the conducted research, and possible future work. 9 Chapter 2 Supervised Learning: Classification The field of machine learning is often divided into three general disciplines: supervised, unsupervised, and reinforcement learning. Supervised learning involves learning a function from examples of its inputs and outputs. Unsupervised learning involves learning patterns in the input when no specific output values are supplied. Reinforcement learning is typically the broadest of the three; rather than being told what to do by a teacher, a reinforcement learning agent must learn from reinforcement, often including the subproblem of learning how the environment works [14]. Supervised learning can further be divided based on the function?s output values. Learning a function with continuous output values is called regression learning; conversely, learning a discrete valued function is called classification learning [14]. The architecture presented in this research will concentrate on classification learning only. Accordingly, this chapter will be organized as follows. First, background information on classification learning is provided. Next, a general process for classification tasks is detailed, followed by common metrics for evaluating classifier performance. Then, an algorithmic framework for classifiers is presented, describing a variety of different classifiers. The chapter will then discuss model overfitting and conclude with general information regarding ensemble learning methods. 10 2.1 Background Classification is a well studied and thoroughly researched pervasive problem which encompasses many diverse applications. Examples include cancer prediction based on MRIs, galaxy classification based on galaxy shapes, and flower species categorization based on petal size and shape [18]. The input data in classification tasks is a collection of records or instances. Each instance is represented by a tuple (x, y) where x is the feature set and y is the target output or class label. Formally, classification is defined as the task of learning a target function f that maps each feature set x to one of the predefined class labels y. The target function can also be referred to as a classification model [18]. Table 2 displays a sample dataset used for predicting whether or not to play tennis based on weather outlook, temperature, humidity, and wind. Table 2: Play Tennis dataset [26] Day Outlook Temperature Humidity Wind Play Tennis? D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild Normal Weak Yes D5 Rain Cool High Strong Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Strong Yes D9 Sunny Cool Normal Weak No D10 Rain Mild Normal Weak No Classification also requires that the class labels be of a categorical nature. Although target values are most often already finite and discrete, they can also sometimes be continuous. In that case, a process called discretization must be performed to transform the output values from continuous to categorical. In this process, the continuous range of output values is typically Class Label Features Ins tan ces divided into a fixed set of intervals, each then represented by a discrete class label. Additionally, classification methods are best suited for predicting datasets with unordered nominal or binary class labels. They are less effective in predicting ordered categories (e.g., classifying humidity as high, medium, or low), as they do not consider implicit ordering among the categori 2.2 General Classification Process Most classification techniques follow a similar progression of steps when performing classification tasks. This progression is shown in (i.e. classifier) is a systematic approach with known target outcomes. The training set is first sent to the classifier so that induction can take place. The classifier then builds a model data?s feature set and its target outcomes. The finished model then receives a test set for deduction and final performance validation. Figure It is also of vital importance that the model generated by the classifier fits both the training data and correctly predicts the class labels of the test data. Thus, a key objective of the 11 Figure 1. The classification technique itself , which builds classification models from a training set that best fits the relationship between the training 1: General classification process es. 12 classifier is to build models that generalize well, i.e., models that accurately predict unseen instances. Models that fail to provide this generalizing capability are said to overfit the data. A more thorough discussion of overfitting will be provided in Section 2.5 of this chapter. 2.3 Performance Evaluation Evaluation of the performance of the model is based on the number of instances correctly and incorrectly classified by the model [18]. These counts are organized into a table known as a confusion matrix. Table 3 displays a confusion matrix for a two-class (i.e. binary) classification problem. Each entry fij in Table 3 represents the number of instances from class i classified as class j. Based on these entries, the total number of correct predictions by the model is (f11 + f00), and conversely, the total number of incorrect predictions is (f10 + f01). Table 3: Confusion Matrix for a two-class prediction problem (with class labels) Predicted Class Class = 1 Class = 0 Actual Class Class = 1 f11 f10 Class = 0 f01 f00 Although a confusion matrix provides the necessary information to gauge performance, it is much more useful to have a single measure which can summarize the matrix?s information, even more so on problems containing a higher number of classifications. Performance metrics such as accuracy and error rate, as defined below, provide this useful measure of performance. g1827g1855g1855g1873g1870g1853g1855g1877 = g3015g3048g3040g3029g3032g3045 g3042g3033 g3030g3042g3045g3045g3032g3030g3047 g3043g3045g3032g3031g3036g3030g3047g3036g3042g3041g3046g3021g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3043g3045g3032g3031g3036g3030g3047g3036g3042g3041g3046 = g3033g3117g3117g2878g3033g3116g3116g3033 g3117g3117g2878g3033g3116g3116g2878g3033g3117g3116g2878g3033g3116g3117 (Eq. 1) g1831g1870g1870g1867g1870 g1870g1853g1872g1857 = g3015g3048g3040g3029g3032g3045 g3042g3033 g3036g3041g3030g3042g3045g3045g3032g3030g3047 g3043g3045g3032g3031g3036g3030g3047g3036g3042g3041g3046g3021g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3043g3045g3032g3031g3036g3030g3047g3036g3042g3041g3046 = g3033g3117g3116g2878g3033g3116g3117g3033 g3117g3117g2878g3033g3116g3116g2878g3033g3117g3116g2878g3033g3116g3117 (Eq. 2) 13 A slightly different representation of the confusion matrix is displayed in Table 4. Although similar information is provided, it is important to note the different terminology as it can be prevalent in machine learning research. Moreover, certain classification tasks, such as cancer prediction, may actually measure performance not solely based on accuracy or error rate, but also on the number of false negatives. For example, diagnosing an individual with cancer when they in fact they don?t have it (false positive) will certainly bring emotional pain and anguish, however, diagnosing an individual as cancer free when in fact they have the disease (false negative) may very well cost that individual?s life. Hence, many classification tasks often attempt to minimize the number of false negatives, also known as Type II Error. Conversely, the number of false positives is also known as Type I Error. Table 4: Confusion Matrix for a two-class prediction problem (statistical representation) Prediction YES NO Actual YES tp (true positive) fn (false negative) *Type II Error NO fp (false positive) *Type I Error tn (true negative) Although a variety of other performance metrics exist (e.g. Balanced Error Rate or Area under the ROC [28]), two other important metrics must be further defined, as they are discussed later in this document. These two metrics are precision and recall. Precision can be seen as a measure of fidelity, whereas recall is a measure of completeness. In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but does not specify the number of items from class C that were not labeled correctly). Conversely, a recall score of 1.0 suggests that every item from class C was 14 labeled as belonging to class C (but says nothing about how many other items were also incorrectly labeled as belonging to class C). Often, the two have an inverse relationship, where an increase of one will cost the reduction of the other. Precision and recall are formally defined below through Eqs. 3 and 4, respectively [29] [28]. g1842g1870g1857g1855g1861g1871g1861g1867g1866 = g3047g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3047g3045g3048g3032 g3043g3042g3046g3036g3047g3036g3049g3032g3046g3047g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3043g3045g3032g3031g3036g3030g3047g3032g3031 g3043g3042g3046g3036g3047g3036g3049g3032g3046 = g3047g3043g3047g3043g2878g3033g3043 (Eq. 3) g1844g1857g1855g1853g1864g1864 = g3047g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3047g3045g3048g3032 g3043g3042g3046g3036g3047g3036g3049g3032g3046g3047g3042g3047g3028g3039 g3041g3048g3040g3029g3032g3045 g3042g3033 g3028g3030g3047g3048g3028g3039 g3043g3042g3046g3036g3047g3036g3049g3032g3046 = g3047g3043g3047g3043g2878g3033g3041 (Eq. 4) 2.4 Algorithmic Framework Given the plethora of induction algorithms existent within the machine learning canon, a complete and detailed assessment of the algorithmic framework in classification is outside the scope of this research. However, with the extensibility and flexibility of the CEFS architecture, it is important to provide a broad collection of learners which adequately represent the discipline. Accordingly, the remainder of this section will discuss logic based algorithms (i.e. Decision Trees and other rule-based classifiers), perceptron based algorithms (i.e. Neural Networks), statistical based algorithms (i.e. Na?ve Bayes classifiers), instance based algorithms (i.e. k- nearest neighbor), and Support Vector Machines. 2.4.1 Logic based Algorithms 2.4.1.1 Decision Trees Decision Trees (DT) [30] [25] [26] are perhaps one of the two most recognizable classification algorithms (Neural Networks being the other). They are trees which classify 15 instances by sorting them based on feature values, where each node in the tree represents a feature in an instance to be classified, and each branch represents a value which the node can assume. The feature best dividing the training data becomes the root node of the tree. The next best feature becomes an internal node, if it contains outgoing branches, or a leaf node, if it contains no outgoing branches. There are numerous ways of measuring how well features divide the training data (e.g. Information Gain [31] or GINI [30]). Instances are then classified beginning at the root node and sorted based on their feature values [32]. Figure 2 depicts a decision tree generated using an ID3 implementation [25], and based on the Play Tennis dataset shown previously on Table 2. As seen, the biggest advantage of DTs is their comprehensibility. Figure 2: ID3 Decision Tree Using the decision tree displayed in Figure 2, the instance (Overlook = Rain, Temperature = Hot, Humidity = Normal, and Wind = Weak) would be classified as ?Yes?. Also note that the Temperature feature is not part of the generated tree. This can often be due to techniques such as pruning, which removes nodes to prevent overfitting and reduce tree complexity [32]. It may also be the case that Temperature just does not offer any further discriminating power than what?s already on the tree, thus it is not included. Outlook Sunny Rain Overcast Yes Humidity Wind Weak Strong Normal High No No Yes Yes 16 The problem of constructing optimal binary decision trees is an NP-complete problem. Moreover, decision trees are typically univariate and do not perform well on problems which require diagonal partitioning. They also tend to perform better in problems dealing with discrete/categorical features [32], as continuous valued features must go first through discretization in order to be utilized in the algorithm. For further and more complete reviews of Decision Trees, refer to [33] or [34]. 2.4.1.2 Other rule-based systems Although decision trees can create a set of rules by traversing down each branch of the tree, rule sets can also be directly induced from a variety of rule-based algorithms. Rules come in the form if antecedent, then consequence. The antecedent contains values taken from the feature set, whereas the consequence contains the target outcome [35]. The difference between the heuristics for decision trees and most other rule-based learners is that the former evaluate the quality of disjoint sets, while the latter evaluate the quality of the set of instances covered by each candidate rule. Thus, it is important for rule-based systems to generate rules which have high predictability or reliability. It is also important to build the smallest rule set that is consistent with the training data, as a large number of rules is often a sign that the algorithm is attempting to ?remember? the training data, thereby not generalizing well (i.e. overfitting) [32]. RIPPER [36], a well-known rule-based algorithm, forms rules through repeated growing and pruning. In the growing phase, rules are made more restrictive in order to fit the training data as closely as possible. Conversely, in the pruning phase, rules are made less restrictive in order to avoid overfitting. Much like DTs, the most appealing characteristic of rule-based systems is their easy comprehensibility. For more on rules based systems, please refer to [37]. 17 2.4.2 Perceptron-based Algorithms As aforementioned, perceptron-based algorithms (e.g. Neural Networks) are among the most well known and often implemented classification algorithms. A single layered perceptron can be described as follows: If x1 through xn are input feature values and w1 through wn are connection weights between the values and the predicted output, then the output of the perceptron is computed by its weighted sum, i.e., uni2211 g1876g3036g1875g3036g3041g3036 , followed by an adjustable threshold function. If the sum is above the threshold, the output is 1, otherwise it?s 0 [32]. Single layer perceptrons are limited, however, in that the algorithm can only properly classify problems which are linearly separable [14]. In an attempt to solve this problem, multilayered perceptrons (i.e. Artificial Neural Networks) were devised [38]. Multilayer artificial neural networks employ the addition of hidden units between the input and output units. The advantage of adding hidden layers is that it enlarges the hypotheses space which the network can represent. These hidden units can be thought of as a perceptron that represents a soft threshold function in the input space. Then, think of the output as a soft-threshold linear combination of several functions [14]. Figure 3 displays a common multilayer neural network with one hidden layer and four inputs. Figure 3: Multilayer Neural Network with one hidden layer and four inputs Output Units Hidden Units Input Units xi wj,i xj wk,j xk 18 Many variations can be implemented in neural networks. For example, any given implementation can have a different number of hidden units, a different number of connections, extra neurons, a different number layers, different activation functions, and beyond. Moreover, the weights between each layer have to be trained before classification can take place. A wide variety of weight training techniques also exist (e.g. back propagation [14] or training weights using Genetic Algorithms [39]). Although Neural Networks can be trained to be great learners, given all of its varying characteristics, its biggest drawback becomes its training time. A more in depth review of Neural Networks can be found in [40]. 2.4.3 Statistical-based Algorithms Statistical learning algorithms are unique in that they provide a probability that an instance belongs to each class, rather than simply predicting a specific classification. Maximum Entropy, for instance, works on the notion that when nothing is known, the distribution should be as uniform as possible. In this technique, training data is used to derive a constraint set for the model which calculates the class-specific expectations for the distribution. Likely the most well known statistical learning algorithm is a Bayesian Network [32]. Bayesian systems are based first on the notion of conditional probabilities. The notation used for conditional probabilities is P(a|b) where a and b are any propositions. They are defined in Eq. 5. From this definition, a Bayes rule (Eq. 6) can be derived. This simple rule underlies all modern AI systems for probabilistic inference and is the second basis for Bayesian systems [14]. g1842g4666g1876uni007Cg1877g4667 = g3017g4666g3051?g3052g4667g3017g4666g3052g4667 . (Eq. 5) g1842g4666g1877uni007Cg1876g4667 = g3017g4666g3051uni007Cg3052g4667g3017g4666g3052g4667g3017g4666g3051g4667 . (Eq. 6) 19 A Bayesian Network is then made up of a structural model and a set of conditional probabilities. The model is a directed graph with nodes representing features and arcs representing feature dependencies. These dependencies are quantified by conditional probabilities for each node given its parents [41]. So, first assume that A1, A2, ?, An are n features in the training set. An example E can be represented by < a1, a2, ?, an >, where ai is the value of Ai. Also, let C represent the class label in the training set. So, c can be used to represent the value that C takes and c(E) to denote the class of E. The classifier represented by a general Bayesian network can then be defined according to Eq. 7. Furthermore, assume that all features are independent given the class (assumption of conditional independence), and the resulting classifier is a Na?ve Bayes (Eq. 8) [41]. g1855g4666g1831g4667 = g1853g1870g1859g1865g1853g1876 g1855 ? g1829 g1842g4666g1855g4667g1842g4666g1853g2869,g1853g2870,?,g1853g3041uni007Cg1855g4667 (Eq. 7) g1855g4666g1831g4667 = g1853g1870g1859g1865g1853g1876 g1855 ? g1829 g1842g4666g1855g4667uni220F g1842g4666g1853g3036uni007Cg1855g4667g3041g3036g2880g2869 (Eq. 8) Simply put, the conditional independence assumption assumes that each variable in the dataset is conditionally independent of each other. Given the assumption of conditional independence, instead of computing the class-conditional probability of every combination of a, the algorithm needs only to estimate the conditional probability of each ai, given c. This approach is more practical since it does not require a large training set to obtain a good estimate of the probability [18]. However, this approach is also what often makes the algorithm less accurate than many of its counterparts (e.g. Neural Networks) [32]. This is because most domains exhibit dependent relationships within their features, breaking down the foundation of conditional independence among variables, and thus negatively affecting the classifier. 20 2.4.4 Instance-based Algorithms Instance-based algorithms typically differ from other learning algorithms in that no training phase is necessary and no explicit model of the training data is created. That is, an instance-based learner reuses the training set each time a new instance needs to be classified. For this reason, instance-based learners are also called lazy learners, as the training data is stored at training time and the learning is delayed until classification time [42] [43]. Perhaps the most unique characteristic of an instance-based algorithm is its use of a proximity measure [18]. This measure determines how close (distance) or how similar one instance is to another. The most common way to measure this is by calculating the Euclidean distance, which is the distance between two points in n-dimensional space. Instance-based algorithms are implemented under the assumption that unseen instances will be similar to the classification of other instances that are nearby in Euclidean space [42]. The standard Euclidean distance d (x, y) between instances x and y in n-dimensional space is defined as follows [43]: g1856g4666g1876,g1877g4667 = g3493uni2211 g4666g1876g3036 ? g1877g3036g4667g2870 g3041g3036g2880g2869 . (Eq. 9) Based on the aforementioned assumption of distance, the basic approach of a k-nearest neighbor (kNN) classifier [44] is to calculate the distance (or similarity) between each test case z = (x', y') and all of the training cases (x, y) uni0404 D to determine its nearest neighbor list, Dz [18]. Then from Dz (which contains the k closest training instances to z) the classification of the new instance is computed. In other words, the outcome of the k-nearest training instances will determine the outcome of the new instance that needs to be classified. Perhaps the most common way of determining the final output is by majority voting, also known as Discrete kNN. The Discrete k-nearest neighbor algorithm specifies that the most 21 common outcome of the k-nearest instances will be used as the outcome for the new instance. In other words, whichever outcome occurs most frequently, will be the outcome for the instance to be classified. Once Dz is obtained, the outcome of the test case is based on the majority vote of its nearest neighbors, which is formally defined as: g1877? = g1853g1870g1859g1865g1853g1876 g1874 uni2211 g1835g4666g1874 = g1877g3036g4667g4666g3051g3284,g3052g3284g4667?g3005g3301 . (Eq. 10) In Eq. 10, v represents all of the possible outcomes (classes), yi represents the class label for one nearest neighbor and I(?) is an indicator function that returns the value one (1) if its argument is true and zero (0) if not. The class with the highest value is then chosen as the outcome. Using majority voting, each neighbor affects the classification equally [18]. The majority voting approach is applied solely to the Discrete kNN algorithm. 2.4.5 Support Vector Machines A considerable amount of attention has been afforded to Support Vector Machines (SVM) recently. In part, this is because SVMs work so well with high-dimensional data by avoiding the curse of dimensionality [18]. SVMs revolve around the concept of a maximal margin hyperplane. That is, if the dataset is linearly separable, then there exists a hyperplane such that all of the instances from one class can be separated from all instances of another. An example of this is shown in Figure 4 [32]. However, as also seen from this example, there exists an infinite number of hyperplanes with the same property. Moreover, although their training errors are zero, there is no guarantee that the hyperplanes will perform as well at test time. So, the algorithm searches for the hyperplane which can offer the greatest class separation, i.e., the maximum margin hyperplane. 22 Figure 4: Maximum Margin Hyperplane The rationale of choosing the hyperplane with the largest margin is that decision boundaries with large margins tend to generalize better than those with small margins. Intuitively, if the margin is small, any slight changes in the decision boundary can significantly affect the system?s performance. Hence, classifiers that produce decision boundaries with small margins are more susceptible to poor generalization and overfitting. Linear separability is not a requirement for SVMs however. A soft margin approach allows SVMs to construct a linear decision boundary even in situations where the classes are not linearly separable [18]. More information on SVMs can be found in [45] or [46]. 2.5 Overfitting Errors committed by a classifier can be categorized as either training or generalization errors. Training errors are instances misclassified during training and generalization errors are unseen instances misclassified during test time. As aforementioned, it is important for a model Optimal hyperplane Other hyperplanes Maximum Margin 23 to not only fit the training data well, but to also accurately classify unseen records. When a classifier creates a model that fits the training data too well, essentially just ?remembering? the training data, it can result in poor generalization. This is known as overfitting [18]. On the path to understanding model overfitting, note that the training error can typically be reduced by simply increasing the model complexity. For instance, the leaf nodes of a decision tree can be expanded until they perfectly fit the training data. Such a tree would yield a training error of zero, but at the cost of a much more complex tree and the high likelihood of a significant increase in generalization error. This is due to the addition of nodes which accidently fit noise points in the training data, thus degrading generalization performance [18]. Overfitting can occur for a variety of reasons: noise in the training set, lack of representative samples, etc. However, there are also steps for decreasing the chances of generating an overfit model. Perhaps the most prevalent step is to reduce model complexity (e.g. pruning). The system can also use a separate validation set for calculating its training error instead of the training set itself. Regardless of the technique, model overfitting is a problem which must be carefully watched in any classification task. 2.6 Ensemble Methods The main goal of an ensemble is to construct multiple classifiers from the original data and then aggregate their predictions when classifying unknown instances. Figure 5 shows a basic view of an ensemble method [18]. As depicted, three main steps exist: training set generation, learning, and integration. Step 1 begins with the original training set D. From this training set, t data subsets are created (D1, D2, ?, Dt). Bagging and boosting are common ways to accomplish this step [18]. Then in Step 2, t base classifiers are generated (C1, C2, ?, Ct). 24 These classifiers may all be the same, all different, or contain any combination of the same or different classifiers. Each classifier Ci is trained using the subset Di. Finally in Step 3, the prediction of each classifier is combined in a predetermined way to produce the resulting classification. Figure 5: A logical view of the ensemble learning method [18] Two primary approaches exist to the integration phase: combination and selection. In the combination approach, the base classifiers produce their class predictions and the final outcome is composed using those predictions. In the selection approach, one of the classifiers is selected and the final prediction is the one produced by it [47]. The simplest and most common combination method is voting, also known as majority voting. In voting, the classification predicted by a base classifier is counted as a vote for that particular class value. The class value with the most votes becomes the final classification [47]. A simple and popular selection method is cross-validation majority (CVM) [48], which estimates the accuracy of each base classifier using cross-validation and selects the classifier with the highest accuracy. Although CVM is a 25 selection method which chooses one classifier for the whole data space, more sophisticated selection methods which estimate local accuracy [49] or meta-level classifiers do exist [50]. Perhaps the most commonly used integration techniques are voting [9], simple and weighted averaging, and a posteriori [51] [52]. According to [53], the main objective when building the base classifiers is to maximize the coverage of the data, which is the percentage of the instances which at least one base classifier can classify correctly. Reaching coverage greater than the accuracy of the best classifier, however, requires diversity among the base classifiers [11] [51] [52]. Although research is always ongoing on new approaches to increase diversity, some methods include training on different subsets of the training set, using different learning algorithms, injecting randomness, and training on different sets of input features [54]. The latter is where ensemble feature selection is successfully applied. 26 Chapter 3 Feature Selection Recall from Chapter 1 that research has shown that much of the power in classification comes not from the specific learning method, but from proper formulation of the problems and crafting the representation to make learning tractable. This suggests that the preparation and representation of the data is of the utmost importance for successful learning, given that many learning algorithms perform similarly on most domains [15]. In the context of machine learning, the formulation and transformation of data prior to learning is known as data preprocessing. Data preprocessing is itself a broad research area consisting of a number of different strategies and techniques interrelated in complex ways. These include techniques such as aggregation, sampling, and variable transformation. The key commonality within these and other data preprocessing techniques is that they all seek to improve learning with respect to time, cost, and quality. Simply put, data is of high quality if it is suitable for its intended use [18]. This is analogous to statistics and experimental sciences, where emphasis is especially enforced on the careful design of experiments to collect data relevant to the specific proposed hypothesis. Similarly, data preprocessing techniques improve the quality and representation of data so that classification tasks can be more efficient and effective. Perhaps one of the most studied and explored data preprocessing techniques is feature selection. The remainder of this chapter is organized as follows. First, background information is provided on feature selection, followed by an in depth explanation of the general feature selection process. Next, feature se algorithmic framework and common feature selection misconceptions are addressed next. Then, a thorough discussion on the subject of relevance and optimality as it pertains to feature selection is presented. Next, brief sections follow. Finally, previous research conducted in ensemble feature select explained, as it pertains highly to the 3.1 Background Feature selection can be described as a technique which finds a original input features under some objective measure, such as prediction accuracy, structure size, or minimal use of input features. Consistent with notions such as O description length [56], and minimum message length structures which correctly predict unseen instances, yet are not so complex that they overfit the data [58]. Within the general classification process, previously show selection occurs just prior to induction between the training data and the classifier ( Figure 6: Feature Selection within general classification process 27 lection embedded methods are discussed. A general on feature selection in unsupervised and supervised learning ion is summarized and architecture explored here. good ccam?s razor [57], feature selection attempts to find n in Figure subset of the [55], minimum 1, feature Figure 6). 28 Furthermore, by extracting irrelevant and redundant information from a given dataset, significant computing time can be saved and models that generalize better to unseen points can be built [59]. Formally, feature selection can be defined as follows [7]: Definition 1 (feature selection) Let Y be the original set of features, with cardinality n. Let d represent the desired number of features in the selected subset X, X ? Y. Let the feature selection criterion function for the set X be represented by J(X). Without lost of generality, let us consider a higher value of J to indicate a better feature set. Formally, the problem of feature selection is to find a subset X ? Y such that |X| = d and g1836g4666g1850g4667 = maxg3027?g3026,uni007Cg3027uni007Cg2880g3031 g1836g4666g1852g4667. (Eq. 11) The relationship between feature selection and the performance of a learning algorithm, however, depends largely on whether or not the assumption of monotonicity holds, i.e. g1836g4666g1827 ? g1828g4667 ? g1836g4666g1827g4667 ? g1827,g1828 ? g1851. (Eq. 12) This assumption implies that additional features always provide additional discriminating power, thus improving the performance of a learning algorithm [60]. It also assumes that the full distribution of the data is known to the learning algorithm. The optimal Bayes rule, for example, is monotonic; adding features cannot decrease the accuracy. In this case, feature selection would not be recommended. In the vast majority of practical and real world scenarios, however, monotonicity does not hold, often because learning algorithms are not given access to the underlying distribution [61]. In addition, noisy features can increase noise in training examples, making it difficult for the learning algorithm to separate useful signal from noise [60]. In fact, many supervised learning algorithms, including decision tree algorithms such as ID3 [25], C4.5 [26], and CART [30], and instance-based algorithms, such as IBL [62] [63], are known to lose prediction accuracy when faced with many features unnecessary for predicting the output, i.e. irrelevant features [61]. These are features that contain almost no useful information 29 for the learning task at hand. For example, students? ID numbers are irrelevant for predicting students? GPAs [18]. Conversely, algorithms such as Na?ve-Bayes [64] [65] [66] are robust with respect to irrelevant features, but their performance may degrade quickly if correlated features, i.e. redundant features, are added [61]. Such features duplicate much or all of the information contained in one or more other features. For instance, the purchase price of a product and the amount of sales tax paid contain much of the same information [18]. In the context of machine learning, irrelevant and redundant features typically serve no purpose except to increase induction time. In fact, such features may also cause other problems as they can confuse the learning algorithm by helping to obscure the distributions of the small set of truly relevant features for the task at hand [19]. Feature selection targets those very problems as it is a technique that reduces the number of features and removes irrelevant, redundant or noisy data by selecting a subset of the original feature set [4]. Although optimal feature selection is typically intractable [61] and many problems related to it have been shown to be NP-hard [67] [68], some feature selection algorithms have proved to be significantly more stable than others [69] [70]. Furthermore, feature selection has been shown to yield benefits such as facilitating data visualization and data understanding, reducing measurement and storage requirements, reducing training and utilization times, and defying the curse of dimensionality to improve prediction performance [22]. As a result, it can be found in many areas of research such as bioinformatics [71] [72] [73] [74], machine learning [71] [75] [76] [19], text classification/mining [77] [78] [79] [80] [81], case-based reasoning [82], statistics [83] [84] [85], security [79] [86], pattern recognition [7] [72] [87] [88], data mining [59] [89] [90], and various others. In addition, numerous reviews and comparative studies on existing feature selection algorithms have been conducted [4] [91] [92] [93] [94]. 30 3.2 General Feature Selection Process Most feature selection methods follow a four step process: subset generation, subset evaluation, stopping criterion, and result validation (Figure 7) [91]. Beginning with subset generation, the selected search strategy produces candidate feature subsets. Each subset is then evaluated and compared to others according to a given evaluation criterion. The best subset is kept and this process is repeated until a stopping criterion is reached, at which point, the selected subset is validated using the pre-selected classifier. The subsequent subsections will provide detailed explanations of each step. Figure 7: Feature Selection Process [91] 3.2.1 Subset Generation Subset generation is essentially a heuristic searching problem, with each location in the search space signifying a candidate subset for evaluation. Furthermore, it can largely be determined by answering two basic questions: where to start and how to search [4]. First, a starting point (or points) must be selected. Some algorithms start with an empty set and incrementally add features (i.e. forward), some start with a full set and incrementally remove features (i.e. backward), or some start with both ends and add and remove features simultaneously (i.e. bidirectional). Other algorithms may start with a predetermined number of randomly selected subsets in order to avoid being trapped by local optima [86]. 31 Second, the algorithm must specify its searching strategy. Since a dataset with N features has 2N unique candidate subsets, exhaustive search is typically not feasible and thus rarely considered or used. So, three main strategies have been previously studied: complete, sequential, and random search [4]. 3.2.1.1 Complete Search Perhaps the biggest advantage of a complete search strategy is that it is guaranteed to find the optimal feature subset based on the evaluation criterion. While an exhaustive search is complete, a strategy does not have to be exhaustive in order to be complete [4]. In fact, algorithms such as beam search [86] and branch-and-bound (BB) [95], can find the optimal subset much more quickly than exhaustive search. However, such algorithms are impractical for problems with large feature sets as the worst case complexity is still exponential. Moreover, methods like BB require the feature selection criterion to be monotonic, although [96] has shown that BB can still work well even in cases where the feature selection criterion is nonmonotonic. 3.2.1.2 Sequential Search Although sequential strategies tend to be much faster than complete strategies, the loss of completeness can also mean the loss of optimality as it is no longer a guarantee that the optimal solution will be found. These strategies are deterministic and iteratively add or remove features until a stopping criterion is reached. Most sequential search strategies will follow a greedy hill- climbing approach, such as sequential forward selection (SFS) [7] [87], sequential backward selection (SBS) [7] [87], or bidirectional selection [93]. The pseudo-code for SFS and SBS are provided in Figure 8 and Figure 9, respectively. These strategies are the most commonly used 32 methods for performing feature selection given that they are simple to implement and fast in producing results as the order of the search space is typically O(N2) or less [4]. Figure 8: Pseudo-code for Sequential Forward Selection (SFS) Algorithm Figure 9: Pseudo-code for Sequential Backward Selection (SBS) Algorithm 3.2.1.3 Random Search Random strategies begin with a collection of randomly chosen candidate subsets. Then, the strategy can follow in one of two directions: sequential or stochastic search. Examples such as random-start hill-climbing and simulated annealing follow a deterministic sequential search pattern to generate subsequent candidate solutions, while also injecting randomness throughout the process [86]. Stochastic methods generate new candidate subsets in a completely random manner (i.e. new subsets do not grow or shrink based on deterministic or sequential changes to previous subsets). Examples such as the Las Vegas algorithm [97] and LVW [98] use the aid of randomness to help escape local optima, allowing it to more adequately search for the global g1850g2868 = g4668g1851g4669 g1862 = 0 g1876? = argmaxg3051?g3025 g3285 g3427g1836g3435g1850g3037 ? g1876g3439g3431 g1850g3037g2878g2869 = g1850g3037 ? g1876? Sequential Backward Selection (SBS) //Step 1 - Initialize candidate subset //Step 2 - Find next feature greedily // based on evaluation metric //Step 3 - Update next feature subset //Step 4 - Check stopping criterion If g1862 < g1856 Then g1862 = g1862 + 1 and go to Step 2 Else Return g1850g3037 g1850g2868 = g4668?g4669 g1862 = 0 g1876? = argmaxg3051?g3025 g3285 g3427g1836g3435g1850g3037 + g1876g3439g3431 g1850g3037g2878g2869 = g1850g3037 + g1876? Sequential Forward Selection (SFS) //Step 1 - Initialize candidate subset //Step 2 - Find next feature greedily // based on evaluation metric //Step 3 - Update next feature subset //Step 4 - Check stopping criterion If g1862 < g1856 Then g1862 = g1862 + 1 and go to Step 2 Else Return g1850g3037 33 optima [4]. LVW, for instance, uses a probabilistic approach to explore high-order relationships among features. It guarantees the optimal solution(s) at the (k + 1)th experiment with probability l / (2D ? k) where D is the number of original features and l is the number of optima [98]. Perhaps the most applied and studied stochastic strategy, is that of a Genetic Algorithm (GA) driven structure [99] [100] [101]. Popularized by Holland and Goldberg, GAs have shown to be effective in searching complex high-dimensional spaces [102] [103]. Moreover, the use of crossover and mutation operators allows GAs to efficiently traverse the search space without getting trapped in local optima. First combined with feature selection by [99], in a GA approach, candidate subsets (i.e. ?chromosome?) are represented as binary strings of length N, where N is the number of features in the dataset. Each position in the binary string contains a zero or a one representing the absence or the presence of the corresponding feature. As described by the pseudo-code provided in Figure 10, a predetermined number of chromosomes (i.e. ?population?) is first generated randomly and then updated through each iteration (i.e. ?generation?) until a stopping criterion is reached. In every generation, each chromosome is assigned a fitness value based on the predetermined evaluation criterion. This value often determines how likely that chromosome is to survive and breed into the next generation. New chromosomes (i.e. ?offspring?) are then created in one of two ways: crossover and/or mutation. In crossover, two existing chromosomes are selected as parents and their binary strings are mixed to create new chromosomes. In mutation, random bits (?genes?) of a single parent or a newly created child are mutated by simply flipping the binary value, thus creating a new chromosome. Once the new chromosomes have been generated, a new population is selected for the next generation and the process begins all over again. Note that depending on selection strategy, old and new chromosomes may or may not survive each generation. 34 Figure 10: Pseudo-code for GA driven Feature Selection Algorithm 3.2.2 Subset Evaluation As aforementioned, each new candidate subset needs to be evaluated by an evaluation criterion to determine the goodness of the subset. Since a great number of different evaluation techniques exist, it should first be pointed out that although a candidate subset may be optimal or near optimal under one criterion, it may or may not be considered optimal under others. The following three major groups of evaluation criteria exist and will be discussed in subsequent sections: independent, dependent, and hybrids. 3.2.2.1 Independent Criteria Independent criteria are used to evaluate candidate solutions based on intrinsic characteristics of the features themselves, such as distance, information, dependency, and consistency measures [104] [105] [106] [93]. Distance measures (i.e. separability, divergence, or GA driven Feature Selection g1872 = 0 //initialize generation counter g1850g3047g4593 = g3419g1850g2868,g1850g2869,?,g1850g3091g3423 //randomly generate initial ? chromosomes (population) g1832g3047g3117 = g1831g1874g1853g1864g4666g1850g3047g4593g4667 //assign fitness values by evaluating population While (Not Done) //continue until stopping criterion is reached { g1842g3047 = g1845g1857g1864g1857g1855g1872_g1842g1853g1870g1857g1866g1872g1871g3435g1850g3047g4593,g1832g3047g3117g3439 //select parents for procreation g1841g3047 = g1842g1870g1867g1855g1870g1857g1853g1872g1857g4666g1842g3047g4667 //create offspring g1832g3047g3118 = g1831g1874g1853g1864g4666g1841g3047g4667 //assign fitness values to offspring g1850g3047g2878g2869g4593 = g1845g1857g1864g1857g1855g1872_g1845g1873g1870g1861g1874g1867g1870g1871g4666g1850g3047g4593,g1832g3047g3117,g1841g3047,g1832g3047g3118g4667 //select next generation g1872 = g1872 + 1 //increment counter } 35 discrimination) try to find features that can separate the dataset?s class labels as far as possible. For example, in a two class problem, a feature X is preferred over feature Y if X generates a greater difference between the two class conditional probabilities than Y. Information measures normally quantify the information that can be gained from each feature. For example, the information gain from feature X is determined by the difference between the prior uncertainty and expected posterior uncertainty using X. Dependency measures (i.e. correlation or similarity) gauge a feature?s ability to predict the value of one variable from the value of another. For instance, feature X is chosen over feature Y if the association between feature X and class C is higher than the association between feature Y and class C. Consistency measures rely on class information feature bias in selecting subsets. These measures try to find a minimum number of features which separate classes as consistently as the full set of features can [4]. Independent criteria are typically used in, and most widely referred to as, filter methods. Coined by John, Kohavi and Pfleger in [58], filter methods have their name because they filter out irrelevant features before induction occurs, using a predefined metric. The process uses general characteristics of the training set to select some features and exclude others. As displayed in Figure 11, filter methods evaluate candidate subsets through metrics independent of the induction or learning algorithm and select the best subset accordingly. Since filtering methods do not involve the use of a learning algorithm to evaluate candidate sets, they can be combined with any learning algorithm after the filtering is complete. Moreover, they are a computationally effective form of data pre-processing, especially when compared to wrapper methods (which use dependent criteria) [74] [107]. Filter methods rely on abstract metrics of features, which indicate important properties of promising features such as orthogonality, large variance, multi-modality of marginal distributions, high kurtosis, low entropy, etc [60]. As such, 36 filter methods do not inherit bias from learning algorithms and are also more computationally efficient than wrapper methods [4]. Figure 11: Filter approach to feature selection Figure 12 describes a generalized form of a filter algorithm, provided by [4]. Given dataset D, begin with a given subset S0 (an empty set, a full set, or any randomly selected subset) and search through the feature space using a particular search strategy. Each generated subset S is evaluated by an independent measure M and compared with the previous best. If better, it?s regarded as the current best subset. The search iterates until a predefined stopping criterion is reached. The algorithm outputs the last current best subset Sbest as the final feature subset [4]. Figure 12: Generalized Filter Algorithm [4] Training Data Feature Selection Algorithm Prediction Model Test Data Independent Metric Evaluation Final Evaluation Learning Algorithm Original Feature Set Best Found Subset Estimated Accuracy ?Goodness? of Subset Feature Subset 37 One of the simplest filtering processes is to evaluate each feature individually based on its correlation with the target function and then select the k features with the highest values. This method has been shown to achieve good results in text categorization tasks [108] [109] [110], often used in combination with either a Na?ve-Bayesian classifier or a nearest neighbor classifier [107]. Other widely used metrics include Information Gain, Odds Ratio, Chi-Squared [77], Log Probability Ratio [77], FOCUS [111], RELIEF [112] [113], Potential Difference [114], Pearson Correlation Coefficient [115], etc. RELIEF, for instance, assigns a ?relevance? weight to each feature, which represents the relevance of the feature to the target concept. It samples instances randomly and updates the relevance values based on the difference between the selected instance and the nearest instances of the same and opposite class (?near-hit? and ?near-miss?) [61]. The algorithm does require the problem to only contain two classes however. Another example, FOCUS, looks for minimal combinations of attributes that perfectly discriminate among the classes. It begins by looking at each feature individually, goes on to pairs of features, triples, and so forth, stopping only when it finds a combination which generates pure partitions of the training set. There are some drawbacks to filter methods however. First, most filtering methods require the pre-selection of a set number of features to be chosen. If the number is too high, irrelevant features will be kept and accuracy may suffer. If the number is too low, useful features may not be selected, once again affecting accuracy. Some remedies to this problem include using a hold-out set to test for the best k, or to use a hill-climber or some other evolutionary computation algorithm to find an optimal k. These solutions, however, will negatively impact what is perhaps the biggest benefit of a filtering algorithm, its relative speed. A second drawback is that filtering algorithms may miss features that would otherwise be useful 38 to the learning algorithm which will predict unseen instances. Since the algorithm bases its selection purely on feature metrics, it may miss features deemed useful by some learning algorithms but irrelevant in others. This is why wrapper methods are widely known to outperform filter methods (in terms of prediction accuracy) [4] [74]. 3.2.2.2 Dependent Criteria Dependent criteria primarily involve the use of a learning algorithm to assess the goodness of candidate subsets. In most cases, predictive accuracy, or some variation of it, is used as the evaluation criterion. Accordingly, methods that utilize dependent criterion as its subset evaluation criterion are also known as wrapper methods. These methods occur outside the basic learning algorithm, but also use said learning algorithm as a subroutine, rather than just as a post-processor. For this reason, John, Kohavi, and Pfleger [58] refer to them as wrapper methods. As displayed in Figure 13, each candidate feature subset is evaluated by running the selected training or validation data through the learning algorithm and using the estimated accuracy of the algorithm as its evaluation metric. As aforementioned, the biggest benefit to using wrapper methods is their tendency to outperform (in terms of prediction accuracy) their filter counterparts. The general argument is that the learning algorithm, which uses the feature subset, should provide a better estimate of accuracy than a separate metric that may have an entirely different bias [107]. Whereas a filter method will use metrics to determine the usefulness of each feature, wrapper methods will use the learning method to determine the usefulness of a candidate set of features for the learning method itself. At the end of the process, not only will the algorithm produce an optimal set of features, but one which will perform most optimally under the aforementioned learning method. 39 Figure 13: Wrapper approach to feature selection Although wrapper methods have shown to be more computationally expensive than filter methods due to their repeated calls to the learning algorithm for evaluation, research has shown that wrapper methods tend to give superior performance as feature subsets are better suited to the predetermined learning algorithm [4]. They have been shown to significantly boost predictive performance in classifiers such as C4.5 and Na?ve-Bayes [61] and on instance based algorithms such as nearest-neighbor, as they would otherwise always use all available features for classification [116]. Figure 14 provides the pseudo-code for a generalized wrapper algorithm [4], which is quite similar to the generalized filter algorithm (Figure 12 ? Section 3.2.2.1) except that it uses a learning algorithm A instead of an independent measure M for the evaluation of each candidate subset. For each generated subset S, the algorithm evaluates its goodness by applying the learning algorithm to the data with subset S and evaluating the accuracy. Thus, different learning algorithms may produce different feature selection results. Varying the search strategies via the function generate(D) and learning algorithms (A) can result in different wrapper algorithms. Prediction Model Test Data Final Evaluation Learning Algorithm Learning Algorithm Feature Selection Algorithm Estimated Accuracy Training Data ?Goodness? of Subset Original Feature Set Best Found Subset Training Data Training Data Feature Subset 40 Figure 14: Generalized Wrapper Algorithm [4] Just as variations can be found to many wrapper methods, they can also be found in the learning algorithm. Variations of Neural Networks, Bayesian networks, and SVMs are often applied on different wrapper problems [117]. Research done in [90] also discusses overfitting and dynamic search in regards to wrapper methods. Finally, wrapper methods have also been widely used in unsupervised classification, typically using measures such as cluster compactness, scatter separability, or maximum likelihood as evaluation criterion [118] [59]. 3.2.2.3 Hybrid Criteria Hybrid feature selection algorithms combine the use of filter and wrapper methods in an attempt to exploit the benefits of both. By combining both techniques at different stages, hybrids are able to take advantage of the speed of a filter method and the accuracy of a wrapper method [73] [119]. Other implementations may also include the use of an embedded method, such as a decision tree, instead of a wrapper method [120] [121]. Similar to the generalized algorithms previously provided, Figure 15 describes the pseudo-code for a generalized hybrid algorithm [4]. 41 The algorithm first uses a filter method to find the best subset for a given predetermined cardinality and then uses the learning algorithm to select the final best subset among the best subsets across different cardinalities. Figure 15: Generalized Hybrid Algorithm [4] The algorithm begins with a given subset S0 (typically an empty set in sequential forward selection) and iterates to find the best subsets at each increased cardinality. In each round for a best subset with cardinality c, it searches through all possible subsets of cardinality c + 1 by adding one feature from the remaining features. Each newly generated subset S with cardinality c + 1 is evaluated by an independent measure M and compared with the previous best. If S is better, it becomes the current best subset Sg2912g2915g2929g2930g4593 at level c + 1. At the end of each iteration, a learning algorithm A is applied on Sg2912g2915g2929g2930g4593 at level c + 1 and the quality of the mined result is compared with that from the best subset at level c. If Sg2912g2915g2929g2930g4593 is better, the algorithm continues to 42 find the best subset at the next level; otherwise, it stops and outputs the current best subset as the final best subset. The quality of results from a learning algorithm provides a natural stopping criterion in this model [4]. As a hybrid method, the method presented in [117] begins by pre-processing the dataset by using two different filter methods, F-score and Information Gain. The separate feature subsets are then combined and processed through a Sequential Floating Wrapper method, which yields the final feature subset. A Support Vector Machine (SVM) then utilizes the resulting feature subset to compute the classification accuracy. 3.2.3 Stopping Criterion A preselected stopping criterion determines when the feature selection process needs to stop. There is often little variation in the stopping criterion used for most feature selection methods. The subsequent list, provided by [4], details a few of the most frequently used ones. 1. The search completes 2. A given bound is reached, such as a minimum number of features or a maximum number of iterations 3. Subsequent addition or deletion of any feature does not produce and better subset 4. A sufficiently good subset is found (e.g., the classification error rate for a candidate subset is less than the allowable error rate for a particular task). 3.2.4 Result Validation Given the case that prior knowledge about the data is available (e.g. synthetic data), then a simple way to validate the selected subset is by comparing it to the known optimal subset. Knowledge about irrelevant and redundant features can also aid in validation as those features 43 are not expected to be selected. In most applications, however, such prior knowledge is not available. In such settings, the validation task typically falls to the comparison of the learner?s error rate on the full set of features versus the error rate on the selected set of features [4] [93]. 3.3 Embedded Methods Embedded methods follow a slightly different feature selection pattern than the general process discussed in Section 3.2, so they are presented here separately. Such methods perform feature selection as part of the learning algorithm itself. That is, within the process of the learning algorithm, the algorithm decides which attributes to use and which to ignore [18]. Decision Trees are perhaps the most well known example of embedded methods [25] [26] [30]. Much like wrapper methods, embedded methods interact directly with a specific learning algorithm. In other words, the feature selection algorithm is embedded into the classifier model itself rather than using the model to evaluate candidate feature sets. Moreover, these methods have the advantages that they interact directly with the classifier while also being less computationally expensive than wrapper methods [74]. A generalized form of this technique is not provided here as embedded methods may be highly different in implementation depending on the learning algorithms encompassing them (e.g. decision tree vs. SVM). Mao, Mohiuddin, and Jain, for instance, use a multilayer feedforward Neural Network with back propagation to simultaneously develop both an optimal feature set and an optimum classifier [122]. They define and utilize a ?node saliency? measure to prune the least salient nodes thus reducing the complexity of the network after it has been trained. Recursive partitioning methods such as decision trees [25] [26] [30], employ a greedy search through the space of decision trees, at each stage using an evaluation function to select the 44 feature which has the best ability to discriminate among the classes. They partition the data based on this feature and repeat the process on each subset, extending the tree downward until no further discrimination is possible. Other embedded methods include Separate-and-Conquer for decision lists [123] and Support Vector Machines of Recursive Feature Elimination [71]. 3.4 General Algorithmic Framework Given the previously provided general procedure for feature selection algorithms, it is now possible to discuss an algorithmic framework. Accordingly, this chapter will present a general feature selection taxonomy (Figure 16) and a categorical table (Table 5) which portrays many of the existent feature selection algorithms. Moreover, although the categorical table of algorithms is by no means exhaustive, it does provide a vast and broad representation of proposed and studied feature selection algorithms. Figure 16: Taxonomy of Feature Selection Algorithms Feature Selection Random Search Stochastic Sequential GA LVW Stand-alone methods Suboptimal Embedded methods Decision Trees ANN (node pruning) Optimal (Complete Search) exhaustive search branch-and-bound Sequential Search SBS SFS SFFS greedy hill-climbing 45 As mentioned, a broad taxonomy of feature selection algorithms is presented in Figure 16. As shown, feature selection is first divided into stand-alone and embedded methods. Stand- alone methods are then split according to their respective guarantees of optimality (optimal and suboptimal). Suboptimal methods are further divided into those using sequential search and ones using random search. A final distinction is then made of random methods, dividing them into methods that search in a stochastic manner and methods that begin randomly, but then follow a sequential pattern for search. Also note that all leaf nodes include at least one example of an existent method related to its category. It should be stated that although the taxonomy does not mention subset evaluation methods such as filter, wrapper, and hybrid methods, it does not do so because the majority of these methods are structured so that they may be combined with most stand-alone methods. Subsequently, Table 5 provides an in depth categorical table of existing feature selection methods [4]. The table is divided by two dimensions (Search Strategies and Evaluation Criteria), as these are the two most dominating factors in feature selection algorithms. Within Search Strategies, algorithms are categorized as Complete, Sequential, and Random. Under Evaluation Criteria, algorithms are categorized into Filter, Wrapper, and Hybrid. Filter algorithms are further divided into Distance, Information, Dependency, and Consistency. Although most of the algorithms listed in this section are not implemented in this research it is noteworthy to mention that the CEFS architecture proposed in this work is structured in such a way that most, if not all, of the listed algorithms can utilize it and benefit from it. This is an important aspect which speaks to the wide extensibility of the proposed methodology and architecture of CEFS. 46 Table 5: Categorical Framework of Feature Selection Algorithms [4] Search Strategies Complete Sequential Random Ev alu ati on Cr ite ria Fil ter Distance B&B [95] BFF [124] RELIEF [112] RELIEFF [113] RELIEFS [125] SFS [126] Segen?s [127] Information MDLM [128] DTM [129] Koller?s [19] SFG [93] FCBF [130] Dependency Bobrowski?s [131] CFS [106] RRESET [132] POE+ACC [133] DVMM [134] Consistency Focus [111] ABB [135] MIFESI [136] Schlimmer?s [137] Set Cover [138] LVI [93] QBB [93] LVF [139] Wr ap pe r Predictive Accuracy BS [86] AMB&B [140] FSLC [141] FSBC [142] SFS-SPLASH 13 WSFG [143] WSBG [143] BDS [86] PQSS [86] RC [144] SS [145] Queiros? [146] SA [86] RGSS [86] LVW [98] RMHC-PF [147] GA [100] [101] RVE [148] ELSA [60] Hy bri d Filter + Wrapper BBHFS [119] Xing [73] 3.5 Common Misconceptions Techniques such as dimensionality reduction, feature extraction, feature construction, feature weighting, feature creation and others are often misused as synonymous to feature selection. It should be pointed out that feature selection strictly selects a subset of the existing feature set, without the creation of new features [61] [149] [150]. However, to better understand feature selection, a brief explanation of some of the aforementioned techniques is noteworthy. 47 The first two techniques, dimensionality reduction and feature extraction, go hand in hand as various algorithms are often interchangeably characterized under both. Tan et al comments that dimensionality reduction commonly uses techniques from linear algebra to project the data from a high-dimensional space into a lower-dimensional space [18]. Similarly, Jain et al describes feature extraction algorithms as algorithms which create new features based on transformations or combinations of the original feature set [7]. For example, Principal Component Analysis (PCA) finds new attributes which are linear combinations of the original attributes, are orthogonal to each other, and capture the maximum amount of variation in the data [18]. In other words, PCA is an orthogonal transformation of the coordinate system, which is obtained by projection onto the principal components, or features, of the data. Since a small number of principal components are often sufficient, a reduction in dimensionality occurs [151]. Further reviews of PCA literature can also be found at [152] and [153]. Consider another example; suppose you have a set of photographs, where each photograph is to be classified as to whether or not it contains a human face. By using a feature extraction algorithm, pixels can be transformed to provide higher-level features, such as edges and areas highly correlated with the presence of human faces. Other examples include applying a Fourier transform to times series data to identify underlying frequencies in order to detect certain periodic patterns [18] and using Independent Component Analysis (ICA) in conjunction with Support Vector Machines (SVM) in order to improve prediction accuracy in series forecasting [154]. Other techniques such as feature weighting [18] and feature construction [22] also provide similar distinctions, but for the brevity of this research, will not be discussed at this time. In summary, feature selection is not to be confused with other techniques which go beyond the 48 original feature set, creating new features; that is, feature selection algorithms always yield a direct subset of the original feature set. 3.6 Relevance and Optimality of Features Relevance, as it pertains to feature selection, is a bit of a loaded term. Defining it is not trivial nor is it widely agreed upon. In fact, numerous authors have provided a number of different definitions for relevance. A brief summary of some of these definitions will be provided in the subsequent paragraphs. Detailed explanations and further definitions of relevance can be found in works such as [61] by Kohavi and John, [107] by Blum and Langley, [72] by Nilsson et al, and most recently in [155] by Bell and Wang. Although the word relevance is often used casually and without formal definition in most feature selection studies, its definition is as important as its use. At the heart of this matter lies the question of relevance vs. usefulness. Selecting only the most relevant features will often produce suboptimal results, especially if the features are redundant. Conversely, a subset of useful features may leave out many redundant, yet relevant, features [22]. In other words, the relevance of a feature does not imply that it should be in the optimal feature subset, while irrelevance does not imply it should not be in the optimal feature subset [61]. For clarification purposes, provided next are some formal definitions acquired from [61], [72], and [107] which may aid in understanding the role of relevance in feature selection. Definition 2 (conditional independence) A variable Xi is conditionally independent of a variable Y given (conditioned on) the set of variables g1845 ? g1850 iff it holds that g1842g3435g1868g4666g1851uni007Cg1850g3036,g1845g4667 = g1868g4666g1851uni007Cg1845g4667g3439 = 1. (Eq. 13) This is denoted g1851 ? g1850g3036uni007Cg1845. 49 Conditional independence is a measure of irrelevance, but it is difficult to use as an operational definition since it depends on the conditioning set S [72]. The definitions of strong and weak relevance, defined next, will help refine what it means for a feature to be considered irrelevant. Definition 3 (strong relevance) A feature Xi is strongly relevant iff there exists some xi, y, and si for which p( Xi = xi, Si = si ) > 0 such that p( Y = y | Xi = xi, Si = si ) ? p( Y = y | Si = si ). (Eq. 14) Definition 4 (weak relevance) A feature Xi is weakly relevant iff it is not strongly relevant and there exists a subset of features g1845g3036g4593of g1845g3036 for which there exists some xi, y, and g1871g3036g4593 with p( Xi = xi, g1845g3036g4593 = g1871g3036g4593 ) > 0 such that p( Y = y | Xi = xi, g1845g3036g4593 = g1871g3036g4593) ? p( Y = y | g1845g3036g4593 = g1871g3036g4593 ). (Eq. 15) In other words, a feature X is strongly relevant if the removal of X alone will yield a decrease in the performance of an optimal classifier. Conversely, a feature X is weakly relevant if it is not strongly relevant and there exists a subset of features, S, such that the performance of a classifier on S is worse than the performance on S ? {X}. Thus, a feature is relevant if it is either weakly or strongly relevant, otherwise it is irrelevant [61]. The importance lies not only with its definition, but research completed in [156] has also shown that algorithms such as C4.5 tend to generate deeper decision trees with lower performances when weakly relevant features are not deleted. Figure 17 provides a useful visual representation of the feature subset space divided into irrelevant, weakly relevant and strongly relevant feature subsets [61]. 50 Figure 17: Feature Subset Space [61] A Bayes classifier, for instance, must use all strongly relevant features and possibly some weakly relevant features (but recall that such classifiers are monotonic and have access to the underlying distribution). Classifiers induced from data, however, are likely to be suboptimal as they have no access to the same distribution. Accordingly, such induction algorithms may benefit from the omission of features, including strongly relevant ones [61]. Thus, it is important to define what it means for a feature subset to be optimal. Definition 5 (optimal feature subset) Given a learning algorithm L and a dataset D, an optimal feature set, S*, is a subset of features such that the predictive accuracy of D is highest under L. As previously mentioned, however, relevant features do not imply that they are optimal, and conversely, optimal features do not imply that they are relevant. This is shown through Examples 1 and 2, acquired from [61]. Example 1 (relevance does not imply optimality) Let all possible instances be {0, 1}3, that is, three Boolean features, say X1, X2, X3. Let the distribution of instances be uniform, and assume the target concept is f (X1, X2, X3) = (X1 uni02C4 X2) uni02C5 X3. Under any reasonable definition of relevance, all features are relevant to this target function. If the hypothesis space is the space of 51 monomials, i.e., conjunctions of literals, the only optimal feature subset is {X3}. Moreover, the accuracy of the monomial X3 is 87.5%, the highest accuracy achievable within this space and adding another feature to the monomial will simply decrease its accuracy. Therefore, this example shows that relevance, even strong relevance, does not imply that a feature should be in an optimal feature subset. Example 2 (optimality does not imply relevance) Assume feature Xi is always set to the value of 1. Under all definitions of relevance described above, this feature is irrelevant. Now consider a limited Perceptron classifier that has an associated weight with each feature and then classifies instances based upon whether the linear combination is greater than zero. Also assume that the threshold = 0. Given that Xi is always set to 1, the limited Perceptron is equivalent in representation power to the regular Perceptron. However, removal of all irrelevant features would remove that crucial feature. Since cases such as this are believed to be rare in practice, it still holds true that irrelevant features should generally be removed. Recall, however, that an objective measure still needs to be selected in order to implement and deploy a feature selection algorithm. So, for the scope of this research, the usefulness of a feature set will provide the necessary objective function for subset comparison and selection. As such, the following definition derived from [157] provides this criteria. Definition 6 (incremental usefulness) Given a sample of data D, a learning algorithm L, and a feature set S, feature xi is incrementally useful to L with respect to S if the accuracy of the hypothesis that L produces using the feature set {xi} ? S is better than the accuracy achieved using just the feature set S. Accordingly, the definition of incremental usefulness will be the focal point for determining whether or not a feature should be included in the optimal feature subset. This is further apparent given the fact that the proposed algorithm later described in this research will employ a wrapper method, which tailors directly to the learning algorithm used to evaluate each feature subset. Therefore, features will be assessed according to their usefulness with respect to the learning algorithm rather than their overall relevance. 52 As a final illustration on the issue of relevance vs. usefulness, refer to an example provided in [61]. One of the artificial datasets (m-of-n-3-7-10) represents a symmetric target function, implying that all features should be ranked equally by any filtering method. However, Na?ve-Bayes improves if a single feature (any one of them) is removed. Consequently, that is the precise motivation behind the selection of a wrapper method over a filter method in the implementation of the proposed architecture in this research (Chapter 4). Note, however, that relevance according to these definitions does not imply membership in the optimal feature subset, and that irrelevance does not imply that a feature cannot be in the optimal feature subset [61]. The underlying purpose among any of this is to design the architecture in such a way that it is allowed to choose usefulness over relevance. 3.7 Feature Selection in Unsupervised Learning Although the use of feature selection in unsupervised learning will not be implemented in this research, a brief discussion of the subject is important given the amount of attention it has received [4] [158] [118] [159] [160]. Unsupervised learning, more specifically clustering [161] [162], groups instances based on the information describing the instance or based on their relationships. The goal is to obtain groups with instances similar or related to one another and different from instances in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better the clustering [158]. Wrapper methods used in conjunction with clustering, evaluate the goodness of a feature subset by the quality of the clusters resulted from applying the clustering algorithm on the candidate feature subset. A number of heuristic measures exist for estimating the quality of clustering results, such as cluster compactness, scatter separability, and maximum likelihood [4]. 53 Further work on developing dependent criteria in feature selection for clustering can also be found in [59], [118], and [159]. Since filter methods don?t depend on the use of a separate learning algorithm for evaluation, the methodology for using filter feature selection in unsupervised learning does not differ from that of supervised learning. Given the infancy of this research area, advances continue to occur. 3.8 Feature Selection in Supervised Learning It has been shown that CART [30], ID3 [25], C4.5 [26], and nearest neighbor [44] algorithms suffer in the presence of irrelevant features. Naive Bayesian classifiers [163] [65], which assume independence of features given the instance label, also degrade when using correlated and redundant features [60]. Even learners such as SVMs, which are considered more robust in high-dimensional settings, can improve classification performance when combined with feature selection techniques [164] [80]. Feature selection is also of particular interest when using nonparametric classification methods such as k nearest neighbor [165], Parzen?s windows [166], or more generally methods based on geometrical models that have a reputation for having high computational and storage costs [167]. Feature selection algorithms have also been widely combined with evolutionary computation algorithms in supervised leanring tasks. The research conducted in [168], for example, combines a Genetic Algorithm and a k-Nearest Neighbor method to identify genes which can jointly discriminate between different classes of cancer. This and many other applications of feature selection on supervised learning tasks, such as text categorization [108] or Parkinsonian sleep analysis [169], are extensively available but outside the scope of this particular project. 54 3.9 Ensemble Feature Selection Feature selection methods discussed up to this point employ the use of a single classifier. An ensemble system, on the other hand, is composed of a set of multiple classifiers and performs classification by selecting from the predictions made by each of the classifiers [18]. Since wide research has shown that ensemble systems are often more accurate than any of the individual classifiers of the system alone [9] [10] [11], it is only natural that ensemble systems and feature selection would be combined at some point. An effective way of generating a diverse, yet accurate, ensemble of base classifiers is to use ensemble feature selection [47]. To recall, theoretical and empirical research has shown that an efficient ensemble should consist not only of high accuracy classifiers, but classifiers which also err in different parts of the input space [9]. Providing different feature subsets allow base classifiers to make their classification errors in different subareas of the instance space. While feature selection algorithms attempt to find an optimal feature subset for the learning algorithm, ensemble feature selection has the additional goal of finding a set of feature subsets which will promote disagreement among the base classifiers [47] [170]. This is perhaps the most important point in understanding the motivation and goals of the research presented in this work. The approach proposed in Chapter 4 will attempt to provide the aforementioned disagreement among the base classifiers, but with a set of highly optimal class-specific feature subsets. The results are believed to be higher prediction accuracy and better domain understandability. Various successful attempts have been made at implementing ensemble feature selection. In [171], Ho proposes a technique called Random Subspace Method (RSM), which randomly selects a predetermined number of features from the original feature set. This is repeated n times to create n feature subsets that are used to generate the n base classifiers used in the ensemble. 55 Ho goes on to show that RSM ensembles can reach effective results presumably because the lack of accuracy in the ensemble members is compensated for by the diversity created by the RSM [171]. In another implementation, Optiz begins by using RSM to generate an initial population of feature subsets, which he then uses and optimizes in his Genetic Algorithm (GA). Optiz uses GAs in conjunction with Neural Networks (NN) as a wrapper method to find optimal feature subsets, which he then uses as training data to build an ensemble model of NNs. Optiz comments that the initial population, acquired through RSM, was surprisingly good and produced better ensembles on average than the popular and power ensemble approaches of bagging and boosting [170]. Other ensemble feature selection studies have included implementations such as a Hill-Climber and Bayesian Classifiers with cross-validation integration [47], GA and k Nearest Neighbor classifiers [172] [173], sequential-search-based strategies and Bayesian Classifiers [174], and comparative ensemble feature selection studies such as [175]. Perhaps the most related work to the architecture proposed in this research is the work done by Vale et al in [176]. In that study, the authors implement a filter method with an emphasis on class-based feature selection to generate the feature subsets used by each base classifier. The use of a filter method is perhaps the most evident shortcoming of the implementation as wrapper methods have often shown to provide more accurate results. In addition, the authors implement a combination based ensemble integration technique, although the specific details of the combination technique are not provided. Depending on the specific combination technique, further improvement can be seen here as well. 56 Chapter 4 Class-specific Ensemble Feature Selection (CEFS) The approach presented and explored in this research, Class-specific Ensemble Feature Selection (CEFS), provides a novel methodology in which to perform feature selection. Specifically, CEFS takes advantage of the notion that features found to be resilient for classifying each classification separately, may be different than the features found to be resilient for the overall dataset. This difference is crucial as there exists features that are optimal for classifying a specific classification, but which may not have been included in the overall best feature set. This loss of information becomes vital in classification tasks. By performing feature selection in a class-specific manner, the algorithm becomes more robust, more effective, and more informative. Moreover, one of the key aspects of CEFS is its extensibility and flexibility in regards to machine learners and feature selection algorithms. In other words, CEFS is designed in such a way that most, if not all, machine learners and feature selection algorithms can be coupled with it. The CEFS algorithm will focus mainly on improving prediction accuracy in classification problems. As an added benefit, the class-specific design will select features optimal to each separate classification, thereby providing more information and understanding in regards to the most useful features to each classification and to the model. To better understand the theory and design behind CEFS, the remainder of this chapter will provide the theoretical, architectural, and algorithmic frameworks for CEFS. 57 4.1 Theoretical Framework Let g1851 = g4668g1851g2869,g1851g2870,?,g1851g3041g4669 represent the original set of features, with cardinality n, and let g1835 = g4668g1835g2869,g1835g2870,?,g1835g3040g4669 represent the original training set of instances, with cardinality m. Let g1872 = g4668g1872g2869,g1872g2870,?,g1872g3040g4669 represent the set of known target outcomes (i.e., classifications) for I. Then, g1835g3037 = g3419g1861g2869,g1861g2870,?,g1861g3041,g1872g3037g3423 represents the j th instance in I, as an assignment of values projected onto Y and with target outcome g1872g3037. Moreover, let d represent the desired number of features in a candidate subset X, such that g1850 ? g1851. Now, let the feature selection evaluation criterion for the set X be denoted by J(X) and be measured with respect to I. Without loss of generality, consider a higher value of J to indicate a better feature set. Then, traditional feature selection seeks to find a subset g1850 ? g1851, such that uni007Cg1850uni007C = g1856 and g1836g4666g1850g4667 = maxg3027?g3026,uni007Cg3027uni007Cg2880g3031 g1836g4666g1852g4667. Now, let g1872? = g4668g1872?g2869,g1872?g2870,?,g1872?g3038g4669 represent the set of distinct known target outcomes for I, where g1872? ? g1872 and g1863 ? g1865. Furthermore, ?g1872?g3038 ? g1872? let g1856g3047g3286g4594 represent the desired number of features in the corresponding subset g1850g3047g3286g4594 , such that g1850g3047g3286g4594 ? g1851. Then, let the feature selection evaluation criterion for g1850g3047g3286g4594 be denoted by g1836?g4672g1850g3047g3286g4594 g4673, but be measured w.r.t. to g1835g3047g3286g4594 = g4676g1835g3047g3286g4594,g2869,g1835g3047g3286g4594,g2870,?,g1835g3047g3286g4594,g3043g4677 with cardinality p, where g1835g3047g3286g4594 ? g1835, g1835g3047g3286g4594,g3037 = g4676g1861g3047g3286g4594,g3037,g2869,g1861g3047g3286g4594,g3037,g2870,?,g1861g3047g3286g4594,g3037,g3041,g1872g3047g3286g4594,g3037g4677, and g1835g3047g3286g4594,g3037 ? g1835g3047g3286g4594 g3643 g1872g3047g3286g4594,g3037 = g1872?g3038. Without loss of generality, consider a higher value of g1836? to indicate a better feature set. Then, class-specific feature selection seeks to find a subset g1850g3047g3286g4594 ? g1851, such that g4698g1850g3047g3286g4594g4698 = g1856g3047g3286g4594 and g1836?g4672g1850g3047g3286g4594g4673 = maxg3027g4594?g3026,g3627g3027g4594g3627g2880g3031 g3295g3286g4594 g1836?g4666g1852?g4667, ?g1872?g3038 ? g1872?. Now, let g1829g4666g1827g4667 represent a learning classifier g1829 using feature set g1827. Moreover, let g1833g3435g1829g4666g1827g4667g3439 represent the generalization evaluation criterion of g1829 using feature set g1827. Without loss 58 of generality, consider a higher value of g1833 to indicate better generalization for feature set g1827 under classifier g1829. Also, let g1831 = g4676g1829 g4672g1850g3047g3286g4594 g4673,g1829 g4672g1850g3047g3286g4594g4673,?,g1829 g4672g1850g3047g3286g4594g4673g4677 represent a class-specific ensemble learning machine which predicts according to g1829 g4672g1850g3047g3286g4594g4673 when the classifiers all agree and g1829g4666g1850g4667 when there is disagreement. Research has already shown that generally g1833g4666g1850g4667 ? g1833g4666g1851g4667 and uni007Cg1850uni007C < uni007Cg1851uni007C. The goal of this research is to show that generally g1833g4666g1831g4667 ? g1833g4666g1850g4667, and that for g1850g3047g4594 = g4676g1850g3047g3117g4594 ? g1850g3047g3118g4594 ? ?? g1850g3047g3286g4594g4677, g1850g3047g4594 g1605 g1850 and g1850g3047g4594 g1606 g1850. 4.2 Architectural Framework Presented next is the procedural design of CEFS (Figure 18). First, assume dataset D, which contains n different classifications. CEFS will use an all-purpose subset generation method such as Sequential Backward Selection (SBS) [82] or a Genetic Algorithm (GA) [170] to produce candidate subsets. These subsets will then be evaluated using a dependent (i.e., wrapper) subset evaluation criterion to evaluate each candidate subset. Note that independent methods of evaluation are also applicable if chosen for use. This process is repeated until an optimal class-specific feature subset if found for each of the n distinct classifications in D. In other words, the feature selection algorithm will be run n times, each time searching for the subset of features which will maximize the classification accuracy, not of the entire dataset D, but of one of the n classifications, thus producing n optimal class-specific subsets. This is done by evaluating the prediction accuracy of the classifier on a single classification at a time, instead of the prediction accuracy of the classifier on the entire test set (which groups all classifications together). To clarify, the training set and test set do not ever change, they still contain all of the instances as though the algorithm is going to predict as normal. The facet that 59 changes is the evaluation criterion. By using only the prediction accuracy of a single classification at a time, the algorithm is able to find an optimal set of class-specific features. The n optimal class-specific feature subsets are coupled with the same type of classifier, and all are combined to form an ensemble machine. New and unseen instances are then sent to the ensemble machine for classification. If all class-specific classifiers agree on the classification of an instance, then that instance is classified accordingly. If any classifier disagrees with the remaining classifiers, the algorithm classifies the instance using the overall best feature subset for the entire dataset. The notion behind this approach is that class-specific feature sets have superior discriminating power in classifying instances when in agreement, but have inferior discriminating power when in disagreement. Hence, a feature set that can generalize well to the entire dataset is used in such instances. Figure 18: CEFS Procedure Class-specific Subset Generation (Class i) Class-specific Subset Evaluation (Class i) Stopping Criterion i < n Increment i i = 1 n = # of classifications Original Subset Subset Build Ensemble with n subsets Result Validation YES YES NO NO Ensemble 60 4.3 Algorithmic Framework It should first be noted that CEFS itself is not a feature selection algorithm, but rather procedural framework in which to conduct feature selection in classification tasks. Thus, it is classifier and feature selection algorithm independent. So, although a broad range of machine learning and feature selection algorithms can be utilized in a class-specific architecture, this section will discuss the algorithms which are implemented to test the hypothesis presented in this research. Three different techniques are tested and used to generate candidate subsets for CEFS: Sequential Backward Selection (SBS), Sequential Forward Selection (SFS), and Genetic Algorithm (GA). The first two are greedy, sequential search algorithms, and are chosen because they follow a strictly deterministic pattern, so results are easily reproducible. However, for sufficiently large datasets, SBS and SFS are unfeasible in regards to run time. So, a third algorithm, a GA, is implemented to handle such datasets. GAs begin with a random sampling of the feature space and typically proceed in a stochastic manner, attempting to converge on an optimal solution. The implementation of a GA is not added solely for its scalability, however. It is of interest to this research to test the aforementioned CEFS architecture under two sets of structurally different subset generation techniques (greedy/deterministic vs. random/stochastic). Additionally, two classifiers are implemented in this research to evaluate candidate feature subsets: a Na?ve Bayesian Classifier (NBC) and a Discrete k-Nearest Neighbor Algorithm (kNN). Both classifiers are described in detail in Chapter 2 of this work. The implementation of these algorithms is well suited for this research as lazy learners tend to benefit greatly from feature selection [177]. Additionally, this implementation of CEFS evaluates candidate subsets using a wrapper method. Specifically, the subset evaluation criterion is the leave-one-out (LOO) 61 validation error from the classifier using the training set. In other words, suppose dataset D contains m instances, then the LOO validation error is the average error across m runs, where in each run, one of the m instances is removed from the training set to become the validation set, and then returned to the training set to be part of the training data for the next run. It is important to note that the classifier is retrained for each of the m runs, preventing the classifier model from having prior knowledge of the instance in the validation set. This approach has the advantage of utilizing as much data as possible for training (i.e., m ? 1 instances). In addition, test sets are mutually exclusive and they effectively cover the entire dataset [18]. Once again, it is of the utmost importance to understand that CEFS is classifier and feature selection algorithm independent. Moreover, that the emphasis of this research is not on the robustness or efficiency of the feature selection algorithm or of the classifier, but rather on the notion that any chosen combination of the two can perform better when implemented in a class-specific architecture than when implemented in a traditional feature selection architecture. One of the reasons that SBS, SFS, NBC, and kNN are chosen is because they are all deterministic algorithms that provide an easier setting for reproducible results. Furthermore, kNNs are equipped to handle both discrete and continuous input data, allowing for a more thorough and diverse collection of datasets to be tested. Conversely, the GA is chosen so high- dimensional datasets could be handled and so that a stochastic method could also be tested under CEFS. 62 Chapter 5 Implementation In order to gauge the usefulness and effectiveness of the CEFS framework and to test the aforementioned hypothesis in this text, an implementation of a Class-specific Ensemble Feature Selection architecture is presented here. As mentioned in Section 4.3, this implementation is tested using three different feature selection subset generators (SBS, SFS, and GA) and two different classifiers (NBC and kNN). Moreover, the implementation is tested over 10 datasets, varying in the number of instances, features, and classifications. It must be noted, however, that the NBC is only combined with the SBS algorithm and only tested over three of the ten datasets. This is due to the fact that the NBC/SBS combination is used mainly as a preliminary experiment to ascertain the potential performance benefits of CEFS and to begin exploring the difference in optimal feature sets between CEFS and traditional feature selection. One of the other goals of this chapter is to show a proof of concept in regards to class- specific feature selection. That is, if the CEFS architecture can outperform a traditional feature selection architecture on different learners and under different selection algorithms, then it can be generalized that similar performance trends can be seen under any learner and any feature selection algorithm, making the CEFS architecture, algorithm independent. So to achieve these goals, the remainder of the chapter will be arranged as follows. First, a brief reiteration of the proposed hypothesis is provided. Next, experimental data, tools, and procedure, algorithms, algorithmic specifications and algorithmic constraints are discussed. 63 Finally, the chapter concludes by presenting and detailing results regarding the performance and domain understandility of this experiment. Also, note that throughout the revealing of the experimental results, the aforementioned hypothesis will be discussed. 5.1 Proposed Hypothesis As the implementation presented in this chapter will seek to test the hypothesis proposed earlier in Chapter 1, it is worthwhile to first revisit the aforementioned hypothesis. Recall that the following four part hypothesis is suggested: 1. There exists a recombination technique such that, when applied to a Class- specific Ensemble Feature Selection (CEFS) architecture, it will allow the system to outperform a traditional feature selection architecture in terms of prediction accuracy. 2. The same CEFS architecture will scale up by continuing to outperform the traditional feature selection architecture when the number of features are increased, as well as, when the number of class labels are increased. 3. The union of the set of relevant class-specific features may not be inclusive of the set of relevant features found traditionally for the entire dataset. 4. Conversely, the set of relevant features found traditionally for the entire dataset may not be inclusive to the union of the set of relevant features for each class label. 5.2 Experiment The experiment conducted here implements a CEFS architecture using the aforementioned feature selection and classification algorithms and compares it against a traditional feature selection architecture utilizing the same algorithms. The subsequent sections describe the details surrounding this experiment. 64 5.2.1 Data A wide variety of datasets are tested in this experiment. These datasets range in the number of instances (i.e., depth), number of features (i.e., width), number of classifications, and in whether or not they are missing values. Moreover, some datasets include strictly discrete input data, some strictly continuous input data, and some contain mixed input data. The notion behind the selection of datasets is to provide a collection that represents many varying dataset characteristics, thereby testing the algorithm under several possible scenarios. Table 6 contains the descriptive information specific to each dataset. A few additional comments need also be stated regarding the preparation of each dataset: 1. Each dataset is randomly sampled, based on a predetermined size, to create a training set and a test set. This task is completed 15 times for each dataset, thus creating 15 sets of randomly sampled data. The training sets are balanced such that they contain an equal number of instances from each classification4. For example, the Congressional Voting training set contains 268 instances, which can be classified as either ?republican? or ?democrat?. This means that 134 instances have a classification of ?republican? and 134 have a classification of ?democrat?. This is especially important since some of the algorithms implemented here have a strong statistical emphasis that could be negatively and unfairly affected by an unbalanced training set. 2. A leave-one-out validation scheme (as described in Section 4.3) is used during training to evaluate candidate k values for the kNN algorithm and to evaluate candidate feature subsets under each classifier. 3. Datasets with missing values were typically handled by replacing the missing values with the median feature value (if feature is discrete) or the mean feature value (if the feature is continuous). Further information specific to each dataset is provided in subsequent sections. 4. Most continuous datasets were normalized. In other words, each value in a feature was divided by the maximum absolute value in that feature. This normalizes all values between -1 and 1, if negative values exist, and between 0 and 1, if only positive values exist. Further information is also provided in subsequent sections. 4 The Arrhythmia dataset is the only one containing an unbalanced training set, as the number of instances in one of the three classifications was simply too small for proper training. 65 Table 6: Dataset characteristics Dataset Name Total Number of Instances Number of Features Number of Class labels Instances in Training Instances in Testing Missing Values Breast Cancer 286 9 2 136 150 No Wine 178 13 3 114 64 No Credit Approval 690 15 2 490 200 Yes Congressional Voting 435 16 2 268 167 Yes Alabama County 340 16 2 188 152 Yes Musk v1 476 166 2 248 228 No Arrhythmia 452 279 3 235 217 Yes Madaleon 2600 500 2 194 2404 No Colon Cancer 62 2000 2 36 26 No Lymphoma 42 4026 2 34 8 Yes 5.2.1.1 Breast Cancer Dataset The Breast Cancer dataset is acquired from the UCI Machine Learning Repository [24]. It was originally created by Matjaz Zwitter and Milan Soklic of the Institute of Oncology at the University Medical Center, in Ljubljana, Yugoslavia. It contains 286 instances with two possible classifications: ?recurrence? (85 instances) and ?no-recurrence? (201 instances). Furthermore, it contains nine nominal features: age bracket, menopause, tumor-size bracket, inv-nodes, node- caps, deg-malig, breast side, breast-quad, and irradiat. Although it contains missing values, the dataset is left as is and classified as such. 5.2.1.2 Wine Dataset The Wine dataset is acquired from the UCI Machine Learning Repository [24], however, it originates from [178]. It contains 178 instances with three possible classifications: ?class 1? (59 instances), ?class 2? (71 instances), and ?class 3? (48 instances). Moreover, it contains 13 66 continuous attributes: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. The dataset does not contain missing values. 5.2.1.3 Credit Approval Dataset The Credit Approval dataset is also acquired from the UCI Machine Learning Repository [24], however, the original source is proprietary and confidential. It contains 690 instances with two possible classifications: ?+? (307 instances) and ?-? (383 instances). It also spans 15 features, six that are continuous and nine that are discrete. Feature names are also proprietary. The missing values in discrete features are replaced by the median value in the feature, whereas the missing values in continuous features are replaced by the mean value in the feature. All continuous features are normalized as previously mentioned in this chapter. 5.2.1.4 Congressional Voting Dataset The Congressional Voting Records dataset is acquired from the UCI Machine Learning Repository [24] and originates from [179]. It contains 435 instances separated by two classifications: ?republican? (168 instances) and ?democrat? (267 instances). It also contains the following 16 Boolean valued (yes or no) features: handicapped-infants, water-project-cost- sharing, adoption-of-the-budget-resolution, physician-fee-freeze, El-Salvador-aid, religious- groups-in-schools, anti-satellite-test-ban, aid-to-Nicaraguan-contras, mx-missile, immigration, synfuels-corporation-cutback, education-spending, superfund-right-to-sue, crime, duty-free- exports, and export-administration-act-South-Africa. Missing values are denoted by ??? and are not replaced by anything. 67 5.2.1.5 Alabama County Jails Dataset The Alabama County Jails dataset is a proprietary dataset acquired from a Regional County Corrections Department in Alabama. It is first introduced and described in [42]. The data obtained from the Alabama County Corrections Program consists 340 instances spanning 18 discrete attributes, which include the offender?s personal information and current and previous charges and convictions. The end result, or predicted outcome, is determined by whether or not the individual completes the program (represented by a one signifying completion ? 222 instances, or a negative one signifying failure to complete the program ? 118 instances). Missing values are left as is and classified accordingly. 5.2.1.6 Musk (version 1) Dataset The Musk v1 dataset is obtained from the UCI Repository [24], although it originates from the AI Group at Arris Pharmaceutical Corp., in San Francisco, CA. The dataset contains 476 conformations of molecules and it is divided into two classifications: 0 => ?non-musk? (269 instances) or 1 => ?musk? (207 instances). The instances are described by 166 features, all continuous and normalized as previously mentioned. No values are missing in this dataset. 5.2.1.7 Arrhythmia Dataset The Arrhythmia dataset is acquired from the UCI Repository [24], but originally owned by Guvenir, Acar, and Muderrisoglu, and reported first on [180]. It contains 279 features, of which 206 are continuous and 73 are discrete. Although the original dataset is comprised of 16 different classifications, the classifications are organized such that they can be condensed to three. Class 01 refers to 'normal' ECG, classes 02 to 15 refer to different classes of arrhythmia, 68 and class 16 refers to the rest of unclassified instances. So, the adjusted dataset contains three classifications: ?N? (245 instances), ?A? (185 instances), and ?U? (22 instances). Moreover, missing values in discrete features are replaced by the median value in the feature, whereas the missing values in continuous features are replaced by the mean value in the feature. Continuous features are then normalized as previously described. It is also noteworthy to mention that this is the only dataset without a balanced training set. Since there exists only 22 instances classified as ?U?, a balanced training set would prove inappropriate for training as the number of instances would be far too few. So, the training set is comprised of 111 instances classified as ?N?, 111 classified as ?A?, and 13 classified as ?U?. The remaining instances are placed in the test set. 5.2.1.8 Madelon Dataset Although the Madelon dataset is also acquired from the UCI Repository [24], it stems from the Feature Selection Challenge at the 2003 Neural Information Processing Systems Conference. The data is synthetic and generated by conference organizers. The dataset contains 2600 instances with binary classification: ?-1? (1300 instances) and ?1? (1300 instances). All 500 features are continuous and normalized, and there are no missing values. 5.2.1.9 Colon Cancer Dataset The Colon Cancer dataset originates from [13] as a well known microarray dataset. It contains the expression of 2000 genes (i.e., features) across the 62 tissues (i.e., instances). The identity (i.e., classification) of each instance is represented by either a -1 => ?no cancer? (40 instances) or a 1 => ?cancer? (22 instances). No values are reported missing and all 2000 features are continuous, and thus normalized according to the previously mentioned process. 69 5.2.1.10 Lymphoma Dataset The B-Cell Lymphoma dataset stems from [181] and can be acquired from http://llmpp.nih.gov/lymphoma. It is comprised of gene expression data utilized to diagnose Diffuse large B-cell lymphoma (DLBCL). The disease is an aggressive malignancy of mature B lymphocytes, with an annual incidence of over 25,000 cases, accounting for roughly 40% of cases of non-Hodgkin's lymphoma. The dataset contains 42 instances which can be classified as ?GC_B-Like? (21 instances) or ?Activated_B-like? (21 instances). Moreover, instances are described across 4026 continuous features. The original data is first ?filtered? and log- transformed (base 2) [181]. Then, the data is standardized so that the mean and standard deviation are 0 and 1, respectively [168]. Missing values are replaced with the feature?s mean value to keep column?s mean and standard deviation intact. 5.2.2 Tools The datasets used in this experiment are stored and accessed using SQL Maestro for MySQL. The CEFS algorithm is written using Java JDK 1.6 as the programming language of choice and jGRASP as the IDE. 5.2.3 Procedure This section will present the procedural organization of this experiment beginning from a high-level and ending with low-level details. As aforementioned, the available classifiers for this experiment are NBC and kNN, and the available feature selection subset generators are SBS, SFS, and GA. The experimental process then begins by separating each dataset into 15 randomly sampled sets of training and testing data, as described in Section 5.2.1. As the first 70 phase of testing, three of the ten datasets (Breast Cancer, Credit Approval, and Congressional Voting) are deployed using the NBC/SBS combination. This is due, in part, to the reasons previously mentioned in the beginning of this chapter, but also because these three datasets contain strictly discrete input values, thus bypassing the need for discretization for the NBC. Given the positive results attained from the first phase of testing, five of the ten datasets are then tested under kNN/SBS and kNN/SFS combinations. The five datasets include the aforementioned three used with NBC/SBS plus the Wine and Alabama County Jails datasets. These datasets are added because they contain discrete and continuous input data, and because the Wine dataset contains more than two distinct classifications. Following further promising results from the second phase of testing, the remaining five datasets (Musk v1, Arrhythmia, Madelon, Colon Cancer, and Lymphoma) are added to the previous five and are all tested under the kNN/GA combination. Note that these five remaining datasets are not tested under SBS or SFS frameworks because the feature cardinality of each dataset makes runtime simply unfeasible. Conversely, the addition of these five datasets allows CEFS to be tested under medium and high-dimensional data, thereby testing the scalability of the framework. In terms of lower level processing tasks, each run begins by selecting the dataset and the classifier. Once this is complete, the classifier is trained and a classification model is typically built. In the case of a NBC or a kNN, no formal model is built as they are both lazy learners. Instead, a kNN will take this time to find an optimal k for the classifier, using a leave-one-out validation scheme to measure performance. The available k values are discussed in the Algorithmic Constraints section of this chapter. Once the classifier is ready, the baseline accuracy is measured by applying the test set to the classifier. 71 The feature selection subset generator is then chosen. Using the traditional feature selection architecture, the overall best feature set is then found. Another measure of accuracy is then taken, but this time, using the overall best feature set. The class-specific ensemble machine then begins to be built. First in the ensemble is the allocation of the same classifier used above for each ensemble module. Just as before, each module also searches for an optimal k if using a kNN classifier. However, the performance of each k is measured by the leave-one-out class-specific accuracy instead of the leave-one-out overall accuracy of the training set. Once the classifiers are ready, each ensemble module then uses the same feature selection subset generator to find an optimal class-specific feature subset. Just as before, this means measuring the performance of each candidate subset by calculating the accuracy of the class-specific instances only. As soon as all class-specific subsets have been found, the system is ready to classify the test set. Each instance in the test set is then sent to the ensemble machine. Each ensemble module classifies the instance separately and sends its classification to the decision phase. If all modules classify the instance the same way, the ensemble machine uses that as its classification. If any module disagrees with the others, the test instance is sent back out to the classifier using the traditionally acquired overall best feature subset for the tie-breaker. The instance is then classified according to the latter classifier. The three test set accuracies are then printed and compared, along with the overall best feature subset and the class-specific best feature subsets. 5.2.4 Procedure Validation Two steps are taken to validate the procedure and algorithms mentioned in the previous section. First, each algorithmic combination implemented in this experiment is traced, step by 72 step, using the Play Tennis dataset (Table 2). This is possible through the use of the java debugger (also implemented in jGRASP), and due to the fact that the Play Tennis dataset is sufficiently small to warrant manual calculations. Accordingly, all calculations are first processed by hand and then validated through the IDE?s debugger. The second step also requires the use of the IDE?s debugger. This step randomly selects instances from the training and test sets and once again computes and compares the manual calculations to the ones provided by the algorithm at runtime. Although this step is conceptually similar to the first step, recall that the Play Tennis dataset is strictly discrete, so this step allows for instances which are strictly continuous or of a mixed combination to also be validated. 5.2.5 Algorithms As previously detailed in Section 4.3, two classifiers (NBC and kNN) and three feature selection subset generators (SBS, SFS, and GA) are implemented in this work. Although the NBC, SBS, and SFS algorithms have been sufficiently described to allow reproducible results, the other two algorithms, kNN and GA, need further description in regards to specific implementation details in order to warrant the similar results. 5.2.5.1 Algorithm Procedural Specifications As commonly seen in most k Nearest Neighbor implementations, the algorithm implemented here utilizes the Euclidean distance to measure likeness between instances. Specifically, the distance between two points for a continuous feature is computed as described in Section 2.4.4 of this work. Conversely, the distance between two points for a discrete feature is computed as either 0 (if the values are the same) or 1 (if the values are different). Moreover, 73 the normalization or standardization of continuous features ensures that the calculated distance in those features also ranges between 0 and 1. Finally, it should be noted that if the classifier encounters a tie in the k nearest classifications, the final prediction is selected as the classification of the closest instance to the test/validation instance. Several notes need to also be made in regards to the specifications of the GA. First, the initial randomly sampled population is generated based on percentages of the overall number of features. In other words, 1/5 of the initial population contains roughly 10% of the features, 1/5 contains 20%, then 40%, 60%, and 80%, respectively. This creates a higher amount of diversity in the initial population as the number of features ranges broadly. Second, candidate feature subsets with a higher leave-one-out validation accuracy (LOOVA) are deemed to be better (i.e., more fit). Moreover, if two candidate subsets yield the same LOOVA, then the one with the smaller number of features is deemed to be better. Parent selection is implemented using binary tournament selection, where two different candidate parents are randomly selected and the more fit of the two becomes one of the parents chosen for procreation. This occurs ? times, where ? represents the number of candidate subsets in the population. Offspring are then created using crossover and mutation based on the probabilities detailed in the subsequent section. Finally, a generational selection approach is implemented, where the newly created offspring replace the parents. It should also be noted that the highest ranking parent can survive each generation based on the probability for elitism also detailed in the next section. 74 5.2.5.2 Algorithm Constraints Two of the algorithms implemented in this experiment, kNN and GA, require constraints to be set or optimized before the algorithm is ready to be deployed. In the case of the kNN, a k value is optimized for the overall classifier and a different one is optimized for each class- specific ensemble module. During the preliminary stages of this research, the potential list of k values were as follows: k: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 0.25*n, 0.5*n, and 0.75*n (where n is the number of instances in the dataset and such that k < n) However, it was observed that some of the class-specific ensemble modules were selecting k such that the classifier simply became a recaller, with no regards for precision. In other words, the class-specific ensemble modules exhibiting this trend classified every instance as its class- specific classification, highly overfitting its training data and poorly generalizing to test data. To adjust for these occurrences, k values greater than 10 were removed from the list of potential k for class-specific modules. Genetic Algorithms (GA) also require the optimization of constraints in order to perform at its peak capability. Most importantly in a GA, or in any Evolutionary Computation algorithm, is that candidate solutions converge toward a global optimum instead of getting trapped in local optimum or just jumping around the search space with little to no improvement. So, as a starting point, the following constraints were acquired from [101]: Population size (i.e., number of instances): 50 Number of generations (i.e., number of runs): 20 Probability of crossover: 0.6 Probability of mutation: 0.001 Probability of selection of highest ranked individual (i.e., elitism): 0.6 75 However, the above settings did not allow for successful convergence in the majority of datasets. So, the constraints were modified until convergence was observed. For all datasets, except Colon Cancer and Lymphoma, the new constraints were as follows: Population size (i.e., number of instances): 20 Number of generations (i.e., number of runs): 40 Probability of crossover: 0.6 Probability of mutation: 0.001 Probability of selection of highest ranked individual (i.e., elitism): 0.8 It was observed that the algorithm needed less diversity in the initial population and more time to explore the search space to converge on a near optimal solution. Moreover, the higher elitism rate yielded greater selection pressure which also aided in convergence. Given the significantly larger search space of the Colon Cancer and Lymphoma datasets, the population size is doubled and the elitism rate is raised to 0.9. Although the number of features in both datasets would normally make this unfeasible in terms of time, the small number of instances in the dataset allows for it to be implemented. 5.2.6 Results The results in this section provide a basis for comparative analysis in two areas: performance and domain understandability. The results are also utilized to test the hypothesis previously presented in this text. Individual run results are provided in Appendices A and B. 5.2.6.1 Performance The central performance metric utilized to compare architectures in this study is the average prediction accuracy obtained from the 15 randomly sampled test sets of each dataset. The average values are presented in subsequent tables and figures in this section. Refer to 76 Appendix A for a full list of accuracy values for each individual run. The first set of measures are provided in Table 7 and depicted in Figure 19. They contain the average accuracies using the NBC/SBS combination for classification. For these, and the remaining results in this section, a higher accuracy represents better performance. Note that CEFS outperforms traditional feature selection in all three datasets. Although this trend continues throughout the remaining results, these results already show proof that part 1 of the hypothesis can be accepted. Specifically, an effective recombination technique has been found which allows the CEFS architecture to outperform a traditional feature selection architecture in terms of prediction accuracy. Table 7: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under NBC/SBS algorithms) Dataset No Feature Selection Traditional Feature Selection CEFS BREAST CANCER 66.44 68.89 69.20 CREDIT APPROVAL 85.33 83.77 84.23 CONGRESSIONAL VOTING 89.74 94.49 94.61 Figure 19: Graphical comparison of feature selection architectures based on dataset accuracy averages (under NBC/SBS algorithms) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 BREAST CANCER CREDIT APPROVAL CONGRESSIONAL VOTING Ac cu rac y Dataset Dataset Accuracy Averages under NBC/SBS No Feature Selection Traditional Feature Selection CEFS 77 Also, note that the accuracy of the Credit Approval dataset under the architecture without feature selection is actually higher than either architecture containing feature selection (Table 7 and Figure 19). Although this is not common and happens rarely in this study, it is a clear reminder that there is no silver bullet in machine learning and that there are certain cases where feature selection can actually hinder performance. As will be repeatedly shown in this study, however, this is much more the exception than the norm. Presented next in Table 8 and Table 9, and in Figure 20 and Figure 21, are the results from the combinations of a kNN/SBS and a kNN/SFS. Much like the previous results, the CEFS architecture continues to outperform the traditional feature selection architecture in all datasets. Table 8: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/SBS algorithms) Dataset No Feature Selection Traditional Feature Selection CEFS BREAST CANCER 77.20 78.44 81.64 WINE 94.06 93.44 93.75 CREDIT APPROVAL 74.40 84.47 85.50 CONGRESSIONAL VOTING 91.30 94.01 94.53 MONTGOMERY COUNTY 49.52 62.37 63.42 Table 9: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/SFS algorithms) Dataset No Feature Selection Traditional Feature Selection CEFS BREAST CANCER 77.20 80.89 83.64 WINE 94.06 91.56 92.92 CREDIT APPROVAL 74.40 84.23 85.20 CONGRESSIONAL VOTING 91.30 94.13 94.49 MONTGOMERY COUNTY 49.52 66.36 67.50 78 Figure 20: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/SBS algorithms) Figure 21: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/SFS algorithms) Next, the results from the kNN/GA combination are presented in Table 10 and in Figure 22. The overall performance improvement trend seen in the previous algorithmic combinations is also seen here, as CEFS outperforms traditional feature selection and no feature selection in all 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 BREAST CANCER WINE CREDIT APPROVAL CONGRESSIONAL VOTING ALABAMA COUNTY Ac cu rac y Dataset Dataset Accuracy Averages under kNN/SBS No Feature Selection Traditional Feature Selection CEFS 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 BREAST CANCER WINE CREDIT APPROVAL CONGRESSIONAL VOTING ALABAMA COUNTY Ac cu rac y Dataset Dataset Accuracy Averages under kNN/SFS No Feature Selection Traditional Feature Selection CEFS 79 10 datasets. Additionally, the accuracy improvements seen on the five large datasets and the two multi-classification datasets suggest that part #2 of the hypothesis can also be accepted. In other words, CEFS may successfully scale up using the framework presented here. Table 10: Tabularized comparison of feature selection architectures based on dataset accuracy averages (under kNN/GA algorithms) Dataset No Feature Selection Traditional Feature Selection CEFS BREAST CANCER 77.20 79.96 81.24 WINE 94.06 94.69 95.42 CREDIT APPROVAL 74.40 85.20 85.50 CONGRESSIONAL VOTING 91.30 93.93 94.09 ALABAMA COUNTY 49.52 62.37 64.17 MUSK V1 78.77 80.06 80.18 ARRHYTHMIA 67.10 68.36 69.59 MADELON 61.32 62.40 62.82 COLON CANCER 65.90 67.69 68.97 LYMPHOMA 75.83 75.00 77.50 Figure 22: Graphical comparison of feature selection architectures based on dataset accuracy averages (under kNN/GA algorithms) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Ac cu rac y Dataset Dataset Accuracy Averages under kNN/GA No Feature Selection Traditional Feature Selection CEFS 80 As shown by the results in this section, choosing an effective recombination technique for the ensemble system allows CEFS to become a better performing architecture than existent architectures. Moreover, the performance improvements on datasets ranging from 9 features to over 4000 features show that CEFS can scale up. These findings support the validity of the first two parts of the hypothesis presented in this research. The next section will explore the remaining two parts. 5.2.6.2 Domain Understandability One of the great advantages of feature selection is the improved data domain understandability afforded from removing irrelevant and redundant features. This key benefit allows domain experts to shift their focus strictly to the most resilient and discriminating features while also reducing model complexity. However, parts 3 and 4 of the hypothesis suggest that traditional feature selection methods may actually miss crucially relevant data, when all classifications are considered at once. To test parts 3 and 4, the best found subset was recorded after each run and for each dataset. The detailed list of these subsets is found in Appendix B. So, consider the best found subsets for run #1 of the kNN/SFS algorithms, using the Congressional Voting dataset, and tabularized in Table 11. The feature subset data from this run alone provides the necessary information to accept parts 3 and 4 of the hypothesis. As shown, the best overall feature set includes features 4, 7, and 12. Conversely, the union of the class- specific best feature sets includes features 4, 5, 7, and 11. Since feature 11 is included in the union of the class-specific sets but not in the overall best set, it follows that the union of the class-specific subsets is not inclusive of the overall best set, thereby proving part 3 of the hypothesis. On the other hand, feature 12 is included in the best overall set, but not in the union 81 of the class-specific sets, thus showing that the overall best set is not inclusive of the union of the class-specific best sets and therefore proving part 4 of the hypothesis. Table 11: Best found subsets for run #1 of kNN/SFS (using Congressional Voting dataset) Class label F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 10 F 11 F 12 F 13 F 14 F 15 F 16 Overall 0 0 0 X 0 0 X 0 0 0 0 X 0 0 0 0 republican 0 0 0 X 0 0 X 0 0 0 0 0 0 0 0 0 democrat 0 0 0 X X 0 0 0 0 0 X 0 0 0 0 0 If a feature is included in the overall best feature set, it should intuitively follow that the feature is relevant to predicting at least one of the classifications in the dataset. The proof and acceptance of part 4 of the hypothesis clearly disproves this notion. On the other hand, a feature relevant to a single classification should intuitively be relevant enough to appear in the overall best feature set. Once again, this notion is rejected as part 3 of the hypothesis is accepted. The question then becomes of whether this improves or diminishes the understandability of the data domain. The answer is that compelling arguments can be made for both cases. On one hand, the union of the best class-specific subsets will generate a higher number of features as features relevant for one classification may be significantly different than the features relevant to another. This may in turn increase model complexity and decrease understandability. On the other hand, by performing feature selection in a class-specific manner, features can be isolated to individual classifications, thus highly increasing and improving classification understandability, a characteristic not available on traditional feature selection architectures. Moreover, certain classification tasks may wish to focus on one classification over another, a task that can be implemented in a CEFS architecture, but not in a traditional feature selection architecture. 82 Chapter 6 Conclusion As discussed in Chapter 1: Introduction, the significant growth in data storage and the widening disparity between data processing and data acquisition have produced ample reasons to employ feature selection techniques when performing machine learning tasks. By effectively removing irrelevant and redundant features from the dataset, such techniques provide essential benefits such as improved learner accuracy, improved domain understandability, decreased model complexity, decreased training time, decreased required storage, etc. As the potential for increased benefit is significant, the research subsequently provides in depth discussions on Classification and Feature Selection in Chapters 2 and 3, respectively. Given the clear motivation and objectives presented in Chapters 1 ? 3, a novel class- specific ensemble feature selection architecture is proposed and comprehensively presented in Chapter 4. This architecture builds an ensemble of models, each with feature subsets optimized with respect to a distinct classification and under a specific base classifier. The proposed method seeks to outperform (in terms of prediction accuracy) a traditional feature selection architecture by providing an approach which brings sought after diversity and disagreement to the ensemble model while supplying the same model with feature subsets containing highly useful features to the base classifiers themselves. So, to test the effectiveness of the aforementioned architecture and the hypothesis surrounding the potential benefits of such a design, a comprehensive experiment is constructed. 83 This experiment is presented in detail in Chapter 5 and consists of two classifiers, three feature selection subset generators, and ten datasets. The results obtained from the experiment solidify the potential benefits of a class-specific feature selection architecture and allow for the acceptance of the four part hypothesis proposed at the beginning of this study. In summary, a strong case is made for the implementation of a class-specific architecture when performing feature selection tasks. Plenty of work remains to be completed however. One of the main drawbacks to a class- specific architecture is the ease in which an ensemble module can overfit the training data. This problem is exacerbated by the fact that the module can sometimes become a pure recaller, with no regards for precision (essentially classifying every instance it sees as the class-specific classification). Given the greedy nature of choosing the best performing class-specific subset, 100% recall means 100% prediction for that particular classification, which can clearly be a problem. So, the class-specific subset evaluation criterion must be balanced in the future, as precision must also be taken into account. It should also be noted that although the CEFS architecture outperformed on average the traditional feature selection architecture, it most often did not do so with statistical significance. Specifically, p-values ranged from 0.04 to 0.91 using two-tailed t-tests. This was especially true as the search space of features scaled up. In other words, the stability of the architecture decreased as the number of features increased. Creating a more stable architecture would likely positively affect the overall performance, thus providing statistically significant results. Finally, it is important to continue to extend CEFS to other classifiers and other feature selection algorithms so that the architecture can be tested on a wider base of techniques and so that further benefits can be explored. 84 References [1] G. Piatetsky-Shapiro, "Knowledge Discovery in Databases: 10 years later," SIGKDD Explorations, vol. 1, no. 2, pp. 59-61, Jan. 2000. [2] Thomson Reuters. (2009, May) What Makes Information Intelligent?. [Online]. http://thomsonreuters.com/content/corporate/articles/What_makes_information_in tel [3] R. C. Craddock, P. Holtzheimer III, X. Hu, and H. Mayberg, "Disease state prediction from resting state functional connectivity," Magnetic Resonance in Medicine, vol. 62, no. 6, pp. 1619-1628, Oct. 2009. [4] H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for Classification and Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 491-502, Apr. 2005. [5] J. Alpert and N. Hajaj. (2008, Jul.) The Official Google Blog: We knew the web was big.... [Online]. http://googleblog.blogspot.com/2008/07/we-knew-web-was- big.html [6] C. Soares, L. Montgomery, K. Rouse, and J. E. Gilbert, "Automating Classification using General Regression Neural Networks," in Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, San Diego, CA, 2008, pp. 508-513. [7] A. Jain and D. Zongker, "Feature Selection: Evaluation, Application, and Small Sample Performance," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, 1997. [8] C. Soares, P. Williams, J. E. Gilbert, and G. Dozier, "A Class-Specific Ensemble Feature Selection Approach for Classification Problems," in Proceedings of the 48th Annual Southeast Regional Conference, Oxford, MS, April 15-17, 2010. [9] E. Bauer and R. Kohavi, "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants," Machine Learning, vol. 36, no. 1, 2, pp. 105-139, 1999. [10] T. G. Dietterich, "Ensemble Learning Methods," in Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. MIT Press, 2001. [11] L. Hansen and P. Salamon, "Neural Network Ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 993-1001, 1990. [12] T. R. Golub, et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531- 537, 1999. 85 [13] U. Alon, et al., "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays," PNAS, vol. 96, no. 12, pp. 6745-6750, 1999. [14] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Pearson Education, 2003. [15] P. Langley and H. A. Simon, "Applications of machine learning and rule induction," Communications of the ACM, vol. 38, no. 11, pp. 55-64, 1995. [16] D. H. Wolpert and W. G. Macready, "No free lunch theorems for search," Santa Fe Institute, Santa Fe, CA, Technical Report SFI-TR-95-02-010, 1995. [17] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [18] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison- Wesley, 2006. [19] D. Koller and M. Sahami, "Toward optimal feature selection," in Proceedings of the 13th International Conference on Machine Learning, San Francisco, CA, 1996, pp. 284-292. [20] J. Sinkkonen and W. S. Sarle. (2002, Oct.) Neural Network FAQ, part 2 of 7: Learning. [Online]. ftp://ftp.sas.com/pub/neural/FAQ2.html [21] B. W. Silverman, Density Estimation for Statistics and Data Analysis, 1st ed. Chapman and Hall/CRC Press, 1986. [22] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [23] D. L. Donoho, "High-dimensional data analysis: The curses and blessings of dimensionality," in Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century, 2000. [24] A. Asuncion and D. J. Newman. (2007) UCI Machine Learning Repository. [Online]. http://archive.ics.uci.edu/ml/ [25] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986. [26] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. [27] A. Y. Ng, "Feature selection, L1 vs. L2 regularization, and rotational invariance," in International Conference on Machine Learning, 2004. [28] D. L. Olson and D. Delen, Advanced Data Mining Techniques. Springer, 2010. [29] Wikipedia. (2010, Jun.) Precision and Recall. [Online]. http://en.wikipedia.org/wiki/Precision_and_recall [30] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Wadsworth International Group, 1984. [31] E. Hunt, J. Martin, and P. Stone, Experiments in Induction. New York: Academic Press, 1966. [32] S. B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," Informatica, vol. 31, pp. 249-268, Jul. 2007. 86 [33] S. K. Murthy, "Automatic Construction of Decision Trees from Data: A Multi- Disciplinary Survey," Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, Dec. 1998. [34] S. R. Safavin and D. Landgrebe, "A survey of decision tree classifier methodology," IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660- 674, 1991. [35] D. Larose, Discovering Knowledge in Data: An Introduction to Data Mining. Hoboken, NJ: John Wiley & Sons, 2005. [36] W. Cohen, "Fast Effective Rule Induction," in Proceedings of the International Conference on Machine Learning, 1995, pp. 115-123. [37] J. Furnkranz, "Separate-and-Conquer Rule Learning," Artificial Intelligence Review, vol. 13, pp. 3-54, 1999. [38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, pp. 318-362. [39] M. N. Siddique and M. O. Tokhi, "Training Neural Networks: Backpropagation vs. Genetic Algorithms," IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2673-2678, 2001. [40] G. Zhang, "Neural Networks for classification: a survey," IEEE Transactions on Systems, Man, and Cybernetics, vol. 30, no. 4, pp. 451-462, 2000. [41] L. Jiang, H. Zhang, and Z. Cai, "A Novel Bayes Model: Hidden Naive Bayes," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10, pp. 1361- 1371, Oct. 2009. [42] C. Soares, L. Montgomery, C. Hamilton, and J. E. Gilbert, "Improving Accuracy in the Montgomery County Corrections Program Using Case-Based Reasoning," in Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, San Diego, CA, December 11 - 13, 2008, pp. 318-323. [43] L. Jiang, Z. Cai, D. Wang, and S. Jiang, "Survey of Improving K-Nearest Neighbor for Classification," in Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery, 2007, pp. 679-683. [44] T. Cover and P. Hart, "Nearest Neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967. [45] C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, Jun. 1998. [46] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge: Cambridge University Press, 2000. [47] A. Tsymbal, S. Puuronen, and D. Patterson, "Ensemble feature selection with the simple Bayesian classification," Information Fusion, vol. 4, pp. 87-100, 2003. [48] R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Proceedings of the International Joint Conference on Artificial Intelligence, 1995. 87 [49] C. Merz, "Dynamical selection of learning algorithms," in Learnin from Data, Artificial Intelligence and Statistics, D. Fisher and H. .-J. Lenz, Eds. NY: Springer-Verlag, 1996. [50] M. Koppel and S. P. Engelson, "Integrating multiple classifiers by finding their areas of expertise," in AAAI-96 Workshop On Integrating Multiple Learning Models, 1996, pp. 53-58. [51] A. Krogh and J. Vedelsby, "Neural Network ensembles, cross validation, and active learning," in Advances in Neural Information Processing Systems, D. Touretzky and T. Leen, Eds. Cambridge, MA: MIT Press, 1995, vol. 7, pp. 231-238. [52] K. Tumer and J. Ghosh, "Error correlation and error reduction in ensemble classifiers," Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches, vol. 9, no. 3, 4, pp. 385-404, 1996. [53] C. Brodley and T. Lane, "Creating and exploiting coverage and diversity," in AAAI- 96 Workshop on Integrating Multiple Learned Models, 1996, pp. 8-14. [54] L. Asker and R. Maclin, "Ensembles as a sequence of classifiers," in Proceedings of the 15th International Joint Conference on Artificial Intelligence, San Francisco, CA, 1997, pp. 860-865. [55] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, "Occam's razor," Information Processing Letters, no. 24, pp. 377-380, 1987. [56] J. Rissanen, "Stochastic complexity and modeling," The Annals of Statistics, no. 14, pp. 1080-1100, 1986. [57] C. Wallace and P. Freeman, "Estimation and inference by compact coding," Journal of the Royal Statistical Society, no. 49, pp. 240-265, 1987. [58] G. H. John, R. Kohavi, and K. Pfleger, "Irrelevant features and the subset selection problem," in Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, 1994, pp. 121-129. [59] Y. Kim, W. N. Street, and F. Menczer, "Feature selection in unsupervised learning via evolutionary search," in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 365-369. [60] Y. Kim, "Feature Selection in Supervised Learning and Unsupervised Learning via Evolutionary Search," University of Iowa Thesis, 2001. [61] R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence: Special Issue on relevance, vol. 97, no. 1 - 2, pp. 273-324, 1997. [62] B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press, 1990. [63] D. W. Aha, D. Kibler, and M. K. Albert, "Instance Based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, 1991. [64] P. Langley, W. Iba, and K. Thompson, "An analysis of Bayesian classifiers," in Proceedings of the Tenth National Conference on Artificial Intelligence, 1992, pp. 223-228. [65] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley, 1973. [66] I. J. Good, The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, 1965. 88 [67] A. L. Blum and R. L. Rivest, "Training a 3-Node Neural Networks is NP-Complete," Neural Networks, vol. 5, pp. 117-127, 1992. [68] E. Amaldi and V. Kann, "On the approximation of minimizing non zero variables or unsatisfied relations in linear systems," Theoretical Computer Science, vol. 209, pp. 237-260, 1998. [69] A. Kalousis, J. Prados, and M. Hilario, "Stability of Feature Selection Algorithms," in Proceedings of the Fifth IEEE International Conference on Data Mining, Houston, TX, 2005. [70] A. Kalousis, J. Prados, and M. Hilario, "Stability of feature selection algorithms: a study on high-dimensional spaces," Knowledge and Information Systems, vol. 12, no. 1, pp. 95-116, 2006. [71] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, no. 1 - 3, pp. 389-422, 2002. [72] R. Nilsson, J. M. Pena, J. Bjorkegren, and J. Tegner, "Consistent Feature Selection for Pattern Recognition," Journal of Machine Learning Research, vol. 8, pp. 589- 612, 2007. [73] E. Xing, M. Jordan, and R. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," in Proceedings of the 15th International Conference on Machine Learning, 2001, pp. 601-608. [74] F. K. Ahmad, N. M. Norwawi, S. Deris, and N. H. Othman, "A review of feature selection techniques via gene expression profiles," International Symposium on Information Technology, vol. 2, pp. 1-7, Aug. 2008. [75] P. Langley, "Selection of relevant features in Machine Learning," in AAAI Fall Symposium on Relevance, 1994, pp. 140-144. [76] R. Caruana and D. Freitag, "Greedy attribute selection," in Machine Learning: Proceedings of the Eleventh International Conference, 1994, pp. 28-36. [77] G. Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification," Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003. [78] H. Lu, et al., "The Effects of Domain Knowledge Relations on Domain Text Classification," in Proceedings of the 27th Chinese Control Conference, Kumming, Yunnan, China, 2008, pp. 460-463. [79] Q. Chen, "Feature Selection for the Topic-Based Mixture Model in Factored Classification," in Proceedings of the 2006 International Conference on Computational Intelligence and Security, 2006, pp. 39-44. [80] E. Gabrilovich and S. Markovitch, "Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5," in Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 41-48. 89 [81] A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney, "Feature Selection Methods for Text Classification," in Proceedings of the 13th SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, 2007, pp. 230-239. [82] D. W. Aha and R. L. Bankert, "Feature selection for the case-based classification of cloud types: An empirical comparison," in Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, 1994, pp. 106-112. [83] M. Ben-Bassat, Use of distance measures, information measures and error bounds in feature evaluation, 2nd ed., P. R. Krishnaiah and L. N. Kanal, Eds. North-Holland Publishing Company, 1982. [84] A. Miller, Subset Selection in Regression, 2nd ed. Chapman & Hall/CRC, 2002. [85] N. R. Draper and H. Smith, Applied Regression Analysis, 2nd ed. John Wiley and Sons, 1981. [86] J. Doak, "An evaluation of feature selection methods and their application to computer security," University of California at Davis, Technical Report CSE-92- 18, 1992. [87] P. Pudil, J. Novovicova, and J. Kittler, "Floating search methods in feature selection," Pattern Recognition Letters, vol. 15, pp. 1119-1125, 1994. [88] M. Kudo and J. Sklansky, "Comparison of algorithms that select features for pattern classifiers," Patter Recognition, vol. 33, pp. 25-41, 2000. [89] Y. Kim, W. N. Street, and F. Menczer, "An evolutionary multi-objective local selection algorithm for customer targeting," in Proceedings of Congress on Evolutionary Computation, 2001, pp. 759-766. [90] R. Kohavi and D. Sommerfield, "Feature subset selection using the wrapper model: Overfitting and dynamic search space topology," in The First International Conference on Knowledge Discovery and Data Mining, 1995, pp. 192-197. [91] M. Dash and H. Liu, "Feature selection for classification," Intelligent Data Analysis, vol. 1, no. 1 - 4, pp. 131-156, 1997. [92] D. W. Aha and R. L. Bankert, "A comparative evaluation of sequential feature selection algorithms," in Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, 1995, pp. 1-7. [93] H. Liu and H. Motoda, Feature selection for Knowledge Discovery and Data Mining. Springer, 1998. [94] A. Arauzo-Azofra and J. M. Benitez, "Empirical Study of Feature Selection Methods in Classification," in Proceedings of the Eighth International Conference on Hybrid Intelligent Systems, 2008, pp. 584-589. [95] P. M. Narendra and K. Fukunaga, "A Branch and Bound Algorithm for Feature Subset Selection," IEEE Transactions on Computers, vol. 26, no. 9, pp. 917-922, Sep. 1977. [96] Y. Hamamoto, S. Uchimura, Y. Matsunra, T. Kanaoka, and S. Tomita, "Evaluation of the Branch and Bound Algorithm for Feature Selection," Pattern Recognition Letters, vol. 11, pp. 453-456, Jul. 1990. 90 [97] G. Brassard and P. Bratley, Fundamentals of Algorithms. New Jersey: Prentice Hall, 1996. [98] H. Liu and R. Setiono, "Feature Selection and Classification - a probabilistic wrapper approach," in Proceedings of the 9th International Conference on Industrail and Engineering Applications of AI and Expert Systems, Fukuoka, Japan, 1996, pp. 419-424. [99] W. Siedlecki and J. Sklansky, "A Note on Genetic Algorithms for Large-Scale Feature Selection," Pattern Recognition Letters, vol. 10, pp. 335-347, Nov. 1989. [100] H. Vafaie and I. F. Imam, "Feature Selection Methods: Genetic Algorithms vs. Greedy-Like Search," in Proceedings of the International Conference on Fuzzy and Intelligent Control Systems, 1994. [101] J. Yang and V. Honavar, "Feature Subset Selection Using a Genetic Algorithm," in Feature Extraction, Construction and Selection: A Data Mining Perspective, 1998, pp. 117-136. [102] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, 1989. [103] J. H. Holland, Adaptation in natural and Artificial Systems. Ann Arbor, IL: The University of Michigan Press, 1975. [104] H. Almuallim and T. G. Dietterich, "Learning Boolean Concepts in the Presence of Many Irrelevant Features," Artificial Intelligence, vol. 69, no. 1-2, pp. 279-305, 1994. [105] M. Ben-Bassat, "Pattern Recognition and Reduction of Dimensionality," in Handbook of Statistics-II, P. R. Krishnaiah and L. N. Kanal, Eds. North Holland, 1982, pp. 773-791. [106] M. A. Hall, "Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning," in Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 359-366. [107] A. L. Blum and P. Langley, "Selection of Relevant Features and Examples in Machine Learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997. [108] C. Apte, F. J. Damerau, and S. M. Weiss, "Automated Learning of Decision Rules for Text Categorization," ACM Transactions on Information Systems, vol. 12, no. 3, p. 233, 1994. [109] D. D. Lewis, "Representation and learning in information retrival.," Department of Computer Science, University of Massachusetts, Amherst, Doctoral dissertation and Technical Report UM-CS-1991-093, 1992. [110] D. D. Lewis, "Feature selection and feature extraction for text categorization," in Proceedings of Speech and Natural Language Workshop, San Francisco, CA, 1992, pp. 212-217. [111] H. Almuallim and T. G. Dietterich, "Learning with many irrelevant features," in Proceedings of the Ninth National Conference on Artificial Intelligence, San Jose, CA, 1991, pp. 547-552. 91 [112] K. Kira and L. A. Rendell, "The feature selection problem: Traditional methods and a new algorithm," in Tenth National Conference on Artificial Intelligence, 1992, pp. 129-134. [113] I. Kononenko, "Estimating attributes: Analysis and extensions of RELIEF," Machine Learning: ECML-94, vol. 784/1994, pp. 171-182, 1994. [114] G. Liu, L. Dong, S. Yuan, Y. Liu, and Y. Li, "New Feature Selection Algorithm based on Potential Difference," in Proceedings of the 2007 IEEE International Conference on Mechatronics and Automation, Harbin, China, 2007, pp. 566-570. [115] S. B. Cho and H. H. Won, "Machine Learning in DNA Microarray Analysis for Cancer Classification," in Proceedings of the First Asia-Pacific Bioinformatics Conference, 2003. [116] P. Langeley and S. Sage, "Oblivious Decision Trees and Abstract Cases," in Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, WA, 1994, pp. 113-117. [117] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, "A Hybrid Feature Selection Mechanism," in Eighth International Conference on Intelligent Systems Design and Applications, 2008, pp. 271-276. [118] M. Dash and H. Liu, "Feature Selection for Clustering," in Proceedings of the Fourth Pacific Asia Conference on Knowledge Discovery and Data Mining, 2000, pp. 110-121. [119] S. Das, "Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection," in Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 74-81. [120] D. Lei, Y. Xiaochun, and X. Jun, "Optimizing Traffic Classification Using Hybrid Feature Selection," in Proceedings of the Ninth International Conference on Web- Age Information Management, 2008, pp. 520-525. [121] M. Baglioni, B. Furletti, and F. Turini, "DrC4.5: Improving C4.5 by means of prior knowledge," in Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, CA, 2005, pp. 474-481. [122] J. Mao, K. Mohiuddin, and A. K. Jain, "Parsimonious Network Design and Feature Selection Through Node Pruning," in Proceedings of the International Conference on Pattern Recognition, Jerusalem, 1994, pp. 622-624. [123] G. Pagallo and D. Haussler, "Boolean feature discovery in empirical learning," Machine Learning, vol. 5, pp. 71-99, 1990. [124] L. Xu, P. Yan, and T. Chang, "Best First Strategy for Feature Selection," in Proceedings of the Ninth International Conference on Pattern Recognition, 1988, pp. 706-708. [125] H. Liu, H. Motoda, and L. Yu, "Feature Selection with Selective Sampling," in Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 395-402. [126] P. Pudil and J. Novovicova, "Novel Methods for Subset Selection with Respect to Problem Knowledge," in Feature Extraction, Construction and Selection: A Data Mining Perspective, H. Liu and H. Motoda, Eds. Springer, 2001, pp. 101-116. 92 [127] J. Segen, "Feature Selection and Constructive Inference," in Proceedings of the Seventh International Conference on Pattern Recognition, 1984, pp. 1344-1346. [128] J. Sheinvald, B. Dom, and W. Niblack, "A Modeling Approach to Feature Selection," in Proceedings of the 10th International Conference on Pattern Recognition, 1990, pp. 535-539. [129] C. Cardie, "Using Decision Trees to Improve Case-Based Learning," in Proceedings of the 10th International Conference on Machine Learning, 1993, pp. 25-32. [130] L. Yu and H. Liu, "Feature Selection for High-Dimensional Data: A Fast Correlation- Based Filter Solution," in Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 856-863. [131] L. Bobrowski, "Feature Selection Based on Some Homogeneity Coefficient," in Proceedings of the Ninth International Conference on Pattern Recognition, 1988, pp. 544-546. [132] M. Modrzejewski, "Feature Selection Using Rough Sets Theory," in Proceedings of the European Conference on Machine Learning, 1993, pp. 213-226. [133] A. N. Mucciardi and E. E. Gose, "A Comparison of Seven Techniques for Choosing Subsets of Pattern Recognition," IEEE Transactions on Computers, vol. 20, pp. 1023-1031, 1971. [134] N. Slonim, G. Bejerano, S. Fine, and N. Tishbym, "Discriminative Feature Selection via Multiclass Variable Memory Markov Model," in Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 578-585. [135] H. Liu, H. Motoda, and M. Dash, "A Monotonic Measure for Optimal Feature Selection," in Proceedings of the 10th European Conference on Machine Learning, 1998, pp. 101-106. [136] A. L. Oliveira and A. S. Vincentelli, "Constructive Induction Using a Non-Greedy Strategy for Feature Selection," in Proceedings of the Ninth International Conference on Machine Learning, 1992, pp. 355-360. [137] J. C. Schlimmer, "Efficiently Inducing Determinations: A Completeand Systematic Search Algorithm that Uses Optimal Pruning," in Proceedings of the 10th International Conference on Machine Learning, 1993, pp. 284-290. [138] M. Dash, "Feature Selection via Set Cover," in Proceedings of the IEEE Knowledge and Data Engineering Exchange Workshop, 1997, pp. 165-171. [139] H. Liu and R. Setiono, "A Probabilistic Approach to Feature Selection - A Filter Solution," in Proceedings of the 13th International Conference on Machine Learning, 1996, pp. 319-327. [140] I. Foroutan and J. Sklansky, "Feature Selection for Automatic Classification of Non- Gaussian Data," IEEE Transactions of Systems, Man, and Cybernatics, vol. 17, no. 2, pp. 187-198, 1987. [141] M. Ichino and J. Sklansky, "Feature Selection for Linear Classifier," in Proceedings of the Seventh International Conference on Pattern Recognition, 1984, pp. 124- 127. 93 [142] M. Ichino and J. Sklansky, "Optimum Feature Selection by Zero-One Programming," IEEE Transactions on Systems, Man, and Cybernetics, vol. 14, no. 5, pp. 737- 746, 1984. [143] P. A. Devijver and J. Kittler, attern Recognition: A Statistical Approach. Prentice Hall International, 1982. [144] P. Domingos, "Context Sensitive Feature Selection for Lazy Learners," AI Revolution, vol. 14, pp. 227-253, 1997. [145] A. W. Moore and M. S. Lee, "Efficient Algorithms for Minimizing Cross Validation Error," in Proceedings of the 11th International Conference on Machine Learning, 1994, pp. 190-198. [146] C. E. Queiros and E. S. Gelsema, "On Feature Selection," in Proceedings of Seventh International Conference on Pattern Recognition, 1984, pp. 128-130. [147] D. B. Skalak, "Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms," in Proceedings of the 11th International Conference on Machine Learning, 1994, pp. 293-301. [148] D. J. Stracuzzi and P. E. Utgoff, "Randomized Variable Elimination," in Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 594-601. [149] J. Kittler, Feature Selection and Extraction. Academic Press, 1986. [150] L. Rendell and R. Seshu, "Learning hard concepts through constructive induction: framework and rationale," Computational Intelligence, vol. 6, no. 4, pp. 247-270, Nov. 1990. [151] B. Scholkopf and A. J. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2002. [152] I. T. Jolliffe, Principal Component Analysis. New York, NY: Springer-Verlag, 1986. [153] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks. New York: Wiley, 1996. [154] L. J. Cao and W. K. Chong, "Feature Extraction in Support Vector Machine: A comparison of PCA, KPCA, and ICA," in Proceedings of the 9th International Conference on Neural Information Processing, 2002, pp. 1001-1005. [155] D. A. Bell and H. Wang, "A formalism for relevance and its application in feature subset selection," Machine Learning, vol. 41, pp. 175-195, 2000. [156] S. B. Thrun, et al., "The MONK's Problems: A Performance Comparison of Different Learning Algorithms," Carnegie Mellon University, Technical Report CMU-CS- 91-197, 1991. [157] R. A. Caruana and D. Freitag, "How useful is relevance?," in Working Notes of the AAAI Fall Symposium on Relevance, New Orleans, LA, 1994, pp. 25-29. [158] M. Steinbach, L. Ertoz, and V. Kumar, "The challenges of clustering high- dimensional data," New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition, 2003. [159] J. G. Dy and C. E. Brodley, "Feature Subset Selection and Order Identification for Unsupervised Learning," in Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 247-254. 94 [160] L. Talavera, "Feature Selection as a Preprocessing Step for Hierarchical Clustering," in Proceedings of the International Conference on Machine Learning, 1999, pp. 389-397. [161] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988. [162] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons, 1990. [163] P. Langeley, W. Iba, and K. Thompson, "An analysis of Bayesian classifiers," in Proceedings of the 10th National Conference on Artificial Intelligence, 1992, pp. 223-228. [164] J. Weston, et al., "Feature Selection for SVMs," AT&T, 2001. [165] P. E. Hart, "The Condensed Nearest Neighbor Rule," IEEE Transactions on Information Theory, pp. 515-516, 1968. [166] E. Parzen, "On Estimation of a Probability Density Function and Mode," Annals of Mathematics and Statistics, no. 33, pp. 1065-1076, 1962. [167] M. Sebban, D. A. Zighed, and S. Di Palma, "Selection and Statistical Validation of Features and Prototypes," in Principles of Data Mining and Knowledge Discovery. Prague, Czech Republic: Springer, 1999, pp. 184-192. [168] L. Li, C. Weinberg, T. Darden, and L. Pedersen, "Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method," Bioinformatics, vol. 17, no. 12, pp. 1131- 1142, 2001. [169] J. Fairley, G. Georgoulas, and G. Vachtsevanos, "Sequential Feature Selection Methods for Parkinsonian Human Sleep Analysis," in Proceedings of the 17th Conference on Control and Automation, Thessaloniki, Greece, 2009, pp. 1468- 1473. [170] D. W. Opitz, "Feature Selection for Ensembles," in Proceedings of 16th National Conference on Artificial Intelligence, 1999, pp. 379-384. [171] T. K. Ho, "The random subspace method for constructing decision forests," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, pp. 832-844, 1998. [172] C. Guerra-Salcedo and D. Whitley, "Genetic Approach for Feature Selection for Ensemble Creation," in Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, FL, 1999, pp. 236-243. [173] C. Guerra-Salcedo and D. Whitley, "Feature Selection Mechanisms for Ensemble Creation: A Genetic Search Perspective,," Data Mining with Evolutionary Algorithms: Research Directions. Papers from the AAAI Workshop., 1999. [174] A. Tsymbal, P. Cunningham, M. Pechenizkiy, and S. Puuronen, "Search Strategies for Ensemble Feature Selection in Medical Diagnostics," in Proceedings of the 16th IEEE Symposium on Computer-Based Medical Systems, 2003, pp. 124-129. [175] L. E. A. Santana, D. F. d. Oliveira, A. M. P. Canuto, and M. C. P. d. Souto, "A Comparative Analysis of Feature Selection Methods for Ensembles with Different Combination Methods," in . 95 [176] K. M. O. Vale, F. G. Dias, A. M. P. Canuto, and M. C. P. Souto, "A Class-Based Feature Selection Method for Ensemble Systems," in Proceedings of the Eighth International Conference on Hybrid Intelligent Systems, 2008, pp. 596-601. [177] P. Cunningham and J. Carney, "Diversity versus quality in classification ensembles based on feature selection," in Proceedings of the 11th European Conference on Machine Learning, 2000, pp. 109-116. [178] M. Forina, R. Leardi, C. Armanino, and S. Lanteri. (1988) PARVUS: An extendable package of programs for data exploration, classification and correlation. Software Package. [179] C. Q. Inc., Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL ed. Washington, D.C., 1985. [180] H. A. Guvenir, B. Acar, G. Demiroz, and A. Cekin, "A Supervised Machine Learning Algorithm for Arrhythmia Analysis," in Proceedings of the Computers in Cardiology Conference, Lund, Sweden, 1997. [181] A. Alizadeh, et al., "Distinct types of diffuse large B-cell lymphoma identifed by gene expression profling," Nature, vol. 403, pp. 503-511, Feb. 2000. 96 Appendices Appendix A: Performance during Individual Runs Note: the table headings for Appendix A are as follows: ? NBC ? Na?ve Bayesian Classifier ? kNN ? k Nearest Neighbor algorithm ? SBS ? Sequential Backward Selection ? SFS ? Sequential Forward Selection ? GA ? Genetic Algorithm ? No FS ? Use of classifier alone, with no feature selection ? Trad. FS ? Use of classifier with traditional feature selection architecture ? CEFS ? Use of classifier with CEFS architecture ? Diff ? Performance Difference between traditional feature selection architecture and CEFS architecture. Lymphoma Dataset LYMPHOMA (kNN - GA) Run No FS Trad. FS CEFS Diff 1 62.5 62.5 75 12.5 2 62.5 75 62.5 -12.5 3 75 87.5 87.5 0 4 87.5 100 100 0 5 62.5 75 87.5 12.5 6 87.5 87.5 87.5 0 7 75 62.5 50 -12.5 8 87.5 75 87.5 12.5 9 75 75 75 0 10 87.5 87.5 87.5 0 11 75 62.5 62.5 0 12 62.5 75 75 0 13 75 62.5 75 12.5 14 87.5 62.5 75 12.5 15 75 75 75 0 Avg 75.833 75 77.5 2.5 Colon Cancer Dataset COLON CANCER (kNN - GA) Run No FS Trad. FS CEFS Diff 1 53.846 57.692 61.538 3.846 2 61.538 69.231 69.231 0 3 61.538 73.077 73.077 0 4 76.923 69.231 73.077 3.846 5 57.692 61.538 61.538 0 6 73.077 76.923 76.923 0 7 61.538 57.692 61.538 3.846 8 73.077 76.923 73.077 -3.846 9 76.923 73.077 73.077 0 10 61.538 65.385 69.231 3.846 11 69.231 69.231 65.385 -3.846 12 61.538 69.231 76.923 7.692 13 57.692 57.692 61.538 3.846 14 76.923 73.077 73.077 0 15 65.385 65.385 65.385 0 Avg 65.897 67.692 68.974 1.282 97 Madelon Dataset MADELON (kNN - GA) Run No FS Trad. FS CEFS Diff 1 60.973 62.677 64.256 1.579 2 59.85 61.347 62.594 1.247 3 61.056 58.811 60.973 2.162 4 62.968 62.053 63.342 1.289 5 60.058 62.677 64.713 2.036 6 60.432 59.933 61.513 1.58 7 60.64 60.515 62.677 2.162 8 61.388 62.51 62.427 -0.083 9 60.848 66.874 63.425 -3.449 10 62.469 61.264 61.139 -0.125 11 59.518 60.682 61.305 0.623 12 61.222 63.549 63.092 -0.457 13 62.843 63.716 61.929 -1.787 14 62.552 66.002 64.381 -1.621 15 62.926 63.342 64.589 1.247 Avg 61.316 62.397 62.824 0.426867 Musk (version 1) Dataset MUSK V1 (kNN - GA) Run No FS Trad. FS CEFS Diff 1 82.018 81.579 80.263 -1.316 2 77.193 77.193 75 -2.193 3 80.263 82.456 80.702 -1.754 4 81.14 79.825 84.211 4.386 5 75.877 79.386 77.193 -2.193 6 79.386 84.211 85.526 1.315 7 78.509 82.018 82.018 0 8 76.316 79.825 81.14 1.315 9 73.684 77.193 79.385 2.192 10 78.947 84.211 82.456 -1.755 11 75 78.07 77.632 -0.438 12 77.193 75 75.439 0.439 13 83.772 80.702 83.333 2.631 14 79.825 78.07 77.193 -0.877 15 82.456 81.14 81.14 0 Avg 78.772 80.059 80.175 0.1168 Arrhythmia Dataset ARRHYTHMIA (kNN - GA) Run No FS Trad. FS CEFS Diff 1 68.203 64.977 66.359 1.382 2 68.203 73.272 74.194 0.922 3 65.899 66.359 70.968 4.609 4 67.742 69.585 69.585 0 5 68.203 67.281 67.742 0.461 6 66.82 65.438 66.359 0.921 7 63.134 63.134 64.977 1.843 8 64.977 68.203 68.203 0 9 70.507 67.742 70.507 2.765 10 64.977 70.046 70.968 0.922 11 64.055 61.751 64.977 3.226 12 70.968 72.811 73.733 0.922 13 68.664 73.733 75.115 1.382 14 70.507 71.429 70.507 -0.922 15 63.594 69.585 69.585 0 Avg 67.097 68.356 69.585 1.228867 98 Montgomery County Jails Dataset MONTGOMERY COUNTY (kNN - GA) Run No FS Trad. FS CEFS Diff 1 50.658 44.737 55.263 10.526 2 46.053 84.211 71.711 -12.5 3 50 65.132 65.132 0 4 50.658 36.184 54.605 18.421 5 50 84.211 84.211 0 6 50 19.079 35.526 16.447 7 42.105 84.211 83.553 -0.658 8 40.789 36.842 48.684 11.842 9 50 65.789 54.605 -11.184 10 50 76.316 75 -1.316 11 56.579 73.684 73.684 0 12 56.579 59.211 66.447 7.236 13 44.079 57.895 43.421 -14.474 14 46.053 63.816 66.447 2.631 15 59.211 84.211 84.211 0 Avg 49.518 62.369 64.167 1.798067 MONTGOMERY COUNTY (kNN - SFS) Run No FS Trad. FS CEFS Diff 1 50.658 73.684 79.605 5.921 2 46.053 84.211 82.237 -1.974 3 50 65.132 65.132 0 4 50.658 57.895 68.421 10.526 5 50 84.211 84.211 0 6 50 84.211 78.947 -5.264 7 42.105 84.211 84.211 0 8 40.789 36.842 41.447 4.605 9 50 49.342 43.421 -5.921 10 50 73.026 71.053 -1.973 11 56.579 73.684 73.684 0 12 56.579 59.211 59.211 0 13 44.079 23.684 32.237 8.553 14 46.053 63.158 65.789 2.631 15 59.211 82.895 82.895 0 Avg 49.518 66.360 67.500 1.140267 MONTGOMERY COUNTY (kNN - SBS) Run No FS Trad. FS CEFS Diff 1 50.658 64.474 69.737 5.263 2 46.053 69.737 66.447 -3.29 3 50 65.132 55.921 -9.211 4 50.658 65.132 75 9.868 5 50 84.211 84.211 0 6 50 55.263 57.895 2.632 7 42.105 71.711 68.421 -3.29 8 40.789 36.842 43.421 6.579 9 50 50.658 42.105 -8.553 10 50 59.868 60.526 0.658 11 56.579 57.237 68.421 11.184 12 56.579 59.211 59.211 0 13 44.079 48.684 50 1.316 14 46.053 63.158 67.105 3.947 15 59.211 84.211 82.895 -1.316 Avg 49.518 62.369 63.421 1.052467 99 Congressional Voting Dataset CONGRESSIONAL VOTING (kNN - GA) Run No FS Trad. FS CEFS Diff 1 89.222 92.814 94.012 1.198 2 90.419 91.617 91.617 0 3 88.623 94.012 94.611 0.599 4 92.216 94.012 95.21 1.198 5 92.216 95.808 94.611 -1.197 6 91.018 92.216 91.018 -1.198 7 91.018 94.012 94.012 0 8 93.413 94.611 95.21 0.599 9 92.814 94.012 95.808 1.796 10 95.21 97.605 95.808 -1.797 11 92.216 95.21 94.012 -1.198 12 90.419 93.413 94.012 0.599 13 92.216 92.814 94.611 1.797 14 91.617 94.611 94.611 0 15 86.826 92.216 92.216 0 Avg 91.298 93.932 94.092 0.159733 CONGRESSIONAL VOTING (kNN - SFS) Run No FS Trad. FS CEFS Diff 1 89.222 94.012 94.611 0.599 2 90.419 95.21 94.611 -0.599 3 88.623 94.012 94.012 0 4 92.216 95.21 95.21 0 5 92.216 94.611 94.611 0 6 91.018 89.82 89.82 0 7 91.018 93.413 95.21 1.797 8 93.413 95.21 95.21 0 9 92.814 95.808 95.808 0 10 95.21 98.204 98.204 0 11 92.216 95.21 94.611 -0.599 12 90.419 94.611 94.012 -0.599 13 92.216 91.617 94.012 2.395 14 91.617 93.413 94.012 0.599 15 86.826 91.617 93.413 1.796 Avg 91.298 94.132 94.491 0.359267 CONGRESSIONAL VOTING (kNN - SBS) Run No FS Trad. FS CEFS Diff 1 89.222 92.216 94.611 2.395 2 90.419 94.012 94.611 0.599 3 88.623 94.012 94.611 0.599 4 92.216 95.21 95.808 0.598 5 92.216 94.611 94.611 0 6 91.018 90.419 91.018 0.599 7 91.018 94.012 94.012 0 8 93.413 95.21 95.21 0 9 92.814 95.808 97.006 1.198 10 95.21 96.407 95.808 -0.599 11 92.216 92.216 94.012 1.796 12 90.419 94.611 94.012 -0.599 13 92.216 92.216 93.413 1.197 14 91.617 95.808 95.808 0 15 86.826 93.413 93.413 0 Avg 91.298 94.012 94.531 0.518867 CONGRESSIONAL VOTING (NBC - SBS) Run No FS Trad. FS CEFS Diff 1 88.024 94.611 94.611 0 2 89.222 94.012 94.012 0 3 88.024 94.012 94.611 0.599 4 89.82 95.21 95.21 0 5 90.419 93.413 93.413 0 6 88.623 94.611 95.808 1.197 7 87.425 94.611 94.611 0 8 91.617 95.808 95.808 0 9 91.617 96.407 96.407 0 10 92.814 97.006 97.006 0 11 94.012 93.413 93.413 0 12 89.82 94.012 94.012 0 13 87.425 94.012 93.413 -0.599 14 89.82 93.413 94.012 0.599 15 87.425 92.814 92.814 0 Avg 89.740 94.491 94.611 0.119733 100 Credit Approval Dataset CREDIT APPROVAL (kNN - GA) Run No FS Trad. FS CEFS Diff 1 76 84 83 -1 2 74 85 84.5 -0.5 3 79.5 88.5 89 0.5 4 72.5 92 90.5 -1.5 5 77 80.5 80 -0.5 6 69.5 88.5 86.5 -2 7 67.5 82.5 85 2.5 8 75.5 83.5 89 5.5 9 75.5 86.5 85.5 -1 10 75.5 86 87 1 11 75.5 80.5 81 0.5 12 75.5 86 86 0 13 73.5 85.5 84.5 -1 14 73.5 83 82.5 -0.5 15 75.5 86 88.5 2.5 Avg 74.4 85.2 85.5 0.3 CREDIT APPROVAL (kNN - SFS) Run No FS Trad. FS CEFS Diff 1 76 86.5 85.5 -1 2 74 79.5 83 3.5 3 79.5 88.5 89.5 1 4 72.5 87.5 89 1.5 5 77 81 83 2 6 69.5 84 84.5 0.5 7 67.5 82 82.5 0.5 8 75.5 89.5 88 -1.5 9 75.5 83 83 0 10 75.5 82.5 83 0.5 11 75.5 80 82 2 12 75.5 86 86.5 0.5 13 73.5 84.5 85.5 1 14 73.5 85 87.5 2.5 15 75.5 84 85.5 1.5 Avg 74.4 84.233 85.2 0.966667 CREDIT APPROVAL (kNN - SBS) Run No FS Trad. FS CEFS Diff 1 76 85 85 0 2 74 84.5 83 -1.5 3 79.5 83.5 87 3.5 4 72.5 87.5 86 -1.5 5 77 83 84 1 6 69.5 79.5 83 3.5 7 67.5 81.5 84 2.5 8 75.5 89 88 -1 9 75.5 82.5 85.5 3 10 75.5 84.5 86.5 2 11 75.5 84.5 85.5 1 12 75.5 86 84.5 -1.5 13 73.5 84 85 1 14 73.5 84.5 87 2.5 15 75.5 87.5 88.5 1 Avg 74.4 84.46667 85.5 1.033333 CREDIT APPROVAL (NBC - SBS) Run No FS Trad. FS CEFS Diff 1 85 85.5 85 -0.5 2 83 80.5 83 2.5 3 85 83.5 83.5 0 4 89 88 89 1 5 83 81.5 81.5 0 6 87 88.5 88.5 0 7 81.5 80.5 80.5 0 8 87 88 88.5 0.5 9 85 84.5 84.5 0 10 87 82 83.5 1.5 11 84 81.5 81.5 0 12 87.5 85.5 85.5 0 13 85.5 84 83.5 -0.5 14 85.5 80.5 81 0.5 15 85 82.5 84.5 2 Avg 85.333 83.767 84.233 0.466667 101 Wine Dataset WINE (kNN - GA) Run No FS Trad. FS CEFS Diff 1 96.875 93.75 96.875 3.125 2 92.188 95.313 95.313 0 3 92.188 95.313 93.75 -1.563 4 96.875 93.75 95.313 1.563 5 90.625 90.625 90.625 0 6 96.875 93.75 93.75 0 7 93.75 96.875 98.438 1.563 8 95.313 89.063 90.625 1.562 9 95.313 95.313 95.313 0 10 95.313 98.438 100 1.562 11 90.625 93.75 93.75 0 12 92.188 98.438 98.438 0 13 92.188 89.063 92.188 3.125 14 95.313 98.438 98.438 0 15 95.313 98.438 98.438 0 Avg 94.063 94.688 95.417 0.729133 WINE (kNN - SFS) Run No FS Trad. FS CEFS Diff 1 96.875 92.188 92.188 0 2 92.188 81.25 92.188 10.938 3 92.188 90.625 90.625 0 4 96.875 89.063 90.625 1.562 5 90.625 89.063 92.188 3.125 6 96.875 93.75 95.313 1.563 7 93.75 96.875 100 3.125 8 95.313 89.063 90.625 1.562 9 95.313 92.188 93.75 1.562 10 95.313 95.313 95.313 0 11 90.625 93.75 90.625 -3.125 12 92.188 92.188 92.188 0 13 92.188 95.313 95.313 0 14 95.313 87.5 87.5 0 15 95.313 95.313 95.313 0 Avg 94.063 91.563 92.917 1.354133 WINE (kNN - SBS) Run No FS Trad. FS CEFS Diff 1 96.875 96.875 96.875 0 2 92.188 93.75 95.313 1.563 3 92.188 95.313 95.313 0 4 96.875 93.75 93.75 0 5 90.625 92.188 92.188 0 6 96.875 87.5 87.5 0 7 93.75 96.875 96.875 0 8 95.313 96.875 96.875 0 9 95.313 95.313 95.313 0 10 95.313 98.438 96.875 -1.563 11 90.625 89.063 89.063 0 12 92.188 87.5 89.063 1.563 13 92.188 90.625 92.188 1.563 14 95.313 92.188 93.75 1.562 15 95.313 95.313 95.313 0 Avg 94.063 93.438 93.750 0.312533 102 Breast Cancer Dataset BREAST CANCER (kNN - GA) Run No FS Trad. FS CEFS Diff 1 80.667 80.667 81.333 0.666 2 70.667 84 86.667 2.667 3 77.333 74.667 84 9.333 4 82 79.333 65.333 -14 5 84.667 82 83.333 1.333 6 78 81.333 81.333 0 7 82.667 82.667 79.333 -3.334 8 79.333 88.667 88.667 0 9 74.667 85.333 88.667 3.334 10 78.667 75.333 84 8.667 11 71.333 72 72.667 0.667 12 80 76.667 83.333 6.666 13 65.333 80.667 80.667 0 14 78 80 81.333 1.333 15 74.667 76 78 2 Avg 77.20007 79.9556 81.2444 1.2888 BREAST CANCER (kNN - SFS) Run No FS Trad. FS CEFS Diff 1 80.667 79.333 78.667 -0.666 2 70.667 84 86.667 2.667 3 77.333 77.333 85.333 8 4 82 88.667 80.667 -8 5 84.667 84.667 84.667 0 6 78 81.333 83.333 2 7 82.667 82.667 84 1.333 8 79.333 88.667 88.667 0 9 74.667 78 88.667 10.667 10 78.667 75.333 88 12.667 11 71.333 80 81.333 1.333 12 80 78 82.667 4.667 13 65.333 80.667 81.333 0.666 14 78 80 78 -2 15 74.667 74.667 82.667 8 Avg 77.200 80.889 83.645 2.7556 BREAST CANCER (kNN - SBS) Run No FS Trad. FS CEFS Diff 1 80.667 74.667 76 1.333 2 70.667 82 87.333 5.333 3 77.333 77.333 87.333 10 4 82 78.667 80 1.333 5 84.667 88 88.667 0.667 6 78 82 80 -2 7 82.667 81.333 88 6.667 8 79.333 88.667 85.333 -3.334 9 74.667 78 80 2 10 78.667 75.333 88 12.667 11 71.333 72 71.333 -0.667 12 80 86.667 87.333 0.666 13 65.333 65.333 72.667 7.334 14 78 71.333 74.667 3.334 15 74.667 75.333 78 2.667 Avg 77.200 78.444 81.644 3.2 BREAST CANCER (NBC - SBS) Run No FS Trad. FS CEFS Diff 1 67.333 68 68.667 0.667 2 70 62 62.667 0.667 3 61.333 71.333 74 2.667 4 70.667 73.333 74.667 1.334 5 72.667 71.333 73.333 2 6 68.667 78 70.667 -7.333 7 64.667 75.333 73.333 -2 8 71.333 70.667 71.333 0.666 9 62 72 69.333 -2.667 10 67.333 65.333 66 0.667 11 67.333 68 69.333 1.333 12 64 65.333 65.333 0 13 57.333 63.333 70 6.667 14 62.667 64.667 64.667 0 15 69.333 64.667 64.667 0 Avg 66.444 68.889 69.200 0.3112 103 Appendix B: Best found overall and class-specific feature subsets per individual runs Note 1: a zero value represents a feature that was not used, whereas, a non-zero value represents the index of a used feature. Note 2: the best found subsets for the five larger datasets (Musk v1, Arrhythmia, Madelon, Colon Cancer, and Lymphoma) are not provided here as they are extremely large. Breast Cancer Dataset (NBC/SBS) Overall: 0 0 0 0 5 6 0 0 0 no-recurrence-events: 0 2 0 0 5 0 0 0 0 recurrence-events: 1 2 3 0 0 6 0 0 9 Overall: 0 0 3 4 0 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 1 0 3 0 0 6 0 0 0 Overall: 0 0 0 0 0 6 0 8 0 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 0 0 3 0 0 6 0 8 9 Overall: 0 2 0 4 0 6 0 0 9 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 1 0 3 4 0 6 0 0 9 Overall: 0 0 0 4 0 0 0 8 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 3 0 0 6 0 8 9 Overall: 0 0 0 0 0 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 0 4 5 6 0 0 0 Overall: 0 2 0 0 0 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 2 3 0 5 0 0 0 9 Overall: 0 2 0 4 0 6 0 0 9 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 0 0 3 0 5 6 0 0 9 Overall: 0 2 0 0 0 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 1 2 0 0 0 6 0 0 0 104 Overall: 0 0 0 0 5 6 7 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 1 2 3 0 0 6 0 0 9 Overall: 0 0 0 0 5 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 1 0 3 0 5 6 7 8 0 Overall: 0 0 0 4 5 6 0 0 0 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 4 5 6 0 0 0 Overall: 0 0 0 4 5 6 0 8 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 3 4 0 0 0 0 9 Overall: 0 0 3 0 5 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 3 4 0 6 0 0 0 Overall: 0 2 0 0 0 6 0 0 9 no-recurrence-events: 0 0 0 0 0 0 0 0 9 recurrence-events: 0 0 0 4 0 6 0 8 9 (kNN/SBS) Overall: 1 2 3 0 0 6 7 8 9 no-recurrence-events: 1 2 3 4 0 6 7 8 9 recurrence-events: 0 2 3 0 5 6 7 0 0 Overall: 0 0 0 0 5 0 0 0 0 no-recurrence-events: 0 0 0 4 0 0 7 8 0 recurrence-events: 0 2 3 0 0 6 0 8 0 Overall: 1 0 0 0 5 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 7 0 9 recurrence-events: 0 2 0 4 5 6 0 8 9 Overall: 0 0 0 0 5 6 7 0 0 no-recurrence-events: 1 0 3 4 5 0 0 0 0 recurrence-events: 0 0 0 0 5 6 0 0 0 Overall: 1 0 3 4 5 6 7 8 9 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 0 5 0 0 0 0 105 Overall: 0 0 0 4 0 0 0 0 0 no-recurrence-events: 0 0 0 4 0 0 7 8 9 recurrence-events: 0 0 3 0 0 0 7 0 0 Overall: 1 2 0 0 5 6 7 0 0 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 0 2 3 0 5 6 0 8 9 Overall: 0 0 0 0 0 6 0 0 0 no-recurrence-events: 1 2 0 4 0 6 7 8 9 recurrence-events: 0 0 0 0 0 6 0 8 9 Overall: 1 2 3 4 5 6 0 8 9 no-recurrence-events: 0 0 0 4 0 0 0 0 0 recurrence-events: 1 0 3 0 5 6 0 8 0 Overall: 0 2 0 0 0 6 0 0 0 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 0 5 0 0 0 0 Overall: 1 2 3 0 5 6 7 8 9 no-recurrence-events: 1 2 3 4 5 6 7 8 9 recurrence-events: 1 0 3 0 5 0 0 0 0 Overall: 1 0 0 4 0 0 0 0 0 no-recurrence-events: 0 0 0 0 0 0 0 0 9 recurrence-events: 0 2 3 4 5 6 7 0 0 Overall: 0 2 3 0 5 6 7 0 9 no-recurrence-events: 0 0 0 4 5 0 0 8 0 recurrence-events: 1 2 3 4 0 6 7 0 9 Overall: 1 2 0 0 5 6 7 0 0 no-recurrence-events: 1 0 0 4 5 0 7 8 0 recurrence-events: 0 2 3 0 0 6 7 8 0 Overall: 0 0 0 4 5 6 0 8 9 no-recurrence-events: 0 0 0 4 0 0 0 8 0 recurrence-events: 0 0 0 0 0 6 0 0 0 (kNN/SFS) Overall: 1 0 3 4 5 6 7 8 0 no-recurrence-events: 0 0 0 0 5 6 0 0 0 recurrence-events: 0 0 0 4 5 0 0 0 9 106 Overall: 0 0 0 4 0 0 0 0 0 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 0 0 6 0 0 0 Overall: 0 2 0 0 5 6 7 8 0 no-recurrence-events: 0 0 0 4 0 0 0 8 0 recurrence-events: 1 0 0 0 0 0 0 8 0 Overall: 0 0 0 4 0 6 0 0 0 no-recurrence-events: 0 2 0 4 0 6 7 8 9 recurrence-events: 1 0 0 0 0 0 0 0 0 Overall: 1 2 3 4 5 6 7 8 9 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 4 0 0 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 no-recurrence-events: 1 2 0 4 5 0 7 0 9 recurrence-events: 0 0 0 0 0 6 0 0 0 Overall: 1 2 3 4 5 6 7 8 9 no-recurrence-events: 0 2 0 4 5 6 0 0 9 recurrence-events: 0 0 0 0 0 6 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 0 0 5 0 0 0 9 Overall: 1 0 3 0 5 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 0 0 0 0 0 0 9 Overall: 0 2 0 0 0 6 0 0 0 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 0 5 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 no-recurrence-events: 1 2 3 4 5 6 7 8 9 recurrence-events: 0 0 0 0 5 0 0 0 0 Overall: 0 0 3 4 0 0 0 0 0 no-recurrence-events: 0 0 0 0 0 0 7 0 0 recurrence-events: 0 0 0 0 0 0 0 0 9 107 Overall: 0 2 0 0 5 0 0 0 0 no-recurrence-events: 1 2 0 4 5 0 0 0 0 recurrence-events: 1 2 3 4 5 6 7 0 9 Overall: 0 0 3 0 5 0 0 0 0 no-recurrence-events: 1 0 0 0 5 0 0 0 0 recurrence-events: 0 0 3 0 5 0 0 0 0 Overall: 0 0 0 4 5 6 7 8 9 no-recurrence-events: 1 0 3 4 5 0 0 0 9 recurrence-events: 0 0 0 0 5 6 0 0 0 (kNN/GA) Overall: 1 2 3 4 5 6 7 8 9 no-recurrence-events: 1 2 0 4 5 6 0 8 9 recurrence-events: 0 0 0 4 5 0 0 0 9 Overall: 0 0 0 4 0 0 0 0 0 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 0 0 0 6 0 0 0 Overall: 1 0 3 4 5 6 0 8 9 no-recurrence-events: 1 0 0 0 5 0 0 0 0 recurrence-events: 0 0 3 0 0 0 7 0 9 Overall: 1 0 0 0 5 6 0 0 0 no-recurrence-events: 0 2 0 0 0 0 0 0 0 recurrence-events: 0 0 0 0 5 6 0 0 0 Overall: 0 0 3 0 5 6 7 8 9 no-recurrence-events: 0 0 0 0 5 0 0 8 9 recurrence-events: 0 0 0 4 0 0 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 2 3 0 0 6 0 0 0 Overall: 1 2 0 4 0 6 0 0 9 no-recurrence-events: 0 0 0 4 0 6 7 0 9 recurrence-events: 0 0 0 0 0 6 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 0 0 5 0 0 0 9 108 Overall: 0 0 0 4 5 6 0 0 0 no-recurrence-events: 0 0 0 0 5 0 0 0 0 recurrence-events: 0 0 0 0 0 0 0 0 9 Overall: 0 0 0 4 5 6 0 0 0 no-recurrence-events: 1 0 0 4 5 0 7 8 9 recurrence-events: 0 0 0 0 5 0 0 0 0 Overall: 1 2 3 0 5 6 7 8 9 no-recurrence-events: 1 2 3 4 5 6 7 8 9 recurrence-events: 0 0 0 0 5 0 0 0 0 Overall: 1 0 3 4 5 6 7 0 0 no-recurrence-events: 0 0 0 0 0 0 7 0 0 recurrence-events: 0 0 0 0 0 0 0 0 9 Overall: 0 2 0 0 5 0 0 0 0 no-recurrence-events: 0 2 0 4 5 6 0 8 9 recurrence-events: 1 0 0 0 5 0 0 0 0 Overall: 0 0 3 0 5 0 0 0 0 no-recurrence-events: 1 0 0 4 0 0 0 0 0 recurrence-events: 0 0 3 0 5 0 0 0 0 Overall: 0 0 3 4 5 6 0 8 9 no-recurrence-events: 1 0 3 4 5 0 0 0 9 recurrence-events: 0 0 0 0 5 0 7 8 9 Wine Dataset (kNN/SBS) Overall: 1 2 3 4 0 0 0 0 0 10 11 12 13 1: 1 2 3 4 0 6 0 0 0 0 0 0 13 2: 0 2 3 0 0 0 0 0 0 10 0 0 13 3: 0 0 0 0 0 0 7 0 0 0 11 0 0 Overall: 1 0 0 0 0 0 7 0 0 10 0 12 13 1: 1 0 0 0 0 0 7 0 0 0 0 0 0 2: 1 2 0 4 5 0 7 0 0 10 0 12 13 3: 1 0 0 0 5 0 7 0 0 0 0 0 0 Overall: 0 2 0 0 0 0 7 0 0 10 11 12 13 1: 1 2 3 4 0 0 7 0 0 0 0 0 0 2: 0 0 0 0 5 0 0 0 0 10 0 0 0 3: 0 2 0 0 0 0 0 0 0 0 11 12 0 109 Overall: 1 2 3 4 0 0 7 0 0 10 0 0 13 1: 1 2 3 4 0 6 0 0 0 0 0 0 13 2: 1 2 3 4 0 0 7 8 0 10 0 12 13 3: 1 0 3 0 0 0 7 0 9 0 0 0 0 Overall: 0 2 0 4 5 0 7 8 9 10 0 0 13 1: 1 0 0 0 0 6 0 0 0 0 0 0 0 2: 0 0 0 4 5 0 7 0 0 0 0 0 13 3: 0 0 0 0 0 0 7 0 0 10 0 0 0 Overall: 1 0 3 4 0 6 0 0 0 0 0 12 13 1: 1 0 3 0 0 0 7 0 0 0 0 0 13 2: 1 0 3 4 0 6 0 0 0 0 0 12 13 3: 1 0 3 0 0 0 7 0 0 0 0 0 0 Overall: 1 2 3 0 0 0 0 0 0 10 0 12 13 1: 1 2 0 0 0 6 0 0 0 0 0 0 13 2: 0 2 3 0 0 0 0 0 0 10 0 0 13 3: 0 2 0 0 0 0 7 0 0 10 0 0 0 Overall: 1 2 0 0 0 0 0 0 0 0 11 12 13 1: 1 0 0 0 0 0 7 0 0 0 0 0 0 2: 1 0 0 0 0 0 0 0 0 10 0 12 13 3: 0 0 3 4 0 0 0 8 9 10 0 0 0 Overall: 0 2 3 4 0 0 7 8 9 10 0 12 13 1: 1 2 3 0 0 0 7 0 0 0 0 0 0 2: 0 0 0 0 0 0 0 0 0 10 0 0 13 3: 0 2 0 0 0 0 7 8 0 0 0 0 0 Overall: 0 2 3 4 0 0 7 0 9 10 0 12 13 1: 1 2 3 4 0 0 7 0 0 0 0 0 0 2: 0 2 3 0 0 0 7 0 9 10 0 0 13 3: 0 0 3 4 0 0 7 0 0 0 0 0 0 Overall: 0 2 3 0 5 0 7 8 0 10 11 0 13 1: 1 0 0 4 5 0 7 0 0 0 0 0 0 2: 1 2 3 0 5 0 7 0 0 10 0 0 13 3: 1 0 0 0 5 0 7 0 0 0 0 0 0 Overall: 0 0 3 0 5 6 7 0 9 0 0 0 13 1: 1 2 3 4 0 0 7 0 0 0 0 0 0 2: 1 2 3 0 5 6 7 0 0 10 0 12 13 3: 1 0 0 0 5 0 7 0 0 0 0 0 0 110 Overall: 0 2 3 4 5 0 7 8 9 10 11 0 13 1: 1 0 0 0 0 0 7 0 0 0 0 0 0 2: 1 2 3 0 5 0 7 8 0 10 0 0 13 3: 0 0 3 0 0 0 7 0 0 0 11 0 0 Overall: 1 2 3 0 0 0 0 0 9 10 0 12 13 1: 1 0 0 4 5 0 7 0 0 0 0 0 0 2: 1 0 0 4 5 0 0 0 9 10 0 0 0 3: 0 0 0 0 0 0 7 0 0 10 0 0 0 Overall: 0 2 0 0 0 6 0 0 9 10 0 0 13 1: 1 0 0 4 5 0 7 0 0 0 0 0 0 2: 1 0 0 0 5 0 0 0 0 10 0 0 0 3: 1 2 0 0 5 0 7 8 0 0 0 0 0 (kNN/SFS) Overall: 1 0 3 4 0 0 7 0 0 10 11 0 0 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 2 0 0 0 0 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 0 0 0 7 0 0 0 0 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 2 3 0 0 0 0 0 0 10 0 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 0 0 6 7 0 0 10 0 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 0 5 0 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 0 5 6 7 0 0 10 0 12 0 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 0 0 6 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 2 3 0 5 0 7 0 0 0 11 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 0 3 0 5 0 7 0 0 0 11 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 3 4 0 0 7 0 0 0 11 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 2 0 0 5 0 7 0 0 10 11 12 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 111 Overall: 1 0 0 4 5 6 7 0 0 10 11 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 4 0 0 0 0 0 10 11 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 0 0 0 0 0 0 7 8 0 10 0 0 0 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 0 0 6 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 0 0 0 7 0 0 10 0 0 0 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 0 3 0 0 0 0 0 0 10 0 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 2 3 4 5 6 7 0 9 10 0 12 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 0 0 0 5 0 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 3 0 5 0 7 0 0 10 11 12 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 2 3 0 5 0 7 0 0 10 0 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 0 0 0 0 5 0 7 0 0 10 0 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 4 0 6 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 4 5 0 7 0 0 10 11 12 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 0 0 0 0 0 0 0 0 0 10 0 0 13 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 0 0 0 0 0 7 8 0 10 0 12 0 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 0 0 4 5 0 0 0 9 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 Overall: 1 2 3 4 0 0 7 8 9 10 11 0 13 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 2: 1 0 0 0 5 0 0 0 0 10 0 0 0 3: 1 2 3 4 5 6 7 8 9 10 11 12 13 112 (kNN/GA) Overall: 1 0 3 4 0 0 0 0 9 10 0 12 13 1: 0 2 3 4 0 0 0 8 9 10 0 0 13 2: 1 2 0 4 0 6 0 8 0 10 0 12 13 3: 0 0 0 0 0 0 0 0 9 10 0 12 0 Overall: 1 2 0 4 0 0 7 0 9 10 11 12 13 1: 1 2 0 0 0 6 0 0 9 10 0 0 0 2: 1 2 0 4 5 0 7 0 0 10 0 12 13 3: 0 0 0 0 0 0 7 0 9 0 0 12 13 Overall: 0 2 0 0 0 0 7 0 0 10 11 12 13 1: 1 0 3 0 0 0 0 0 0 0 11 12 13 2: 1 0 0 0 5 0 7 0 0 10 11 0 13 3: 0 0 3 0 0 6 7 0 0 0 0 0 0 Overall: 1 2 0 0 0 0 7 8 9 10 0 12 13 1: 1 2 0 0 5 6 7 0 0 0 0 0 13 2: 1 2 0 4 0 0 0 8 0 10 11 12 13 3: 0 0 0 0 5 6 7 0 9 0 11 0 0 Overall: 1 0 0 4 0 6 7 0 9 10 0 12 13 1: 0 2 0 4 5 6 0 0 0 10 11 0 0 2: 0 0 0 4 5 6 7 0 0 10 0 0 13 3: 1 0 0 0 0 0 7 0 9 10 0 12 13 Overall: 0 2 3 0 5 0 7 0 0 10 11 12 13 1: 0 0 3 0 0 0 7 8 0 0 0 0 13 2: 1 2 3 4 0 0 0 0 9 10 11 0 13 3: 1 0 3 0 5 0 7 0 9 0 0 12 0 Overall: 1 0 3 4 0 0 7 0 0 10 11 0 13 1: 0 0 0 0 0 6 0 0 0 10 0 0 13 2: 1 0 0 4 5 0 7 0 0 10 0 12 13 3: 0 2 0 4 5 0 7 0 9 0 0 0 13 Overall: 1 0 3 0 0 6 7 0 0 10 0 12 0 1: 1 2 0 0 5 0 7 0 0 10 11 0 0 2: 0 0 0 4 0 0 7 0 0 10 0 0 0 3: 0 0 0 0 0 0 7 0 9 0 0 12 13 Overall: 1 2 0 0 5 6 0 0 9 10 11 12 13 1: 0 0 3 0 0 0 7 8 0 10 0 0 0 2: 1 2 3 0 5 6 0 0 0 10 11 12 13 3: 0 0 0 0 5 0 7 0 9 0 0 0 13 113 Overall: 0 2 0 4 5 0 7 0 9 10 0 12 13 1: 1 2 0 0 5 0 7 8 0 0 0 0 0 2: 1 0 0 0 0 0 7 8 0 10 0 0 13 3: 0 0 3 0 5 0 0 0 9 0 11 12 0 Overall: 1 0 3 0 0 0 0 0 0 10 11 12 13 1: 1 0 0 0 0 0 0 0 9 10 0 12 13 2: 1 0 3 4 5 0 0 0 0 10 11 12 13 3: 0 0 3 0 0 6 7 0 0 0 0 0 0 Overall: 0 2 0 4 5 0 7 0 9 10 0 12 13 1: 0 0 0 0 5 6 7 0 0 10 0 0 13 2: 0 0 0 0 5 0 0 0 0 10 0 0 13 3: 0 2 0 0 0 0 7 0 0 0 11 12 13 Overall: 0 2 0 4 5 6 7 8 0 10 0 12 13 1: 1 0 3 0 0 6 0 0 0 0 0 0 13 2: 0 2 0 4 5 0 7 0 0 10 0 12 13 3: 0 0 0 0 0 0 7 8 0 10 0 12 0 Overall: 1 0 0 4 5 0 7 0 0 10 11 12 13 1: 1 0 0 4 5 6 0 0 9 10 0 0 0 2: 1 0 0 0 5 0 0 0 9 10 0 0 13 3: 1 0 0 4 0 6 7 0 0 0 0 12 0 Overall: 1 0 3 4 0 6 0 0 9 10 11 12 13 1: 1 0 0 4 5 0 7 0 0 10 11 0 0 2: 1 0 0 4 5 6 7 0 0 0 0 0 13 3: 0 0 0 0 0 6 0 0 9 10 0 12 0 Credit Approval Dataset (NBC/SBS) Overall: 0 0 0 4 5 0 7 0 9 10 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 2 0 4 0 0 0 0 9 10 11 0 0 0 0 Overall: 0 0 0 0 0 0 7 0 9 0 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 2 0 0 0 0 0 0 9 10 11 0 0 0 0 Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 1 0 0 4 5 6 7 8 9 10 11 0 13 14 15 114 Overall: 0 0 0 4 0 6 7 0 9 0 11 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 2 0 0 0 0 0 0 9 10 11 0 0 0 0 Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 15 Overall: 0 0 0 4 0 0 7 0 9 10 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 0 Overall: 0 0 0 4 0 0 7 0 9 10 0 12 13 0 15 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 15 Overall: 0 0 0 4 0 6 0 0 9 0 11 0 13 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 15 Overall: 0 0 0 4 0 0 0 0 9 10 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 0 Overall: 0 0 0 4 5 0 0 0 9 10 0 0 13 0 15 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 2 0 4 0 6 7 8 9 10 11 0 0 0 15 Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 1 0 3 4 5 6 7 8 9 10 11 12 13 14 15 Overall: 1 0 0 4 0 0 0 0 9 10 0 0 13 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 1 2 3 4 0 6 7 8 9 10 11 0 0 14 15 Overall: 0 0 0 4 0 0 7 0 9 10 0 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 2 0 0 0 0 0 0 9 10 11 0 13 0 0 Overall: 1 0 0 4 0 0 0 0 9 0 11 0 0 0 0 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 4 5 6 0 8 9 10 11 0 0 14 15 115 Overall: 1 0 0 0 0 0 0 0 9 10 0 0 0 0 15 +: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 -: 0 0 0 0 0 0 0 0 9 10 11 0 0 0 0 (kNN/SBS) Overall: 0 0 0 0 0 0 0 0 9 10 0 12 0 14 0 +: 0 2 0 4 0 0 7 8 9 0 0 0 0 14 0 -: 0 0 3 4 0 6 0 8 9 10 11 12 0 0 0 Overall: 1 0 0 4 0 6 7 0 9 10 0 0 0 0 0 +: 1 0 0 4 0 6 0 0 9 0 0 0 0 14 0 -: 1 0 3 4 0 6 0 8 9 10 11 0 0 0 0 Overall: 1 0 0 0 0 0 0 0 9 10 0 0 0 0 0 +: 0 0 0 4 0 0 0 8 9 10 0 12 0 14 0 -: 0 0 3 4 5 0 7 0 9 0 11 12 13 0 0 Overall: 1 0 0 4 0 0 0 0 9 10 0 12 0 0 0 +: 1 0 3 0 0 0 7 0 9 0 0 0 0 0 0 -: 1 0 3 0 0 6 7 8 9 0 11 0 0 0 15 Overall: 0 0 0 4 5 0 7 0 9 10 0 12 13 14 0 +: 1 2 0 4 5 0 0 8 9 0 0 0 0 0 0 -: 1 0 3 4 0 0 7 0 9 0 11 0 0 0 0 Overall: 1 0 3 0 0 0 7 8 9 10 0 12 0 14 0 +: 0 0 0 4 0 0 7 0 9 0 0 12 0 14 0 -: 1 2 3 4 0 0 0 8 9 0 11 0 0 0 0 Overall: 1 0 0 4 0 6 7 0 9 0 0 0 0 14 0 +: 1 2 0 4 0 6 0 8 9 0 0 12 0 14 0 -: 0 0 3 4 0 6 7 0 9 0 11 12 0 0 0 Overall: 0 0 0 4 0 6 7 0 9 10 0 12 0 14 0 +: 0 0 3 0 0 6 7 0 9 10 0 12 0 0 0 -: 1 0 0 4 5 6 7 8 9 10 11 12 0 14 0 Overall: 1 0 0 4 0 6 7 0 9 10 0 12 0 0 0 +: 0 2 0 4 0 6 7 0 9 10 0 12 13 14 0 -: 1 2 3 4 0 0 7 8 9 0 11 12 13 0 0 Overall: 1 0 0 4 5 6 7 0 9 10 0 12 13 0 0 +: 0 2 0 4 0 0 7 0 9 0 0 0 0 14 0 -: 0 0 3 4 0 0 0 0 9 10 11 12 0 0 0 116 Overall: 0 0 0 4 0 0 7 0 9 10 0 0 0 14 0 +: 0 2 0 4 0 6 0 0 9 10 0 12 0 0 0 -: 1 0 3 4 5 6 0 0 9 10 11 12 13 0 0 Overall: 0 0 0 4 0 6 0 0 9 10 0 0 13 14 0 +: 0 0 0 4 0 6 7 0 9 0 0 0 0 14 0 -: 0 0 3 4 0 0 7 8 0 0 11 12 0 0 0 Overall: 0 2 0 4 0 6 7 0 9 10 0 0 13 14 0 +: 0 2 0 0 0 6 0 0 9 0 0 0 0 14 0 -: 0 0 3 0 0 0 0 8 9 0 11 12 0 0 0 Overall: 0 0 0 4 0 6 0 0 9 10 0 0 13 14 0 +: 0 0 0 0 0 0 7 0 9 0 0 0 0 0 0 -: 0 0 3 0 0 0 0 8 9 0 11 12 0 0 0 Overall: 1 0 0 4 0 6 0 0 9 10 0 0 13 0 0 +: 0 0 0 4 0 6 7 0 9 10 0 12 13 14 0 -: 1 0 3 4 0 6 0 8 9 0 11 12 0 0 0 (kNN/SFS) Overall: 1 2 0 0 0 6 0 0 9 10 0 0 13 14 0 +: 0 2 3 4 0 6 0 0 9 10 0 0 0 0 0 -: 0 2 0 0 0 6 0 0 9 10 11 0 13 0 0 Overall: 1 2 0 0 0 6 7 0 9 0 0 0 13 0 0 +: 0 0 0 0 0 0 7 0 9 0 0 0 0 14 0 -: 0 0 3 0 0 0 0 8 0 0 11 12 0 0 0 Overall: 0 2 0 4 0 6 7 0 9 10 0 0 13 14 0 +: 0 0 0 0 0 0 7 8 9 0 0 12 0 14 0 -: 1 2 3 0 0 0 0 0 0 0 11 12 0 0 0 Overall: 1 2 0 0 0 6 0 0 9 10 0 0 0 0 0 +: 0 2 3 4 5 0 7 0 9 0 0 12 13 14 0 -: 0 2 0 0 0 6 7 0 9 10 0 0 13 0 0 Overall: 1 2 0 4 0 6 0 0 9 0 0 12 0 14 0 +: 0 2 3 4 0 0 7 0 9 0 0 0 13 14 0 -: 1 2 3 4 5 6 7 0 9 10 11 12 13 14 0 Overall: 0 0 0 0 0 0 7 0 9 10 0 0 13 0 0 +: 0 0 3 4 0 0 0 0 9 0 0 12 13 14 0 -: 1 2 0 0 0 6 0 0 0 10 11 12 0 0 0 117 Overall: 1 2 0 4 0 6 7 0 9 0 0 12 13 14 0 +: 0 2 3 4 0 0 7 0 9 0 0 0 13 14 0 -: 0 2 3 4 0 0 0 0 0 0 11 12 0 0 15 Overall: 1 2 0 4 5 6 7 0 9 10 0 12 0 14 0 +: 0 0 3 0 0 0 7 0 9 0 0 0 13 14 0 -: 0 2 3 4 5 6 0 0 9 10 11 12 13 0 0 Overall: 1 2 0 4 0 6 0 0 9 0 0 0 0 14 0 +: 0 0 0 0 0 6 7 0 9 0 0 0 0 14 0 -: 0 2 3 4 0 0 0 8 0 0 11 12 0 0 0 Overall: 0 2 0 4 0 6 0 0 9 0 0 0 0 0 0 +: 0 2 0 4 5 0 7 0 9 10 0 0 13 14 0 -: 0 2 0 4 0 6 7 0 9 10 0 0 0 0 0 Overall: 0 0 3 4 0 0 7 0 9 0 0 0 13 0 0 +: 0 2 3 0 0 0 7 0 9 10 0 0 0 14 0 -: 0 2 0 4 0 6 0 0 9 10 0 0 13 0 0 Overall: 0 0 0 4 0 6 0 0 9 10 0 0 13 14 0 +: 0 0 0 0 0 6 0 0 9 0 0 12 13 14 0 -: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 Overall: 1 2 0 4 0 6 7 0 9 10 0 12 13 14 0 +: 0 0 3 0 0 6 0 0 9 0 0 0 0 0 0 -: 0 2 0 4 5 6 7 0 9 10 0 0 13 14 0 Overall: 0 2 0 4 0 6 0 0 9 10 0 0 13 14 0 +: 0 0 0 0 0 0 7 0 9 10 0 0 13 0 0 -: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 Overall: 0 0 3 4 0 0 7 0 9 0 0 0 13 14 0 +: 0 0 0 0 0 0 7 0 9 0 0 0 0 14 0 -: 1 2 3 4 5 0 7 0 9 10 11 12 13 14 0 (kNN/GA) Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 +: 1 2 0 0 5 6 7 8 9 0 0 0 13 14 0 -: 0 2 3 4 5 0 0 8 9 0 11 12 0 0 0 Overall: 0 2 0 4 0 0 0 0 9 10 0 0 0 0 0 +: 0 0 0 0 0 0 7 0 9 0 0 0 0 14 0 -: 1 2 3 4 0 6 0 8 9 0 11 0 13 0 0 118 Overall: 1 2 0 4 0 6 0 0 9 10 0 12 13 0 0 +: 0 2 3 0 5 0 7 0 9 0 0 0 13 14 0 -: 1 2 3 0 0 0 0 0 9 0 11 12 13 14 0 Overall: 0 0 0 0 0 0 7 0 9 10 0 0 0 14 0 +: 0 0 0 4 5 6 7 0 9 10 0 12 0 14 0 -: 0 0 3 0 5 0 7 8 9 10 11 0 13 0 15 Overall: 1 0 0 0 5 6 0 0 9 0 0 12 13 14 0 +: 0 0 3 4 0 0 7 0 9 0 0 12 0 14 0 -: 0 0 3 0 5 0 7 8 0 0 11 0 0 0 0 Overall: 0 2 0 4 0 6 7 0 9 10 0 12 0 0 0 +: 0 0 3 0 5 0 0 0 9 0 0 12 13 14 0 -: 0 0 3 4 0 0 0 8 9 0 11 12 0 14 0 Overall: 0 2 0 0 0 0 7 0 9 10 0 0 0 0 0 +: 0 2 0 0 5 0 7 8 9 0 0 0 0 14 0 -: 0 0 3 0 5 6 0 0 9 10 11 12 13 0 0 Overall: 1 0 0 0 0 0 7 8 9 10 0 12 0 0 0 +: 0 0 3 0 0 0 7 0 9 0 0 0 13 14 0 -: 1 0 0 4 5 6 7 0 9 10 0 12 13 14 0 Overall: 0 0 0 4 5 6 7 0 9 10 0 0 0 14 0 +: 0 2 0 4 0 6 0 0 9 10 11 12 0 14 0 -: 0 0 3 0 0 0 0 8 9 0 11 12 13 0 0 Overall: 1 0 0 4 0 6 7 0 9 10 0 0 13 14 0 +: 0 2 0 4 0 0 7 0 9 10 0 0 13 14 0 -: 0 0 3 4 5 0 7 0 9 0 11 12 0 14 0 Overall: 0 0 3 0 5 0 0 0 9 0 0 0 0 0 0 +: 1 0 0 0 5 0 7 8 9 0 0 0 0 0 0 -: 0 0 3 4 0 0 0 8 9 0 11 12 13 14 0 Overall: 0 0 0 4 0 6 0 0 9 10 0 0 13 14 0 +: 1 0 0 4 5 6 0 0 9 0 0 12 13 14 0 -: 0 0 3 0 0 0 7 8 9 0 11 12 13 0 0 Overall: 0 0 0 0 0 0 0 0 9 10 0 0 0 14 0 +: 0 2 0 0 0 6 7 0 9 0 0 0 0 14 0 -: 1 0 3 0 5 6 7 8 9 0 11 0 0 0 0 119 Overall: 0 0 3 0 0 0 0 0 9 0 0 0 13 0 0 +: 1 2 0 4 5 6 0 0 9 0 0 0 0 14 0 -: 0 0 3 0 5 0 0 8 9 10 11 12 0 0 0 Overall: 0 2 0 0 0 6 7 0 9 10 0 12 13 14 0 +: 1 0 0 0 0 6 0 0 9 0 0 0 13 14 0 -: 0 0 3 0 5 6 7 8 9 0 11 12 0 0 0 Congressional Voting Dataset (NBC/SBS) Overall: 0 0 0 4 0 0 0 0 0 0 11 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 0 12 0 0 0 0 Overall: 0 2 0 4 0 0 0 0 9 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 11 0 0 0 0 0 Overall: 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 republican: 0 0 0 4 0 6 0 0 0 0 0 0 0 0 0 0 democrat: 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 9 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 1 0 0 4 5 0 0 0 0 10 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 1 0 0 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 15 0 Overall: 0 0 3 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 11 0 0 0 0 0 120 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 0 8 0 0 0 12 0 0 0 0 Overall: 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 1 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 7 0 0 0 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 6 0 0 0 0 0 12 0 0 0 0 Overall: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 1 0 0 4 0 0 0 0 0 0 11 12 0 0 0 0 (kNN/SBS) Overall: 0 0 0 4 0 0 0 0 0 0 11 12 13 0 0 0 republican: 0 0 0 4 0 0 7 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 11 0 0 0 0 0 Overall: 0 2 3 4 0 0 7 0 0 0 0 0 0 0 0 0 republican: 1 0 3 4 0 6 0 0 9 0 11 12 13 14 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 8 0 0 0 0 0 14 0 0 republican: 1 0 0 4 0 0 0 8 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 6 0 8 9 10 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 9 0 11 0 0 0 0 0 republican: 0 0 3 4 0 0 0 0 0 0 0 0 0 14 0 0 democrat: 1 0 0 4 5 0 0 8 0 0 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 7 0 0 0 0 0 0 0 0 0 121 Overall: 1 2 3 4 0 0 0 8 9 0 11 12 0 14 0 0 republican: 1 0 0 4 0 0 0 8 9 0 0 0 13 14 0 0 democrat: 1 2 0 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 3 4 0 0 0 8 0 0 11 0 13 14 15 0 republican: 0 2 3 4 5 0 7 0 0 10 11 12 0 14 0 0 democrat: 0 0 3 4 0 0 7 0 0 10 11 0 0 14 0 0 Overall: 0 2 3 4 0 6 0 0 9 10 11 0 0 0 15 0 republican: 1 0 0 4 0 0 0 8 9 0 0 0 0 14 0 0 democrat: 0 0 3 4 5 6 7 0 0 10 0 12 0 0 15 0 Overall: 0 0 3 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 7 0 9 0 11 0 0 0 0 0 Overall: 0 2 3 4 0 6 7 0 9 0 11 12 13 14 0 0 republican: 0 2 3 4 0 0 7 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 7 0 0 0 11 0 0 0 0 0 Overall: 0 0 3 4 0 0 7 0 0 0 0 12 0 14 0 0 republican: 0 2 0 4 0 0 7 0 0 0 0 0 13 0 0 0 democrat: 0 2 3 4 0 0 7 0 0 0 0 0 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 republican: 0 0 0 4 0 0 0 8 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 11 0 0 0 0 0 Overall: 1 0 3 4 5 6 0 0 9 10 11 0 13 14 0 16 republican: 1 0 0 4 0 6 0 0 9 10 11 12 13 14 0 0 democrat: 0 2 3 4 5 0 0 8 0 0 11 0 0 0 0 0 Overall: 0 0 3 4 0 0 0 0 0 10 11 0 13 0 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 5 0 0 0 0 0 11 12 13 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 14 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 1 2 0 4 0 0 0 0 0 0 11 12 0 0 0 0 (kNN/SFS) Overall: 0 0 0 4 0 0 7 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 7 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 11 0 0 0 0 0 122 Overall: 0 0 3 4 0 0 7 0 0 10 11 0 0 0 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 9 0 0 0 0 0 0 16 republican: 0 0 0 4 0 0 0 0 0 10 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 7 0 0 0 11 12 0 0 15 0 Overall: 0 2 0 4 0 0 0 0 0 0 11 12 0 0 0 0 republican: 1 2 3 4 0 0 0 0 0 0 0 0 0 14 0 0 democrat: 0 0 0 4 5 0 0 0 0 0 11 0 13 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 2 3 4 0 0 0 0 0 0 0 0 0 0 15 16 democrat: 0 0 3 4 0 0 0 0 9 0 0 0 13 0 0 0 Overall: 1 0 0 4 5 6 0 0 0 10 0 12 0 14 0 0 republican: 1 0 0 4 5 0 0 8 0 0 0 0 0 14 0 16 democrat: 0 0 0 4 0 0 0 0 0 10 0 0 0 0 0 0 Overall: 1 2 3 4 0 6 0 0 0 0 11 0 0 0 0 0 republican: 1 0 0 4 0 6 7 0 9 0 11 0 0 14 0 0 democrat: 0 0 3 4 5 6 7 8 0 10 11 0 0 14 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 0 democrat: 0 2 3 4 5 6 7 0 9 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 16 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 16 democrat: 0 2 3 4 0 0 0 0 9 0 11 0 13 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 0 0 16 republican: 0 0 0 4 0 0 0 0 0 10 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 7 0 0 10 11 0 0 0 15 0 Overall: 0 0 3 4 0 6 7 8 0 10 0 0 0 14 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 11 0 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 2 3 4 0 0 0 0 0 10 0 0 0 0 0 0 123 Overall: 1 2 3 4 0 6 0 0 9 10 11 12 13 14 0 16 republican: 0 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 3 4 0 0 0 0 0 0 0 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 republican: 1 2 3 4 5 6 7 0 0 0 11 12 0 14 0 0 democrat: 1 0 0 4 5 0 0 0 9 0 11 12 0 0 0 0 Overall: 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 16 republican: 0 2 3 4 0 6 7 0 9 0 0 0 0 14 0 0 democrat: 0 0 3 4 0 6 7 0 0 0 0 12 0 0 0 0 (kNN/GA) Overall: 0 2 0 4 0 0 0 0 9 0 0 0 0 14 0 0 republican: 0 0 0 4 5 0 0 8 0 10 0 0 13 14 0 0 democrat: 1 0 0 4 0 6 7 0 0 10 11 12 0 0 0 0 Overall: 0 2 3 4 5 0 7 0 0 0 0 0 13 14 0 0 republican: 1 2 3 4 0 6 0 0 9 10 11 12 13 14 0 0 democrat: 0 2 3 4 5 0 0 0 0 0 11 0 13 14 15 0 Overall: 0 0 0 4 0 0 7 0 0 0 11 12 13 0 0 0 republican: 0 0 0 4 0 0 0 0 9 10 0 12 0 0 0 16 democrat: 0 0 3 4 0 6 0 0 9 10 0 12 0 0 0 0 Overall: 0 0 0 4 5 6 0 0 0 0 0 0 0 14 0 0 republican: 0 2 0 4 0 0 7 8 0 0 11 12 0 14 0 0 democrat: 0 0 0 4 0 0 0 0 0 0 11 0 0 0 0 0 Overall: 0 0 0 4 0 6 0 0 9 0 0 12 0 0 0 0 republican: 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 democrat: 0 0 0 4 0 0 0 0 0 0 0 12 13 14 0 0 Overall: 0 2 3 4 0 0 0 0 0 0 0 0 0 0 15 16 republican: 1 0 3 4 0 0 0 0 0 10 0 0 0 14 15 0 democrat: 1 0 3 4 5 0 0 0 0 10 11 12 0 0 0 0 Overall: 0 0 0 4 0 6 0 0 0 0 0 12 13 0 0 16 republican: 0 2 3 4 0 0 0 0 9 0 0 0 0 0 15 16 democrat: 0 0 3 4 5 0 7 8 9 10 11 0 13 14 15 0 Overall: 0 0 3 4 5 0 7 8 0 10 11 0 0 14 0 0 republican: 0 0 0 4 0 0 7 0 0 0 0 12 0 0 0 16 democrat: 0 0 3 4 5 6 0 8 0 10 0 0 0 14 15 0 124 Overall: 1 0 3 4 5 6 7 0 0 0 11 12 0 0 15 16 republican: 1 2 0 4 5 0 7 0 0 0 0 12 13 14 0 0 democrat: 0 0 3 4 0 0 7 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 8 9 10 0 0 0 0 0 0 republican: 1 2 3 4 0 6 7 8 9 10 11 12 0 0 0 0 democrat: 0 0 3 4 5 6 7 0 0 0 11 12 0 14 15 0 Overall: 0 0 0 4 0 0 0 0 0 0 0 12 0 14 15 16 republican: 0 2 0 4 0 6 7 0 0 0 0 0 13 14 0 0 democrat: 0 2 3 4 0 0 7 0 0 0 0 0 0 0 0 0 Overall: 0 2 3 4 0 0 0 0 0 10 0 0 0 0 0 0 republican: 0 2 3 4 0 0 7 0 0 10 0 0 13 14 15 16 democrat: 0 0 0 4 0 0 0 8 0 0 0 12 0 0 0 0 Overall: 0 0 3 4 0 6 0 0 9 10 0 12 0 0 0 0 republican: 0 0 0 4 0 0 7 0 0 0 0 0 13 14 0 0 democrat: 1 0 3 4 5 0 7 0 0 10 11 0 0 0 15 16 Overall: 0 0 0 4 0 0 0 0 9 0 11 0 13 0 0 0 republican: 0 0 0 4 0 6 0 0 0 10 11 0 0 0 0 0 democrat: 0 0 0 4 5 6 0 0 0 0 11 12 0 0 0 0 Overall: 0 0 0 4 0 0 0 8 9 0 11 12 13 14 0 0 republican: 0 0 0 4 0 6 7 0 9 0 0 12 13 14 15 16 democrat: 0 0 3 4 0 0 0 0 0 0 11 12 0 0 0 0 Montgomery County Jails Dataset (kNN/SBS) Overall: 1 0 0 0 5 6 0 0 0 0 11 12 0 0 0 0 -1: 0 2 3 0 5 0 0 0 0 0 11 0 0 0 15 16 1: 0 0 0 4 5 6 0 8 0 0 0 12 0 14 0 0 Overall: 0 0 0 4 5 6 0 0 0 0 11 0 13 0 0 0 -1: 1 0 0 4 5 6 7 0 0 0 11 0 13 0 15 0 1: 0 0 0 4 5 6 7 8 9 10 0 0 13 14 15 16 Overall: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 -1: 0 2 3 0 5 0 0 0 0 0 11 0 13 0 15 16 1: 0 0 0 4 5 6 0 8 0 0 11 12 0 0 0 0 Overall: 0 2 0 4 5 6 7 0 0 10 11 12 13 14 0 16 -1: 0 2 0 0 5 6 0 0 9 0 11 0 0 0 15 16 1: 0 0 0 4 0 0 0 0 0 0 0 0 0 14 0 0 125 Overall: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 -1: 1 2 0 4 0 0 0 0 0 0 11 0 0 0 0 0 1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 Overall: 1 2 0 4 5 6 7 0 0 0 0 0 0 0 15 0 -1: 1 2 3 0 0 6 7 8 0 10 11 12 13 14 15 16 1: 1 0 0 4 5 6 7 8 9 0 11 0 13 14 15 0 Overall: 1 2 0 4 5 6 7 8 0 10 11 12 13 14 0 16 -1: 0 0 3 0 0 6 0 0 0 10 11 0 13 0 15 16 1: 1 0 0 4 0 0 0 8 0 0 0 12 13 14 0 0 Overall: 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 -1: 0 2 3 0 5 6 0 0 0 0 11 0 13 0 0 0 1: 1 2 0 4 5 0 7 8 9 10 11 0 13 14 15 0 Overall: 0 2 3 4 5 6 7 0 9 10 11 0 13 14 15 16 -1: 0 2 3 0 0 0 0 0 9 10 11 0 13 0 15 16 1: 1 0 0 4 0 6 0 8 0 0 11 12 13 14 15 0 Overall: 1 0 0 4 5 0 7 0 0 0 0 0 0 14 15 16 -1: 0 2 3 0 5 6 0 0 0 0 0 0 0 0 0 16 1: 1 0 0 4 5 6 7 8 0 0 11 0 13 14 15 16 Overall: 1 0 0 0 0 0 0 0 0 10 0 0 13 14 0 0 -1: 1 2 0 0 5 0 7 0 0 0 11 0 0 0 0 16 1: 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 -1: 0 0 3 0 5 0 0 0 0 0 0 0 0 0 15 16 1: 1 0 0 4 0 0 0 8 0 0 11 0 13 0 0 0 Overall: 1 2 3 4 5 0 7 0 0 10 11 12 13 0 0 0 -1: 0 2 3 0 5 6 0 0 0 0 11 0 0 0 15 16 1: 1 0 3 4 5 0 7 8 0 10 11 12 13 14 0 16 Overall: 0 2 0 0 0 0 0 8 0 0 0 0 0 0 15 0 -1: 0 2 3 0 5 6 7 0 0 10 11 12 13 14 15 0 1: 0 0 0 0 0 6 7 8 0 10 0 0 0 14 15 0 Overall: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1: 1 2 0 0 5 6 7 0 0 10 0 12 13 0 15 16 1: 0 2 0 0 0 6 7 8 0 10 0 0 0 14 0 0 126 (kNN/SFS) Overall: 0 0 0 0 5 0 0 0 0 0 11 12 0 14 0 0 -1: 0 2 0 0 5 6 7 0 9 0 0 0 0 14 15 16 1: 0 0 0 0 0 0 0 0 9 0 0 12 0 0 0 0 Overall: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1: 0 0 0 0 0 0 0 0 0 0 11 0 0 0 15 0 1: 0 0 0 0 5 0 0 8 0 0 0 0 0 0 0 0 Overall: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 -1: 0 2 3 0 5 6 7 0 9 10 11 12 0 0 15 16 1: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 Overall: 1 2 0 0 0 6 0 0 9 0 0 12 0 14 0 0 -1: 0 2 3 0 5 6 0 0 0 10 11 0 0 0 15 16 1: 1 0 0 4 5 0 0 8 9 10 0 12 0 14 0 0 Overall: 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 -1: 1 2 0 0 5 6 0 0 9 0 11 12 13 14 15 16 1: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 Overall: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1: 1 2 0 0 0 0 0 0 9 0 11 12 0 14 15 16 1: 1 0 0 0 5 6 0 8 9 10 0 12 0 14 0 16 Overall: 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 -1: 0 0 3 0 0 0 0 8 9 10 0 12 13 14 15 16 1: 0 0 0 0 5 0 0 0 0 0 0 0 0 14 0 0 Overall: 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 -1: 1 2 3 0 5 6 7 8 9 10 11 12 13 14 15 16 1: 1 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 Overall: 0 0 0 0 0 0 0 0 0 0 0 12 13 14 0 16 -1: 0 2 3 0 0 0 0 0 9 10 11 0 13 0 15 16 1: 1 2 0 4 0 6 0 8 9 10 11 12 13 14 15 16 Overall: 0 0 0 0 0 0 0 0 0 10 11 0 0 0 0 16 -1: 0 0 0 0 0 0 7 0 0 10 11 0 0 14 15 16 1: 0 0 0 0 5 0 7 0 0 0 0 0 0 0 0 16 Overall: 0 0 0 0 0 0 0 8 9 0 0 0 0 0 0 0 -1: 0 2 0 0 5 6 0 8 9 10 11 12 0 0 0 16 1: 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 127 Overall: 0 0 0 0 5 0 0 0 9 0 0 0 0 0 0 0 -1: 0 2 3 0 5 0 0 0 9 10 11 0 13 0 15 16 1: 0 0 0 0 5 0 0 0 9 0 0 0 0 0 0 0 Overall: 0 0 0 0 0 0 0 8 9 0 0 0 0 14 0 0 -1: 0 2 3 0 5 0 7 0 9 0 11 0 0 0 15 16 1: 0 0 0 0 0 0 0 8 9 10 0 0 0 14 0 0 Overall: 0 2 0 0 0 0 0 8 0 0 0 0 0 0 15 0 -1: 1 2 3 0 5 6 7 8 9 10 11 12 13 14 15 16 1: 0 0 0 0 0 0 0 8 9 0 0 0 0 0 0 0 Overall: 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 -1: 0 0 0 0 0 0 0 8 0 10 0 0 0 0 0 0 1: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (kNN/GA) Overall: 0 0 0 0 0 0 0 0 9 0 0 12 0 14 15 16 -1: 0 2 0 0 0 0 7 0 0 0 0 0 0 14 15 16 1: 0 0 0 0 0 0 0 0 9 0 0 12 0 0 0 0 Overall: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1: 0 2 0 0 5 0 7 0 9 0 0 0 13 0 15 16 1: 0 0 0 0 0 0 0 8 0 0 0 0 0 14 15 0 Overall: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 -1: 0 2 3 0 5 0 0 0 0 0 11 0 13 0 15 16 1: 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 Overall: 1 2 0 0 0 0 0 0 0 0 0 0 0 14 0 0 -1: 0 2 0 0 0 6 0 0 0 0 0 0 0 0 15 16 1: 0 2 0 4 0 0 0 8 0 10 0 12 13 14 0 0 Overall: 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 -1: 0 0 0 0 0 6 0 0 9 0 11 12 0 0 15 0 1: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 -1: 0 2 0 0 0 6 0 0 9 0 11 12 0 14 15 16 1: 0 0 0 0 0 0 0 8 9 0 0 12 0 0 15 0 Overall: 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 -1: 0 2 0 0 5 0 7 0 0 0 11 12 0 0 15 16 1: 0 0 0 0 0 0 7 0 9 0 0 0 0 0 0 0 128 Overall: 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 -1: 1 2 3 0 0 0 0 8 9 10 0 12 13 0 15 16 1: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 15 0 Overall: 1 0 0 0 0 0 0 8 0 0 0 0 13 14 0 16 -1: 0 2 3 0 5 0 0 0 9 10 11 0 13 14 15 16 1: 0 0 0 4 0 6 0 8 9 10 11 0 13 14 15 0 Overall: 0 0 0 0 5 0 0 0 9 0 0 0 0 14 15 0 -1: 1 2 3 0 5 0 0 8 9 10 11 0 0 0 0 16 1: 0 0 0 0 0 0 0 0 0 10 11 0 0 0 0 16 Overall: 0 0 0 0 0 0 0 8 9 0 0 0 0 0 0 0 -1: 0 0 0 0 0 0 7 0 9 0 0 12 13 0 0 16 1: 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 Overall: 0 0 0 0 5 0 0 0 9 0 0 0 0 0 0 0 -1: 0 2 3 0 5 0 0 0 9 10 11 0 13 0 15 16 1: 0 0 3 4 5 0 0 8 9 10 0 0 0 14 0 0 Overall: 0 0 0 4 5 6 0 0 0 10 0 12 0 14 0 16 -1: 0 0 3 0 5 0 0 0 9 0 0 0 13 0 15 0 1: 0 0 0 0 0 0 0 8 9 0 11 0 0 0 0 0 Overall: 0 0 0 0 0 0 0 8 0 0 0 0 0 0 15 0 -1: 0 2 3 0 5 6 7 0 9 10 11 12 13 14 15 0 1: 0 2 0 0 0 0 0 8 0 0 0 0 0 0 0 0 Overall: 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1: 0 0 0 0 0 0 0 8 0 10 0 0 0 0 0 0 1: 0 0 0 0 0 0 7 8 9 10 0 0 0 14 0 0