A Clustering Rule Based Approach for Classification Problems by Philicity Kapryelle Williams A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 9, 2010 Keywords: Data Mining, Classification, Rule Based Classifiers, Data Clustering Copyright 2010 by Philicity Kapryelle Williams Approved by Juan E. Gilbert, Chair, Professor of Computer Science and Software Engineering N. Hari Narayanan, Professor of Computer Science and Software Engineering Cheryl D. Seals, Associate Professor of Computer Science and Software Engineering ii Abstract Today?s data storage and collection abilities have allowed the accumulation of enormous amounts of data. Data mining can be a useful tool in transforming these large amounts of raw data into useful information. Predictive modeling is a very popular area in data mining. The results of these type tasks can contain helpful information that can be used in decision making. Problems arise when the data sets that are used to build these models are not as complete (e.g. erroneous/missing values) as the data used to evaluate the model. Rule based classifiers are widely used and accepted type of predictive model. We present a method to reduce the severity of the effects of missing data on the performance of rule base classifiers using divisive data clustering. The Clustering Rule based Approach (CRA) clusters the original training data and builds a separate rule based model on the cluster wise data. The individual models are combined into a larger model and evaluated against test data. We evaluate the effects of the missing attribute information for ordered and unordered rule sets. We experimentally show that the collective model is less affected by missing attribute information when the test data has missing attribute values. iii Acknowledgements First and foremost, I would like to thank God for he is my strength and through him I can do a1l things. I would like to thank Dr. Juan Gilbert for all his support, encouragement, and patience. The many times when I did not know how I was going to make it, or didn?t think I could do it, he was there to guide me. Thank you to my family for their love, support, and encouragement. They have been very supportive an encouraging throughout my academic tenure here at Auburn. Last but not least I would like to say thank you to all of the members of the Human Centered Computing Lab. There support, encouragement, and just being there will forever be appreciated!! iv Table of Contents Abstract ........................................................................................................................................... ii Acknowledgements ........................................................................................................................ iii List of Tables ................................................................................................................................ vii List of Figures ................................................................................................................................. x Chapter 1 ......................................................................................................................................... 1 Introduction ..................................................................................................................................... 1 1.1 Motivation ............................................................................................................................. 1 1.2 Problem Description and Background .................................................................................. 2 1.3 Organization .......................................................................................................................... 5 Chapter 2 ......................................................................................................................................... 6 Literature Review............................................................................................................................ 6 2.1 Data Mining Overview ......................................................................................................... 6 2.2 Predictive Modeling .......................................................................................................... 8 2.3 Association Analysis ....................................................................................................... 20 2.4 Cluster Analysis .............................................................................................................. 25 2.5 Anomaly Detection ......................................................................................................... 29 2.6 Ensemble Methods .......................................................................................................... 30 v 2.7 Other Related Information and Work ............................................................................. 34 2.8 Data Mining Tools .......................................................................................................... 37 Chapter 3 ....................................................................................................................................... 39 Research Details............................................................................................................................ 39 3.1 Conceptual Overview.......................................................................................................... 39 Chapter 4 ....................................................................................................................................... 42 Experiment Design........................................................................................................................ 42 4.1 Data ..................................................................................................................................... 42 4.2 Materials and Tools............................................................................................................. 42 4.2.1 PRISM.......................................................................................................................... 43 4.2.2 Applications Quest? ................................................................................................... 43 4.3 Procedure ............................................................................................................................ 47 4.4 Expected Outcomes ............................................................................................................ 48 Chapter 5 ....................................................................................................................................... 49 Research Findings ......................................................................................................................... 49 5.1 Results ................................................................................................................................. 49 Chapter 6 ................................................................................................................................... 96 Summary ....................................................................................................................................... 96 6.1 Summary and Conclusion ................................................................................................... 96 6.2 Future Work ........................................................................................................................ 97 vi References ..................................................................................................................................... 98 vii List of Tables Table 1Tennis Training Data .......................................................................................................... 3 Table 2 Vertebrate Unlabeled/Unseen Data ................................................................................. 11 Table 3 Data Sets .......................................................................................................................... 42 Table 4 Similarity Matrix with Difference Measures Example .................................................... 45 Table 5 Example Nominal Population Matrix .............................................................................. 45 Table 6 Nominal Population Matrix Example .............................................................................. 46 Table 7 Example Nominal Population Matrix .............................................................................. 46 Table 8 Example Nominal Population Matrix Adjusted............................................................... 46 Table 9 Congressional Voting Results for Complete Data (No Missing Attributes) ................... 50 Table 10 Congressional Voting 1 Missing Attribute .................................................................... 51 Table 11 Congressional Data Results 2 Missing Attributes ......................................................... 51 Table 12 Congressional Voting Data Results 3 Missing Attributes ............................................. 51 Table 13 Congressional Voting Results 4 Missing Attributes ...................................................... 52 Table 14 Congressional Voting Example Traditional Rule Set .................................................... 55 Table 15 Congressional Voting Example Collective Rule Set ..................................................... 56 Table 16 Anneal Results Complete Data (No Missing Attributes)............................................... 57 Table 17 Anneal Results 1 Missing Attribute ............................................................................... 57 Table 18 Anneal Results 2 Missing Attributes ............................................................................. 58 Table 19 Anneal Results 3 Missing Attributes ............................................................................. 58 viii Table 20 Anneal Results 4 Missing Attributes ............................................................................. 58 Table 21 Example Anneal Traditional Rule Set ........................................................................... 61 Table 22 Example Anneal Collective Rule Set ............................................................................. 63 Table 23 Thyroid Results Complete Data (No Missing Attributes) ............................................. 64 Table 24 Thyroid Results 1 Missing Attribute ............................................................................. 64 Table 25 Thyroid Results 2 Missing Attributes ............................................................................ 65 Table 26 Thyroid Results 3 Missing Attributes ............................................................................ 65 Table 27 Thyroid Results 4 Missing Attributes ............................................................................ 65 Table 28 Thyroid Example Traditional Model ............................................................................. 70 Table 29 Thyroid Example Collective Model .............................................................................. 72 Table 30 Mushroom Results Complete Data (No Missing Attributes) ........................................ 73 Table 31 Mushroom Results 1 Missing Attribute ......................................................................... 73 Table 32 Mushroom Results 2 Missing Attributes ....................................................................... 74 Table 33 Mushroom Results 3 Missing Attributes ....................................................................... 74 Table 34 Mushroom Results 4 Missing Attributes ....................................................................... 74 Table 35 Mushroom Example Traditional Model ........................................................................ 77 Table 36 Mushroom Example Collective Model .......................................................................... 78 Table 37 Tic-Tac-Toe Results Complete Data (No Missing Attributes) ...................................... 79 Table 38 Tic-Tac-Toe Results 1 Missing Attribute ...................................................................... 80 Table 39 Tic-Tac-Toe Results 2 Missing Attributes..................................................................... 80 Table 40 Tic-Tac-Toe Results 3 Missing Attributes..................................................................... 80 Table 41 Tic-Tac-Toe Results 4 Missing Attributes..................................................................... 81 Table 42 Tic-Tac-Toe Example Traditional Model ...................................................................... 86 ix Table 43 Tic-Tac-Toe Example Collective Model ....................................................................... 91 Table 44 Best Performance for Best and Worst Conditions Ordered Models .............................. 91 Table 45 Best Performance for Best and Worst Conditions Unordered Models .......................... 92 Table 46 Results of Welch Two Sample t-test (Ordered Models) ................................................ 94 Table 47 Results Two Sample Welch t-test (Unordered Models) ................................................ 94 x List of Figures Figure 1 ID3 Decision Tree ............................................................................................................ 3 Figure 2 Data mining's relationship to other disciplines ................................................................. 6 Figure 3 CRISP-DM Model ............................................................................................................ 7 Figure 4 Four Core Data Mining Tasks .......................................................................................... 8 Figure 5 Vertebrate Training Data (Labeled/Known Outcomes) ................................................. 10 Figure 6 Vertebrate Rule Set/Model ............................................................................................. 10 Figure 7 Simple Decision Tree ..................................................................................................... 12 Figure 8 Basic Sequential Covering Algorithm ............................................................................ 16 Figure 9 Rule Growing Strategies................................................................................................. 17 Figure 10 Simple Neural Network ................................................................................................ 19 Figure 11 Lattice Structure ........................................................................................................... 22 Figure 12 Apriori Principle ........................................................................................................... 23 Figure 13 Support Based Pruning ................................................................................................. 24 Figure 14 Records to be clustered ................................................................................................. 26 Figure 15 Cluster Hierarchy .......................................................................................................... 26 Figure 16 Linkage Examples ........................................................................................................ 28 Figure 17 K-means Algorithm ...................................................................................................... 28 Figure 18 Ensemble Method Process ............................................................................................ 31 Figure 19 Bagging and Boosting Example ................................................................................... 33 xi Figure 20 CRA Flow ..................................................................................................................... 41 Figure 21 PRISM Pseudocode ...................................................................................................... 43 Figure 22 Congressional Voting Accuracies Unordered Model ................................................... 52 Figure 23 Congressional Voting Accuracies Ordered Model ....................................................... 53 Figure 24 Congressional Voting Average Rule Lengths .............................................................. 53 Figure 25 Anneal Accuracies Unordered Model .......................................................................... 59 Figure 26 Anneal Accuracies Ordered Model .............................................................................. 59 Figure 27 Anneal Average Rule Lengths ...................................................................................... 60 Figure 28 Thyroid Accuracies Unordered Models ....................................................................... 66 Figure 29 Thyroid Accuracies Ordered Models ........................................................................... 66 Figure 30 Thyroid Average Rule Lengths .................................................................................... 67 Figure 31Mushroom Accuracies Unordered Models .................................................................... 75 Figure 32 Mushroom Accuracies Ordered Models ....................................................................... 76 Figure 33 Mushroom Average Rule Lengths................................................................................ 76 Figure 34 Tic-Tac-Toe Accuracies Unordered Models ................................................................ 81 Figure 35 Tic-Tac-Toe Accuracies Ordered Models .................................................................... 82 Figure 36 Tic-Tac-Toe Average Rule Lengths ............................................................................. 82 Figure 37 Percentage Drop in Accuracy Ordered Models ............................................................ 92 Figure 38 Percentage Drop in Accuracy Unordered Models ........................................................ 93 Figure 39 Ordered vs. Unordered Accuracies Overall .................................................................. 95 1 Chapter 1 Introduction 1.1 Motivation Advances in data collection and storage capabilities have enabled organizations, companies, institutions, etc to collect immense amounts of data. The challenge lies in trying to turn all of that raw data into useful information. One way to address this problem is with the use of data mining. Data Mining can be defined as ?sorting through data to identify patterns and establish relationships? (TechTarget, 2008). Data mining blends traditional data analysis methods with sophisticated algorithms for processing large amounts of data. These algorithms search large data repositories in order to find new and useful patterns that might have otherwise been unidentified. Data mining methods can and are being used in many domains to address a variety of tasks. One of the more popular uses of data mining techniques is predictive modeling or classification. Certain types (i.e. decision trees, neural networks, etc.) of data mining algorithms have the ability to predict future outcomes using the attributes in the data set. For example, let?s assume that a long distance company wants to promote its new long distance plan. The company can apply predictive modeling algorithms to specifically target customers who are more likely to sign up for the plan based on their past behavior. 2 As one could imagine, the accuracy of predictive modeling algorithms is very important. Their results contain information that can be very valuable and have the ability to help make decisions and drive change. Rule based algorithms are widely used and accepted to perform classification of tasks due to their ease of interpretability and understanding. These classifiers build a model or rule set from training data as a set of high-quality rules, which can be used to predict the class labels of unlabeled instances (Wang & Karypis, 2003). There have been many rule based classification systems studied and proposed which yield adequate classification accuracy in a variety of applications such as RIPPER (Cohen, 1995), FOIL (Quinlan & Gaetz, 1993), and CPAR (Yin & Han, 2003) . However, if the test data used to evaluate the model is not as complete as the training data used to build the model the classification accuracy suffers as the model tends to follow the training data too closely. This is problematic as in most real world applications data is often incorrect and or missing parts. Thus, it is advantageous to have a rule set or model that is robust and can make reasonably accurate predictions when the test data is incomplete. This research explores the notion of adding robustness to rule based classifiers via divisive clustering. 1.2 Problem Description and Background Rule based classifiers are used widely because of the ease of interpretability of the models or set of rules they generate. These classifiers perform exceptionally well on complete data sets. Meaning, the data is clean, correct, and does not have missing attribute values. To generate the model or train the classifier the training data uses attribute and values relative to each other to segregate the data to generate a rule relative to a particular class. This poses a problem as real world data sets are not clean, correct and often have missing attribute values. The models produced by traditional rule based classifiers are sensitive to missing attribute values in new/unseen data. For example, Table 1 and Figure 1 decision tree (Quinlan J. , 1993). 3 depict a popular data set Table 1Tennis Training Data Figure 1 ID3 Decision Tree and the resulting 4 The rules derived from this tree are: If outlook is sunny and humidity is high, then do not play tennis. If outlook is sunny and humidity is normal, then play tennis. If outlook is overcast, then play tennis. If outlook is rain and wind is strong, then do not play tennis. If outlook is rain and wind is weak, then play tennis. Notice that all of the rules derived contain the outlook attribute. Suppose there is some test data in which the outlook information is missing. This model is not robust as none of the rules generated would be able to make predictions. In fact, as the number of attributes within a data set that have missing data increases the classifiers performance/accuracy decreases. The model that the classifier generates is unable to classify all of the instances new/unseen test data. For the purposes of this research we call this a hard classification problem. This research proposes a new method for adding robustness to rule based classifiers using divisive clustering for hard classification problems. With this approach the training data set is broken into clusters based on holistic similarities such that items within a particular cluster are more alike than those outside of the cluster. Each cluster generates a model trained with smaller homogeneous data sets. Multiple models are created using the clustered data and then combined into a single model (the experiment). The hypothesis for this research is that the resulting combined rule base will be more resilient against missing attribute information. Resiliency refers to minimizing any decreases in classification accuracy as the number of missing attributes increases versus the traditional (the control) model. It is believed that the collective model is more resilient than the traditional model because it will have an added level of robustness and ability to classify new/unseen data instances that would not have been classified with a traditional classifier. The collective model contains rules that otherwise would not have been 5 created because of the alternate views of the data set used to train the multiple individual classifiers. 1.3 Organization In the following chapters the research agenda will be examined. Chapter 2 provides an overview of data mining techniques and practices, more specifically, predictive modeling/classification and data clustering, and discusses any work pertinent to this research. Chapter 3 provides conceptual view of the approach studied in this research. Chapters 4 and 5 describe the design of the experiment conducted to study the validity of the approach and the results of that experiment respectively. Finally, Chapter 6 provides a summary and conclusions and discusses areas of future work. 6 Chapter 2 Literature Review 2.1 Data Mining Overview Data Mining is the process of discovering useful information in large amounts of data. It is fundamentally built on the principles of algorithms used in statistics, artificial intelligence, pattern recognition, and machine learning. From statistics, data mining incorporates characteristics such as sampling, estimation, and hypothesis testing. From artificial intelligence, pattern recognition, and machine learning, data mining incorporates the following: search algorithms, modeling techniques, and learning theories. Figure 2 conveys data mining?s relationship to other areas (Tan, Steinbach, & Kumar, 2006) Figure 2 Data mining's relationship to other disciplines Data mining is being used in many domains in a variety of ways. Due to this, a CRross Industry Standard for Data Mining Projects industry, and application neutral. The life cycle according to the standard has six phases: Business/Research Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Figure 3 depicts CRIS Most data mining tasks are usually either predictive or descriptive in nature. Predictive tasks are ones in which the goal is to predi values of other attributes. The attribute that is being predicted is frequently referred to as the target or dependent variable. The other attributes that are used for making the prediction are 7 (CRISP-DM) was developed. This standard is tool, P-DM (CRISP-DM, 2000). Figure 3 CRISP-DM Model ct the value of a particular attribute based on the 8 referred to as the explanatory or independent variables. Descriptive tasks are ones in which the goal is to derive patterns (i.e. correlations, clusters, trends) that will provide a summary of the underlying relationships in the data. These tasks are usually exploratory and often require some post-processing to confirm as well as explain the results. There are four tasks that are at the heart of data mining. They are Predictive Modeling, Cluster Analysis, Association Analysis, and Anomaly Detection (see Figure 4) (Tan, Steinbach, & Kumar, 2006) Figure 4 Four Core Data Mining Tasks 2.2 Predictive Modeling Predictive modeling is one of the most popular subfields in data mining. It is the process of using the patterns found in the data set to predict future outcomes. Algorithms used for 9 predictive modeling build a model for the dependent variable as a function of the independent variables (Berson, Smith, & Thearling, 1999). There are two types of predictive modeling tasks: classification and regression. Classification tasks are used when the target or dependent variable is discrete. Regression tasks are used when the target variable is continuous (Tan, Steinbach, & Kumar, 2006) Predictive modeling problems are comprised of four things: a dependent variable, independent variables, a learning/training data set, and a test data set (Lewis, 2000). The learning/training data set contains values for both the dependent and independent variables, and is used to build the model. This model is then applied to the test set for evaluation. The test set is a subset of the training/learning data set. The performance of the model is based on the counts of the test records that are correctly or incorrectly predicted or classified. There are several predictive modeling techniques. The next section provides an overview of some of them. 2.2.1 Rule based Classifiers A rule-based classifier is a method used for classification using a set of ?if?then? rules. A rule can be expressed as ri: (Condition) barb2right yi. The left hand side of the rule is the antecedent or condition. The right hand side is the consequent. A rule covers an instance if the condition matches the attributes of that instance. An example of classification using a rule based method is depicted below. Figure 5 shows the training data, Figure 6 displays the rule set (model) generated and Table 2 contains the unlabeled instances for the vertebrate classification problem. The goal of this model is to determine what type of vertebrate the instance will be classified as. There are five possible outcomes: mammal, reptile, fishes, amphibians, and birds. (Tan, Steinbach, & Kumar, 2006) R1 covers the hawk instance. A hawk does not give birth and can fly and will be classified as a bird. R3 covers the grizzly bear instance. A grizzly bear gives birth and is warm blooded and will be classified as a mammal. Figure 5 Vertebrate Training Data Figure Name Blood Type human warm python cold salmon cold whale warm frog cold komodo cold bat warm pigeon warm cat warm leopard shark cold turtle cold penguin warm porcupine warm eel cold salamander cold gila monster cold platypus warm owl warm dolphin warm eagle warm 10 (Labeled/Known Outcomes) 6 Vertebrate Rule Set/Model Give Birth Can Fly Live in Water yes no no mammals no no no reptiles no no yes fishes yes no yes mammals no no sometimes amphibians no no no reptiles yes yes no mammals no yes no birds yes no no mammals yes no yes fishes no no sometimes reptiles no no sometimes birds yes no no mammals no no yes fishes no no sometimes amphibians no no no reptiles no no no mammals no yes no birds yes no yes mammals no yes no birds Class 11 Table 2 Vertebrate Unlabeled/Unseen Data Rule based classification algorithms tend to fall into the two categories: decision trees and covering algorithms (Li, Topor, & Shen, 2002) (Hu & Li, 2005) (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005). 2.2.1.1 Decision Trees A decision tree is a predictive model in which the results are displayed as a tree type structure (Berson, Smith, & Thearling, 1999). A decision tree consists of a collection of decision nodes, which are connected by branches, descending from the root node until coming to an end at the leaf nodes. Each branch of the tree is a classification question and the leaves are the partitions or segments of the dataset with the classification or decision (Larose, 2005) (Quinlan J. R., 1986). The first step in the decision tree building process is to ?grow? the tree. The goal is to create a tree that works as close to perfect as possible with the data provided. When growing the tree, the main task is to find the best question to ask at each branch or split point in the tree. This task is repeated until there is either only one record in the segment, each of the records in the segment are the same, or there is not any significant gain in making a split. When one of these conditions are met the tree stops growing (Berson, Smith, & Thearling, 1999). Figure 7 conveys a simple decision tree example (DMS, 2008). Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ? grizzly bear warm yes no no ? 12 Figure 7 Simple Decision Tree There are certain requirements that must be adhered to in order to apply a decision tree algorithm. A training data set must be provided. This training data should be varied in nature and provide the algorithm with a robust sampling of the types of records that will need classification in the future. This is very important as decision trees learn by example. If the algorithm is provided bad data then the ability to correctly classify future data will greatly diminish. Also, the target variable must be discrete (Larose, 2005). Decision trees are a part of data mining technology that has existed in some form for decades. Originally, these techniques were created for statisticians to automate the process of determining which attributes were useful or in correlation with the problem they were attempting 13 to solve or understand. They are also particularly adept at handling raw data with little or no pre- processing (Berson, Smith, & Thearling, 1999). Perhaps the most appealing aspect of decision trees is how easy they are to interpret. This is especially evident when deriving decision rules. Rules can be derived by simply traversing the tree from the root to any leaf node and come in the form of if antecedent, then consequent. The antecedent contains the attribute values from the branches taken by a particular path throughout the tree. The consequent contains the classification value for the dependent variable given by that particular leaf node (Larose, 2005) (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005). Using the simple decision tree in Figure 7 an example of a decision rule would be if A = red and B < 4.5 then K = y. There are several algorithms that are used to produce decision trees. These include: ID3, C4.5, Classification and Regression Trees (CART), and Chi-Square Automatic Interaction Detector (CHAID). The next sections describe each. ID3 Developed in the 1970?s, ID3 was one of the first decision tree algorithms. Initially, the algorithm was used for things like learning good game play strategies for chess end games. In chess, the end game is the portion of the game where there are only a few pieces left on the board (Quinlan & Gaetz, 1993) (Chess Endgame, 2008). However, ID3 has been used for a variety of tasks in various domains and has been modified and improved many times. ID3 works by choosing the predictors and their corresponding split values based on the gain in information that the split(s) will make available. Gain is the difference in the amount of information that will be needed to properly make an accurate prediction before and after a split(s) is made (Berson, Smith, & Thearling, 1999). 14 C4.5 C4.5 is an enhancement of the ID3 algorithm. It improved ID3 algorithm by allowing predictors with missing or continuous values to be used, introducing tree pruning, and enabling rule derivation. The trees produced by this algorithm are also not restricted to binary splits. These trees are more variable in shape (Berson, Smith, & Thearling, 1999) (Larose, 2005) (Quinlan J. , 1993). Classification and Regression Trees Classification and Regression Trees (CART) was developed in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone (Breiman, Freidman, Olshen, & Stone, 1984). Many of the C4.5 techniques appear in CART. The trees that are produced by the CART algorithm are strictly binary. This means that each node in the tree can only have two branches. CART analysis is a form of recursive partitioning. The algorithm recursively partitions the data in the training set into subsets of records with similar values for the target or dependent variable. To grow the tree, CART performs an exhaustive search of available variables to pick the best or most optimal split variable. The fully grown tree that is initially produced will yield the lowest error rate when compared against the training data. However, this model may be too complex and usually results in overfitting. An overfit model is one that too closely follows all of the traits of the training data and is not general enough to represent the overall population (Berson, Smith, & Thearling, 1999) (Larose, 2005) (Lewis, 2000). CART uses a cross validation approach and pruning to combat overfitting. Cross validation, also known as leave-one-out training, is a method for confirming a procedure for building a model that is computationally intensive (Lewis, 2000). In cross-validation, the training data is divided randomly into N segments, stratified by the variable of interest. One of the 15 segments is set aside for use as an independent test data set. The other (N-1) segments are combined for use as the training data set. The entire model building process is repeated N times, using a different segment of data as the test data each time. N different models are produced, each of which that can be tested against the independent test data. Based on the results of the cross-validation the initial complex tree is pruned to produce the most optimal general tree. The most complex tree rarely performs best on the test data. By using cross validation, the tree that is most likely to perform well on new data is produced (Lewis, 2000). Chi-Square Automatic Interaction Detector Chi-Square Automatic Interaction Detector (CHAID) was published in 1980 by Gordon V. Kass (Kass, 1980 ). CHAID is very similar to CART but differs in the way splits are determined. The algorithm uses the chi square test to choose the best split. The chi square test is used in contingency tables to determine which categorical predictor is the furthest from independence with the prediction values. All predictors must either be categorical or forced into a categorical form, because of CHAID?s reliance on contingency tables, to shape its test of significance for each predictor (Berson, Smith, & Thearling, 1999) (CHAID). 2.2.1.2 Covering Algorithms Rules can be extracted directly from the training data using a sequential covering algorithm. These algorithms learn a set of classification rules one rule at a time and use a sequential covering method to remove instances in the training data that are covered by each new rule found (Wang & Karypis, 2003) (Han & Kamber, 2006). This process repeats until there are no instances left in the training data set. There are many types of covering algorithms such as AQ15 (Michalski, Mozetic, Hong, & Lavrac, 1986), CN2 (Clark & Niblett, 1989), PRISM (Witten & Frank, 2005) and RIPPER (Cohen, 1995). A basic sequential covering algorithm is 16 depicted in Figure 8. The algorithm begins with an empty rule list. The best rule for class y that will cover the training records is mined. During this process all the training records that are covered by class y are deemed positive examples while those belonging to other classes are negative. A rule is of interest if it covers the majority of the positive and none or a minute few of the negative examples. The newly found rule is added to the bottom of the rule list and the training instances it covers are removed. This process repeats until some stopping criteria is met and proceeds to generate rules for the remaining classes (Tan, Steinbach, & Kumar, 2006). Let E be the training records and A be the set of attribute-value pairs, {(Aj, vj)}. Let Yo be an ordered set of classes {y1, y2,?yk}. Let R ={} be the initial rule list. for each class y ? Yo ? {yk} do while stopping condition is not met do r ? Learn-One-Rule (E, A, y). Remove training records from E that are covered by r. Add r to the bottom of the rule list: R ? R uni02C5 r. end while end for Insert the default rule, {} ? yk, to the bottom of the rule list R. Figure 8 Basic Sequential Covering Algorithm There are two commonly used methods to ?grow? a rule: general to specific or specific to general. The general to specific strategy starts by creating a rule r: {} ? y. The left side of the rule is an empty set and the right side contains the target classification. This initial rule has poor quality as it covers all of the instances in the training data for that class. New conditions or conjuncts are added to improve the quality of the rule. For example, Figure 9 (a) depicts the general to specific rule growing method. The condition (conjunct), Body Temperature=warm- blooded is added initially to the rule. The algorithm will then explore all possibilities and greedily chooses the next condition to add. The specific to general method one of the positive examples is chosen at random as the seed to grow the rule. There is a refinement step in which 17 the rule is generalized by removing a condition (conjunct) to allow it to cover more positive examples. Figure 9(b) depicts this process (Tan, Steinbach, & Kumar, 2006). Some evaluation metric (i.e. accuracy, coverage, etc.) is used to determine which conjuncts to add or remove to improve rule quality during the process of growing a rule. 2.2.1.3 Ordered and Unordered Rule Sets Rule based-classifier rule sets are either ordered or unordered. The rules within an ordered rule set are arranged in an order based on some metric (i.e. accuracy) in decreasing order of Figure 9 Rule Growing Strategies 18 priority. When there is a new unlabeled instance, the first rule that matches makes the prediction. An ordered rule set is also known as a decision list. These classifiers usually have a ?default class? which is used when no classification can be made (no rule matches) (Tan, Steinbach, & Kumar, 2006) (Hu & Li, 2005). C4.5rules (Quinlan J. , 1993) and CBA (Liu, Hsu, & Ma, 1998) both use this method. Un-ordered models do not arrange the rules in any type of sequence or order. All of the rules that match are used to make the prediction (i.e. voting). The class that receives the most votes wins and is used as the prediction. The vote for each class may be weighted by the rules accuracy in some instances. There are both advantages and disadvantages when using an unordered rule base. Unordered rule sets are not as vulnerable to misclassifications as a result of choosing the wrong rule. Building the model is also less expensive as the rules do not have to be kept in any particular order. However, classifying a new test instance can be expensive as the all of the attributes of that instance must be compared against every rule in the model. (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005) 2.2.1 Other Classifiers Na?ve Bayesian Classifiers Bayesian classifiers are based on the Bayes theorem. This theorem is a simple mathematical formula that is used for calculating conditional probabilities. The na?ve Bayes classifier learns the conditional probability for each attribute given a particular class label. For example, if A and B are two random events then P (A|B) is the conditional probability of A occurring given B. Classification is performed by using the Bayes theorem to calculate the probability of a class given a particular instance. The class with the highest posterior probability is then used as the prediction. Using the previous example, P (A|B) is the posterior probability. 19 (Friedman & Goldszmidt, Building classifiers using Bayesian networks, 1996) (Langley & Sage, 1994) (Friedman, Geiger, & Goldszmidt, Bayesian Network Classifiers, 1997) Neural Networks An example of an authentic neural network would be the human brain. The brain can recognize patterns, make predictions, and learn. Artificial neural networks seek to emulate these capabilities. Neural networks in data mining are essentially computer programs applying pattern recognition and machine learning algorithms to build predictive models (Berson, Smith, & Thearling, 1999). There are two main structures in a neural network: nodes and links. Nodes are artificial neurons and links are the connections between them. To make a prediction, the neural network accepts values for the independent variables or predictors at the input nodes. The values of these nodes are then multiplied by values stored in the links. These values are added together at the output node, after which some threshold function is used and the resulting number is the prediction. Figure 10 (Berson, Smith, & Thearling, 1999) is an example of a simple neural network. In this example Age and Income are the input nodes and Default is the output node and predicts if a person will default on a bank loan. Figure 10 Simple Neural Network 20 Most neural networks are not as simple as depicted above. There is usually hidden layer of nodes between the input and output nodes. They are deemed ?hidden? because their contents are not made known to the end user. It is also possible to have more than one hidden layer, thus making the network very complex. 2.3 Association Analysis Association analysis is useful for finding interesting relationships that are hidden in large data sets. The goal of this type of analysis is to uncover rules (or associations) for quantifying the relationship between two or more attributes (Larose, 2005). These relationships are displayed in the form of an association rule. An association rule is an implication expression of the form: Y X Y X confYX ,sup:? , where X and Y are disjoint itemsets (i.e., ?=?YX ), Y Xsup is the support, and YXconf is the confidence. The strength or goodness a rule is measured by its support and confidence. Support defines how many times a rule is pertinent to a particular data set. Confidence defines how often item in Y appear in instances that contain X. For example, a store may find that out of 100 customers shopping on a Tuesday night, 20 bought diapers, and of the 20 who bought diapers, 5 bought beer. The association rule for this example would be if buy diapers then buy beer. This rule would have a support of 5/100 = 5% and a confidence of 5/20 = 25% (diapers barb2right beer: 0.05, 0.25) (Tan, Steinbach, & Kumar, 2006) Mining association rules from large data repositories involves a two step process: frequent itemset generation and rule generation. First, all of the frequent itemsets must be found. Frequent itemsets are those that satisfy some minimum support threshold. Then, from the frequent item sets found in the first step, association rules are generated that satisfy the minimum 21 support and confidence conditions. In association analysis an itemset is a collection of zero or more items. Frequent itemset generation is generally more expensive than rule generation. A dataset that has k items could produce up to 2k -1 itemsets. Because k can be quite large, the search space of itemsets that needs to be investigated is exponentially large. To enumerate the list of all possible itemsets a lattice structure can be used (Figure 11). The brute force method of finding frequent itemsets involves determining the support count for each candidate itemset in the lattice structure. There are ways of reducing the computational complexity of frequent itemset generation. One of these is to reduce the number of candidate itemsets using the Apriori principle. The Apriori principle states that if an itemset is frequent, then all of its subsets must also be frequent (Figure 12). On the other hand, if an itemset is infrequent then all of its supersets must be infrequent too. For example in Figure 13, because {a, b} is an infrequent itemset the entire sub graph containing the supersets {a, b} can be pruned. This method of reducing the exponential search space based on the support measure is known as support-based pruning. Apriori was the first rule mining algorithm which pioneered the use of support based pruning (Tan, Steinbach, & Kumar, 2006). This algorithm uses support based pruning to control the exponential growth of candidate itemsets. Apriori uses a bottom up technique in which the frequent item sets are extended one at a time. This is known as candidate itemset generation. This process continues until no further successful extensions are found (Larose, 2005). 22 Figure 11 Lattice Structure 23 Figure 12 Apriori Principle Figure 24 13 Support Based Pruning 25 2.4 Cluster Analysis Clustering can be defined as ?a division of data into groups of similar objects (Berkhin, 2002)?. Instances within these groups or clusters are more similar to each other than instances belonging to other clusters. Clustering and classification are different from one another in that there is no target or dependent variable in clustering. Clustering does not attempt to classify or predict the value of the target variable. These algorithms attempt to segment the whole data set into homogenous clusters. The more similarity within a cluster and the bigger the difference between clusters the better the clustering. For the best possible performance, clustering algorithms require that the data be normalized so that any one attribute or variable will not control the analysis. (Jain, Murty, & Flynn) (Anderberg, 1973) There are two main types of clustering algorithms: hierarchical and partitional. The next section describes each. 2.4.1 Hierarchical Clustering Hierarchical clustering creates a tree of clusters known as a dendrogram. With this type of clustering, the smallest clusters in the tree join together to create the next level of clusters. At this level, the clusters then join together to create the next level of clusters. The top or root of this tree is the cluster that contains all the records. There are two types of hierarchical clustering: agglomerative and divisive (Berkhin, 2002) (Berson, Smith, & Thearling, 1999). Agglomerative clustering algorithms begin with having as many clusters as there are records. Each cluster will contain one record. Then the clusters that are closest to one another, based on some distance, are joined together to create the next largest cluster. This process is continued until the hierarchy is built with a single cluster, which contains all records, at the top. Figures 14 and 15 describe the agglomerative clustering process (Cluster Analysis, 2008). 26 Figure 14 contains a set of records that will be clustered. Figure 15 shows the cluster hierarchy after the agglomerative approach has been applied. Figure 14 Records to be clustered Figure 15 Cluster Hierarchy 27 Divisive clustering algorithms apply an opposite approach. These types of algorithms start will all of the records in one cluster and then recursively splits the most suitable cluster. This process continues until some stopping criteria are met. How appropriate a cluster or clusters for merging or splitting depends on how similar or dissimilar items in each cluster are. The distance between individual records has to be generalized to the distance between clusters in order for splitting or merging to occur. This proximity measure is called a linkage metric. The major linkage metrics include: Single Linkage, Complete Linkage, and Average Linkage (see Figure 16) (Tan, Steinbach, & Kumar, 2006). Single linkage (nearest neighbor) is based on the minimum distance between any record in one cluster and any record in another cluster. Cluster similarity is based on the similarity of the most similar members from each cluster. Complete linkage (farthest neighbor) is based on the maximum distance of any record in one cluster and any record in another cluster. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. Average linkage was designed to decrease the dependence of the cluster linkage criteria on extreme values. The criteria here is the average distance of all the records in one cluster from all of the records in another cluster. 2.4.2 Partional Clustering Partitional clustering is dividing the data set in such a way that each record belongs to one and only one cluster. These algorithms don?t produce dendrogram as in hierarchical clustering. A single partition of the data is produ a randomly picked or user defined number of clusters. The algorithms then optimize each cluster based on some validity measure. There are several partitioning clustering approaches. The next section discusses two of these: K K-Means K-means is one of the oldest and most widely used clustering techniques. The name K means comes from how each of the that cluster. This point is called the centroid. The basic K straight forward. Figure 17 describes this algorithm 28 Figure 16 Linkage Examples ced. Partitional clustering algorithms begin with -Means and Expectation Maximization (EM) (Berkhin, 2002) K clusters is represented by the mean of the points within -means algorithm is very simple and (Tan, Steinbach, & Kumar, 2006) Figure 17 K-means Algorithm . - . 29 The first step is to pick K initial centroids. K is the number of clusters wanted and is a user specified parameter. Next each data instance or point is assigned to the centroid it is closest to. This collection of points is a cluster. The centroid of each of these clusters is updated based on the points assigned to the cluster. These steps are repeated until there is no change in the centroids. Instances are assigned to a centroid based on a proximity measure that quantifies closeness for a particular data set. There are several types of proximity measures such as Euclidean distance and Cosine similarity (Berkhin, 2002). Expectation Maximization (EM) The EM technique is very similar to the k-means approach. They are both iterative in nature and start with a random guess. The difference lies in the idea of hard versus soft cluster membership. In k-means a hard membership approach is adopted. This means that an instance belongs to one and only one cluster. In EM a soft membership approach is used, which means that membership of an instance can be spread amongst many clusters. The EM algorithm is a two step process: Expectation and Maximization. The expectation portion is the first step and involves calculating cluster probabilities (the expected class values). The maximization step involves calculating the distribution parameters. This step is the maximization of the possibilities of the distribution given the data. These steps are repeated until a ?log-likelihood convergence is achieved.? (Berkhin, 2002) (Witten & Frank, 2005). 2.5 Anomaly Detection The main objective in anomaly detection is to locate instances that are different from most of the other instances. These abnormal objects are known as outliers and have attribute values that deviate considerably from the norm or expected values. Some areas where this type 30 of analysis is very important are fraud detection, intrusion detection, public health and medicine. Some common causes of anomalies are things such as data from different types, natural variation and data measurement or collection errors (Tan, Steinbach, & Kumar, 2006) There are three fundamental approaches to anomaly detection: supervised, unsupervised, and semi-supervised. The difference in these techniques is the degree to which known outcomes or classifications (class labels) are available for at least some portion of the data (Tan, Steinbach, & Kumar, 2006) The supervised anomaly detection approach requires a training set that contains both normal and abnormal instances. Unsupervised anomaly detection techniques seek to assign a score to each instance that reflects how abnormal that instance is. With semi-supervised anomaly detection techniques the goal is to find an anomaly score (label) for a set or group of instances using information from labeled normal instances. 2.6 Ensemble Methods As one could imagine the accuracy of predictive modeling algorithms is quite important. Ensemble methods seek to improve the accuracy of classifiers (such as decision trees) by combining the predictions of multiple classifiers (Figure 18). These methods construct a set of base classifiers using the training data and make classifications for new instances by voting on the prediction each base classifier makes. (Tan, Steinbach, &Kumar, 2006) Research has shown that ensembles have a tendency to perform better than single classifiers in the ensemble (Opitz & Macline, 1999) (Quinlan J. , 2006) (Freund & Schapire, 1999) (Berk, 2004). Bagging, boosting, and stacking are well known ensemble techniques. 31 Figure 18 Ensemble Method Process 2.6.1 Bagging Bagging or bootstrap aggregation trains each base classifier with a random redistribution of the training (Opitz & Macline, 1999). Each of the training sets for the base classifiers are generated by randomly selecting, with replacement, N examples, where N is the size of the original training set. Since sampling is done with replacement, some of the instances may possibly appear many times within the same training set. Likewise, some instances may be omitted from the training set (see Figure 19). As previously stated, bagging involves combining multiple models (for example decision trees). These models can be combined by having each model vote on each new instance. The class that receives the most votes is deemed to be the correct one. When the target variable is numeric the average is used. The predictions or classifications made by voting become more 32 dependable as more votes are considered. Researchers have determined that bagging is effective on the learning algorithms of which small changes in the training data can produce big changes in predictions (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005) (Opitz & Macline, 1999). 2.6.2 Boosting This ensemble method technique gets its name from its capacity to take a ?weak learning algorithm? and ?boosting? it into a ?strong? learning algorithm (Freund & Schapire, 1999). A weak learning algorithm is one that performs slightly better than random guessing. Like bagging, boosting uses voting to combine the output of multiple models of the same type (for example decision trees). Boosting, however, is iterative, unlike bagging in which each base classifier is built separately; each new model is influenced by the performance of the models built before it. New models are pushed to become experts for instances mishandled by earlier models. Boosting weights a model?s contribution by its performance instead of giving equal weight to all as in bagging (Witten & Frank, 2005). For example in Figure 19 (Opitz & Macline, 1999), assume instance 1 is an outlier and hard to classify correctly. This instance appears more in later training sets because boosting will focus (increase its weight) more on correctly classifying it. 33 Figure 19 Bagging and Boosting Example 2.6.3 Stacking Stacking or Stacked Generalization differs from bagging and boosting in that it involves combining classifiers of different types (e.g., decision trees and neural networks). It is less commonly used because it can be difficult to analyze and there is no standard or accepted best practice. Stacking eliminates voting by introducing the idea of a metalearner. Stacking attempts to ?learn? which base classifiers are the most reliable by using the metalearner to determine how best to combine the output of the base classifiers (Witten & Frank, 2005). The inputs to the metalearner (level 1 model) are the predictions of the base classifiers (level 0 model). Each level 1 instance will have as many attributes as there are level 0 learners. 34 When classifying, the new instance is given to each of the level 0 learners and those results are given to the level 1 learner. The level 1 learner then ?learns? the best way to combine the predictions to make the final prediction (Witten & Frank, 2005). 2.7 Other Related Information and Work Discretization Many data of the data mining algorithms used for classification tasks can?t tolerate numeric data. For these algorithms the data must be transformed into a categorical format. This transformation is known as discretization. Discretization has two steps: ? Decide how many categories to create ? Determine how to map the values to the categories There are many ways to discretize continuous attributes (Dougherty, Kohavi, & Sahami, 1995). One of the simplest ways is to make use of the standard deviation to create categories. For example, suppose attribute A is a nominal attribute. First obtain the minimum, maximum and standard deviation values for A. Let min=1, max=5 and stddev=0.25. The initial category would be 1 ?1.25 (min <= A < (min + stddev)). Any values greater than or equal to 1 and less than 1.25 would be updated to reflect this category in the dataset. The next category would be 1.25 ? 1.50 ((min + stddev) <= A < (min + stddev + stddev)). Again, any values that fall within this range would be reflected in the dataset. This process continues until the range is less than or equal the max value. Association Based Classification There have been efforts made to combine association and traditional rule mining techniques. (Liu, Hsu, & Ma, 1998), showed that this integration could be performed efficiently without loss in classification accuracy. The combination focuses on the association rules where 35 the right side contains the target class attribute. These rules are identified as class association rules. Data mining with this associative classification framework has 3 basic steps: ? Discretize any continuous attributes ? Generate all class association rules ? Build classifier bases on the rules generated All association based algorithms e.g. Apriori (Agrawal & Srikant, 1994), can be adapted to use for classification purposes. CBA (Liu, Hsu, & Ma, 1998) , CMAR (Li, Han, & Pei, 2001), and HARMONY (Wang & Karypis, 2003) are examples of applications that utilize some form of this technique. (Hu & Li, 2005) Data Clustering and Predictive Modeling For some classification or regression problems it is often advantageous to divide the data into relatively homogenous groups or clusters and build a model for each group. The advantages of building multiple models in this manner are improved accuracy, reliability and interpretability. (Baumann & Germond, 1993), (Djukanovic, Babic, Sobajic, & Pao, 1993), (Oh & Han, 2001) and (Lokmic & Smith, 2000) all explored this two step method in their research. (Deodhar & Ghosh, 2007), investigated notion of infusing data clustering process with the creation of predictive models. Their work focused on problems in which the independent variables could be naturally sectioned into two or more groups and makes use of co-clustering. The authors partitioned the independent variables into groups with respect to their modes then instantaneously clustered along each mode and built a predictive model for each co-cluster. Co- clustering is a method that simultaneously clusters data along multiple axes. (Cheng & Church, 2000) (Cho, Dhillon, Guan, & Sra, 2004) This technique is usually applied to a matrix of data points and takes advantage of the duality between two axes to improve upon traditional single sided clustering. The rows in the matrix are data points and the columns are features. 36 The aforementioned approaches are all methods that have successfully combined some aspect of data clustering and predictive modeling/classification. However, none of them address the issue of attempting to make classifications when the data that is used to evaluate the model is not as complete as the data used to build the model. Robust Rule based Classification (Hu & Li, 2005) and (Li, Topor, & Shen, 2002) focus on building a more robust classifier. The authors proposed using association rules as a way to make rule base classifiers more robust. The idea is that a traditional rule based classifier can be improved by adding a certain level of redundancy to the rule set. This robust classifier contains more rules than the generalized model and combats possible classification error due to missing and/or incomplete data in the new/unseen instances. In terms of missing values, robustness involves the following: ? Ability to tolerate missing values in test data ? Ability to tolerate missing values in training data Missing data is a well known issue in the field of data mining. There has been research performed in regards to handling missing values in training data such as work presented by (Clark & Niblett, 1989), (Mingers, 1989) and (Batista & Monard, 2003 ). General methods for managing missing values involve pre-process substitution via estimations utilizing an approach such as nearest neighbors. The work presented by (Hu & Li, 2005) and (Li, Topor, & Shen, 2002) differs from these existing methods in that no estimations or substitutions are made for any missing values. This research employs a larger rule set to make the classifier more resilient against missing test data. The authors designed, implemented and evaluated a practical robust rule based classifier for 0, 1, 2, 3 and 4 missing attributes using an ordered rule set using association classification rules. The rules were ordered by rule accuracy in descending order. A 37 rule set was generated using the training data and evaluated against test data with added missing values. The missing values were added by randomly omitting some values in the test data. Each test record in the test data had the same number of missing attributes on average. The results of the experiment were positive and proved that their proposed method of adding redundancy via association rules was indeed more resilient against missing attributes in the test data. The work by (Hu & Li, 2005) is similar to the research presented in this dissertation as both are focused on creating models that are a resilient against missing attribute information. However, this research differs in that we present a method to add resiliency via divisive clustering and rule based models created using a sequential covering algorithm. Multiple rule sets are generated, combined and used as a collective model. The creation of multiple models on different segments of the data allows for the creation of rules that may not have been created using a more traditional approach which aids in adding resiliency against missing data. 2.8 Data Mining Tools There are many data mining tools available. This section gives a brief description of some of them. Microsoft SQL Server Microsoft SQL Server is a ?comprehensive, integrated data management and analysis software that enables organizations to reliably manage mission-critical information and confidently run today?s increasingly complex business applications (Microsoft, 2007)?. SQL Server 2005 is the platform leader in a variety of areas including business intelligence and database management systems. 38 Rapid Miner ?Rapid Miner (formerly YALE) is the world-leading open-source system for knowledge discovery and data mining (Rapid-i, 2008)?. This tool is not only available as a stand-alone application for data analysis but also as a data mining engine that can be integrated into your own products. Weka Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software that is written in Java. The software was developed at the University of Waikato and is free under the GNU General Public License (Witten & Frank, 2005). ORACLE Data Mining Oracle Data Mining (ODM) is an option of Oracle Database 10g Enterprise Edition. This tool provides the ability to ?produce actionable predictive information and build integrated business intelligence applications (ORACLE, 2007)?. 39 Chapter 3 Research Details 3.1 Conceptual Overview This section describes the approach taken to test the hypothesis discussed in section 1.2. The objective of this research is to develop a new method for classification problems in which the test data may not be as complete as data used to build a model that is still able to achieve fairly accurate results. The problem is more formally expressed here. Let A = {a1, a2,?, am} be a set of distinct attributes found in a table, D, where D = {d1,d2,?,dn} indicates a set of all instances within a table. Attributes capture the basic characteristics of an instance. Each attribute ai contains a set of distinct values Vi= {v1, v2,?, vn}. A pattern is a set of one or more attribute value equivalent constraints that are conjunctively joined (ai=vij and at=vtk...). A rule, r, is an implication in the form P ? c, where P is a pattern (the condition) and c (the consequence), such that c is a member of A where c = ak and c is not a member of P. A traditional rule based classifier would build a rule set R={r1, r2,?, rn}.Given a test instance T, a rule r covers T if condition(r) ?T. A rule can make predictions on the instances it covers, r(T) ? consequence(r). If the consequence(r) is the actual class of T then the rule has made the correct prediction. If the class is not the actual class it is an incorrect prediction. Given a test instance T?, with one or more missing attribute values from A, where V? (unknown attribute values) is not a subset of V (known attribute values), denoted V?? V; r cannot make predictions for T?. Therefore T? is unclassifiable by R. 40 Using the proposed approach, multiple rule sets would be generated via divisive data clustering. A cluster, clustj ( kj ??1 ), where clustj ? D, such that each instance is a member of one and only one cluster, Clust = {clust1, clust2,?,clustk},where k is the designated number of initial clusters. An instance is assigned to a cluster, clustj, based on holistic similarities such that instances within a cluster are more similar than those outside the cluster. A rule set, Rj, is generated for each cluster. The multiple rule sets are combined into a super rule set, R? = {R1? R2?? Rk}. R? is more resilient against missing attribute information as there exists a k of which the attributes in R? are sufficiently diverse. Meaning, that there exists two or more rule sets within R? that are not proper subsets with respect to their individual As, where ( kj ??1 and ki ??1 and ji ? ) of other rule sets within R?. Clustering Rule Based Approach As previously mentioned the premise of this research is to add robustness and resiliency to rule based classification via divisive clustering. The idea is that there exists a k that will create a model that will be more resilient against missing attributes. We present the Clustering Rule based Approach (CRA) for classification. In CRA, the data is first clustered into k clusters using divisive clustering. The clustered data is then used to train individual rule base classifiers using a sequential covering algorithm. The rules are then merged into a combined rule set. Finally, each test instances is evaluated against the combined rule set, predicting its output. Figure 20 depicts this process. It is important to note that prior to the cluster creation any continuous attributes are discretized. 41 Figure 20 CRA Flow 42 Chapter 4 Experiment Design We conducted an experiment to test the validity of the CRA process. This section describes the design of that experiment. 4.1 Data The data for this experiment was obtained from the UCI Repository (Blake & Merz, 1998). These data sets varied in size, attributes, and attribute value data types (e.g. continuous, nominal). Table 3 gives a description of each data set used. Data Set Number of Records Number of Attributes Number of Classifications Annealing 898 38 5 Thyroid 3,164 25 2 Mushroom 8,124 22 2 Tic-Tac-Toe 958 9 2 Congressional Voting 435 16 2 Table 3 Data Sets 4.2 Materials and Tools All of the coding used to create and test the CRA process was completed utilizing Java. All data was stored and manipulated using MYSQL database tables and Microsoft Excel. The sequential covering algorithm used in this experiment was the PRISM algorithm. The divisive clustering algorithm used is the approach employed by Applications Quest (AppsQuest). 43 4.2.1 PRISM PRISM is a simple sequential covering algorithm introduced by (Witten & Frank, 2005). The algorithm follows the process of the basic covering algorithm presented in chapter 2. The pseudocode for the PRISM algorithm is shown in figure 21. PRISM generates only correct or perfect rules. The success of a rule is measured by the accuracy formula p/t, where t is all of the instances covered by the new rule of which p are positive examples of the class. Rules with less than 100% accuracy are incorrect, meaning cases are assigned to the class that are not do not actually belong to the class in question. PRISM continues to add conjuncts (conditions) to the rule until it is perfect or there are no more attributes available for use. For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A=v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A=v to R Remove the instances covered by R from E. Figure 21 PRISM Pseudocode 4.2.2 Applications Quest? Applications Quest? (AppsQuest) is a data mining tool developed by Juan E. Gilbert to aid institutions, agencies and businesses in the admissions/hiring process using data clustering. (Gilbert J. E., Applications Quest: A Case Study on Holistic Admissions, 2008) (Gilbert J. E., Applications Quest: Computing Diversity, 2006) The goal of the system is to recommend either 44 the most diverse or most similar group of applicants depending upon the needs of the organization by comparing applicants in a holistic way. There are many elements of an application that can be easily compared. Take GPA (grade point average), for example. If one student has a GPA of 3.5 and another has 2.7 the GPA of 3.5 is clearly higher. This comparison is not as straightforward when there are non numeric elements of the application to consider. AppsQuest differs from other divisive clustering applications in its treatment of nominal attributes (non-numeric). Traditionally, data clustering applications require the nominal attributes be assigned a value that allows them to be compared (e.g. 0 if the values are different and 1 if the values are the same). City, state, race, and gender are examples of nominal attributes. To provide a more accurate measure to compare nominal attributes the Nominal Population Metric (NPM) was created. The NPM process begins by identifying the nominal attributes. For clarification purposes, let?s use Race as an example. There are 20 applicants in the pool where Race is an attribute. This attribute has the following unique values: B-black, W-white, H- hispanic, A-Asian, and O-other. Using the AppsQuest approach every application is compared to every other application in its entirety. For 20 applications compared 2 at a time there would be 190 comparisons. These results were obtained using the formula n C r = n!/[(n-r)! r!]. The NPM processes the nominal attributes in the following manner (Gilbert J. E., 2008): 1. Calculate the total number of combinations for all applications using n C r. Continuing the example, the total number of 190 combinations and place into matrix of combination and application pairs. This matrix is called the similarity matrix (see Table 4). Application 1 Application 2 Difference Measure 1 2 20% 1 3 15% 45 1 4 75% Table 4 Similarity Matrix with Difference Measures Example 2. Calculate the number of unique attributes values. For this example, the Race attribute has 5. 3. Calculate the number of combinations for the unique nominal attributes using n C r. The result is 10 combinations for the Race attribute. 4. For the 10 combinations of the nominal attribute value pairs, compute the coverage percentage within the application similarity matrix. This equates to the number of rows in the application similarity matrix have a value for Race pairs W-B, W-H, B-A, etc. These figures are placed into a nominal population matrix. (see Table 5) Race Coverage W-B 30% W-H 15% B-A 5% Table 5 Example Nominal Population Matrix 5. The nominal population matrix contains the attribute pair coverage across all comparisons. This is an accurate measure of the impact of the nominal attribute value pairs based on their existence within the data set. 6. If necessary the Coverage values are adjusted. This is the objective when the application is when measuring difference vs. similarity. The W-B Race pair had the greatest coverage in the nominal population matrix and is therefore the most common attribute value pair. The B-A value pair is the least common or most novel. The B-A pair should be receive more ?credit? as it is the most novel. The credit is calculated proportionately in relation to other attribute value pairs. To accomplish this, the Coverage values are subtracted from 100% or 1.00 (see Table 6). This inverts the Coverage values such that B-A is given more credit compared to W-B relative to their actual coverage within application similarity matrix. Race Coverage W-B 70% W-H 85% 46 B-A 95% Table 6 Nominal Population Matrix Example 7. The Coverage values in the nominal population matrix are now the Nominal Population Metrics which can be used in clustering algorithms as a method to accurately compare nominal attributes. The question, ?What?s the difference between a Black and Asian along the Race attribute?? can be answered. The NPM value is 95% in this example. An alternative approach is to use the distribution of the values for each attribute. Using the previous example, Table 7 depicts the distribution of race within the data set (e.g. 70% of the instances have a W for race). Race Distribution B 10% W 70% A 15% H 5% Table 7 Example Nominal Population Matrix Race Distribution B 90% W 30% A 85% H 95% Table 8 Example Nominal Population Matrix Adjusted 1. Compute the distribution for each nominal attribute. (see Table 7) 2. Use the distribution value for each attribute at time of comparison. For example if calculating the difference between B and A along the Race attribute .1 and .15 would be used, respectively, in the calculation. 3. Compute the Nominal difference by identifying the two smallest distribution values. In this example this corresponds to H and B with values .05 and .1 respectively. This identifies the 2 most novel attributes based on their frequency of occurrence in the data set. 47 4. Calculate the sum of the 2 most novel attributes to determine their combined frequency (.05 + .1=.15). This frequency becomes the baseline for normalization as these are the most novel pair-wise nominal attribute values (for Race in this example). 5. Calculate the normalization by subtracting the combined frequency from 1. In this example, 1 - .15 = .85. This value becomes the normalization factor. All pair-wise comparisons are divided by this value. The H-B comparison yields a difference value of 1 (.85/.85). The W-H comparison yields a difference of .294 ([1-(.70 + .05)]/.85 =.294). The difference between W-H is .294 vs. 1.00 for H-B. In general, the NPM covers the use of the distribution of nominal attribute values in efforts to determine the similarity or difference between nominal attribute values. 4.3 Procedure The data was loaded into MYSQL database tables. The data was discretized using the standard deviation to create the corresponding categories for each continuous attribute (if any). The data was then split into training and test data sets. 80% of the data was used for training and 20% of the data was used for testing purposes. Proportionate samples for each class were used. The training data was clustered using AppsQuest divisive clustering algorithm. PRISM was then run on the clustered data. The resulting rule sets were then merged into a collective model. It is important to note that during the merging process any duplicate rules were removed. Contradictory rules were also resolved. Contradictory rules are rules in which the conditions for two or more rules are the same but the resulting classes are different. If this occurred the rule with the highest coverage was kept and the other rules removed. We examined the effects of the clustering rule based approach using both ordered and unordered rule sets for 0, 1, 2, 3, and 4 missing attributes. The attributes determined to be ?missing? were chosen at random and modified in test data prior to each iteration of the experiment. (Hu & Li, 2005) The accuracy of each rule was determined using the calculation p/t. In this instance the t corresponds to the total number of instances covered in relation to the entire training data set instead of a particular class 48 and p is the number of instances that are positive examples of the target classification. A voting scheme was used in the unordered rule set using the accuracy of each rule that matched the test instance. The class with the highest sum of accuracies was used. The rule with the highest coverage was used to make the prediction if a tie occurred. In the ordered rule set the rules were arranged in decreasing order of accuracy. The first rule that matched was used to make the prediction. The results reported are based on an average of ten runs per dataset. Meaning for each data set the process was executed with 10 training and 10 individual test sets. 4.4 Expected Outcomes It is expected that for some k, the CRA process will produce a collective model that will be more tolerant of missing attribute information in the new unseen test instances. The collective model will be larger and have shorter simpler rules which often achieve high classification accuracy and aid in adding robustness. (Witten & Frank, 2005) The rule length refers to the number of conditions (conjuncts) the rule contains. Longer rules are more specific and shorter rules are more general. 49 Chapter 5 Research Findings 5.1 Results The goal of this research is to introduce a new method for adding robustness to rule based classifiers using divisive clustering. The resulting model would be more resilient against missing attribute information in new/unseen data. This section describes the results of the experiment conducted to test the validity of this method. k values Given the number of different datasets used in the experiment, there are no guarantees that the best performing k on one dataset will translate to the best performing k on other datasets. As such, the performance of the CRA process is measured against a set of k values: 1, 3, 5, 7, 9, 11, 13, and 15; however, it is possible to build the model using k values across a larger range, i.e. 1, 2, 3, ? N-1, but that approach was not necessary. The results for k value 1 reflect the performance of the model using the traditional methods (control group). Results for Congressional Voting Data Tables 9 through 13 display the results for the tests performed using the Congressional Voting dataset. The traditional classifier performed very well on the complete data set as shown in Table 4 achieving an accuracy of 97% and 99% for the unordered and ordered models respectively. The CRA process performed well and was able to match the performance of the 50 traditional unordered classifier with an accuracy of 97%. For 1, 2, 3 and 4 missing attributes there existed a k that produced a collective model that obtained higher classification accuracy for either type of model. These values are highlighted in green. Instances in which the CRA process ?tied? with the traditional model are highlighted in yellow. (Note: Highlighting scheme green for wins, yellow for ties will be consistent for remaining results.) Figures 22 and 23 depict a graphical representation of the classification accuracy for each k. As the number of missing attributes increased, the performance of the traditional method (k value 1 for ordered and unordered models) suffered while the CRA process performed more consistently. Figure 24 depicts the rule average rule lengths for the models. The rule lengths were smaller for models built using the CRA method. On average, the rule lengths were 2.81 and 1.84 for the traditional and CRA methods respectively. Additionally, the size of the CRA generated models were larger (contained more rules) than the traditional models. The CRA models contained, on average, approximately 25 rules while the traditional model contained 22. Also, there was a decrease in the number of instances not classified when the CRA method was used. Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .97 .99 1 23.9 33.7 2.83 3 .97 .98 0 24.4 30.2 2.22 5 .95 .95 0 26.8 31.5 2.01 7 .95 .96 0 26.8 30.6 1.94 9 .93 .95 0 26 29 1.77 11 .91 .97 0 24.8 28.3 1.69 13 .92 .95 0 24.8 28.6 1.71 15 .91 .94 0 24.9 28.5 1.68 Table 9 Congressional Voting Results for Complete Data (No Missing Attributes) 51 Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .92 .96 1.75 21.6 31.5 2.78 3 .94 .97 1 22.8 29 2.17 5 .94 .95 0 24.9 29.3 1.98 7 .93 .95 0 24.9 28.6 1.91 9 .91 .91 0 23.3 25.7 1.69 11 .91 .92 0 22.7 25.6 1.63 13 .9 .93 0 22.6 25.4 1.65 15 .92 .93 0 23.9 25.8 1.59 Table 10 Congressional Voting 1 Missing Attribute Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .87 .89 4 21.45 30.18 2.68 3 .89 .91 1.36 23.18 30.45 2.29 5 .9 .92 0 25.18 31.64 2.1 7 .89 .92 0 25.91 20.45 1.96 9 .9 .92 0 25.91 29 1.76 11 .87 .91 0 25.82 28.73 1.73 13 .87 .9 0 24.36 27.36 1.68 15 .89 .91 0 24.91 27.27 1.63 Table 11 Congressional Data Results 2 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .85 .86 6.64 21.27 32.09 2.86 3 .92 .92 .82 24.09 29.82 2.05 5 .92 .93 0 25.73 30.18 2.06 7 .91 .91 0 26.27 29.45 1.96 9 .88 .91 0 25.73 28.27 1.83 11 .85 .89 0 24.82 26.36 1.67 13 .87 .91 0 24.91 26.64 1.59 15 .87 .9 0 25 25.91 1.57 Table 12 Congressional Voting Data Results 3 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy 1 .77 3 .86 5 .89 7 .89 9 .85 11 .86 13 .87 15 .88 Table 13 Congressional Voting Results 4 Missing Attributes Figure 22 Congressional Voting 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 Ac cu rac y 52 Average Number Not Classified Average Number of Rules Average Number Unique Attributes .79 9.18 24.81 34.18 .88 2 25.82 32.27 .9 0 25.55 29.55 .9 0 27.18 30.64 .89 0 26.45 28.55 .89 0 24.73 27.45 .9 0 24.73 27.36 .9 0 25.36 26.73 Accuracies Unordered Model 5 7 9 11 13 15 k value Average Rule Length 2.9 2.3 1.92 1.91 1.76 1.75 1.7 1.56 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing Figure 23 Congressional Voting Figure 24 Congressional Voting Average Rule Lengths 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 Ac cu rac y 0 0.5 1 1.5 2 2.5 3 1 3 Av era ge R ule Le ng th 53 Accuracies Ordered Model 5 7 9 11 13 15 k value 54 Cluster Rules Accuracy Coverage 1 0 If phys_fee_freeze=`n` and adopt_budget_res=`y` then class=`democrat` 1 173 2 0 If phys_fee_freeze=`n` and crime=`n` then class=`democrat` 1 126 3 0 If syn_corp_cuts=`y` and immigration=`n` and water_proj_cost_shar=`n` and handicaped_infants=`n` and exp_admin_safrica=`n` then class=`democrat` 1 2 4 0 If phys_fee_freeze=`n` and syn_corp_cuts=`y` then class=`democrat` 1 94 5 0 If el_salvador_aid=`n` and syn_corp_cuts=`y` and immigration=`n` then class=`democrat` 1 35 6 0 If phys_fee_freeze=`?` and education_spending=`n` then class=`democrat` 1 6 7 0 If exp_admin_safrica=`?` and adopt_budget_res=`y` and aid_nic_contras=`n` then class=`democrat` 1 6 8 0 If education_spending=`n` and exp_admin_safrica=`n` and crime=`y` then class=`democrat` 1 5 9 0 If phys_fee_freeze=`n` and education_spending=`y` then class=`democrat` 1 23 10 0 If superfund_right_to_sue=`?` and syn_corp_cuts=`y` then class=`democrat` 1 7 11 0 If superfund_right_to_sue=`?` and exp_admin_safrica=`y` then class=`democrat` 1 7 12 0 If phys_fee_freeze=`n` and exp_admin_safrica=`?` then class=`democrat` 1 60 13 0 If adopt_budget_res=`?` and mx_missile=`y` then class=`democrat` 1 5 14 0 If syn_corp_cuts=`y` and superfund_right_to_sue=`n` and adopt_budget_res=`n` then class=`democrat` 1 8 15 0 If exp_admin_safrica=`?` and handicaped_infants=`y` and adopt_budget_res=`n` then class=`democrat` 1 3 16 0 If adopt_budget_res=`n` and education_spending=`y` and handicaped_infants=`n` then class=`republican` 0.92391 92 17 0 If relig_grp_schools=`y` and syn_corp_cuts=`n` then class=`republican` 0.75781 128 18 0 If adopt_budget_res=`n` and relig_grp_schools=`n` then class=`republican` 0.71429 14 19 0 If superfund_right_to_sue=`y` and exp_admin_safrica=`y` and syn_corp_cuts=`n` and immigration=`y` and mx_missile=`y` then class=`republican` 0.46154 13 20 0 If crime=`y` and handicaped_infants=`y` and syn_corp_cuts=`y` and relig_grp_schools=`n` then class=`republican` 0.25 4 55 21 0 If el_salvador_aid=`y` and immigration=`n` and mx_missile=`n` and handicaped_infants=`y` and duty_free_exports=`n` and relig_grp_schools=`y` then class=`republican` 0.66667 12 22 0 If el_salvador_aid=`y` and water_proj_cost_shar=`n` and education_spending=`y` and adopt_budget_res=`y` then class=`republican` 0.85714 7 23 0 If duty_free_exports=`?` and handicaped_infants=`n` and exp_admin_safrica=`?` and relig_grp_schools=`y` then class=`republican` 0.75 4 24 0 If handicaped_infants=`n` and relig_grp_schools=`n` and immigration=`y` and water_proj_cost_shar=`n` and duty_free_exports=`y` and exp_admin_safrica=`y` and education_spending=`n` then class=`republican` 0.2 5 25 0 If water_proj_cost_shar=`y` then class=`republican` 0.39264 163 26 0 If exp_admin_safrica=`?` then class=`republican` 0.17284 81 Table 14 Congressional Voting Example Traditional Rule Set Table 14 displays one of the rule sets generated in the traditional manner using the Congressional Voting data. Table 15 displays one of rule sets generated with the CRA process. The rule set in Table 15 is a collective model derived from 7 individual models (7 clusters). Cluster Rules Accuracy Coverage 1 0 If water_proj_cost_shar=`n` and duty_free_exports=`n` and handicaped_infants=`n` and exp_admin_safrica=`n` then class=`democrat` 0.125 16 2 0 If phys_fee_freeze=`n` and education_spending=`y` then class=`democrat` 1 23 3 0 If adopt_budget_res=`y` and duty_free_exports=`n` then class=`democrat` 0.85915 71 4 0 If education_spending=`n` and duty_free_exports=`y` then class=`democrat` 0.9619 105 5 0 If superfund_right_to_sue=`n` and anti_sat_test_ban=`n` then class=`democrat` 0.75 20 6 0 If education_spending=`n` and exp_admin_safrica=`?` then class=`democrat` 0.96364 55 7 0 If el_salvador_aid=`?` then class=`democrat` 0.91667 12 8 0 If mx_missile=`y` then class=`democrat` 0.9125 160 9 0 If water_proj_cost_shar=`y` and exp_admin_safrica=`y` and duty_free_exports=`n` then class=`republican` 0.54386 57 10 0 If phys_fee_freeze=`y` and adopt_budget_res=`n` and superfund_right_to_sue=`y` then class=`republican` 0.97143 105 12 0 If phys_fee_freeze=`y` and adopt_budget_res=`n` then 0.95041 121 56 class=`republican` 13 0 If exp_admin_safrica=`y` and anti_sat_test_ban=`n` and education_spending=`y` then class=`republican` 0.86275 51 15 1 If adopt_budget_res=`y` then class=`republican` 0.07576 198 16 2 If crime=`n` then class=`democrat` 0.98485 132 17 2 If duty_free_exports=`y` then class=`democrat` 0.91729 133 18 2 If handicaped_infants=`n` then class=`democrat` 0.42932 191 19 2 If immigration=`n` then class=`democrat` 0.65143 175 20 2 If phys_fee_freeze=`n` and adopt_budget_res=`y` then class=`democrat` 1 173 21 2 If immigration=`y` and duty_free_exports=`n` and superfund_right_to_sue=`n` and mx_missile=`y` and anti_sat_test_ban=`y` and water_proj_cost_shar=`y` and syn_corp_cuts=`y` then class=`republican` 0.33333 3 22 2 If crime=`y` and syn_corp_cuts=`n` and duty_free_exports=`n` and water_proj_cost_shar=`n` then class=`republican` 0.85417 48 23 3 If phys_fee_freeze=`n` then class=`democrat` 0.9898 196 24 3 If crime=`y` then class=`republican` 0.62871 202 26 4 If aid_nic_contras=`y` then class=`democrat` 0.91935 186 27 4 If adopt_budget_res=`?` then class=`democrat` 0.77778 9 28 5 If superfund_right_to_sue=`y` then class=`democrat` 0.36047 172 29 5 If adopt_budget_res=`n` then class=`democrat` 0.17021 141 30 5 If handicaped_infants=`n` and education_spending=`n` and el_salvador_aid=`y` and adopt_budget_res=`y` and mx_missile=`n` and relig_grp_schools=`y` and water_proj_cost_shar=`y` then class=`republican` 0.2 5 31 6 If aid_nic_contras=`n` then class=`democrat` 0.26351 148 32 6 If aid_nic_contras=`y` and el_salvador_aid=`y` then class=`republican` 0.47619 21 33 6 If anti_sat_test_ban=`y` and education_spending=`n` and immigration=`y` then class=`republican` 0.125 72 Table 15 Congressional Voting Example Collective Rule Set Results for Anneal data Tables 16 through 20 contain the results for the performed using the Anneal data set. The traditional classifiers both performed well on the complete data set achieving a 90% and 94% for the unordered and ordered models respectively. The CRA process achieved a higher accuracy 57 than the traditional model for tests with 2, 3 and 4 missing attributes (highlighted in green). Figures 25 and 26 depict a graphical representation of the classification accuracy for each k. There was a decrease in the number of instances not classified when the CRA method was utilized with the occurrence of missing attributes. Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .9 .94 0 28 46.4 3.7 3 .82 .9 0 46.4 52.6 2.75 5 .78 .86 0 52.3 53.7 2.44 7 .76 .86 0 51.4 52.9 2.54 9 .76 .85 0 52.8 51.6 2.21 11 .76 .86 0 53.4 52.6 2.17 13 .76 .85 0 54.4 52.7 2.07 15 .76 .85 0 52.5 49.8 1.93 Table 16 Anneal Results Complete Data (No Missing Attributes) Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .87 .91 .6 31 46.3 3.46 3 .83 .9 0 44.6 51.7 2.79 5 .77 .87 0 51.7 55.7 2.58 7 .76 .86 0 52 53 2.38 9 .76 .86 0 50.8 50.2 2.23 11 .76 .84 0 49.4 50.7 2.1 13 .77 .84 0 50.8 49.5 1.9 15 .76 .84 0 51.6 50.2 1.86 Table 17 Anneal Results 1 Missing Attribute Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .85 .87 2.1 33.6 50.7 3.88 3 .8 .88 0 49.3 55.4 3 5 .78 .86 0 55.7 55.7 2.64 7 .78 .86 0 54.4 54.9 2.62 9 .77 .87 0 53.8 51.7 2.36 58 11 .76 .87 0 53.5 50 2.13 13 .76 .85 0 52.7 48.2 1.98 15 .76 .85 0 52.3 47.1 1.81 Table 18 Anneal Results 2 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .85 .88 2.2 30.6 47.8 3.49 3 .82 .9 0 44.2 52.8 2.97 5 .78 .87 0 51.7 52.6 2.48 7 .76 .86 0 53.5 54.6 2.41 9 .76 .86 0 53.8 53.4 2.27 11 .76 .84 0 52.5 51.6 2.16 13 .76 .85 0 54.9 51.3 1.98 15 .76 .84 53 48.8 1.9 Table 19 Anneal Results 3 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .82 .84 4.8 33.3 48.3 3.61 3 .81 .87 0 47.3 56.7 3.19 5 .78 .84 0 46.6 54.7 2.76 7 .77 .84 0 50.3 51.4 2.41 9 .76 .83 0 52.7 51.5 2.23 11 .76 .83 0 54 51 2.12 13 .76 .83 0 56 51.8 2 15 .76 .83 0 51.6 48.4 1.89 Table 20 Anneal Results 4 Missing Attributes Figure 25 Figure 26 Figure 27 depicts the average set. The CRA method produced shorter rules. methods were 2.32 and 3.63 on average. 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 3 Ac cu rac y 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 1 3 Ac cu rac y 59 Anneal Accuracies Unordered Model Anneal Accuracies Ordered Model rule length for the models created using the Anneal data The rule lengths for the collective and traditional Also, the models created via the CRA method were 5 7 9 11 13 15 k value 5 7 9 11 13 15 k value 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing generally larger than those created with the traditional method. 52 rules, on average, while the traditional model contained 31. examples of the traditional and collective models created. Figure Cluster Rules 1 0 If steel=`S` and carbon=`0 2 0 If squality=`?` and bf=`Y` and steel=`A` then class=`1` 3 0 If exptl=`Y` then class=`1` 4 0 If squality=`?` and family=`?` and carbon=`0 strength=`0-114.95` then class=`2` 5 0 If con=`?` and temper_rolling=`?` and carbon=`0 and family=`?` and squality=`?` then class=`2` 6 0 If width=`810.61- temper_rolling=`?` and con=`?` then class=`2` 7 0 If strength=`574.76 8 0 If formal_ability=`2` and len=`0 enamelability=`?` then class=`3` 9 0 If bl=`Y` then class=`3` 10 0 If bf=`Y` then class=`3` 0 0.5 1 1.5 2 2.5 3 3.5 4 1 3 Av era ge R ule Le ng th 60 The collective model contained Tables 21 and 22 27 Anneal Average Rule Lengths Accuracy -13.72` then class=`1` -13.72` and 0.81481 -13.72` 0.42857 1215.92` and carbon=`0-13.72` and 0.71429 -689.71` then class=`2` -1870.92` and 0.90476 0.85124 0.95192 5 7 9 11 13 15 k value display Coverage 1 4 1 1 1 1 81 14 7 1 8 210 121 104 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 61 11 0 If con=`S` and thick=`0.24-1.11` then class=`3` 0.91228 285 12 0 If family=`?` and strength=`0-114.95` and width=`810.61-1215.92` and hardness=`0-24.75` then class=`3` 0.86111 36 13 0 If family=`?` and width=`405.31-810.61` and temper_rolling=`?` and strength=`0-114.95` then class=`3` 0.88333 240 14 0 If family=`?` and width=`1215.92-1525` and oil=`?` then class=`3` 0.8253 166 15 0 If steel=`R` and exptl=`?` then class=`3` 0.71362 213 16 0 If squality=`G` and thick=`0.24-1.11` and len=`0- 1870.92` then class=`3` 0.93548 93 17 0 If thick=`1.98-2.85` and blue_bright_varn_clean=`?` then class=`3` 0.80488 41 18 0 If formal_ability=`?` and bt=`?` and oil=`?` and width=`405.31-810.61` and bw_me=`?` then class=`3` 0.85507 69 19 0 If width=`0-405.31` and family=`?` then class=`3` 0.82143 84 20 0 If ferro=`?` and chrom=`?` and hardness=`0-24.75` and phos=`?` and width=`405.31-810.61` and bw_me=`?` then class=`3` 0.84153 183 21 0 If con=`S` then class=`3` 0.84966 439 22 0 If hardness=`49.49-74.24` then class=`3` 1 66 23 0 If formal_ability=`2` then class=`3` 0.86139 303 24 0 If con=`?` and steel=`A` and bt=`?` and width=`405.31- 810.61` and lustre=`?` and oil=`?` and bf=`?` then class=`5` 0.51429 35 25 0 If con=`?` and temper_rolling=`?` and carbon=`0-13.72` and bt=`?` and sfinish=`?` and bore=`0000` and bf=`?` then class=`5` 0.57143 70 26 0 If steel=`?` and squality=`?` and blue_bright_varn_clean=`?` and len=`0-1870.92` then class=`5` 0.92308 13 27 0 If thick=`0.24-1.11` then class=`5` 0.06865 437 28 0 If con=`?` and squality=`G` and bt=`?` and bl=`?` and lustre=`?` and bore=`0000` and bw_me=`?` and bf=`?` then class=`U` 0.31373 51 29 0 If bl=`?` then class=`U` 0.05369 596 Table 21 Example Anneal Traditional Rule Set RuleID Cluster Rules Accuracy Coverage 2 0 If squality=`?` and steel=`R` and con=`S` then class=`2` 1 61 3 0 If sfinish=`P` then class=`2` 1 6 62 4 0 If enamelability=`2` then class=`2` 1 5 5 0 If shape=`COIL` then class=`3` 0.82121 330 6 0 If thick=`0.24-1.11` and family=`?` and width=`405.31-810.61` and enamelability=`?` then class=`3` 0.8984 187 7 0 If family=`?` then class=`3` 0.82736 614 8 0 If con=`S` then class=`3` 0.84966 439 9 0 If thick=`1.11-1.98` and bl=`?` and width=`1215.92-1525` and bf=`?` and enamelability=`?` and strength=`0-114.95` then class=`5` 0.31818 22 10 0 If non_ageing=`N` and thick=`1.11-1.98` and formal_ability=`3` and len=`0-1870.92` and shape=`SHEET` then class=`5` 0.83333 12 11 0 If non_ageing=`N` and bw_me=`B` then class=`5` 0.53333 15 12 0 If steel=`A` and bw_me=`?` then class=`5` 0.12075 265 13 0 If thick=`0.24-1.11` then class=`5` 0.06865 437 14 0 If bc=`?` then class=`U` 0.04469 716 15 1 If steel=`S` and carbon=`0-13.72` then class=`1` 1 4 16 1 If exptl=`Y` then class=`1` 1 1 17 1 If squality=`?` and carbon=`0-13.72` and family=`?` and width=`0-405.31` then class=`2` 0.75 12 18 1 If steel=`R` and squality=`?` and bt=`?` and shape=`COIL` then class=`2` 0.64286 14 19 1 If con=`?` and carbon=`0-13.72` and bore=`0000` and family=`?` and squality=`?` and bt=`?` and width=`405.31-810.61` then class=`2` 0.18182 11 20 1 If strength=`574.76-689.71` then class=`2` 1 8 21 1 If family=`?` and temper_rolling=`?` and width=`405.31-810.61` and formal_ability=`?` then class=`3` 0.90625 64 22 1 If con=`S` and thick=`0.24-1.11` then class=`3` 0.91228 285 23 1 If thick=`1.98-2.85` and blue_bright_varn_clean=`?` then class=`3` 0.80488 41 24 1 If formal_ability=`2` then class=`3` 0.86139 303 25 1 If strength=`459.81-574.76` then class=`3` 0.61538 13 26 1 If family=`?` and thick=`0.24-1.11` then class=`3` 0.8973 370 27 1 If bt=`Y` and hardness=`0-24.75` then class=`3` 0.76923 39 28 1 If thick=`1.11-1.98` then class=`3` 0.55621 169 29 1 If carbon=`41.15-54.86` then class=`3` 1 13 30 1 If con=`A` then class=`5` 0.46667 30 31 1 If thick=`0.24-1.11` and carbon=`0-13.72` and strength=`0-114.95` and hardness=`0-24.75` then 0.09772 307 63 class=`5` 32 1 If squality=`G` and strength=`0-114.95` then class=`U` 0.14483 145 33 1 If width=`1215.92-1525` then class=`U` 0.03286 213 34 2 If squality=`?` and steel=`A` and bf=`Y` then class=`1` 1 1 35 2 If squality=`?` and family=`?` and carbon=`0-13.72` and bc=`?` and bl=`?` and bore=`0000` and bt=`?` and bw_me=`?` and cbond=`?` and chrom=`?` and corr=`?` and enamelability=`?` and exptl=`?` and ferro=`?` and hardness=`0-24.75` and jurofm=`?` and lustre=`?` and marvi=`?` and non_ageing=`?` and oil=`?` and packing=`?` and phos=`?` and product_type=`C` and len=`0-1870.92` and blue_bright_varn_clean=`?` and sfinish=`?` and strength=`0-114.95` and bf=`Y` and width=`405.31- 810.61` then class=`2` 0.5 2 36 2 If steel=`M` then class=`2` 0.5 16 37 2 If blue_bright_varn_clean=`B` then class=`2` 1 3 38 2 If con=`S` and shape=`SHEET` then class=`2` 0.19244 291 39 2 If width=`0-405.31` and ferro=`?` and hardness=`0- 24.75` then class=`3` 0.75676 74 40 2 If thick=`0.24-1.11` and family=`?` then class=`3` 0.8973 370 41 2 If bt=`?` and formal_ability=`?` and oil=`?` and shape=`SHEET` and sfinish=`?` then class=`3` 0.98485 66 42 2 If thick=`1.98-2.85` and con=`?` then class=`3` 0.88571 35 43 2 If ferro=`?` and hardness=`0-24.75` and phos=`?` and thick=`1.11-1.98` then class=`3` 0.60769 130 44 2 If hardness=`49.49-74.24` then class=`3` 1 66 45 2 If shape=`SHEET` and strength=`0-114.95` and bt=`?` and len=`0-1870.92` and bw_me=`?` and oil=`?` and carbon=`0-13.72` then class=`5` 0.176 125 46 2 If shape=`COIL` and bf=`?` and lustre=`?` and bt=`?` then class=`5` 0.08511 235 47 2 If steel=`A` and bf=`?` and bore=`0000` and lustre=`?` then class=`U` 0.10425 259 Table 22 Example Anneal Collective Rule Set 64 Results for Thyroid Data Tables 23 through 28 display the results of the test performed using the Thyroid data set. The traditional models performed very well achieving accuracies of 97% and 98% for the unordered and ordered models respectively. A collective model was created that was able to match the performance of the traditional model in the presence of 0, 1, 2, and 3 missing attributes (highlighted in yellow). The CRA process created a model that achieved a higher accuracy than the traditional method in the presence of 4 missing attributes. Figures 28 and 29 provide a graphical representation of the accuracies achieved for each k. Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .97 .98 .1 52.3 59.7 2.22 3 .96 .98 0 59 64.9 2.13 5 .95 .97 0 51.7 51.4 2.01 7 .95 .97 0 50.3 49 1.78 9 .95 .96 0 48.2 46.1 1.7 11 .95 .96 0 47.2 46 1.72 13 .95 .96 0 47.7 45.7 1.7 15 .95 .96 0 47.9 45.8 1.62 Table 23 Thyroid Results Complete Data (No Missing Attributes) Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .97 .97 1.7 53.4 60.1 2.29 3 .95 .97 .3 58.4 64.2 2.18 5 .95 .97 .1 50.2 49.5 1.92 7 .95 .96 0 47.1 46.2 1.81 9 .95 .96 0 49 46.8 1.71 11 .95 .96 0 47.4 44.5 1.67 13 .95 .96 0 47.4 43.8 1.66 15 .96 .96 0 46.3 42.2 1.63 Table 24 Thyroid Results 1 Missing Attribute 65 Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .96 .97 3.4 57.9 59.8 2.3 3 .96 .97 .8 62.7 61 2.17 5 .95 .97 0 49.3 49.2 2.12 7 .95 .96 0 44.7 44.2 1.72 9 .95 .96 0 44.9 42.1 1.64 11 .95 .96 0 43.7 40.4 1.63 13 .95 .96 0 43.1 40 1.61 15 .95 .96 0 41.4 38.1 1.56 Table 25 Thyroid Results 2 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .95 .97 1.9 45.6 52.8 2.34 3 .95 .97 .8 48 54.6 2.26 5 .95 .96 .4 56.4 55.7 2.07 7 .95 .96 .2 52.7 50.6 1.87 9 .95 .96 0 53.6 50.6 1.81 11 .95 .96 0 54.5 49.9 1.77 13 .95 .96 0 52.9 46.6 1.72 15 .95 .96 0 49 43.8 1.65 Table 26 Thyroid Results 3 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .94 .95 11.9 55.1 59.3 2.34 3 .95 .97 .5 61.5 63.4 2.39 5 .95 .97 .1 56.1 53.3 1.89 7 .95 .96 0 55 51.2 1.84 9 .95 .96 0 52.2 48.2 1.76 11 .95 .96 0 47.7 46.1 1.74 13 .95 .96 0 46.4 43.2 1.67 15 .95 .96 0 43.3 41.5 1.64 Table 27 Thyroid Results 4 Missing Attributes Figure 28 Figure 29 Figure 30 displays the average rule lengths of the models created using the set. The rules that were generated via the CRA method were shorter. 0.925 0.93 0.935 0.94 0.945 0.95 0.955 0.96 0.965 0.97 0.975 1 3 Ac cu rac y 0.935 0.94 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985 1 3 Ac cu rac y 66 Thyroid Accuracies Unordered Models Thyroid Accuracies Ordered Models The rule lengths were 2.3 5 7 9 11 13 15 k value 5 7 9 11 13 15 k value Thyroid data 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing and 1.8, on average, for the traditional and collective models respectively. The models created by the traditional method were larger. On average, t while the collective models contained 50. this data set via the traditional and CRA methods. Figure Cluster Rules 1 0 If TSH=`164` then class=`hypothyroid` 2 0 If FTI=`0-60.23` and TSH=`23.9 2.99` then class=`hypothyroid` 3 0 If FTI=`0-60.23` and age=`58.87 1.59` and tumor=`f` 4 0 If TSH=`150` then class=`hypothyroid` 5 0 If TT4=`5.30` then class=`hypothyroid` 6 0 If TSH=`47.79-71.69` and T3=`0 query_hypothyroid=`f` then class=`hypothyroid` 7 0 If FTI=`0-60.23` and TSH=`23.9 then class=`hypothyroid` 8 0 If FTI=`0-60.23` and age=`78.16 on_thyroxine=`f` then class=`hypothyroid` 9 0 If thyroid_surgery=`t` and age=`1 1.99` then class=`hypothyroid` 0 0.5 1 1.5 2 2.5 1 3 Av era ge Ru le Le ng th 67 he traditional method models contained 52 rules Tables 28 and 29 are examples of models generated for 30 Thyroid Average Rule Lengths Accuracy -47.79` and T3=`1.99- -78.16` and T4U=`1.36- then class=`hypothyroid` -1` and -47.79` and T3=`0-1` -97.45` and -20.29` and T3=`1- 5 7 9 11 13 15 k value Coverage 1 1 1 6 1 4 1 2 1 1 1 13 1 13 1 6 1 3 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 68 10 0 If query_hypothyroid=`t` and age=`78.16-97.45` and T3=`1.99-2.99` and FTI=`60.23-99` then class=`hypothyroid` 1 2 11 0 If TSH=`235` then class=`hypothyroid` 1 1 12 0 If TSH=`216` then class=`hypothyroid` 1 1 13 0 If FTI=`0-60.23` and age=`58.87-78.16` and TT4=`10- 55.48` and T3=`1-1.99` then class=`hypothyroid` 1 10 14 0 If TSH=`213` then class=`hypothyroid` 1 1 15 0 If T4U=`0.91-1.13` and on_thyroxine=`f` and query_hypothyroid=`t` and TT4=`55.48-99` and T3=`1- 1.99` and age=`58.87-78.16` and sex=`F` then class=`hypothyroid` 0.5 2 16 0 If T4U=`0.91-1.13` and age=`39.58-58.87` and query_hypothyroid=`t` and TT4=`55.48-99` and T3=`1.99-2.99` then class=`hypothyroid` 1 1 17 0 If TSH=`100` and on_thyroxine=`f` then class=`hypothyroid` 1 3 18 0 If TSH=`47.79-71.69` and FTI=`0-60.23` and on_antithyroid_medication=`f` and query_hypothyroid=`f` then class=`hypothyroid` 1 20 19 0 If FTI=`0-60.23` and TSH=`23.9-47.79` and age=`20.29- 39.58` then class=`hypothyroid` 1 7 20 0 If TSH=`288` then class=`hypothyroid` 1 1 21 0 If TSH=`145` then class=`hypothyroid` 1 3 22 0 If TSH=`71.69-95.58` and age=`58.87-78.16` then class=`hypothyroid` 1 1 23 0 If TT4=`6` then class=`hypothyroid` 1 2 24 0 If FTI=`0-60.23` and TSH_measured=`y` and age=`39.58-58.87` and T4U=`0.91-1.13` and TT4=`10- 55.48` and on_antithyroid_medication=`f` then class=`hypothyroid` 0.72727 11 25 0 If thyroid_surgery=`t` and age=`1-20.29` and FTI=`60.23-99` then class=`hypothyroid` 0.66667 3 26 0 If TT4=`230` and age=`20.29-39.58` and query_hypothyroid=`f` then class=`hypothyroid` 1 1 27 0 If TT4=`4` and age=`20.29-39.58` then class=`hypothyroid` 1 2 28 0 If TSH=`95.58-96` then class=`hypothyroid` 1 1 29 0 If FTI=`0-60.23` and TSH_measured=`y` and on_thyroxine=`f` and sick=`f` and sex=`M` and age=`39.58-58.87` and query_hyperthyroid=`f` then class=`hypothyroid` 0.75 8 30 0 If TSH=`109` then class=`hypothyroid` 1 1 31 0 If T4U=`1.13-1.36` and TT4=`55.48-99` and age=`58.87- 78.16` and T3=`1.99-2.99` then class=`hypothyroid` 0.5 2 69 32 0 If TT4=`10-55.48` and T3=`0-1` and TSH=`23.9-47.79` then class=`hypothyroid` 1 12 33 0 If FTI=`0-60.23` and T4U=`0.45-0.68` and sick=`f` then class=`hypothyroid` 1 2 34 0 If age=`?` then class=`hypothyroid` 0.02924 342 35 0 If TT4=`7.50` then class=`hypothyroid` 1 1 36 0 If TSH=`143` then class=`hypothyroid` 1 1 37 0 If FTI=`0-60.23` and T4U=`1.13-1.36` and sex=`M` then class=`hypothyroid` 1 5 38 0 If TT4=`10-55.48` and T3=`0-1` and sex=`F` and T4U=`0.68-0.91` and age=`58.87-78.16` then class=`hypothyroid` 0.66667 3 39 0 If FTI=`0-60.23` and thyroid_surgery=`t` and age=`20.29-39.58` then class=`hypothyroid` 1 1 40 0 If TT4=`10-55.48` and thyroid_surgery=`t` then class=`hypothyroid` 0.85714 7 41 0 If FTI=`60.23-99` and age=`20.29-39.58` and sex=`M` and T3=`1-1.99` and T4U=`0.91-1.13` then class=`hypothyroid` 0.14286 7 42 0 If goitre=`t` and TT4=`55.48-99` and T3=`?` and age=`39.58-58.87` and TSH=`0-23.9` then class=`hypothyroid` 1 1 43 0 If TSH=`530` then class=`hypothyroid` 1 1 44 0 If FTI=`0-60.23` and thyroid_surgery=`t` then class=`hypothyroid` 0.75 8 45 0 If TSH=`165` then class=`hypothyroid` 1 1 46 0 If query_hypothyroid=`t` and T4U=`0.68-0.91` and T3=`1.99-2.99` and age=`58.87-78.16` then class=`hypothyroid` 1 2 47 0 If TT4=`109` and TSH=`23.9-47.79` then class=`hypothyroid` 1 1 48 0 If TSH=`117` then class=`hypothyroid` 1 1 49 0 If TSH=`71.69-95.58` and age=`1-20.29` then class=`hypothyroid` 1 1 50 0 If T3=`3.99-4.98` then class=`negative` 1 43 51 0 If TSH=`0-23.9` and age=`?` and sex=`M` then class=`negative` 1 78 52 0 If FTI=`121` then class=`negative` 1 30 53 0 If TSH=`0-23.9` then class=`negative` 0.98279 2034 54 0 If TSH=`?` then class=`negative` 1 371 55 0 If T3=`1-1.99` and TSH=`23.9-47.79` and query_hypothyroid=`f` and on_thyroxine=`f` and age=`?` then class=`negative` 1 2 70 56 0 If T3=`1-1.99` and TSH=`23.9-47.79` and query_hypothyroid=`f` and age=`58.87-78.16` and sex=`F` then class=`negative` 0.8 5 57 0 If TSH=`23.9-47.79` and T3=`1-1.99` and query_hypothyroid=`f` and on_thyroxine=`f` and thyroid_surgery=`f` then class=`negative` 0.76923 13 58 0 If query_hyperthyroid=`f` then class=`negative` 0.9503 2334 Table 28 Thyroid Example Traditional Model Cluster Rules Accuracy Coverage 1 0 If TSH=`164` then class=`hypothyroid` 1 1 2 0 If FTI=`0-60.23` and TSH=`23.9-47.79` and T3=`1.99- 2.99` then class=`hypothyroid` 1 6 3 0 If FTI=`0-60.23` and age=`58.87-78.16` and T4U=`1.36- 1.59` and tumor=`f` then class=`hypothyroid` 1 4 4 0 If TSH=`150` then class=`hypothyroid` 1 2 5 0 If TT4=`5.30` then class=`hypothyroid` 1 1 6 0 If TSH=`47.79-71.69` and T3=`0-1` and query_hypothyroid=`f` then class=`hypothyroid` 1 13 7 0 If FTI=`0-60.23` and TSH=`23.9-47.79` and T3=`0-1` then class=`hypothyroid` 1 13 8 0 If FTI=`0-60.23` and age=`78.16-97.45` and on_thyroxine=`f` then class=`hypothyroid` 1 6 9 0 If thyroid_surgery=`t` and age=`1-20.29` and T3=`1- 1.99` then class=`hypothyroid` 1 3 10 0 If query_hypothyroid=`t` and age=`78.16-97.45` and T3=`1.99-2.99` and FTI=`60.23-99` then class=`hypothyroid` 1 2 11 0 If TSH=`235` then class=`hypothyroid` 1 1 12 0 If TSH=`216` then class=`hypothyroid` 1 1 13 0 If FTI=`0-60.23` and age=`58.87-78.16` and TT4=`10- 55.48` and T3=`1-1.99` then class=`hypothyroid` 1 10 14 0 If TSH=`213` then class=`hypothyroid` 1 1 15 0 If T4U=`0.91-1.13` and on_thyroxine=`f` and query_hypothyroid=`t` and TT4=`55.48-99` and T3=`1- 1.99` and age=`58.87-78.16` and sex=`F` then class=`hypothyroid` 0.5 2 16 0 If T4U=`0.91-1.13` and age=`39.58-58.87` and query_hypothyroid=`t` and TT4=`55.48-99` and T3=`1.99-2.99` then class=`hypothyroid` 1 1 17 0 If TSH=`100` and on_thyroxine=`f` then class=`hypothyroid` 1 3 71 18 0 If TSH=`47.79-71.69` and FTI=`0-60.23` and on_antithyroid_medication=`f` and query_hypothyroid=`f` then class=`hypothyroid` 1 20 19 0 If FTI=`0-60.23` and TSH=`23.9-47.79` and age=`20.29- 39.58` then class=`hypothyroid` 1 7 20 0 If TSH=`288` then class=`hypothyroid` 1 1 21 0 If TSH=`145` then class=`hypothyroid` 1 3 22 0 If TSH=`71.69-95.58` and age=`58.87-78.16` then class=`hypothyroid` 1 1 23 0 If TT4=`6` then class=`hypothyroid` 1 2 24 0 If FTI=`0-60.23` and TSH_measured=`y` and age=`39.58-58.87` and T4U=`0.91-1.13` and TT4=`10- 55.48` and on_antithyroid_medication=`f` then class=`hypothyroid` 0.72727 11 25 0 If thyroid_surgery=`t` and age=`1-20.29` then class=`hypothyroid` 0.66667 6 26 0 If TT4=`230` and age=`20.29-39.58` and query_hypothyroid=`f` then class=`hypothyroid` 1 1 27 0 If TT4=`4` and age=`20.29-39.58` then class=`hypothyroid` 1 2 28 0 If TSH=`95.58-96` then class=`hypothyroid` 1 1 29 0 If FTI=`0-60.23` and TSH_measured=`y` and on_thyroxine=`f` and sick=`f` and sex=`M` and age=`39.58-58.87` and query_hyperthyroid=`f` then class=`hypothyroid` 0.75 8 30 0 If TSH=`109` then class=`hypothyroid` 1 1 31 0 If T4U=`1.13-1.36` and TT4=`55.48-99` and age=`58.87- 78.16` and T3=`1.99-2.99` then class=`hypothyroid` 0.5 2 32 0 If TT4=`10-55.48` and T3=`0-1` and TSH=`23.9-47.79` then class=`hypothyroid` 1 12 33 0 If FTI=`0-60.23` and T4U=`0.45-0.68` and sick=`f` then class=`hypothyroid` 1 2 34 0 If age=`?` then class=`hypothyroid` 0.02924 342 35 0 If TT4=`7.50` then class=`hypothyroid` 1 1 36 0 If TSH=`143` then class=`hypothyroid` 1 1 37 0 If FTI=`0-60.23` and T4U=`1.13-1.36` and sex=`M` then class=`hypothyroid` 1 5 38 0 If TT4=`10-55.48` and T3=`0-1` and sex=`F` and T4U=`0.68-0.91` and age=`58.87-78.16` then class=`hypothyroid` 0.66667 3 39 0 If FTI=`0-60.23` and thyroid_surgery=`t` and age=`20.29-39.58` then class=`hypothyroid` 1 1 40 0 If TT4=`10-55.48` and thyroid_surgery=`t` then class=`hypothyroid` 0.85714 7 72 41 0 If FTI=`60.23-99` and age=`20.29-39.58` and sex=`M` and T3=`1-1.99` and T4U=`0.91-1.13` then class=`hypothyroid` 0.14286 7 42 0 If goitre=`t` and T3=`?` and TT4=`55.48-99` and age=`39.58-58.87` then class=`hypothyroid` 0.5 2 43 0 If TSH=`530` then class=`hypothyroid` 1 1 44 0 If FTI=`0-60.23` and thyroid_surgery=`t` then class=`hypothyroid` 0.75 8 45 0 If TSH=`165` then class=`hypothyroid` 1 1 46 0 If query_hypothyroid=`t` and T4U=`0.68-0.91` and T3=`1.99-2.99` and age=`58.87-78.16` then class=`hypothyroid` 1 2 47 0 If TT4=`109` and TSH=`23.9-47.79` then class=`hypothyroid` 1 1 48 0 If TSH=`117` then class=`hypothyroid` 1 1 49 0 If TSH=`71.69-95.58` and age=`1-20.29` then class=`hypothyroid` 1 1 50 0 If FTI=`124` then class=`negative` 1 26 51 0 If FTI=`109` then class=`negative` 1 31 52 0 If TSH=`0-23.9` and T4U=`0.68-0.91` and sex=`M` then class=`negative` 1 259 53 0 If T3=`3.99-4.98` then class=`negative` 1 43 54 0 If TSH=`0-23.9` and age=`?` and sex=`M` then class=`negative` 1 78 55 0 If FTI=`121` then class=`negative` 1 30 56 0 If TSH=`0-23.9` then class=`negative` 0.98279 2034 57 0 If TSH=`?` then class=`negative` 1 371 58 0 If T3=`1-1.99` and TSH=`23.9-47.79` and query_hypothyroid=`f` and on_thyroxine=`f` and age=`?` then class=`negative` 1 2 59 0 If T3=`1-1.99` and TSH=`23.9-47.79` and query_hypothyroid=`f` and age=`58.87-78.16` and sex=`F` then class=`negative` 0.8 5 60 0 If TSH=`23.9-47.79` and T3=`1-1.99` and query_hypothyroid=`f` and on_thyroxine=`f` and thyroid_surgery=`f` then class=`negative` 0.76923 13 61 0 If query_hyperthyroid=`f` then class=`negative` 0.9503 2334 62 1 If age=`58.87-78.16` then class=`hypothyroid` 0.06417 748 63 1 If FTI_measured=`y` then class=`negative` 0.9482 2336 64 2 If age=`20.29-39.58` then class=`hypothyroid` 0.03978 553 65 2 If lithium=`f` then class=`negative` 0.95217 2530 Table 29 Thyroid Example Collective Model 73 Results for Mushroom Data Tables 30 through 34 display the results of the test performed using the Mushroom data set. The traditional models achieved accuracies of 89% and 100% for the unordered and ordered models respectively. The CRA process generated a collective model that achieved a higher accuracy than the traditional method with the occurrence of 2, 3, and 4 missing attributes. Figures 31 and 32 provide a graphical representation of the accuracies achieved for each k. Additionally, the CRA method matched the performance of the traditional method with the occurrence of 0 and 3 missing attributes. Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .89 1 0 23.5 33.4 2.06 3 .89 .9 0 28.4 32.6 1.54 5 .88 .91 0 26.4 29.7 1.41 7 .87 .89 0 25.8 28.2 1.36 9 .85 .9 0 25.2 27.9 1.36 11 .85 .91 0 24.5 27.1 1.31 13 .8 .91 0 25.2 27.7 1.31 15 .76 .9 0 25.8 28.1 1.35 Table 30 Mushroom Results Complete Data (No Missing Attributes) Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .89 .96 40.8 23 33.4 2.01 3 .87 .92 .3 26.7 32.1 1.56 5 .86 .89 .1 25.2 29.3 1.45 7 .82 .87 0 25.3 28.7 1.41 9 .77 .87 0 23.8 27.3 1.36 11 .81 .9 0 23.3 25.5 1.31 13 .81 .9 0 23.6 26.3 1.31 15 .8 .9 0 24.4 26.9 1.35 Table 31 Mushroom Results 1 Missing Attribute 74 Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .86 .92 75.1 22.2 31.3 1.95 3 .87 .88 2.3 28.9 36 1.73 5 .8 .87 .2 25.4 30.2 1.47 7 .78 .87 0 24.1 27 1.37 9 .79 .86 0 22.8 25 1.28 11 .77 .88 0 23.2 25.4 1.32 13 .76 .87 0 24.8 26.7 1.32 15 .76 .87 0 24.4 27.1 1.36 Table 32 Mushroom Results 2 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .8 .86 157 21.3 30.2 1.92 3 .82 .85 1.7 28 34.1 1.69 5 .8 .88 .7 25.9 29.8 1.49 7 .76 .81 0 25.6 28.7 1.38 9 .76 .86 0 24.6 27.1 1.28 11 .76 .86 .2 25.2 27.4 1.29 13 .76 .86 0 26.5 28.4 1.31 15 .77 .86 .1 25.9 27.9 1.29 Table 33 Mushroom Results 3 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .77 .82 201.7 24.3 35.2 2.06 3 .82 .87 2.4 25.1 29.9 1.52 5 .8 .86 1 28.4 32.5 1.59 7 .75 .87 .2 25.8 26.9 1.36 9 .74 .84 .3 23.6 26.8 1.36 11 .75 .85 .5 22.6 25.7 1.4 13 .71 .85 .2 23.5 27 1.39 15 .7 .85 .1 24.6 28 1.39 Table 34 Mushroom Results 4 Missing Attributes Figure 31Mushroom Accuracies Unordered Models Figure 33 depicts the average rule lengths of the models generated. The collective models created with the CRA process contained smaller rules. 1.4 for the traditional and collective models respectively method were generally larger than the traditional models. The collective models contained 25 rules while the traditional model contained 22 on average. the traditional and collective models created using this data set. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 Ac cu rac y 75 On average, the rule lengths were . The models created via the CRA Tables 35 and 36 display examples of 5 7 9 11 13 15 k value 2 and 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing Figure 32 Figure Cluster Rules 1 0 If odor=`a` then class=`e` 2 0 If odor=`n` and stalk_shape=`t` then class=`e` 3 0 If habitat=`w` then class=`e` 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 Ac cu rac y 0 0.5 1 1.5 2 2.5 1 3 Av era ge Ru le Le ng th 76 Mushroom Accuracies Ordered Models 33 Mushroom Average Rule Lengths Accuracy 5 7 9 11 13 15 k value 5 7 9 11 13 15 k value Coverage 1 319 1 2020 1 145 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 77 4 0 If odor=`l` then class=`e` 1 317 5 0 If stalk_color_above_ring=`o` then class=`e` 1 143 6 0 If ring_number=`t` and spore_print_color=`w` then class=`e` 1 417 7 0 If odor=`n` and stalk_surface_below_ring=`f` then class=`e` 1 362 8 0 If cap_shape=`s` then class=`e` 1 25 9 0 If odor=`n` and habitat=`u` then class=`e` 1 72 10 0 If cap_color=`c` and gill_size=`n` then class=`e` 1 21 11 0 If gill_spacing=`w` and cap_color=`n` then class=`e` 1 223 12 0 If cap_color=`y` then class=`p` 0.63392 855 13 0 If cap_surface=`s` and gill_spacing=`c` and gill_attachment=`f` and ring_number=`o` then class=`p` 0.8406 1261 14 0 If gill_size=`n` and stalk_shape=`t` then class=`p` 0.9455 1468 15 0 If habitat=`p` and ring_number=`o` then class=`p` 0.91551 864 16 0 If stalk_shape=`e` and population=`y` and habitat=`g` and gill_color=`p` then class=`p` 0.81081 74 17 0 If stalk_shape=`e` and ring_number=`o` and habitat=`g` then class=`p` 0.6118 729 18 0 If habitat=`m` and gill_color=`w` then class=`p` 0.15789 57 19 0 If population=`v` and stalk_color_above_ring=`w` and ring_type=`p` and cap_shape=`x` and ring_number=`o` and gill_attachment=`f` and stalk_surface_above_ring=`s` and stalk_surface_below_ring=`s` and veil_color=`w` and veil_type=`p` and cap_surface=`f` and habitat=`d` and stalk_root=`b` and gill_color=`u` and spore_print_color=`k` then class=`p` 0.41667 12 20 0 If stalk_root=`b` and cap_shape=`x` then class=`p` 0.50772 1554 21 0 If cap_color=`g` and bruises=`f` and ring_number=`o` then class=`p` 0.69004 813 22 0 If population=`v` and stalk_color_below_ring=`w` and stalk_color_above_ring=`w` and bruises=`t` then class=`p` 0.60682 440 23 0 If population=`s` and stalk_surface_below_ring=`s` and stalk_root=`e` then class=`p` 0.40892 269 24 0 If habitat=`d` then class=`p` 0.39825 2521 25 0 If ring_number=`o` and gill_spacing=`w` and stalk_surface_below_ring=`s` and stalk_surface_above_ring=`s` then class=`p` 0.23939 330 Table 35 Mushroom Example Traditional Model 78 Cluster Rules Accuracy Coverage 1 0 If odor=`n` and stalk_shape=`t` then class=`e` 1 2020 2 0 If population=`y` then class=`e` 0.62127 1373 3 0 If odor=`n` and cap_color=`g` then class=`e` 1 834 4 0 If habitat=`w` then class=`e` 1 145 5 0 If stalk_color_below_ring=`n` and gill_spacing=`w` then class=`e` 1 39 6 0 If odor=`n` and stalk_surface_below_ring=`k` then class=`e` 1 120 7 0 If cap_shape=`s` then class=`e` 1 25 8 0 If odor=`l` then class=`e` 1 317 9 0 If odor=`n` and cap_shape=`x` then class=`e` 0.99442 1255 10 0 If odor=`a` then class=`e` 1 319 11 0 If ring_type=`f` then class=`e` 1 39 12 0 If ring_number=`t` and spore_print_color=`w` then class=`e` 1 417 13 0 If odor=`n` and habitat=`u` then class=`e` 1 72 14 0 If stalk_surface_below_ring=`k` then class=`p` 0.9346 1835 15 0 If population=`v` and stalk_color_below_ring=`w` and cap_color=`e` then class=`p` 0.82641 409 16 0 If stalk_color_below_ring=`p` and stalk_color_above_ring=`p` and cap_color=`e` then class=`p` 0.78112 233 17 0 If stalk_color_above_ring=`w` and gill_spacing=`c` then class=`p` 0.50811 2527 18 0 If gill_size=`n` and bruises=`f` and cap_shape=`x` then class=`p` 0.91828 673 19 0 If ring_type=`e` then class=`p` 0.63653 2234 20 0 If gill_color=`w` then class=`p` 0.20329 974 21 1 If ring_type=`p` then class=`e` 0.79388 3168 22 1 If veil_color=`w` then class=`p` 0.49228 6348 23 1 If bruises=`f` then class=`p` 0.69286 3793 24 2 If ring_number=`o` then class=`e` 0.49175 5997 25 2 If spore_print_color=`w` then class=`e` 0.23862 1911 26 2 If bruises=`t` then class=`p` 0.18632 2705 Table 36 Mushroom Example Collective Model 79 Results for Tic-Tac-Toe Data Tables 37 through 41 display the results of the test performed using the Tic-Tac-Toe data set. The traditional models achieved accuracies of 82% and 99% for the unordered and ordered models respectively. The CRA process generated a collective model that achieved a higher accuracy than the traditional method with the occurrence of 1, 2, 3, and 4 missing attributes. Additionally, the number of instances not classified decreased when using the collective models created with the CRA process. Figures 34 and 35 provide a graphical representation of the accuracies achieved for each k while Figure 36 displays the average rule lengths. The rules created using the CRA method were shorter. On average the rule lengths were 3.35 and 2.38 for the traditional and collective models respectively. The CRA models were larger than the traditional models. The collective model contained 97 rules while the traditional model contained 52 rules on average. Tables 42 and 43 provide examples of the traditional and collective models generated. Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .82 .99 .1 54.6 26.8 3.4 3 .79 .96 0 78.9 27 2.73 5 .76 .94 0 89.7 27 2.54 7 .78 .9 0 101.2 26.9 2.42 9 .76 .86 0 107.5 27 2.3 11 .75 .85 0 106.8 27 2.27 13 .75 .84 0 104 27 2.2 15 .74 .82 0 101.9 27 2.13 Table 37 Tic-Tac-Toe Results Complete Data (No Missing Attributes) Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .75 .84 11.5 52.1 26.6 3.33 80 3 .75 .85 .1 80.9 26.9 2.72 5 .74 .83 0 88.3 27 2.6 7 .74 .81 0 105 27 2.42 9 .74 .8 0 106 27 2.32 11 .73 .8 0 99.9 26.9 2.23 13 .71 .78 0 98.5 26.9 2.16 15 .71 .78 0 98.3 27 2.12 Table 38 Tic-Tac-Toe Results 1 Missing Attribute Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .68 .73 21.3 49.7 26.7 3.31 3 .72 .79 .6 78.2 27 2.76 5 .72 .78 0 87.9 27 2.58 7 .72 .77 0 98.1 27 2.42 9 .72 .75 0 106.1 27 2.34 11 .71 .75 0 105.7 27 2.26 13 .7 .74 0 105.1 27 2.18 15 .69 .73 0 102.3 27 2.14 Table 39 Tic-Tac-Toe Results 2 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .57 .58 46.6 49.7 26.5 3.36 3 .68 .72 3.7 80.9 27 2.77 5 .7 .73 .5 86.8 27 2.62 7 .69 .72 .2 96.6 27 2.43 9 .69 .72 0 103.9 27 2.35 11 .69 .72 0 104.2 27 2.28 13 .67 .73 0 99.7 27 2.16 15 .66 .71 0 100.8 27 2.1 Table 40 Tic-Tac-Toe Results 3 Missing Attributes Number Clusters Unordered Accuracy Ordered Accuracy Average Number Not Classified Average Number of Rules Average Number Unique Attributes Average Rule Length 1 .48 .49 67.6 52.2 26.8 3.34 3 .66 .68 8 77 26.8 2.75 5 .68 7 .68 9 .69 11 .68 13 .67 15 .66 Table 41 Tic Figure 34 Tic 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3 Ac cu rac y 81 .7 2.4 88.3 27 .71 .3 102.6 27 .71 0 109.8 27 .7 0 109.1 27 .7 0 102.1 26.9 .7 0 99.6 27 -Tac-Toe Results 4 Missing Attributes -Tac-Toe Accuracies Unordered Models 5 7 9 11 13 15 k value 2.57 2.42 2.32 2.28 2.2 2.14 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing Figure 35 Tic Figure 36 Cluster Rules 1 0 If bottom_left_square=`o` and top_left_square=`o` and middle_left_square=`o` then class=`negative` 0 0.2 0.4 0.6 0.8 1 1.2 1 3 Ac cu rac y 0 0.5 1 1.5 2 2.5 3 3.5 1 3 Av era ge Ru le Le ng th 82 -Tac-Toe Accuracies Ordered Models Tic-Tac-Toe Average Rule Lengths Accuracy 5 7 9 11 13 15 k value 5 7 9 11 13 15 k value Coverage 1 32 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 0 Missing 1 Missing 2 Missing 3 Missing 4 Missing 83 2 0 If middle_middle_square=`o` and bottom_middle_square=`o` and top_middle_square=`o` then class=`negative` 1 32 3 0 If middle_middle_square=`o` and bottom_left_square=`o` and top_right_square=`o` then class=`negative` 1 38 4 0 If bottom_right_square=`o` and bottom_left_square=`o` and bottom_middle_square=`o` then class=`negative` 1 30 5 0 If top_left_square=`o` and top_right_square=`o` and top_middle_square=`o` then class=`negative` 1 28 6 0 If bottom_right_square=`o` and top_right_square=`o` and middle_right_square=`o` then class=`negative` 1 28 7 0 If bottom_right_square=`o` and bottom_middle_square=`x` and top_right_square=`x` and middle_left_square=`x` and top_middle_square=`o` and bottom_left_square=`o` and top_left_square=`x` then class=`negative` 1 2 8 0 If bottom_right_square=`o` and bottom_middle_square=`x` and middle_right_square=`x` and top_right_square=`x` and top_middle_square=`o` and top_left_square=`x` then class=`negative` 1 3 9 0 If middle_middle_square=`o` and top_left_square=`o` and bottom_right_square=`o` then class=`negative` 1 39 10 0 If middle_right_square=`o` and bottom_middle_square=`x` and top_right_square=`x` and bottom_right_square=`x` and middle_left_square=`x` then class=`negative` 1 2 11 0 If middle_middle_square=`o` and middle_right_square=`o` and middle_left_square=`o` then class=`negative` 1 27 12 0 If top_right_square=`o` and middle_left_square=`o` and bottom_right_square=`o` and bottom_left_square=`x` then class=`negative` 1 5 13 0 If top_right_square=`o` and bottom_middle_square=`o` and middle_middle_square=`o` and bottom_right_square=`x` then class=`negative` 1 7 14 0 If top_right_square=`o` and top_left_square=`o` and bottom_middle_square=`o` and middle_left_square=`o` then class=`negative` 1 1 15 0 If middle_middle_square=`b` and middle_right_square=`o` then class=`positive` 0.8 50 16 0 If middle_middle_square=`x` and bottom_left_square=`x` and bottom_middle_square=`o` then class=`positive` 0.93103 58 84 17 0 If middle_middle_square=`x` and bottom_left_square=`b` then class=`positive` 0.84146 82 18 0 If bottom_middle_square=`b` and top_middle_square=`o` then class=`positive` 0.82432 74 19 0 If bottom_right_square=`x` and bottom_middle_square=`x` and top_middle_square=`o` and middle_middle_square=`o` then class=`positive` 0.80952 21 20 0 If middle_middle_square=`x` and bottom_left_square=`x` and top_left_square=`o` and bottom_right_square=`o` then class=`positive` 0.95652 23 21 0 If bottom_left_square=`x` and top_right_square=`o` and middle_left_square=`x` and bottom_right_square=`b` and middle_middle_square=`o` then class=`positive` 0.875 8 22 0 If middle_middle_square=`b` and bottom_left_square=`x` and middle_right_square=`b` then class=`positive` 0.89474 19 23 0 If middle_middle_square=`x` and bottom_right_square=`b` and top_right_square=`o` then class=`positive` 0.74074 27 24 0 If bottom_right_square=`x` and middle_middle_square=`b` and bottom_left_square=`b` then class=`positive` 0.76923 13 25 0 If bottom_right_square=`x` and middle_middle_square=`b` and middle_right_square=`x` then class=`positive` 0.84375 32 26 0 If middle_middle_square=`x` and top_left_square=`b` and middle_left_square=`o` then class=`positive` 0.96154 26 27 0 If middle_middle_square=`x` and top_right_square=`b` and middle_left_square=`x` then class=`positive` 0.91667 36 28 0 If top_middle_square=`x` and bottom_middle_square=`o` then class=`positive` 0.6383 94 29 0 If middle_middle_square=`x` and middle_left_square=`b` and top_right_square=`b` then class=`positive` 0.84211 19 30 0 If bottom_right_square=`x` and middle_left_square=`b` and bottom_middle_square=`x` then class=`positive` 0.79487 39 31 0 If bottom_right_square=`x` and top_right_square=`o` and bottom_middle_square=`x` and top_middle_square=`x` and top_left_square=`o` then class=`positive` 0.77778 9 32 0 If top_left_square=`x` and middle_middle_square=`b` then class=`positive` 0.82609 69 33 0 If middle_left_square=`b` and bottom_left_square=`o` then class=`positive` 0.67123 73 85 34 0 If middle_middle_square=`x` and bottom_left_square=`x` then class=`positive` 0.85606 132 35 0 If middle_middle_square=`x` and top_right_square=`o` and middle_left_square=`x` and bottom_right_square=`x` then class=`positive` 0.84211 19 36 0 If bottom_middle_square=`b` then class=`positive` 0.69697 198 37 0 If top_middle_square=`b` and bottom_right_square=`x` and middle_right_square=`x` and middle_middle_square=`o` and top_left_square=`o` then class=`positive` 0.88889 9 38 0 If top_right_square=`o` and bottom_left_square=`x` then class=`positive` 0.64655 116 39 0 If middle_middle_square=`x` and top_left_square=`x` then class=`positive` 0.82353 136 40 0 If top_left_square=`x` and bottom_right_square=`o` and top_middle_square=`x` then class=`positive` 0.64407 59 41 0 If top_left_square=`b` and bottom_left_square=`o` and bottom_right_square=`x` then class=`positive` 0.8 25 42 0 If middle_left_square=`x` and bottom_middle_square=`x` and top_left_square=`b` then class=`positive` 0.57895 19 43 0 If top_middle_square=`b` and bottom_right_square=`x` and middle_middle_square=`o` and middle_right_square=`x` and middle_left_square=`o` then class=`positive` 0.88889 9 44 0 If top_middle_square=`b` and bottom_middle_square=`o` then class=`positive` 0.85366 82 45 0 If middle_right_square=`o` and middle_left_square=`x` and top_left_square=`x` and bottom_middle_square=`x` then class=`positive` 0.52941 17 46 0 If bottom_middle_square=`x` and top_left_square=`o` and top_right_square=`x` and bottom_left_square=`o` and bottom_right_square=`o` and top_middle_square=`x` then class=`positive` 0.33333 3 47 0 If middle_right_square=`b` and top_right_square=`b` and middle_left_square=`o` then class=`positive` 0.83333 18 48 0 If middle_middle_square=`x` then class=`positive` 0.79109 359 49 0 If bottom_right_square=`x` and top_left_square=`o` and top_right_square=`b` and bottom_middle_square=`x` then class=`positive` 0.81818 22 50 0 If bottom_middle_square=`x` and middle_middle_square=`o` and middle_left_square=`o` and top_left_square=`o` then class=`positive` 0.42857 14 86 51 0 If bottom_middle_square=`x` then class=`positive` 0.58863 299 Table 42 Tic-Tac-Toe Example Traditional Model RuleID Cluster Rules Accuracy Coverage 1 0 If top_right_square=`o` and bottom_right_square=`o` and bottom_left_square=`b` then class=`negative` 0.57895 19 2 0 If bottom_middle_square=`o` then class=`positive` 0.69259 270 3 0 If middle_right_square=`o` then class=`positive` 0.70722 263 5 1 If bottom_right_square=`o` and bottom_middle_square=`o` then class=`negative` 0.45977 87 6 1 If middle_middle_square=`o` and bottom_middle_square=`o` then class=`negative` 0.5974 77 7 1 If bottom_right_square=`o` and top_left_square=`x` and bottom_middle_square=`x` then class=`negative` 0.5 36 8 1 If top_left_square=`o` and top_middle_square=`o` and top_right_square=`o` then class=`negative` 1 28 9 1 If middle_middle_square=`x` then class=`positive` 0.79109 359 10 1 If bottom_right_square=`x` then class=`positive` 0.71554 341 11 2 If middle_left_square=`o` and bottom_right_square=`b` then class=`negative` 0.38182 55 12 2 If top_middle_square=`x` and middle_middle_square=`o` then class=`negative` 0.6036 111 13 2 If bottom_middle_square=`x` and top_middle_square=`b` and bottom_left_square=`o` and bottom_right_square=`b` then class=`negative` 0.85714 7 14 2 If bottom_left_square=`b` and bottom_middle_square=`x` then class=`negative` 0.51724 58 15 2 If middle_middle_square=`b` and bottom_right_square=`o` then class=`negative` 0.44444 45 16 2 If bottom_right_square=`o` and top_right_square=`x` then class=`negative` 0.37879 132 17 2 If top_left_square=`x` and top_right_square=`x` then class=`negative` 0.23256 129 18 2 If top_left_square=`x` and bottom_left_square=`o` and middle_middle_square=`o` then class=`negative` 0.71111 45 19 2 If bottom_right_square=`o` and middle_right_square=`o` then class=`negative` 0.45783 83 20 2 If middle_right_square=`b` and top_left_square=`o` then class=`negative` 0.51471 68 21 2 If top_middle_square=`o` and 0.62963 135 87 middle_right_square=`x` then class=`positive` 22 2 If middle_right_square=`x` and top_middle_square=`b` and top_right_square=`x` and bottom_left_square=`o` and middle_middle_square=`o` then class=`positive` 0.83333 6 23 2 If bottom_right_square=`x` and top_left_square=`o` and middle_left_square=`x` then class=`positive` 0.52778 36 24 2 If top_left_square=`b` and middle_left_square=`b` and bottom_right_square=`x` then class=`positive` 0.80952 21 25 2 If middle_middle_square=`x` and bottom_left_square=`o` then class=`positive` 0.70345 145 26 2 If bottom_right_square=`b` and top_right_square=`o` then class=`positive` 0.5614 57 27 2 If bottom_middle_square=`b` and bottom_right_square=`o` then class=`positive` 0.72603 73 28 2 If bottom_left_square=`x` and top_middle_square=`o` then class=`positive` 0.72078 154 29 2 If bottom_left_square=`x` and top_middle_square=`b` and middle_middle_square=`o` then class=`positive` 0.70455 44 30 3 If middle_right_square=`o` and top_right_square=`o` and bottom_right_square=`o` then class=`negative` 1 28 31 3 If middle_left_square=`o` and bottom_left_square=`x` and bottom_middle_square=`x` then class=`negative` 0.24561 57 32 3 If middle_middle_square=`b` then class=`negative` 0.28467 137 33 3 If bottom_middle_square=`o` and bottom_left_square=`o` then class=`negative` 0.45349 86 34 3 If middle_middle_square=`o` and top_left_square=`o` and bottom_right_square=`o` then class=`negative` 1 39 35 3 If bottom_right_square=`x` and middle_left_square=`o` then class=`negative` 0.26923 156 36 3 If bottom_middle_square=`b` then class=`positive` 0.69697 198 37 3 If middle_right_square=`b` and bottom_left_square=`b` then class=`positive` 0.625 32 38 3 If top_right_square=`b` and bottom_middle_square=`x` then class=`positive` 0.73418 79 39 3 If middle_middle_square=`x` and middle_left_square=`b` then class=`positive` 0.80208 96 40 3 If middle_right_square=`b` and top_left_square=`x` then class=`positive` 0.79 100 41 3 If top_right_square=`o` and 0.34568 81 88 middle_middle_square=`o` then class=`positive` 42 3 If bottom_left_square=`x` then class=`positive` 0.727 337 43 4 If middle_left_square=`x` and middle_right_square=`b` then class=`negative` 0.49333 75 44 4 If top_middle_square=`o` and bottom_right_square=`b` then class=`negative` 0.43103 58 45 4 If top_left_square=`o` and middle_left_square=`o` and bottom_left_square=`o` then class=`negative` 1 32 46 4 If top_middle_square=`o` and bottom_left_square=`b` then class=`negative` 0.40984 61 47 4 If middle_left_square=`x` and top_left_square=`b` then class=`negative` 0.44776 67 48 4 If top_left_square=`x` then class=`positive` 0.70088 341 49 4 If bottom_middle_square=`x` then class=`positive` 0.58863 299 50 4 If top_middle_square=`x` then class=`positive` 0.5914 279 51 5 If bottom_left_square=`o` and middle_middle_square=`b` and bottom_right_square=`o` then class=`negative` 0.75 16 52 5 If bottom_left_square=`o` and middle_right_square=`o` and bottom_right_square=`o` then class=`negative` 0.5 14 53 5 If bottom_left_square=`b` and middle_right_square=`o` then class=`negative` 0.41818 55 54 5 If bottom_left_square=`o` and top_right_square=`o` and bottom_right_square=`o` then class=`negative` 0.6 15 55 5 If bottom_left_square=`b` and top_middle_square=`o` and middle_middle_square=`o` then class=`negative` 0.77778 18 56 5 If bottom_left_square=`o` and middle_right_square=`b` and bottom_right_square=`o` and bottom_middle_square=`o` then class=`negative` 1 11 57 5 If middle_middle_square=`o` then class=`positive` 0.43911 271 58 5 If middle_right_square=`b` then class=`positive` 0.66834 199 59 5 If top_right_square=`b` then class=`positive` 0.69143 175 60 5 If top_middle_square=`o` then class=`positive` 0.68132 273 61 5 If middle_left_square=`x` and middle_middle_square=`b` then class=`positive` 0.61404 57 62 5 If middle_right_square=`x` then class=`positive` 0.59672 305 63 6 If bottom_left_square=`o` and middle_left_square=`o` then class=`negative` 0.45783 83 64 6 If middle_middle_square=`o` and bottom_right_square=`b` and bottom_middle_square=`o` then class=`negative` 0.5 20 89 65 6 If top_right_square=`o` and bottom_middle_square=`b` and bottom_right_square=`x` then class=`negative` 0.45946 37 66 6 If top_right_square=`o` and middle_left_square=`b` and top_middle_square=`o` then class=`negative` 0.58824 17 67 6 If middle_middle_square=`o` and bottom_middle_square=`o` and top_middle_square=`o` then class=`negative` 1 32 68 6 If top_middle_square=`x` and bottom_left_square=`o` then class=`negative` 0.46715 137 69 6 If top_right_square=`o` and middle_right_square=`o` then class=`negative` 0.45122 82 70 6 If middle_right_square=`o` and bottom_left_square=`o` then class=`negative` 0.3 60 71 6 If top_middle_square=`x` and top_right_square=`b` then class=`negative` 0.45161 62 72 6 If top_right_square=`o` and top_left_square=`o` and top_middle_square=`o` then class=`negative` 1 28 73 6 If top_left_square=`x` and top_middle_square=`b` then class=`negative` 0.33333 93 74 6 If bottom_right_square=`o` and middle_middle_square=`o` then class=`negative` 0.66667 81 75 6 If top_middle_square=`x` and middle_right_square=`o` then class=`negative` 0.34454 119 76 6 If bottom_left_square=`o` and middle_middle_square=`o` then class=`negative` 0.7013 77 77 6 If bottom_left_square=`o` and bottom_right_square=`o` then class=`negative` 0.5974 77 78 6 If top_middle_square=`x` and bottom_right_square=`x` and middle_left_square=`x` then class=`negative` 0.66667 12 79 6 If top_right_square=`x` and top_left_square=`o` and bottom_left_square=`x` then class=`positive` 0.79032 62 80 6 If middle_middle_square=`b` and bottom_right_square=`x` then class=`positive` 0.83582 67 81 6 If middle_middle_square=`x` and bottom_left_square=`x` and middle_left_square=`o` then class=`positive` 0.8871 62 82 6 If top_right_square=`b` and top_left_square=`o` then class=`positive` 0.62121 66 83 6 If middle_middle_square=`x` and middle_left_square=`x` then class=`positive` 0.784 125 84 6 If top_left_square=`b` and middle_middle_square=`x` and 0.81818 22 90 middle_left_square=`b` then class=`positive` 85 6 If bottom_right_square=`x` and middle_left_square=`o` and middle_right_square=`x` then class=`positive` 0.77419 62 86 6 If bottom_right_square=`x` and bottom_left_square=`b` and middle_right_square=`x` then class=`positive` 0.79487 39 87 6 If middle_right_square=`b` and bottom_right_square=`x` then class=`positive` 0.65476 84 88 6 If middle_middle_square=`b` and bottom_left_square=`x` and middle_right_square=`x` then class=`positive` 0.76923 13 89 6 If middle_left_square=`b` and middle_right_square=`x` then class=`positive` 0.5 70 90 6 If top_right_square=`o` then class=`positive` 0.5625 256 91 7 If bottom_middle_square=`b` and middle_right_square=`o` and top_middle_square=`x` then class=`negative` 0.42424 33 92 7 If bottom_left_square=`o` then class=`negative` 0.44403 268 93 7 If middle_middle_square=`x` and middle_right_square=`o` and bottom_right_square=`o` then class=`negative` 0.42 50 94 7 If middle_middle_square=`x` and bottom_right_square=`x` then class=`negative` 0.15 140 95 7 If middle_middle_square=`o` and bottom_right_square=`o` then class=`negative` 0.66667 81 96 7 If top_left_square=`o` and bottom_left_square=`x` and bottom_right_square=`x` then class=`positive` 0.83636 55 97 7 If bottom_right_square=`o` and middle_left_square=`x` then class=`positive` 0.5473 148 98 8 If bottom_left_square=`o` and top_right_square=`b` and top_left_square=`o` then class=`negative` 0.52381 21 99 8 If top_right_square=`o` and bottom_middle_square=`b` then class=`negative` 0.48611 72 100 8 If bottom_left_square=`o` and top_left_square=`o` and middle_left_square=`o` then class=`negative` 1 32 101 8 If top_left_square=`x` and middle_middle_square=`o` then class=`negative` 0.48529 136 102 8 If bottom_middle_square=`o` and top_right_square=`o` then class=`negative` 0.2963 54 103 8 If bottom_left_square=`b` then class=`positive` 0.66049 162 104 8 If top_right_square=`x` and middle_right_square=`x` then class=`positive` 0.69466 131 91 105 8 If middle_right_square=`b` and bottom_middle_square=`b` then class=`positive` 0.73913 46 Table 43 Tic-Tac-Toe Example Collective Model While the collective models generated using the CRA process were not able to achieve higher accuracies than the traditional models in all cases. As the number of missing attributes increased the performance of the traditional method (ordered and unordered model) suffered the most. The collective models were less affected by the introduction of missing attributes. Take the Anneal data set for example. The highest accuracies achieved by the traditional and collective models (ordered) were 94% and 90% respectively under the best conditions (0 missing attributes). These models achieved 84% and 87% accuracy under the worst conditions (4 missing attributes). This equates to a 10% drop in accuracy for the traditional model compared to only a 3% drop in accuracy for the CRA model. Even though the models produced by the CRA process achieved lower accuracies in some instances these models were less affected by the missing attribute information. Table 44 provides the best performance of each data set under the best (no missing attributes) and worst conditions (4 missing attributes) and the drop in accuracy for each dataset for the ordered models. Table 45 displays this information for the unordered models. Figures 37 and 38 illustrate the drop in accuracy for both types of models. Data Set Traditional Best Collective Best Traditional Worst Collective Worst Traditional Drop Collective Drop Congressional Voting 99% 98% 79% 90% 20% 8% Anneal 94% 90% 84% 87% 10% 3% Thyroid 98% 98% 95% 97% 3% 1% Mushroom 100% 90% 82% 87% 18% 3% Tic-Tac-Toe 99% 96% 49% 71% 50% 28% Table 44 Best Performance for Best and Worst Conditions Ordered Models Data Set Traditional Best Collective Congressional Voting 97% Anneal 90% Thyroid 97% Mushroom 89% Tic-Tac-Toe 82% Table 45 Best Performance for Best and Worst Conditions Unordered Models Figure 37 Percentage Drop in Accuracy Ordered Models -60 -50 -40 -30 -20 -10 0 (% ) P erc en tag e D ro p i n A ccu rac y 92 Best Traditional Worst Collective Worst Traditional Drop 97% 77% 89% 20% 82% 82% 81% 8% 96% 94% 95% 3% 89% 77% 82% 12% 79% 48% 69% 34% Data Sets Collective Drop 8% 1% 1% 7% 10% CRA Traditional Figure 38 Percentage Drop in Accuracy Unordered Models The aforementioned results of the drop in accuracy Sample t-test to determine if they were statistically significant. Tables 46 and 47 (ordered and unordered models) display the results of this analysis for each data set. The H0: No difference in means for Traditional vs. CRA drop in classification accuracy for this test. For each data set evaluated, the p hypothesis. Meaning, that there is evidence created using the CRA process are more resilient against the affects of missing attribute information. Data Set Congressional Voting Anneal -40 -35 -30 -25 -20 -15 -10 -5 0 93 were evaluated using a Welch?s Two null hypothesis was -value was less than .05. Therefore we can reject the null that supports the notion that the collective models t Degrees of Freedom p-value 5.8755 17.889 0.00001 2.6111 17.731 0.01783 Data Sets CRA Traditional 94 Thyroid 2.5697 12.488 0.02392 Mushroom 5.9832 16.013 0.00001 Tic-Tac-Toe 12.6789 11.794 0.00000003 Table 46 Results of Two Sample Welch t-test (Ordered Models) Data Set t Degrees of Freedom p-value Congressional Voting 5.1854 17.91 0.00006 Anneal 3.6934 15.388 0.002087 Thyroid 3.678 10.998 0.003640 Mushroom 2.1514 17.916 0.04534 Tic-Tac-Toe 9.268 13.954 0.0000002 Table 47 Results Two Sample Welch t-test (Unordered Models) The hypothesis of this research was that the model created with the CRA process would be more resilient against missing attribute information in the test data. Resiliency refers to minimizing any decreases in classification accuracy as the number of missing attributes increases. For each data set the collective models created with the CRA process suffered less of a drop in classification accuracy as compared to the traditional models (see Figures 37 and 38). These findings are statistically significant (see tables 46 and 47) and support the hypothesis as the collective model maintained the ability to make fairly accurate predictions when faced with less than desirable conditions. This resiliency is a result of the collective model being generally larger than the traditional model and containing shorter more general rules as described in the analysis for each data set. This directly relates to the manner in which the collective models 95 were created. The CRA process of building individual models on smaller homogenous subsets of the data produced collective models for all data sets that contained rules that were on average shorter than the rules in the traditional model. Also, the collective models created for 4 out the 5 data sets contained more rules on average than the traditional model. Figure 39 Ordered vs. Unordered Accuracies Overall Figure 39 displays the overall performance of the unordered and ordered models. As seen in the figure, the ordered models performed better than the unordered models from an overall aspect. However, the difference in performance was negligible in some instances 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ac cu rac y Ordered Unordered 96 Chapter 6 Summary 6.1 Summary and Conclusion Data mining is a powerful and useful tool that can be used to extract information from large data repositories. These techniques are used in a variety of domains to perform an array of different tasks. Predictive modeling/classification is one of the most thoroughly studied and researched subfield in data mining. The goal of these types of tasks is to predict some future outcome based on historical information. Rule based classifiers are one of the most often used types of predictive model. A rule based classifier builds a model from training data as a set of high quality rules. The models that theses algorithms generate are easy to understand and interpret and widely accepted. There have been many rule based classification systems designed, studied and implemented that perform adequately in many applications. However, problems arise when the data that is used to assess the predictive ability of the model is not as comprehensive as the data used to create the model. When faced with missing attribute information it is desirable to have a model that maintains the ability to make fairly accurate predictions. Missing attributes are a well known issue in the field of data mining. (Batista & Monard, 2003 ) (Witten & Frank, 2005) (Tan, Steinbach, & Kumar, 2006) The majority of the research performed focuses on handling missing attribute information in the training data. There has been little in the way of studying the effects of missing information in the test data. We present a process which results in a model that is more resilient against the effects of missing attribute information in the 97 new/unseen data. The clustering rule-based approach (CRA) seeks to added robustness using data clustering. The training data is clustered and multiple models are built using the cluster wise data. The models are then combined into a collective model and evaluated against test data modified to contain missing attribute information. The experimental results support the hypothesis that the collective model is more resilient against the effects of missing attribute information. The collective model incurred a smaller loss in classification accuracy when faced with missing attributes. Also, the collective model displayed the ability to classify instances unable to be classified using traditional methods. These findings are encouraging as most real world data sets are often not as complete as practitioners would like. The work presented and evaluated here provides a new method of ensuring the model created will have the ability to make fairly accurate predictions when faced with missing/incomplete data. 6.2 Future Work There are many paths the direction of this research can take. We evaluated two methods of combining the rules from the multiple models: sort in decreasing order of accuracy and a voting scheme. Other approaches for combining the rules should be examined. Also, a wider variety of data sets should be used to evaluate the CRA process. The effects of a different set of k values should also be examined. Different clustering algorithms may also be studied, as well as, combining this technique with other process proven to improve classification accuracy such as ensemble methods. 98 References Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Proceedings of teh Twentieth International Conference on Very Large Databases, (pp. 487-499). Santiago, Chile. Anderberg, M. R. (1973). Cluters Analysis for Applications. Academic Press. Batista, G. E., & Monard, M. C. (2003 ). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6) , 519-533. Baumann, T., & Germond, A. (1993). Application of the kohonen network to short-term forecasting. In ANNPS, (pp. 407-412). Berk, R. (2004). An Introduction to Ensemble Methods for Data Analysis. California Center for Population Research. Online Working Paper Series. Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. San Jose, CA: Acrue Software. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM. McGraw-Hill Companies. Blake, E. K., & Merz, C. J. (1998). UCI repository of machine learning databases. Retrieved from http://www.ics.uci.edu/~mlearn/MLRepository.html Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group. CHAID. (n.d.). Retrieved February 15, 2008, from http://www.wikipedia.org/wiki/CHAID Cheng, Y., & Church, G. (2000). Biclustering of experession data. ICMB , 93-103. Chess Endgame. (2008). Retrieved April 3, 2008, from http://www.wikipedia.org/wiki/Endgame Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum squared residue co-clustering of gene expression data. SDM . Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning 3(4) , 261 - 283. 99 Cluster Analysis. (2008). Retrieved March 6, 2008, from Wikipedia: http://en.wikipedia.org/wiki/Data_clusterinCohen, W. (1995). Fast Effective Rule Induction. Procedings of 12th International Conference on Machine Learning, (pp. 115-123). Tahoe City, CA. CRISP-DM. (2000). Retrieved February 16, 2008, from CRISP-DM: Cross Industry Standard Process for Data Mining: http://www.crisp-dm.org/Process/index.htm Deodhar, M., & Ghosh, J. (2007). A Framework for Simultaneous Co-clustering and Learning from Complex Data. KDD'07, (pp. 250-259). San Jose, California. Djukanovic, B., Babic, B., Sobajic, D., & Pao, Y. (1993). Unsupervised/supervised learning concept for 24-house load forecasting. IEE Proceedings-Generation, Transmission and Distribution, (pp. 140:311-318). DMS. (2008). Retrieved March 16, 2008, from Decision Trees: http://dms.irb.hr/tutorial/tut_dtrees.php Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continous features. ICML. Freund, Y., & Schapire, R. (1999). A short introduction to boosting. Japanese Society for Artificial Intelligence , 771-780. Friedman, N., & Goldszmidt, M. (1996). Building classifiers using Bayesian networks. National Conference on Artificial Intelligence (pp. 1277-1284). Menlo Park, CA: AAAI Press. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning 29 , 131-163. Gilbert, J. E. (2006). Applicaitons Quest: Computing Diversity. Commmunications of the ACM, 49, 3 , pp. 99-104. Gilbert, J. E. (2008). Applications Quest: A Case Study on Holistic Admissions. Journal of College Admission , 12 - 18. Gilbert, J. E. (2006). Applications Quest: Computing Diversity. Communications of the ACM, 49, 3, ACM , 99-104. Gilbert, J. E. (2008). Patent No. US 2008/10183731 A1. USA. Han, J., & Kamber, M. (2006). Data Mining Techniques and Concepts. Morgan Kaufmann. Hu, H., & Li, J. (2005). Using Association Rules to Make Rule-based Classifiers Robust. 100 Proceedings of the Sixteenth Austrailian Database Conference (ADC 2005), (pp. 47-54). Newcastle, Australia. Jain, A. K., Murty, M. M., & Flynn, P. J. (n.d.). Data Clustering: A Review. ACM Computing Surveys 31, 3 , 264-323. Kass, G. V. (1980 ). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statitics,Vol 29, No 2 , 119-127. Langley, P., & Sage, S. (1994). Induction of Selective Bayesian Classifiers. Conference on Uncertainty in Artificial Intelligence (pp. 399-406). Seattle, WA: Morgan Kaufmann. Larose, D. (2005). Discovering Knowledge in Data: An Introduction to Data Mining. Hoboken, NJ: John Wiley & Sons. Lewis, R. (2000). Indroduction to Classification and Regression Trees (CART). Annual Meeting of the Society for Academic Emergency Medicine. San Francisco, CA. Li, J., Topor, R., & Shen, H. (2002). Construct robust rule sets for classification. Proceedings of the Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 564-569). Edmonton, Alberta, Canada. Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple class-association rules. Proceedings 2001 IEEE International Conference on Data Mining (ICDM 2001) (pp. 369-376). IEEE Computer Society Press. Liu, B., Hsu, J., & Ma, Y. (1998). Integrating classification and assoiciation rule mining. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), (pp. 27-31). Lokmic, L., & Smith, K. (2000). Cash flow forecasting using supervised and unsupervised neural netoworks. IJCNN , 06:6343. Michalski, R., Mozetic, I., Hong, J., & Lavrac, N. (1986). The AQ15 inductive learning system: an overview and experiments. Proceedings of IMAL 1986'. Universite de Paris-Sud, Orsay. Microsoft. (2007). Microsoft SQL Server 2005. Retrieved March 15, 2008, from Microsoft Corporation: http://www.microsoft.com/sql/prodinfo/overview/default.mspx Mingers, J. (1989). An emprical comparison of selection measures for decision tree induction. Machine Learning vol 3 , 319-342. Oh, K., & Han, I. (2001). An intelligent clustering forecasting system based on change-point detection and artificial neural networks: Application to financial economics. HICSS-34 volume 3 , 3011. 101 Opitz, D., & Macline, R. (1999). Popular ensemble methods: An empirical study. Artificial Intelligence Research , 169-198. ORACLE. (2007). ORACLE. Retrieved March 15, 2008, from Oracle Data Mining: http://www.oracle.com/technology/products/bi/odm/index.html Quinlan, J. (2006). Bagging,Boosting, and C4.5. Retrieved January 29, 2008, from RuleQuest: http://www.rulequest.com/Personal/q.aaai96.ps Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning vol. 1 , 81-106. Quinlan, J., & Gaetz, M. W. (1993). FOIL: A Midterm Report. Proceedings of the 1993 European Conference on Machine Learning, (pp. 3-20). Vienna, Austria. Rapid-i. (2008). Rapid Miner. Retrieved March 16, 2008, from http://rapid-i.com Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston, MA: Addison Wesley. TechTarget. (2008). Retrieved January 29, 2008, from What is data mining?: http://searchsqlserver.techtarget.com/sDefinition/0,,sid87_gci211901,00.html Wang, J., & Karypis, G. (2003). HARMONY: Efficiently Mining the Best Rules for Classifcation. Proceedings of the 2003 SIAM International Conference on Data Mining, (pp. 205-216). Newport Beach, CA. Witten, I. H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques. San Francisco, CA: Morgan Kaufmann. Yin, X., & Han, J. (2003). CPAR: Classification based on Predictive Associaiton Rules. Proceedings of the 2003 SIAM International Conference on Data Mining, (pp. 331-335). San Francisco, CA.