A Clustering Rule Based Approach for Classification Problems
by
Philicity Kapryelle Williams
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
August 9, 2010
Keywords: Data Mining, Classification, Rule Based Classifiers, Data Clustering
Copyright 2010 by Philicity Kapryelle Williams
Approved by
Juan E. Gilbert, Chair, Professor of Computer Science and Software Engineering
N. Hari Narayanan, Professor of Computer Science and Software Engineering
Cheryl D. Seals, Associate Professor of Computer Science and Software Engineering
ii
Abstract
Today?s data storage and collection abilities have allowed the accumulation of enormous
amounts of data. Data mining can be a useful tool in transforming these large amounts of raw
data into useful information. Predictive modeling is a very popular area in data mining. The
results of these type tasks can contain helpful information that can be used in decision making.
Problems arise when the data sets that are used to build these models are not as complete (e.g.
erroneous/missing values) as the data used to evaluate the model. Rule based classifiers are
widely used and accepted type of predictive model. We present a method to reduce the severity
of the effects of missing data on the performance of rule base classifiers using divisive data
clustering. The Clustering Rule based Approach (CRA) clusters the original training data and
builds a separate rule based model on the cluster wise data. The individual models are combined
into a larger model and evaluated against test data. We evaluate the effects of the missing
attribute information for ordered and unordered rule sets. We experimentally show that the
collective model is less affected by missing attribute information when the test data has missing
attribute values.
iii
Acknowledgements
First and foremost, I would like to thank God for he is my strength and through him I can
do a1l things. I would like to thank Dr. Juan Gilbert for all his support, encouragement, and
patience. The many times when I did not know how I was going to make it, or didn?t think I
could do it, he was there to guide me. Thank you to my family for their love, support, and
encouragement. They have been very supportive an encouraging throughout my academic tenure
here at Auburn. Last but not least I would like to say thank you to all of the members of the
Human Centered Computing Lab. There support, encouragement, and just being there will
forever be appreciated!!
iv
Table of Contents
Abstract ........................................................................................................................................... ii
Acknowledgements ........................................................................................................................ iii
List of Tables ................................................................................................................................ vii
List of Figures ................................................................................................................................. x
Chapter 1 ......................................................................................................................................... 1
Introduction ..................................................................................................................................... 1
1.1 Motivation ............................................................................................................................. 1
1.2 Problem Description and Background .................................................................................. 2
1.3 Organization .......................................................................................................................... 5
Chapter 2 ......................................................................................................................................... 6
Literature Review............................................................................................................................ 6
2.1 Data Mining Overview ......................................................................................................... 6
2.2 Predictive Modeling .......................................................................................................... 8
2.3 Association Analysis ....................................................................................................... 20
2.4 Cluster Analysis .............................................................................................................. 25
2.5 Anomaly Detection ......................................................................................................... 29
2.6 Ensemble Methods .......................................................................................................... 30
v
2.7 Other Related Information and Work ............................................................................. 34
2.8 Data Mining Tools .......................................................................................................... 37
Chapter 3 ....................................................................................................................................... 39
Research Details............................................................................................................................ 39
3.1 Conceptual Overview.......................................................................................................... 39
Chapter 4 ....................................................................................................................................... 42
Experiment Design........................................................................................................................ 42
4.1 Data ..................................................................................................................................... 42
4.2 Materials and Tools............................................................................................................. 42
4.2.1 PRISM.......................................................................................................................... 43
4.2.2 Applications Quest? ................................................................................................... 43
4.3 Procedure ............................................................................................................................ 47
4.4 Expected Outcomes ............................................................................................................ 48
Chapter 5 ....................................................................................................................................... 49
Research Findings ......................................................................................................................... 49
5.1 Results ................................................................................................................................. 49
Chapter 6 ................................................................................................................................... 96
Summary ....................................................................................................................................... 96
6.1 Summary and Conclusion ................................................................................................... 96
6.2 Future Work ........................................................................................................................ 97
vi
References ..................................................................................................................................... 98
vii
List of Tables
Table 1Tennis Training Data .......................................................................................................... 3
Table 2 Vertebrate Unlabeled/Unseen Data ................................................................................. 11
Table 3 Data Sets .......................................................................................................................... 42
Table 4 Similarity Matrix with Difference Measures Example .................................................... 45
Table 5 Example Nominal Population Matrix .............................................................................. 45
Table 6 Nominal Population Matrix Example .............................................................................. 46
Table 7 Example Nominal Population Matrix .............................................................................. 46
Table 8 Example Nominal Population Matrix Adjusted............................................................... 46
Table 9 Congressional Voting Results for Complete Data (No Missing Attributes) ................... 50
Table 10 Congressional Voting 1 Missing Attribute .................................................................... 51
Table 11 Congressional Data Results 2 Missing Attributes ......................................................... 51
Table 12 Congressional Voting Data Results 3 Missing Attributes ............................................. 51
Table 13 Congressional Voting Results 4 Missing Attributes ...................................................... 52
Table 14 Congressional Voting Example Traditional Rule Set .................................................... 55
Table 15 Congressional Voting Example Collective Rule Set ..................................................... 56
Table 16 Anneal Results Complete Data (No Missing Attributes)............................................... 57
Table 17 Anneal Results 1 Missing Attribute ............................................................................... 57
Table 18 Anneal Results 2 Missing Attributes ............................................................................. 58
Table 19 Anneal Results 3 Missing Attributes ............................................................................. 58
viii
Table 20 Anneal Results 4 Missing Attributes ............................................................................. 58
Table 21 Example Anneal Traditional Rule Set ........................................................................... 61
Table 22 Example Anneal Collective Rule Set ............................................................................. 63
Table 23 Thyroid Results Complete Data (No Missing Attributes) ............................................. 64
Table 24 Thyroid Results 1 Missing Attribute ............................................................................. 64
Table 25 Thyroid Results 2 Missing Attributes ............................................................................ 65
Table 26 Thyroid Results 3 Missing Attributes ............................................................................ 65
Table 27 Thyroid Results 4 Missing Attributes ............................................................................ 65
Table 28 Thyroid Example Traditional Model ............................................................................. 70
Table 29 Thyroid Example Collective Model .............................................................................. 72
Table 30 Mushroom Results Complete Data (No Missing Attributes) ........................................ 73
Table 31 Mushroom Results 1 Missing Attribute ......................................................................... 73
Table 32 Mushroom Results 2 Missing Attributes ....................................................................... 74
Table 33 Mushroom Results 3 Missing Attributes ....................................................................... 74
Table 34 Mushroom Results 4 Missing Attributes ....................................................................... 74
Table 35 Mushroom Example Traditional Model ........................................................................ 77
Table 36 Mushroom Example Collective Model .......................................................................... 78
Table 37 TicTacToe Results Complete Data (No Missing Attributes) ...................................... 79
Table 38 TicTacToe Results 1 Missing Attribute ...................................................................... 80
Table 39 TicTacToe Results 2 Missing Attributes..................................................................... 80
Table 40 TicTacToe Results 3 Missing Attributes..................................................................... 80
Table 41 TicTacToe Results 4 Missing Attributes..................................................................... 81
Table 42 TicTacToe Example Traditional Model ...................................................................... 86
ix
Table 43 TicTacToe Example Collective Model ....................................................................... 91
Table 44 Best Performance for Best and Worst Conditions Ordered Models .............................. 91
Table 45 Best Performance for Best and Worst Conditions Unordered Models .......................... 92
Table 46 Results of Welch Two Sample ttest (Ordered Models) ................................................ 94
Table 47 Results Two Sample Welch ttest (Unordered Models) ................................................ 94
x
List of Figures
Figure 1 ID3 Decision Tree ............................................................................................................ 3
Figure 2 Data mining's relationship to other disciplines ................................................................. 6
Figure 3 CRISPDM Model ............................................................................................................ 7
Figure 4 Four Core Data Mining Tasks .......................................................................................... 8
Figure 5 Vertebrate Training Data (Labeled/Known Outcomes) ................................................. 10
Figure 6 Vertebrate Rule Set/Model ............................................................................................. 10
Figure 7 Simple Decision Tree ..................................................................................................... 12
Figure 8 Basic Sequential Covering Algorithm ............................................................................ 16
Figure 9 Rule Growing Strategies................................................................................................. 17
Figure 10 Simple Neural Network ................................................................................................ 19
Figure 11 Lattice Structure ........................................................................................................... 22
Figure 12 Apriori Principle ........................................................................................................... 23
Figure 13 Support Based Pruning ................................................................................................. 24
Figure 14 Records to be clustered ................................................................................................. 26
Figure 15 Cluster Hierarchy .......................................................................................................... 26
Figure 16 Linkage Examples ........................................................................................................ 28
Figure 17 Kmeans Algorithm ...................................................................................................... 28
Figure 18 Ensemble Method Process ............................................................................................ 31
Figure 19 Bagging and Boosting Example ................................................................................... 33
xi
Figure 20 CRA Flow ..................................................................................................................... 41
Figure 21 PRISM Pseudocode ...................................................................................................... 43
Figure 22 Congressional Voting Accuracies Unordered Model ................................................... 52
Figure 23 Congressional Voting Accuracies Ordered Model ....................................................... 53
Figure 24 Congressional Voting Average Rule Lengths .............................................................. 53
Figure 25 Anneal Accuracies Unordered Model .......................................................................... 59
Figure 26 Anneal Accuracies Ordered Model .............................................................................. 59
Figure 27 Anneal Average Rule Lengths ...................................................................................... 60
Figure 28 Thyroid Accuracies Unordered Models ....................................................................... 66
Figure 29 Thyroid Accuracies Ordered Models ........................................................................... 66
Figure 30 Thyroid Average Rule Lengths .................................................................................... 67
Figure 31Mushroom Accuracies Unordered Models .................................................................... 75
Figure 32 Mushroom Accuracies Ordered Models ....................................................................... 76
Figure 33 Mushroom Average Rule Lengths................................................................................ 76
Figure 34 TicTacToe Accuracies Unordered Models ................................................................ 81
Figure 35 TicTacToe Accuracies Ordered Models .................................................................... 82
Figure 36 TicTacToe Average Rule Lengths ............................................................................. 82
Figure 37 Percentage Drop in Accuracy Ordered Models ............................................................ 92
Figure 38 Percentage Drop in Accuracy Unordered Models ........................................................ 93
Figure 39 Ordered vs. Unordered Accuracies Overall .................................................................. 95
1
Chapter 1
Introduction
1.1 Motivation
Advances in data collection and storage capabilities have enabled organizations,
companies, institutions, etc to collect immense amounts of data. The challenge lies in trying to
turn all of that raw data into useful information. One way to address this problem is with the use
of data mining.
Data Mining can be defined as ?sorting through data to identify patterns and establish
relationships? (TechTarget, 2008). Data mining blends traditional data analysis methods with
sophisticated algorithms for processing large amounts of data. These algorithms search large data
repositories in order to find new and useful patterns that might have otherwise been unidentified.
Data mining methods can and are being used in many domains to address a variety of tasks.
One of the more popular uses of data mining techniques is predictive modeling or
classification. Certain types (i.e. decision trees, neural networks, etc.) of data mining algorithms
have the ability to predict future outcomes using the attributes in the data set. For example, let?s
assume that a long distance company wants to promote its new long distance plan. The company
can apply predictive modeling algorithms to specifically target customers who are more likely to
sign up for the plan based on their past behavior.
2
As one could imagine, the accuracy of predictive modeling algorithms is very important.
Their results contain information that can be very valuable and have the ability to help make
decisions and drive change. Rule based algorithms are widely used and accepted to perform
classification of tasks due to their ease of interpretability and understanding. These classifiers
build a model or rule set from training data as a set of highquality rules, which can be used to
predict the class labels of unlabeled instances (Wang & Karypis, 2003). There have been many
rule based classification systems studied and proposed which yield adequate classification
accuracy in a variety of applications such as RIPPER (Cohen, 1995), FOIL (Quinlan & Gaetz,
1993), and CPAR (Yin & Han, 2003) . However, if the test data used to evaluate the model is not
as complete as the training data used to build the model the classification accuracy suffers as the
model tends to follow the training data too closely. This is problematic as in most real world
applications data is often incorrect and or missing parts. Thus, it is advantageous to have a rule
set or model that is robust and can make reasonably accurate predictions when the test data is
incomplete. This research explores the notion of adding robustness to rule based classifiers via
divisive clustering.
1.2 Problem Description and Background
Rule based classifiers are used widely because of the ease of interpretability of the
models or set of rules they generate. These classifiers perform exceptionally well on complete
data sets. Meaning, the data is clean, correct, and does not have missing attribute values. To
generate the model or train the classifier the training data uses attribute and values relative to
each other to segregate the data to generate a rule relative to a particular class. This poses a
problem as real world data sets are not clean, correct and often have missing attribute values.
The models produced by traditional rule based classifiers are sensitive to missing attribute values
in new/unseen data. For example, Table 1 and Figure 1
decision tree (Quinlan J. , 1993).
3
depict a popular data set
Table 1Tennis Training Data
Figure 1 ID3 Decision Tree
and the resulting
4
The rules derived from this tree are:
If outlook is sunny and humidity is high, then do not play tennis.
If outlook is sunny and humidity is normal, then play tennis.
If outlook is overcast, then play tennis.
If outlook is rain and wind is strong, then do not play tennis.
If outlook is rain and wind is weak, then play tennis.
Notice that all of the rules derived contain the outlook attribute. Suppose there is some test data
in which the outlook information is missing. This model is not robust as none of the rules
generated would be able to make predictions. In fact, as the number of attributes within a data set
that have missing data increases the classifiers performance/accuracy decreases. The model that
the classifier generates is unable to classify all of the instances new/unseen test data. For the
purposes of this research we call this a hard classification problem.
This research proposes a new method for adding robustness to rule based classifiers using
divisive clustering for hard classification problems. With this approach the training data set is
broken into clusters based on holistic similarities such that items within a particular cluster are
more alike than those outside of the cluster. Each cluster generates a model trained with smaller
homogeneous data sets. Multiple models are created using the clustered data and then combined
into a single model (the experiment). The hypothesis for this research is that the resulting
combined rule base will be more resilient against missing attribute information. Resiliency refers
to minimizing any decreases in classification accuracy as the number of missing attributes
increases versus the traditional (the control) model. It is believed that the collective model is
more resilient than the traditional model because it will have an added level of robustness and
ability to classify new/unseen data instances that would not have been classified with a
traditional classifier. The collective model contains rules that otherwise would not have been
5
created because of the alternate views of the data set used to train the multiple individual
classifiers.
1.3 Organization
In the following chapters the research agenda will be examined. Chapter 2 provides an
overview of data mining techniques and practices, more specifically, predictive
modeling/classification and data clustering, and discusses any work pertinent to this research.
Chapter 3 provides conceptual view of the approach studied in this research. Chapters 4 and 5
describe the design of the experiment conducted to study the validity of the approach and the
results of that experiment respectively. Finally, Chapter 6 provides a summary and conclusions
and discusses areas of future work.
6
Chapter 2
Literature Review
2.1 Data Mining Overview
Data Mining is the process of discovering useful information in large amounts of
data. It is fundamentally built on the principles of algorithms used in statistics, artificial
intelligence, pattern recognition, and machine learning. From statistics, data mining incorporates
characteristics such as sampling, estimation, and hypothesis testing. From artificial intelligence,
pattern recognition, and machine learning, data mining incorporates the following: search
algorithms, modeling techniques, and learning theories. Figure 2 conveys data mining?s
relationship to other areas (Tan, Steinbach, & Kumar, 2006)
Figure 2 Data mining's relationship to other disciplines
Data mining is being used in many domains in a variety of ways. Due to this, a CRross
Industry Standard for Data Mining Projects
industry, and application neutral. The life cycle according to the standard has six phases:
Business/Research Understanding, Data Understanding, Data Preparation, Modeling, Evaluation,
and Deployment. Figure 3 depicts CRIS
Most data mining tasks are usually either predictive or descriptive in nature. Predictive
tasks are ones in which the goal is to predi
values of other attributes. The attribute that is being predicted is frequently referred to as the
target or dependent variable. The other attributes that are used for making the prediction are
7
(CRISPDM) was developed. This standard is tool,
PDM (CRISPDM, 2000).
Figure 3 CRISPDM Model
ct the value of a particular attribute based on the
8
referred to as the explanatory or independent variables. Descriptive tasks are ones in which the
goal is to derive patterns (i.e. correlations, clusters, trends) that will provide a summary of the
underlying relationships in the data. These tasks are usually exploratory and often require some
postprocessing to confirm as well as explain the results.
There are four tasks that are at the heart of data mining. They are Predictive Modeling,
Cluster Analysis, Association Analysis, and Anomaly Detection (see Figure 4) (Tan, Steinbach,
& Kumar, 2006)
Figure 4 Four Core Data Mining Tasks
2.2 Predictive Modeling
Predictive modeling is one of the most popular subfields in data mining. It is the process
of using the patterns found in the data set to predict future outcomes. Algorithms used for
9
predictive modeling build a model for the dependent variable as a function of the independent
variables (Berson, Smith, & Thearling, 1999). There are two types of predictive modeling tasks:
classification and regression. Classification tasks are used when the target or dependent variable
is discrete. Regression tasks are used when the target variable is continuous (Tan, Steinbach, &
Kumar, 2006)
Predictive modeling problems are comprised of four things: a dependent variable,
independent variables, a learning/training data set, and a test data set (Lewis, 2000). The
learning/training data set contains values for both the dependent and independent variables, and
is used to build the model. This model is then applied to the test set for evaluation. The test set is
a subset of the training/learning data set. The performance of the model is based on the counts of
the test records that are correctly or incorrectly predicted or classified.
There are several predictive modeling techniques. The next section provides an overview
of some of them.
2.2.1 Rule based Classifiers
A rulebased classifier is a method used for classification using a set of ?if?then? rules.
A rule can be expressed as ri: (Condition) barb2right yi. The left hand side of the rule is the antecedent or
condition. The right hand side is the consequent. A rule covers an instance if the condition
matches the attributes of that instance. An example of classification using a rule based method is
depicted below. Figure 5 shows the training data, Figure 6 displays the rule set (model)
generated and Table 2 contains the unlabeled instances for the vertebrate classification problem.
The goal of this model is to determine what type of vertebrate the instance will be classified as.
There are five possible outcomes: mammal, reptile, fishes, amphibians, and birds. (Tan,
Steinbach, & Kumar, 2006)
R1 covers the hawk instance. A hawk does not give birth and can fly and will be
classified as a bird. R3 covers the grizzly bear instance. A grizzly bear gives birth and is warm
blooded and will be classified as a mammal.
Figure 5 Vertebrate Training Data
Figure
Name Blood Type
human warm
python cold
salmon cold
whale warm
frog cold
komodo cold
bat warm
pigeon warm
cat warm
leopard shark cold
turtle cold
penguin warm
porcupine warm
eel cold
salamander cold
gila monster cold
platypus warm
owl warm
dolphin warm
eagle warm
10
(Labeled/Known Outcomes)
6 Vertebrate Rule Set/Model
Give Birth Can Fly Live in Water
yes no no mammals
no no no reptiles
no no yes fishes
yes no yes mammals
no no sometimes amphibians
no no no reptiles
yes yes no mammals
no yes no birds
yes no no mammals
yes no yes fishes
no no sometimes reptiles
no no sometimes birds
yes no no mammals
no no yes fishes
no no sometimes amphibians
no no no reptiles
no no no mammals
no yes no birds
yes no yes mammals
no yes no birds
Class
11
Table 2 Vertebrate Unlabeled/Unseen Data
Rule based classification algorithms tend to fall into the two categories: decision trees and
covering algorithms (Li, Topor, & Shen, 2002) (Hu & Li, 2005) (Tan, Steinbach, & Kumar,
2006) (Witten & Frank, 2005).
2.2.1.1 Decision Trees
A decision tree is a predictive model in which the results are displayed as a tree type
structure (Berson, Smith, & Thearling, 1999). A decision tree consists of a collection of decision
nodes, which are connected by branches, descending from the root node until coming to an end
at the leaf nodes. Each branch of the tree is a classification question and the leaves are the
partitions or segments of the dataset with the classification or decision (Larose, 2005) (Quinlan J.
R., 1986).
The first step in the decision tree building process is to ?grow? the tree. The goal is to
create a tree that works as close to perfect as possible with the data provided. When growing the
tree, the main task is to find the best question to ask at each branch or split point in the tree. This
task is repeated until there is either only one record in the segment, each of the records in the
segment are the same, or there is not any significant gain in making a split. When one of these
conditions are met the tree stops growing (Berson, Smith, & Thearling, 1999). Figure 7 conveys
a simple decision tree example (DMS, 2008).
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
12
Figure 7 Simple Decision Tree
There are certain requirements that must be adhered to in order to apply a decision tree
algorithm. A training data set must be provided. This training data should be varied in nature and
provide the algorithm with a robust sampling of the types of records that will need classification
in the future. This is very important as decision trees learn by example. If the algorithm is
provided bad data then the ability to correctly classify future data will greatly diminish. Also, the
target variable must be discrete (Larose, 2005).
Decision trees are a part of data mining technology that has existed in some form for
decades. Originally, these techniques were created for statisticians to automate the process of
determining which attributes were useful or in correlation with the problem they were attempting
13
to solve or understand. They are also particularly adept at handling raw data with little or no pre
processing (Berson, Smith, & Thearling, 1999).
Perhaps the most appealing aspect of decision trees is how easy they are to interpret. This
is especially evident when deriving decision rules. Rules can be derived by simply traversing the
tree from the root to any leaf node and come in the form of if antecedent, then consequent. The
antecedent contains the attribute values from the branches taken by a particular path throughout
the tree. The consequent contains the classification value for the dependent variable given by that
particular leaf node (Larose, 2005) (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005).
Using the simple decision tree in Figure 7 an example of a decision rule would be if A = red and
B < 4.5 then K = y. There are several algorithms that are used to produce decision trees. These
include: ID3, C4.5, Classification and Regression Trees (CART), and ChiSquare Automatic
Interaction Detector (CHAID). The next sections describe each.
ID3
Developed in the 1970?s, ID3 was one of the first decision tree algorithms. Initially, the
algorithm was used for things like learning good game play strategies for chess end games. In
chess, the end game is the portion of the game where there are only a few pieces left on the board
(Quinlan & Gaetz, 1993) (Chess Endgame, 2008). However, ID3 has been used for a variety of
tasks in various domains and has been modified and improved many times.
ID3 works by choosing the predictors and their corresponding split values based on the
gain in information that the split(s) will make available. Gain is the difference in the amount of
information that will be needed to properly make an accurate prediction before and after a
split(s) is made (Berson, Smith, & Thearling, 1999).
14
C4.5
C4.5 is an enhancement of the ID3 algorithm. It improved ID3 algorithm by allowing predictors
with missing or continuous values to be used, introducing tree pruning, and enabling rule
derivation. The trees produced by this algorithm are also not restricted to binary splits. These
trees are more variable in shape (Berson, Smith, & Thearling, 1999) (Larose, 2005) (Quinlan J. ,
1993).
Classification and Regression Trees
Classification and Regression Trees (CART) was developed in 1984 by Leo Breiman,
Jerome Friedman, Richard Olshen and Charles Stone (Breiman, Freidman, Olshen, & Stone,
1984). Many of the C4.5 techniques appear in CART. The trees that are produced by the CART
algorithm are strictly binary. This means that each node in the tree can only have two branches.
CART analysis is a form of recursive partitioning. The algorithm recursively partitions the data
in the training set into subsets of records with similar values for the target or dependent variable.
To grow the tree, CART performs an exhaustive search of available variables to pick the best or
most optimal split variable. The fully grown tree that is initially produced will yield the lowest
error rate when compared against the training data. However, this model may be too complex
and usually results in overfitting. An overfit model is one that too closely follows all of the traits
of the training data and is not general enough to represent the overall population (Berson, Smith,
& Thearling, 1999) (Larose, 2005) (Lewis, 2000).
CART uses a cross validation approach and pruning to combat overfitting. Cross
validation, also known as leaveoneout training, is a method for confirming a procedure for
building a model that is computationally intensive (Lewis, 2000). In crossvalidation, the training
data is divided randomly into N segments, stratified by the variable of interest. One of the
15
segments is set aside for use as an independent test data set. The other (N1) segments are
combined for use as the training data set. The entire model building process is repeated N times,
using a different segment of data as the test data each time. N different models are produced,
each of which that can be tested against the independent test data. Based on the results of the
crossvalidation the initial complex tree is pruned to produce the most optimal general tree. The
most complex tree rarely performs best on the test data. By using cross validation, the tree that is
most likely to perform well on new data is produced (Lewis, 2000).
ChiSquare Automatic Interaction Detector
ChiSquare Automatic Interaction Detector (CHAID) was published in 1980 by Gordon
V. Kass (Kass, 1980 ). CHAID is very similar to CART but differs in the way splits are
determined. The algorithm uses the chi square test to choose the best split. The chi square test is
used in contingency tables to determine which categorical predictor is the furthest from
independence with the prediction values. All predictors must either be categorical or forced into
a categorical form, because of CHAID?s reliance on contingency tables, to shape its test of
significance for each predictor (Berson, Smith, & Thearling, 1999) (CHAID).
2.2.1.2 Covering Algorithms
Rules can be extracted directly from the training data using a sequential covering
algorithm. These algorithms learn a set of classification rules one rule at a time and use a
sequential covering method to remove instances in the training data that are covered by each new
rule found (Wang & Karypis, 2003) (Han & Kamber, 2006). This process repeats until there are
no instances left in the training data set. There are many types of covering algorithms such as
AQ15 (Michalski, Mozetic, Hong, & Lavrac, 1986), CN2 (Clark & Niblett, 1989), PRISM
(Witten & Frank, 2005) and RIPPER (Cohen, 1995). A basic sequential covering algorithm is
16
depicted in Figure 8. The algorithm begins with an empty rule list. The best rule for class y that
will cover the training records is mined. During this process all the training records that are
covered by class y are deemed positive examples while those belonging to other classes are
negative. A rule is of interest if it covers the majority of the positive and none or a minute few of
the negative examples. The newly found rule is added to the bottom of the rule list and the
training instances it covers are removed. This process repeats until some stopping criteria is met
and proceeds to generate rules for the remaining classes (Tan, Steinbach, & Kumar, 2006).
Let E be the training records and A be the set of attributevalue pairs, {(Aj, vj)}.
Let Yo be an ordered set of classes {y1, y2,?yk}.
Let R ={} be the initial rule list.
for each class y ? Yo ? {yk} do
while stopping condition is not met do
r ? LearnOneRule (E, A, y).
Remove training records from E that are covered by r.
Add r to the bottom of the rule list: R ? R uni02C5 r.
end while
end for
Insert the default rule, {} ? yk, to the bottom of the rule list R.
Figure 8 Basic Sequential Covering Algorithm
There are two commonly used methods to ?grow? a rule: general to specific or specific to
general. The general to specific strategy starts by creating a rule r: {} ? y. The left side of the
rule is an empty set and the right side contains the target classification. This initial rule has poor
quality as it covers all of the instances in the training data for that class. New conditions or
conjuncts are added to improve the quality of the rule. For example, Figure 9 (a) depicts the
general to specific rule growing method. The condition (conjunct), Body Temperature=warm
blooded is added initially to the rule. The algorithm will then explore all possibilities and
greedily chooses the next condition to add. The specific to general method one of the positive
examples is chosen at random as the seed to grow the rule. There is a refinement step in which
17
the rule is generalized by removing a condition (conjunct) to allow it to cover more positive
examples. Figure 9(b) depicts this process (Tan, Steinbach, & Kumar, 2006).
Some evaluation metric (i.e. accuracy, coverage, etc.) is used to determine which conjuncts to
add or remove to improve rule quality during the process of growing a rule.
2.2.1.3 Ordered and Unordered Rule Sets
Rule basedclassifier rule sets are either ordered or unordered. The rules within an ordered
rule set are arranged in an order based on some metric (i.e. accuracy) in decreasing order of
Figure 9 Rule Growing Strategies
18
priority. When there is a new unlabeled instance, the first rule that matches makes the
prediction. An ordered rule set is also known as a decision list. These classifiers usually have a
?default class? which is used when no classification can be made (no rule matches) (Tan,
Steinbach, & Kumar, 2006) (Hu & Li, 2005). C4.5rules (Quinlan J. , 1993) and CBA (Liu, Hsu,
& Ma, 1998) both use this method.
Unordered models do not arrange the rules in any type of sequence or order. All of the
rules that match are used to make the prediction (i.e. voting). The class that receives the most
votes wins and is used as the prediction. The vote for each class may be weighted by the rules
accuracy in some instances. There are both advantages and disadvantages when using an
unordered rule base. Unordered rule sets are not as vulnerable to misclassifications as a result of
choosing the wrong rule. Building the model is also less expensive as the rules do not have to be
kept in any particular order. However, classifying a new test instance can be expensive as the all
of the attributes of that instance must be compared against every rule in the model. (Tan,
Steinbach, & Kumar, 2006) (Witten & Frank, 2005)
2.2.1 Other Classifiers
Na?ve Bayesian Classifiers
Bayesian classifiers are based on the Bayes theorem. This theorem is a simple
mathematical formula that is used for calculating conditional probabilities. The na?ve Bayes
classifier learns the conditional probability for each attribute given a particular class label. For
example, if A and B are two random events then P (AB) is the conditional probability of A
occurring given B. Classification is performed by using the Bayes theorem to calculate the
probability of a class given a particular instance. The class with the highest posterior probability
is then used as the prediction. Using the previous example, P (AB) is the posterior probability.
19
(Friedman & Goldszmidt, Building classifiers using Bayesian networks, 1996) (Langley & Sage,
1994) (Friedman, Geiger, & Goldszmidt, Bayesian Network Classifiers, 1997)
Neural Networks
An example of an authentic neural network would be the human brain. The brain can
recognize patterns, make predictions, and learn. Artificial neural networks seek to emulate these
capabilities. Neural networks in data mining are essentially computer programs applying pattern
recognition and machine learning algorithms to build predictive models (Berson, Smith, &
Thearling, 1999).
There are two main structures in a neural network: nodes and links. Nodes are artificial
neurons and links are the connections between them. To make a prediction, the neural network
accepts values for the independent variables or predictors at the input nodes. The values of these
nodes are then multiplied by values stored in the links. These values are added together at the
output node, after which some threshold function is used and the resulting number is the
prediction. Figure 10 (Berson, Smith, & Thearling, 1999) is an example of a simple neural
network. In this example Age and Income are the input nodes and Default is the output node and
predicts if a person will default on a bank loan.
Figure 10 Simple Neural Network
20
Most neural networks are not as simple as depicted above. There is usually hidden layer
of nodes between the input and output nodes. They are deemed ?hidden? because their contents
are not made known to the end user. It is also possible to have more than one hidden layer, thus
making the network very complex.
2.3 Association Analysis
Association analysis is useful for finding interesting relationships that are hidden in large
data sets. The goal of this type of analysis is to uncover rules (or associations) for quantifying the
relationship between two or more attributes (Larose, 2005). These relationships are displayed in
the form of an association rule. An association rule is an implication expression of the form:
Y
X
Y
X confYX ,sup:? , where X and Y are disjoint itemsets (i.e., ?=?YX ),
Y
Xsup is the
support, and YXconf is the confidence.
The strength or goodness a rule is measured by its support and confidence. Support
defines how many times a rule is pertinent to a particular data set. Confidence defines how often
item in Y appear in instances that contain X. For example, a store may find that out of 100
customers shopping on a Tuesday night, 20 bought diapers, and of the 20 who bought diapers, 5
bought beer. The association rule for this example would be if buy diapers then buy beer. This
rule would have a support of 5/100 = 5% and a confidence of 5/20 = 25% (diapers barb2right beer: 0.05,
0.25) (Tan, Steinbach, & Kumar, 2006)
Mining association rules from large data repositories involves a two step process:
frequent itemset generation and rule generation. First, all of the frequent itemsets must be found.
Frequent itemsets are those that satisfy some minimum support threshold. Then, from the
frequent item sets found in the first step, association rules are generated that satisfy the minimum
21
support and confidence conditions. In association analysis an itemset is a collection of zero or
more items.
Frequent itemset generation is generally more expensive than rule generation. A dataset
that has k items could produce up to 2k 1 itemsets. Because k can be quite large, the search space
of itemsets that needs to be investigated is exponentially large. To enumerate the list of all
possible itemsets a lattice structure can be used (Figure 11). The brute force method of finding
frequent itemsets involves determining the support count for each candidate itemset in the lattice
structure. There are ways of reducing the computational complexity of frequent itemset
generation. One of these is to reduce the number of candidate itemsets using the Apriori
principle. The Apriori principle states that if an itemset is frequent, then all of its subsets must
also be frequent (Figure 12). On the other hand, if an itemset is infrequent then all of its supersets
must be infrequent too. For example in Figure 13, because {a, b} is an infrequent itemset the
entire sub graph containing the supersets {a, b} can be pruned. This method of reducing the
exponential search space based on the support measure is known as supportbased pruning.
Apriori was the first rule mining algorithm which pioneered the use of support based pruning
(Tan, Steinbach, & Kumar, 2006). This algorithm uses support based pruning to control the
exponential growth of candidate itemsets. Apriori uses a bottom up technique in which the
frequent item sets are extended one at a time. This is known as candidate itemset generation.
This process continues until no further successful extensions are found (Larose, 2005).
22
Figure 11 Lattice Structure
23
Figure 12 Apriori Principle
Figure
24
13 Support Based Pruning
25
2.4 Cluster Analysis
Clustering can be defined as ?a division of data into groups of similar objects (Berkhin,
2002)?. Instances within these groups or clusters are more similar to each other than instances
belonging to other clusters. Clustering and classification are different from one another in that
there is no target or dependent variable in clustering. Clustering does not attempt to classify or
predict the value of the target variable. These algorithms attempt to segment the whole data set
into homogenous clusters. The more similarity within a cluster and the bigger the difference
between clusters the better the clustering. For the best possible performance, clustering
algorithms require that the data be normalized so that any one attribute or variable will not
control the analysis. (Jain, Murty, & Flynn) (Anderberg, 1973) There are two main types of
clustering algorithms: hierarchical and partitional. The next section describes each.
2.4.1 Hierarchical Clustering
Hierarchical clustering creates a tree of clusters known as a dendrogram. With this type
of clustering, the smallest clusters in the tree join together to create the next level of clusters. At
this level, the clusters then join together to create the next level of clusters. The top or root of
this tree is the cluster that contains all the records. There are two types of hierarchical clustering:
agglomerative and divisive (Berkhin, 2002) (Berson, Smith, & Thearling, 1999).
Agglomerative clustering algorithms begin with having as many clusters as there are
records. Each cluster will contain one record. Then the clusters that are closest to one another,
based on some distance, are joined together to create the next largest cluster. This process is
continued until the hierarchy is built with a single cluster, which contains all records, at the top.
Figures 14 and 15 describe the agglomerative clustering process (Cluster Analysis, 2008).
26
Figure 14 contains a set of records that will be clustered. Figure 15 shows the cluster hierarchy
after the agglomerative approach has been applied.
Figure 14 Records to be clustered
Figure 15 Cluster Hierarchy
27
Divisive clustering algorithms apply an opposite approach. These types of algorithms
start will all of the records in one cluster and then recursively splits the most suitable cluster.
This process continues until some stopping criteria are met.
How appropriate a cluster or clusters for merging or splitting depends on how similar or
dissimilar items in each cluster are. The distance between individual records has to be
generalized to the distance between clusters in order for splitting or merging to occur. This
proximity measure is called a linkage metric. The major linkage metrics include: Single
Linkage, Complete Linkage, and Average Linkage (see Figure 16) (Tan, Steinbach, & Kumar,
2006).
Single linkage (nearest neighbor) is based on the minimum distance between any record
in one cluster and any record in another cluster. Cluster similarity is based on the similarity of
the most similar members from each cluster.
Complete linkage (farthest neighbor) is based on the maximum distance of any record in
one cluster and any record in another cluster. Cluster similarity is based on the similarity of the
most dissimilar members from each cluster.
Average linkage was designed to decrease the dependence of the cluster linkage criteria
on extreme values. The criteria here is the average distance of all the records in one cluster from
all of the records in another cluster.
2.4.2 Partional Clustering
Partitional clustering is dividing the data set in such a way that each record belongs to
one and only one cluster. These algorithms don?t produce dendrogram as in hierarchical
clustering. A single partition of the data is produ
a randomly picked or user defined number of clusters. The algorithms then optimize each cluster
based on some validity measure. There are several partitioning clustering approaches. The next
section discusses two of these: K
KMeans
Kmeans is one of the oldest and most widely used clustering techniques. The name K
means comes from how each of the
that cluster. This point is called the centroid. The basic K
straight forward. Figure 17 describes this algorithm
28
Figure 16 Linkage Examples
ced. Partitional clustering algorithms begin with
Means and Expectation Maximization (EM) (Berkhin, 2002)
K clusters is represented by the mean of the points within
means algorithm is very simple and
(Tan, Steinbach, & Kumar, 2006)
Figure 17 Kmeans Algorithm
.

.
29
The first step is to pick K initial centroids. K is the number of clusters wanted and is a
user specified parameter. Next each data instance or point is assigned to the centroid it is closest
to. This collection of points is a cluster. The centroid of each of these clusters is updated based
on the points assigned to the cluster. These steps are repeated until there is no change in the
centroids.
Instances are assigned to a centroid based on a proximity measure that quantifies
closeness for a particular data set. There are several types of proximity measures such as
Euclidean distance and Cosine similarity (Berkhin, 2002).
Expectation Maximization (EM)
The EM technique is very similar to the kmeans approach. They are both iterative in
nature and start with a random guess. The difference lies in the idea of hard versus soft cluster
membership. In kmeans a hard membership approach is adopted. This means that an instance
belongs to one and only one cluster. In EM a soft membership approach is used, which means
that membership of an instance can be spread amongst many clusters.
The EM algorithm is a two step process: Expectation and Maximization. The expectation
portion is the first step and involves calculating cluster probabilities (the expected class values).
The maximization step involves calculating the distribution parameters. This step is the
maximization of the possibilities of the distribution given the data. These steps are repeated until
a ?loglikelihood convergence is achieved.? (Berkhin, 2002) (Witten & Frank, 2005).
2.5 Anomaly Detection
The main objective in anomaly detection is to locate instances that are different from
most of the other instances. These abnormal objects are known as outliers and have attribute
values that deviate considerably from the norm or expected values. Some areas where this type
30
of analysis is very important are fraud detection, intrusion detection, public health and medicine.
Some common causes of anomalies are things such as data from different types, natural variation
and data measurement or collection errors (Tan, Steinbach, & Kumar, 2006)
There are three fundamental approaches to anomaly detection: supervised, unsupervised,
and semisupervised. The difference in these techniques is the degree to which known outcomes
or classifications (class labels) are available for at least some portion of the data (Tan, Steinbach,
& Kumar, 2006)
The supervised anomaly detection approach requires a training set that contains both
normal and abnormal instances. Unsupervised anomaly detection techniques seek to assign a
score to each instance that reflects how abnormal that instance is. With semisupervised anomaly
detection techniques the goal is to find an anomaly score (label) for a set or group of instances
using information from labeled normal instances.
2.6 Ensemble Methods
As one could imagine the accuracy of predictive modeling algorithms is quite important.
Ensemble methods seek to improve the accuracy of classifiers (such as decision trees) by
combining the predictions of multiple classifiers (Figure 18). These methods construct a set of
base classifiers using the training data and make classifications for new instances by voting on
the prediction each base classifier makes. (Tan, Steinbach, &Kumar, 2006) Research has shown
that ensembles have a tendency to perform better than single classifiers in the ensemble (Opitz &
Macline, 1999) (Quinlan J. , 2006) (Freund & Schapire, 1999) (Berk, 2004). Bagging, boosting,
and stacking are well known ensemble techniques.
31
Figure 18 Ensemble Method Process
2.6.1 Bagging
Bagging or bootstrap aggregation trains each base classifier with a random redistribution
of the training (Opitz & Macline, 1999). Each of the training sets for the base classifiers are
generated by randomly selecting, with replacement, N examples, where N is the size of the
original training set. Since sampling is done with replacement, some of the instances may
possibly appear many times within the same training set. Likewise, some instances may be
omitted from the training set (see Figure 19).
As previously stated, bagging involves combining multiple models (for example decision
trees). These models can be combined by having each model vote on each new instance. The
class that receives the most votes is deemed to be the correct one. When the target variable is
numeric the average is used. The predictions or classifications made by voting become more
32
dependable as more votes are considered. Researchers have determined that bagging is effective
on the learning algorithms of which small changes in the training data can produce big changes
in predictions (Tan, Steinbach, & Kumar, 2006) (Witten & Frank, 2005) (Opitz & Macline,
1999).
2.6.2 Boosting
This ensemble method technique gets its name from its capacity to take a ?weak learning
algorithm? and ?boosting? it into a ?strong? learning algorithm (Freund & Schapire, 1999). A
weak learning algorithm is one that performs slightly better than random guessing. Like bagging,
boosting uses voting to combine the output of multiple models of the same type (for example
decision trees). Boosting, however, is iterative, unlike bagging in which each base classifier is
built separately; each new model is influenced by the performance of the models built before it.
New models are pushed to become experts for instances mishandled by earlier models. Boosting
weights a model?s contribution by its performance instead of giving equal weight to all as in
bagging (Witten & Frank, 2005). For example in Figure 19 (Opitz & Macline, 1999), assume
instance 1 is an outlier and hard to classify correctly. This instance appears more in later training
sets because boosting will focus (increase its weight) more on correctly classifying it.
33
Figure 19 Bagging and Boosting Example
2.6.3 Stacking
Stacking or Stacked Generalization differs from bagging and boosting in that it involves
combining classifiers of different types (e.g., decision trees and neural networks). It is less
commonly used because it can be difficult to analyze and there is no standard or accepted best
practice. Stacking eliminates voting by introducing the idea of a metalearner. Stacking attempts
to ?learn? which base classifiers are the most reliable by using the metalearner to determine how
best to combine the output of the base classifiers (Witten & Frank, 2005).
The inputs to the metalearner (level 1 model) are the predictions of the base classifiers
(level 0 model). Each level 1 instance will have as many attributes as there are level 0 learners.
34
When classifying, the new instance is given to each of the level 0 learners and those
results are given to the level 1 learner. The level 1 learner then ?learns? the best way to combine
the predictions to make the final prediction (Witten & Frank, 2005).
2.7 Other Related Information and Work
Discretization
Many data of the data mining algorithms used for classification tasks can?t tolerate
numeric data. For these algorithms the data must be transformed into a categorical format. This
transformation is known as discretization. Discretization has two steps:
? Decide how many categories to create
? Determine how to map the values to the categories
There are many ways to discretize continuous attributes (Dougherty, Kohavi, & Sahami,
1995). One of the simplest ways is to make use of the standard deviation to create categories.
For example, suppose attribute A is a nominal attribute. First obtain the minimum, maximum and
standard deviation values for A. Let min=1, max=5 and stddev=0.25. The initial category would
be 1 ?1.25 (min <= A < (min + stddev)). Any values greater than or equal to 1 and less than 1.25
would be updated to reflect this category in the dataset. The next category would be 1.25 ? 1.50
((min + stddev) <= A < (min + stddev + stddev)). Again, any values that fall within this range
would be reflected in the dataset. This process continues until the range is less than or equal the
max value.
Association Based Classification
There have been efforts made to combine association and traditional rule mining
techniques. (Liu, Hsu, & Ma, 1998), showed that this integration could be performed efficiently
without loss in classification accuracy. The combination focuses on the association rules where
35
the right side contains the target class attribute. These rules are identified as class association
rules. Data mining with this associative classification framework has 3 basic steps:
? Discretize any continuous attributes
? Generate all class association rules
? Build classifier bases on the rules generated
All association based algorithms e.g. Apriori (Agrawal & Srikant, 1994), can be adapted to use
for classification purposes. CBA (Liu, Hsu, & Ma, 1998) , CMAR (Li, Han, & Pei, 2001), and
HARMONY (Wang & Karypis, 2003) are examples of applications that utilize some form of this
technique. (Hu & Li, 2005)
Data Clustering and Predictive Modeling
For some classification or regression problems it is often advantageous to divide the data
into relatively homogenous groups or clusters and build a model for each group. The advantages
of building multiple models in this manner are improved accuracy, reliability and interpretability.
(Baumann & Germond, 1993), (Djukanovic, Babic, Sobajic, & Pao, 1993), (Oh & Han, 2001)
and (Lokmic & Smith, 2000) all explored this two step method in their research.
(Deodhar & Ghosh, 2007), investigated notion of infusing data clustering process with
the creation of predictive models. Their work focused on problems in which the independent
variables could be naturally sectioned into two or more groups and makes use of coclustering.
The authors partitioned the independent variables into groups with respect to their modes then
instantaneously clustered along each mode and built a predictive model for each cocluster. Co
clustering is a method that simultaneously clusters data along multiple axes. (Cheng & Church,
2000) (Cho, Dhillon, Guan, & Sra, 2004) This technique is usually applied to a matrix of data
points and takes advantage of the duality between two axes to improve upon traditional single
sided clustering. The rows in the matrix are data points and the columns are features.
36
The aforementioned approaches are all methods that have successfully combined some
aspect of data clustering and predictive modeling/classification. However, none of them address
the issue of attempting to make classifications when the data that is used to evaluate the model is
not as complete as the data used to build the model.
Robust Rule based Classification
(Hu & Li, 2005) and (Li, Topor, & Shen, 2002) focus on building a more robust
classifier. The authors proposed using association rules as a way to make rule base classifiers
more robust. The idea is that a traditional rule based classifier can be improved by adding a
certain level of redundancy to the rule set. This robust classifier contains more rules than the
generalized model and combats possible classification error due to missing and/or incomplete
data in the new/unseen instances. In terms of missing values, robustness involves the following:
? Ability to tolerate missing values in test data
? Ability to tolerate missing values in training data
Missing data is a well known issue in the field of data mining. There has been research
performed in regards to handling missing values in training data such as work presented by
(Clark & Niblett, 1989), (Mingers, 1989) and (Batista & Monard, 2003 ). General methods for
managing missing values involve preprocess substitution via estimations utilizing an approach
such as nearest neighbors. The work presented by (Hu & Li, 2005) and (Li, Topor, & Shen,
2002) differs from these existing methods in that no estimations or substitutions are made for any
missing values. This research employs a larger rule set to make the classifier more resilient
against missing test data. The authors designed, implemented and evaluated a practical robust
rule based classifier for 0, 1, 2, 3 and 4 missing attributes using an ordered rule set using
association classification rules. The rules were ordered by rule accuracy in descending order. A
37
rule set was generated using the training data and evaluated against test data with added missing
values. The missing values were added by randomly omitting some values in the test data. Each
test record in the test data had the same number of missing attributes on average. The results of
the experiment were positive and proved that their proposed method of adding redundancy via
association rules was indeed more resilient against missing attributes in the test data.
The work by (Hu & Li, 2005) is similar to the research presented in this dissertation as
both are focused on creating models that are a resilient against missing attribute information.
However, this research differs in that we present a method to add resiliency via divisive
clustering and rule based models created using a sequential covering algorithm. Multiple rule
sets are generated, combined and used as a collective model. The creation of multiple models on
different segments of the data allows for the creation of rules that may not have been created
using a more traditional approach which aids in adding resiliency against missing data.
2.8 Data Mining Tools
There are many data mining tools available. This section gives a brief description of
some of them.
Microsoft SQL Server
Microsoft SQL Server is a ?comprehensive, integrated data management and analysis
software that enables organizations to reliably manage missioncritical information and
confidently run today?s increasingly complex business applications (Microsoft, 2007)?. SQL
Server 2005 is the platform leader in a variety of areas including business intelligence and
database management systems.
38
Rapid Miner
?Rapid Miner (formerly YALE) is the worldleading opensource system for knowledge
discovery and data mining (Rapidi, 2008)?. This tool is not only available as a standalone
application for data analysis but also as a data mining engine that can be integrated into your
own products.
Weka
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software that is written in Java. The software was developed at the University of
Waikato and is free under the GNU General Public License (Witten & Frank, 2005).
ORACLE Data Mining
Oracle Data Mining (ODM) is an option of Oracle Database 10g Enterprise Edition. This
tool provides the ability to ?produce actionable predictive information and build integrated
business intelligence applications (ORACLE, 2007)?.
39
Chapter 3
Research Details
3.1 Conceptual Overview
This section describes the approach taken to test the hypothesis discussed in section 1.2.
The objective of this research is to develop a new method for classification problems in which
the test data may not be as complete as data used to build a model that is still able to achieve
fairly accurate results. The problem is more formally expressed here.
Let A = {a1, a2,?, am} be a set of distinct attributes found in a table, D, where D =
{d1,d2,?,dn} indicates a set of all instances within a table. Attributes capture the basic
characteristics of an instance. Each attribute ai contains a set of distinct values Vi= {v1, v2,?, vn}.
A pattern is a set of one or more attribute value equivalent constraints that are conjunctively
joined (ai=vij and at=vtk...). A rule, r, is an implication in the form P ? c, where P is a pattern
(the condition) and c (the consequence), such that c is a member of A where c = ak and c is not a
member of P. A traditional rule based classifier would build a rule set R={r1, r2,?, rn}.Given a
test instance T, a rule r covers T if condition(r) ?T. A rule can make predictions on the instances
it covers, r(T) ? consequence(r). If the consequence(r) is the actual class of T then the rule has
made the correct prediction. If the class is not the actual class it is an incorrect prediction.
Given a test instance T?, with one or more missing attribute values from A, where V? (unknown
attribute values) is not a subset of V (known attribute values), denoted V?? V; r cannot make
predictions for T?. Therefore T? is unclassifiable by R.
40
Using the proposed approach, multiple rule sets would be generated via divisive data
clustering. A cluster, clustj ( kj ??1 ), where clustj ? D, such that each instance is a member
of one and only one cluster, Clust = {clust1, clust2,?,clustk},where k is the designated number of
initial clusters. An instance is assigned to a cluster, clustj, based on holistic similarities such that
instances within a cluster are more similar than those outside the cluster. A rule set, Rj, is
generated for each cluster. The multiple rule sets are combined into a super rule set, R? = {R1?
R2?? Rk}. R? is more resilient against missing attribute information as there exists a k of which
the attributes in R? are sufficiently diverse. Meaning, that there exists two or more rule sets within
R? that are not proper subsets with respect to their individual As, where ( kj ??1 and ki ??1
and ji ? ) of other rule sets within R?.
Clustering Rule Based Approach
As previously mentioned the premise of this research is to add robustness and resiliency
to rule based classification via divisive clustering. The idea is that there exists a k that will create
a model that will be more resilient against missing attributes. We present the Clustering Rule
based Approach (CRA) for classification. In CRA, the data is first clustered into k clusters using
divisive clustering. The clustered data is then used to train individual rule base classifiers using a
sequential covering algorithm. The rules are then merged into a combined rule set. Finally, each
test instances is evaluated against the combined rule set, predicting its output. Figure 20 depicts
this process. It is important to note that prior to the cluster creation any continuous attributes are
discretized.
41
Figure 20 CRA Flow
42
Chapter 4
Experiment Design
We conducted an experiment to test the validity of the CRA process. This section
describes the design of that experiment.
4.1 Data
The data for this experiment was obtained from the UCI Repository (Blake & Merz,
1998). These data sets varied in size, attributes, and attribute value data types (e.g. continuous,
nominal). Table 3 gives a description of each data set used.
Data Set
Number of
Records
Number of
Attributes
Number of
Classifications
Annealing 898 38 5
Thyroid 3,164 25 2
Mushroom 8,124 22 2
TicTacToe 958 9 2
Congressional
Voting 435 16 2
Table 3 Data Sets
4.2 Materials and Tools
All of the coding used to create and test the CRA process was completed utilizing Java.
All data was stored and manipulated using MYSQL database tables and Microsoft Excel. The
sequential covering algorithm used in this experiment was the PRISM algorithm. The divisive
clustering algorithm used is the approach employed by Applications Quest (AppsQuest).
43
4.2.1 PRISM
PRISM is a simple sequential covering algorithm introduced by (Witten & Frank, 2005).
The algorithm follows the process of the basic covering algorithm presented in chapter 2. The
pseudocode for the PRISM algorithm is shown in figure 21. PRISM generates only correct or
perfect rules. The success of a rule is measured by the accuracy formula p/t, where t is all of the
instances covered by the new rule of which p are positive examples of the class. Rules with less
than 100% accuracy are incorrect, meaning cases are assigned to the class that are not do not
actually belong to the class in question. PRISM continues to add conjuncts (conditions) to the
rule until it is perfect or there are no more attributes available for use.
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty lefthand side that predicts class C
Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A=v to the lefthand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A=v to R
Remove the instances covered by R from E.
Figure 21 PRISM Pseudocode
4.2.2 Applications Quest?
Applications Quest? (AppsQuest) is a data mining tool developed by Juan E. Gilbert to
aid institutions, agencies and businesses in the admissions/hiring process using data clustering.
(Gilbert J. E., Applications Quest: A Case Study on Holistic Admissions, 2008) (Gilbert J. E.,
Applications Quest: Computing Diversity, 2006) The goal of the system is to recommend either
44
the most diverse or most similar group of applicants depending upon the needs of the
organization by comparing applicants in a holistic way. There are many elements of an
application that can be easily compared. Take GPA (grade point average), for example. If one
student has a GPA of 3.5 and another has 2.7 the GPA of 3.5 is clearly higher. This comparison
is not as straightforward when there are non numeric elements of the application to consider.
AppsQuest differs from other divisive clustering applications in its treatment of nominal
attributes (nonnumeric). Traditionally, data clustering applications require the nominal attributes
be assigned a value that allows them to be compared (e.g. 0 if the values are different and 1 if the
values are the same). City, state, race, and gender are examples of nominal attributes. To
provide a more accurate measure to compare nominal attributes the Nominal Population Metric
(NPM) was created. The NPM process begins by identifying the nominal attributes. For
clarification purposes, let?s use Race as an example. There are 20 applicants in the pool where
Race is an attribute. This attribute has the following unique values: Bblack, Wwhite, H
hispanic, AAsian, and Oother. Using the AppsQuest approach every application is compared to
every other application in its entirety. For 20 applications compared 2 at a time there would be
190 comparisons. These results were obtained using the formula n C r = n!/[(nr)! r!]. The NPM
processes the nominal attributes in the following manner (Gilbert J. E., 2008):
1. Calculate the total number of combinations for all applications using n C r. Continuing
the example, the total number of 190 combinations and place into matrix of combination
and application pairs. This matrix is called the similarity matrix (see Table 4).
Application 1 Application 2
Difference Measure
1 2 20%
1 3 15%
45
1 4 75%
Table 4 Similarity Matrix with Difference Measures Example
2. Calculate the number of unique attributes values. For this example, the Race attribute has
5.
3. Calculate the number of combinations for the unique nominal attributes using n C r. The
result is 10 combinations for the Race attribute.
4. For the 10 combinations of the nominal attribute value pairs, compute the coverage
percentage within the application similarity matrix. This equates to the number of rows
in the application similarity matrix have a value for Race pairs WB, WH, BA, etc.
These figures are placed into a nominal population matrix. (see Table 5)
Race Coverage
WB 30%
WH 15%
BA 5%
Table 5 Example Nominal Population Matrix
5. The nominal population matrix contains the attribute pair coverage across all
comparisons. This is an accurate measure of the impact of the nominal attribute value
pairs based on their existence within the data set.
6. If necessary the Coverage values are adjusted. This is the objective when the application
is when measuring difference vs. similarity. The WB Race pair had the greatest coverage
in the nominal population matrix and is therefore the most common attribute value pair.
The BA value pair is the least common or most novel. The BA pair should be receive
more ?credit? as it is the most novel. The credit is calculated proportionately in relation to
other attribute value pairs. To accomplish this, the Coverage values are subtracted from
100% or 1.00 (see Table 6). This inverts the Coverage values such that BA is given more
credit compared to WB relative to their actual coverage within application similarity
matrix.
Race Coverage
WB 70%
WH 85%
46
BA 95%
Table 6 Nominal Population Matrix Example
7. The Coverage values in the nominal population matrix are now the Nominal Population
Metrics which can be used in clustering algorithms as a method to accurately compare
nominal attributes. The question, ?What?s the difference between a Black and Asian
along the Race attribute?? can be answered. The NPM value is 95% in this example.
An alternative approach is to use the distribution of the values for each attribute. Using
the previous example, Table 7 depicts the distribution of race within the data set (e.g. 70% of the
instances have a W for race).
Race Distribution
B 10%
W 70%
A 15%
H 5%
Table 7 Example Nominal Population Matrix
Race Distribution
B 90%
W 30%
A 85%
H 95%
Table 8 Example Nominal Population Matrix Adjusted
1. Compute the distribution for each nominal attribute. (see Table 7)
2. Use the distribution value for each attribute at time of comparison. For example if
calculating the difference between B and A along the Race attribute .1 and .15 would be
used, respectively, in the calculation.
3. Compute the Nominal difference by identifying the two smallest distribution values. In
this example this corresponds to H and B with values .05 and .1 respectively. This
identifies the 2 most novel attributes based on their frequency of occurrence in the data
set.
47
4. Calculate the sum of the 2 most novel attributes to determine their combined frequency
(.05 + .1=.15). This frequency becomes the baseline for normalization as these are the
most novel pairwise nominal attribute values (for Race in this example).
5. Calculate the normalization by subtracting the combined frequency from 1. In this
example, 1  .15 = .85. This value becomes the normalization factor. All pairwise
comparisons are divided by this value. The HB comparison yields a difference value of 1
(.85/.85). The WH comparison yields a difference of .294 ([1(.70 + .05)]/.85 =.294).
The difference between WH is .294 vs. 1.00 for HB.
In general, the NPM covers the use of the distribution of nominal attribute values in
efforts to determine the similarity or difference between nominal attribute values.
4.3 Procedure
The data was loaded into MYSQL database tables. The data was discretized using the
standard deviation to create the corresponding categories for each continuous attribute (if any).
The data was then split into training and test data sets. 80% of the data was used for training and
20% of the data was used for testing purposes. Proportionate samples for each class were used.
The training data was clustered using AppsQuest divisive clustering algorithm. PRISM was then
run on the clustered data. The resulting rule sets were then merged into a collective model. It is
important to note that during the merging process any duplicate rules were removed.
Contradictory rules were also resolved. Contradictory rules are rules in which the conditions for
two or more rules are the same but the resulting classes are different. If this occurred the rule
with the highest coverage was kept and the other rules removed. We examined the effects of the
clustering rule based approach using both ordered and unordered rule sets for 0, 1, 2, 3, and 4
missing attributes. The attributes determined to be ?missing? were chosen at random and
modified in test data prior to each iteration of the experiment. (Hu & Li, 2005) The accuracy of
each rule was determined using the calculation p/t. In this instance the t corresponds to the total
number of instances covered in relation to the entire training data set instead of a particular class
48
and p is the number of instances that are positive examples of the target classification. A voting
scheme was used in the unordered rule set using the accuracy of each rule that matched the test
instance. The class with the highest sum of accuracies was used. The rule with the highest
coverage was used to make the prediction if a tie occurred. In the ordered rule set the rules were
arranged in decreasing order of accuracy. The first rule that matched was used to make the
prediction. The results reported are based on an average of ten runs per dataset. Meaning for
each data set the process was executed with 10 training and 10 individual test sets.
4.4 Expected Outcomes
It is expected that for some k, the CRA process will produce a collective model that will
be more tolerant of missing attribute information in the new unseen test instances. The collective
model will be larger and have shorter simpler rules which often achieve high classification
accuracy and aid in adding robustness. (Witten & Frank, 2005) The rule length refers to the
number of conditions (conjuncts) the rule contains. Longer rules are more specific and shorter
rules are more general.
49
Chapter 5
Research Findings
5.1 Results
The goal of this research is to introduce a new method for adding robustness to rule based
classifiers using divisive clustering. The resulting model would be more resilient against missing
attribute information in new/unseen data. This section describes the results of the experiment
conducted to test the validity of this method.
k values
Given the number of different datasets used in the experiment, there are no guarantees
that the best performing k on one dataset will translate to the best performing k on other datasets.
As such, the performance of the CRA process is measured against a set of k values: 1, 3, 5, 7, 9,
11, 13, and 15; however, it is possible to build the model using k values across a larger range, i.e.
1, 2, 3, ? N1, but that approach was not necessary. The results for k value 1 reflect the
performance of the model using the traditional methods (control group).
Results for Congressional Voting Data
Tables 9 through 13 display the results for the tests performed using the Congressional
Voting dataset. The traditional classifier performed very well on the complete data set as shown
in Table 4 achieving an accuracy of 97% and 99% for the unordered and ordered models
respectively. The CRA process performed well and was able to match the performance of the
50
traditional unordered classifier with an accuracy of 97%. For 1, 2, 3 and 4 missing attributes
there existed a k that produced a collective model that obtained higher classification accuracy for
either type of model. These values are highlighted in green. Instances in which the CRA process
?tied? with the traditional model are highlighted in yellow. (Note: Highlighting scheme green
for wins, yellow for ties will be consistent for remaining results.) Figures 22 and 23 depict a
graphical representation of the classification accuracy for each k. As the number of missing
attributes increased, the performance of the traditional method (k value 1 for ordered and
unordered models) suffered while the CRA process performed more consistently. Figure 24
depicts the rule average rule lengths for the models. The rule lengths were smaller for models
built using the CRA method. On average, the rule lengths were 2.81 and 1.84 for the traditional
and CRA methods respectively. Additionally, the size of the CRA generated models were larger
(contained more rules) than the traditional models. The CRA models contained, on average,
approximately 25 rules while the traditional model contained 22. Also, there was a decrease in
the number of instances not classified when the CRA method was used.
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .97 .99 1 23.9 33.7 2.83
3 .97 .98 0 24.4 30.2 2.22
5 .95 .95 0 26.8 31.5 2.01
7 .95 .96 0 26.8 30.6 1.94
9 .93 .95 0 26 29 1.77
11 .91 .97 0 24.8 28.3 1.69
13 .92 .95 0 24.8 28.6 1.71
15 .91 .94 0 24.9 28.5 1.68
Table 9 Congressional Voting Results for Complete Data (No Missing Attributes)
51
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .92 .96 1.75 21.6 31.5 2.78
3 .94 .97 1 22.8 29 2.17
5 .94 .95 0 24.9 29.3 1.98
7 .93 .95 0 24.9 28.6 1.91
9 .91 .91 0 23.3 25.7 1.69
11 .91 .92 0 22.7 25.6 1.63
13 .9 .93 0 22.6 25.4 1.65
15 .92 .93 0 23.9 25.8 1.59
Table 10 Congressional Voting 1 Missing Attribute
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .87 .89 4 21.45 30.18 2.68
3 .89 .91 1.36 23.18 30.45 2.29
5 .9 .92 0 25.18 31.64 2.1
7 .89 .92 0 25.91 20.45 1.96
9 .9 .92 0 25.91 29 1.76
11 .87 .91 0 25.82 28.73 1.73
13 .87 .9 0 24.36 27.36 1.68
15 .89 .91 0 24.91 27.27 1.63
Table 11 Congressional Data Results 2 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .85 .86 6.64 21.27 32.09 2.86
3 .92 .92 .82 24.09 29.82 2.05
5 .92 .93 0 25.73 30.18 2.06
7 .91 .91 0 26.27 29.45 1.96
9 .88 .91 0 25.73 28.27 1.83
11 .85 .89 0 24.82 26.36 1.67
13 .87 .91 0 24.91 26.64 1.59
15 .87 .9 0 25 25.91 1.57
Table 12 Congressional Voting Data Results 3 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
1 .77
3 .86
5 .89
7 .89
9 .85
11 .86
13 .87
15 .88
Table 13 Congressional Voting Results 4 Missing Attributes
Figure 22 Congressional Voting
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3
Ac
cu
rac
y
52
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
.79 9.18 24.81 34.18
.88 2 25.82 32.27
.9 0 25.55 29.55
.9 0 27.18 30.64
.89 0 26.45 28.55
.89 0 24.73 27.45
.9 0 24.73 27.36
.9 0 25.36 26.73
Accuracies Unordered Model
5 7 9 11 13 15
k value
Average
Rule
Length
2.9
2.3
1.92
1.91
1.76
1.75
1.7
1.56
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
Figure 23 Congressional Voting
Figure 24 Congressional Voting Average Rule Lengths
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3
Ac
cu
rac
y
0
0.5
1
1.5
2
2.5
3
1 3
Av
era
ge
R
ule
Le
ng
th
53
Accuracies Ordered Model
5 7 9 11 13 15
k value
54
Cluster Rules Accuracy Coverage
1 0
If phys_fee_freeze=`n` and adopt_budget_res=`y` then
class=`democrat` 1 173
2 0
If phys_fee_freeze=`n` and crime=`n` then
class=`democrat` 1 126
3 0
If syn_corp_cuts=`y` and immigration=`n` and
water_proj_cost_shar=`n` and handicaped_infants=`n`
and exp_admin_safrica=`n` then class=`democrat` 1 2
4 0
If phys_fee_freeze=`n` and syn_corp_cuts=`y` then
class=`democrat` 1 94
5 0
If el_salvador_aid=`n` and syn_corp_cuts=`y` and
immigration=`n` then class=`democrat` 1 35
6 0
If phys_fee_freeze=`?` and education_spending=`n` then
class=`democrat` 1 6
7 0
If exp_admin_safrica=`?` and adopt_budget_res=`y` and
aid_nic_contras=`n` then class=`democrat` 1 6
8 0
If education_spending=`n` and exp_admin_safrica=`n`
and crime=`y` then class=`democrat` 1 5
9 0
If phys_fee_freeze=`n` and education_spending=`y` then
class=`democrat` 1 23
10 0
If superfund_right_to_sue=`?` and syn_corp_cuts=`y`
then class=`democrat` 1 7
11 0
If superfund_right_to_sue=`?` and
exp_admin_safrica=`y` then class=`democrat` 1 7
12 0
If phys_fee_freeze=`n` and exp_admin_safrica=`?` then
class=`democrat` 1 60
13 0
If adopt_budget_res=`?` and mx_missile=`y` then
class=`democrat` 1 5
14 0
If syn_corp_cuts=`y` and superfund_right_to_sue=`n`
and adopt_budget_res=`n` then class=`democrat` 1 8
15 0
If exp_admin_safrica=`?` and handicaped_infants=`y`
and adopt_budget_res=`n` then class=`democrat` 1 3
16 0
If adopt_budget_res=`n` and education_spending=`y` and
handicaped_infants=`n` then class=`republican` 0.92391 92
17 0
If relig_grp_schools=`y` and syn_corp_cuts=`n` then
class=`republican` 0.75781 128
18 0
If adopt_budget_res=`n` and relig_grp_schools=`n` then
class=`republican` 0.71429 14
19 0
If superfund_right_to_sue=`y` and
exp_admin_safrica=`y` and syn_corp_cuts=`n` and
immigration=`y` and mx_missile=`y` then
class=`republican` 0.46154 13
20 0
If crime=`y` and handicaped_infants=`y` and
syn_corp_cuts=`y` and relig_grp_schools=`n` then
class=`republican` 0.25 4
55
21 0
If el_salvador_aid=`y` and immigration=`n` and
mx_missile=`n` and handicaped_infants=`y` and
duty_free_exports=`n` and relig_grp_schools=`y` then
class=`republican` 0.66667 12
22 0
If el_salvador_aid=`y` and water_proj_cost_shar=`n` and
education_spending=`y` and adopt_budget_res=`y` then
class=`republican` 0.85714 7
23 0
If duty_free_exports=`?` and handicaped_infants=`n` and
exp_admin_safrica=`?` and relig_grp_schools=`y` then
class=`republican` 0.75 4
24 0
If handicaped_infants=`n` and relig_grp_schools=`n` and
immigration=`y` and water_proj_cost_shar=`n` and
duty_free_exports=`y` and exp_admin_safrica=`y` and
education_spending=`n` then class=`republican` 0.2 5
25 0 If water_proj_cost_shar=`y` then class=`republican` 0.39264 163
26 0 If exp_admin_safrica=`?` then class=`republican` 0.17284 81
Table 14 Congressional Voting Example Traditional Rule Set
Table 14 displays one of the rule sets generated in the traditional manner using the
Congressional Voting data. Table 15 displays one of rule sets generated with the CRA process.
The rule set in Table 15 is a collective model derived from 7 individual models (7 clusters).
Cluster Rules Accuracy Coverage
1 0
If water_proj_cost_shar=`n` and duty_free_exports=`n`
and handicaped_infants=`n` and exp_admin_safrica=`n`
then class=`democrat` 0.125 16
2 0
If phys_fee_freeze=`n` and education_spending=`y` then
class=`democrat` 1 23
3 0
If adopt_budget_res=`y` and duty_free_exports=`n` then
class=`democrat` 0.85915 71
4 0
If education_spending=`n` and duty_free_exports=`y`
then class=`democrat` 0.9619 105
5 0
If superfund_right_to_sue=`n` and anti_sat_test_ban=`n`
then class=`democrat` 0.75 20
6 0
If education_spending=`n` and exp_admin_safrica=`?`
then class=`democrat` 0.96364 55
7 0 If el_salvador_aid=`?` then class=`democrat` 0.91667 12
8 0 If mx_missile=`y` then class=`democrat` 0.9125 160
9 0
If water_proj_cost_shar=`y` and exp_admin_safrica=`y`
and duty_free_exports=`n` then class=`republican` 0.54386 57
10 0
If phys_fee_freeze=`y` and adopt_budget_res=`n` and
superfund_right_to_sue=`y` then class=`republican` 0.97143 105
12 0 If phys_fee_freeze=`y` and adopt_budget_res=`n` then 0.95041 121
56
class=`republican`
13 0
If exp_admin_safrica=`y` and anti_sat_test_ban=`n` and
education_spending=`y` then class=`republican` 0.86275 51
15 1 If adopt_budget_res=`y` then class=`republican` 0.07576 198
16 2 If crime=`n` then class=`democrat` 0.98485 132
17 2 If duty_free_exports=`y` then class=`democrat` 0.91729 133
18 2 If handicaped_infants=`n` then class=`democrat` 0.42932 191
19 2 If immigration=`n` then class=`democrat` 0.65143 175
20 2
If phys_fee_freeze=`n` and adopt_budget_res=`y` then
class=`democrat` 1 173
21 2
If immigration=`y` and duty_free_exports=`n` and
superfund_right_to_sue=`n` and mx_missile=`y` and
anti_sat_test_ban=`y` and water_proj_cost_shar=`y` and
syn_corp_cuts=`y` then class=`republican` 0.33333 3
22 2
If crime=`y` and syn_corp_cuts=`n` and
duty_free_exports=`n` and water_proj_cost_shar=`n`
then class=`republican` 0.85417 48
23 3 If phys_fee_freeze=`n` then class=`democrat` 0.9898 196
24 3 If crime=`y` then class=`republican` 0.62871 202
26 4 If aid_nic_contras=`y` then class=`democrat` 0.91935 186
27 4 If adopt_budget_res=`?` then class=`democrat` 0.77778 9
28 5 If superfund_right_to_sue=`y` then class=`democrat` 0.36047 172
29 5 If adopt_budget_res=`n` then class=`democrat` 0.17021 141
30 5
If handicaped_infants=`n` and education_spending=`n`
and el_salvador_aid=`y` and adopt_budget_res=`y` and
mx_missile=`n` and relig_grp_schools=`y` and
water_proj_cost_shar=`y` then class=`republican` 0.2 5
31 6 If aid_nic_contras=`n` then class=`democrat` 0.26351 148
32 6
If aid_nic_contras=`y` and el_salvador_aid=`y` then
class=`republican` 0.47619 21
33 6
If anti_sat_test_ban=`y` and education_spending=`n` and
immigration=`y` then class=`republican` 0.125 72
Table 15 Congressional Voting Example Collective Rule Set
Results for Anneal data
Tables 16 through 20 contain the results for the performed using the Anneal data set. The
traditional classifiers both performed well on the complete data set achieving a 90% and 94% for
the unordered and ordered models respectively. The CRA process achieved a higher accuracy
57
than the traditional model for tests with 2, 3 and 4 missing attributes (highlighted in green).
Figures 25 and 26 depict a graphical representation of the classification accuracy for each k.
There was a decrease in the number of instances not classified when the CRA method was
utilized with the occurrence of missing attributes.
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .9 .94 0 28 46.4 3.7
3 .82 .9 0 46.4 52.6 2.75
5 .78 .86 0 52.3 53.7 2.44
7 .76 .86 0 51.4 52.9 2.54
9 .76 .85 0 52.8 51.6 2.21
11 .76 .86 0 53.4 52.6 2.17
13 .76 .85 0 54.4 52.7 2.07
15 .76 .85 0 52.5 49.8 1.93
Table 16 Anneal Results Complete Data (No Missing Attributes)
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .87 .91 .6 31 46.3 3.46
3 .83 .9 0 44.6 51.7 2.79
5 .77 .87 0 51.7 55.7 2.58
7 .76 .86 0 52 53 2.38
9 .76 .86 0 50.8 50.2 2.23
11 .76 .84 0 49.4 50.7 2.1
13 .77 .84 0 50.8 49.5 1.9
15 .76 .84 0 51.6 50.2 1.86
Table 17 Anneal Results 1 Missing Attribute
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .85 .87 2.1 33.6 50.7 3.88
3 .8 .88 0 49.3 55.4 3
5 .78 .86 0 55.7 55.7 2.64
7 .78 .86 0 54.4 54.9 2.62
9 .77 .87 0 53.8 51.7 2.36
58
11 .76 .87 0 53.5 50 2.13
13 .76 .85 0 52.7 48.2 1.98
15 .76 .85 0 52.3 47.1 1.81
Table 18 Anneal Results 2 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .85 .88 2.2 30.6 47.8 3.49
3 .82 .9 0 44.2 52.8 2.97
5 .78 .87 0 51.7 52.6 2.48
7 .76 .86 0 53.5 54.6 2.41
9 .76 .86 0 53.8 53.4 2.27
11 .76 .84 0 52.5 51.6 2.16
13 .76 .85 0 54.9 51.3 1.98
15 .76 .84 53 48.8 1.9
Table 19 Anneal Results 3 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .82 .84 4.8 33.3 48.3 3.61
3 .81 .87 0 47.3 56.7 3.19
5 .78 .84 0 46.6 54.7 2.76
7 .77 .84 0 50.3 51.4 2.41
9 .76 .83 0 52.7 51.5 2.23
11 .76 .83 0 54 51 2.12
13 .76 .83 0 56 51.8 2
15 .76 .83 0 51.6 48.4 1.89
Table 20 Anneal Results 4 Missing Attributes
Figure 25
Figure 26
Figure 27 depicts the average
set. The CRA method produced shorter rules.
methods were 2.32 and 3.63 on average.
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1 3
Ac
cu
rac
y
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
1 3
Ac
cu
rac
y
59
Anneal Accuracies Unordered Model
Anneal Accuracies Ordered Model
rule length for the models created using the Anneal data
The rule lengths for the collective and traditional
Also, the models created via the CRA method were
5 7 9 11 13 15
k value
5 7 9 11 13 15
k value
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
generally larger than those created with the traditional method.
52 rules, on average, while the traditional model contained 31.
examples of the traditional and collective models created.
Figure
Cluster Rules
1 0 If steel=`S` and carbon=`0
2 0 If squality=`?` and bf=`Y` and steel=`A` then class=`1`
3 0 If exptl=`Y` then class=`1`
4 0
If squality=`?` and family=`?` and carbon=`0
strength=`0114.95` then class=`2`
5 0
If con=`?` and temper_rolling=`?` and carbon=`0
and family=`?` and squality=`?` then class=`2`
6 0
If width=`810.61
temper_rolling=`?` and con=`?` then class=`2`
7 0 If strength=`574.76
8 0
If formal_ability=`2` and len=`0
enamelability=`?` then class=`3`
9 0 If bl=`Y` then class=`3`
10 0 If bf=`Y` then class=`3`
0
0.5
1
1.5
2
2.5
3
3.5
4
1 3
Av
era
ge
R
ule
Le
ng
th
60
The collective model contained
Tables 21 and 22
27 Anneal Average Rule Lengths
Accuracy
13.72` then class=`1`
13.72` and
0.81481
13.72`
0.42857
1215.92` and carbon=`013.72` and
0.71429
689.71` then class=`2`
1870.92` and
0.90476
0.85124
0.95192
5 7 9 11 13 15
k value
display
Coverage
1 4
1 1
1 1
81
14
7
1 8
210
121
104
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
61
11 0 If con=`S` and thick=`0.241.11` then class=`3` 0.91228 285
12 0
If family=`?` and strength=`0114.95` and
width=`810.611215.92` and hardness=`024.75` then
class=`3` 0.86111 36
13 0
If family=`?` and width=`405.31810.61` and
temper_rolling=`?` and strength=`0114.95` then
class=`3` 0.88333 240
14 0
If family=`?` and width=`1215.921525` and oil=`?` then
class=`3` 0.8253 166
15 0 If steel=`R` and exptl=`?` then class=`3` 0.71362 213
16 0
If squality=`G` and thick=`0.241.11` and len=`0
1870.92` then class=`3` 0.93548 93
17 0
If thick=`1.982.85` and blue_bright_varn_clean=`?` then
class=`3` 0.80488 41
18 0
If formal_ability=`?` and bt=`?` and oil=`?` and
width=`405.31810.61` and bw_me=`?` then class=`3` 0.85507 69
19 0 If width=`0405.31` and family=`?` then class=`3` 0.82143 84
20 0
If ferro=`?` and chrom=`?` and hardness=`024.75` and
phos=`?` and width=`405.31810.61` and bw_me=`?`
then class=`3` 0.84153 183
21 0 If con=`S` then class=`3` 0.84966 439
22 0 If hardness=`49.4974.24` then class=`3` 1 66
23 0 If formal_ability=`2` then class=`3` 0.86139 303
24 0
If con=`?` and steel=`A` and bt=`?` and width=`405.31
810.61` and lustre=`?` and oil=`?` and bf=`?` then
class=`5` 0.51429 35
25 0
If con=`?` and temper_rolling=`?` and carbon=`013.72`
and bt=`?` and sfinish=`?` and bore=`0000` and bf=`?`
then class=`5` 0.57143 70
26 0
If steel=`?` and squality=`?` and
blue_bright_varn_clean=`?` and len=`01870.92` then
class=`5` 0.92308 13
27 0 If thick=`0.241.11` then class=`5` 0.06865 437
28 0
If con=`?` and squality=`G` and bt=`?` and bl=`?` and
lustre=`?` and bore=`0000` and bw_me=`?` and bf=`?`
then class=`U` 0.31373 51
29 0 If bl=`?` then class=`U` 0.05369 596
Table 21 Example Anneal Traditional Rule Set
RuleID Cluster Rules Accuracy Coverage
2 0
If squality=`?` and steel=`R` and con=`S` then
class=`2` 1 61
3 0 If sfinish=`P` then class=`2` 1 6
62
4 0 If enamelability=`2` then class=`2` 1 5
5 0 If shape=`COIL` then class=`3` 0.82121 330
6 0
If thick=`0.241.11` and family=`?` and
width=`405.31810.61` and enamelability=`?` then
class=`3` 0.8984 187
7 0 If family=`?` then class=`3` 0.82736 614
8 0 If con=`S` then class=`3` 0.84966 439
9 0
If thick=`1.111.98` and bl=`?` and
width=`1215.921525` and bf=`?` and
enamelability=`?` and strength=`0114.95` then
class=`5` 0.31818 22
10 0
If non_ageing=`N` and thick=`1.111.98` and
formal_ability=`3` and len=`01870.92` and
shape=`SHEET` then class=`5` 0.83333 12
11 0 If non_ageing=`N` and bw_me=`B` then class=`5` 0.53333 15
12 0 If steel=`A` and bw_me=`?` then class=`5` 0.12075 265
13 0 If thick=`0.241.11` then class=`5` 0.06865 437
14 0 If bc=`?` then class=`U` 0.04469 716
15 1 If steel=`S` and carbon=`013.72` then class=`1` 1 4
16 1 If exptl=`Y` then class=`1` 1 1
17 1
If squality=`?` and carbon=`013.72` and family=`?`
and width=`0405.31` then class=`2` 0.75 12
18 1
If steel=`R` and squality=`?` and bt=`?` and
shape=`COIL` then class=`2` 0.64286 14
19 1
If con=`?` and carbon=`013.72` and bore=`0000`
and family=`?` and squality=`?` and bt=`?` and
width=`405.31810.61` then class=`2` 0.18182 11
20 1 If strength=`574.76689.71` then class=`2` 1 8
21 1
If family=`?` and temper_rolling=`?` and
width=`405.31810.61` and formal_ability=`?` then
class=`3` 0.90625 64
22 1 If con=`S` and thick=`0.241.11` then class=`3` 0.91228 285
23 1
If thick=`1.982.85` and
blue_bright_varn_clean=`?` then class=`3` 0.80488 41
24 1 If formal_ability=`2` then class=`3` 0.86139 303
25 1 If strength=`459.81574.76` then class=`3` 0.61538 13
26 1 If family=`?` and thick=`0.241.11` then class=`3` 0.8973 370
27 1 If bt=`Y` and hardness=`024.75` then class=`3` 0.76923 39
28 1 If thick=`1.111.98` then class=`3` 0.55621 169
29 1 If carbon=`41.1554.86` then class=`3` 1 13
30 1 If con=`A` then class=`5` 0.46667 30
31 1
If thick=`0.241.11` and carbon=`013.72` and
strength=`0114.95` and hardness=`024.75` then 0.09772 307
63
class=`5`
32 1
If squality=`G` and strength=`0114.95` then
class=`U` 0.14483 145
33 1 If width=`1215.921525` then class=`U` 0.03286 213
34 2
If squality=`?` and steel=`A` and bf=`Y` then
class=`1` 1 1
35 2
If squality=`?` and family=`?` and carbon=`013.72`
and bc=`?` and bl=`?` and bore=`0000` and bt=`?`
and bw_me=`?` and cbond=`?` and chrom=`?` and
corr=`?` and enamelability=`?` and exptl=`?` and
ferro=`?` and hardness=`024.75` and jurofm=`?`
and lustre=`?` and marvi=`?` and non_ageing=`?`
and oil=`?` and packing=`?` and phos=`?` and
product_type=`C` and len=`01870.92` and
blue_bright_varn_clean=`?` and sfinish=`?` and
strength=`0114.95` and bf=`Y` and width=`405.31
810.61` then class=`2` 0.5 2
36 2 If steel=`M` then class=`2` 0.5 16
37 2 If blue_bright_varn_clean=`B` then class=`2` 1 3
38 2 If con=`S` and shape=`SHEET` then class=`2` 0.19244 291
39 2
If width=`0405.31` and ferro=`?` and hardness=`0
24.75` then class=`3` 0.75676 74
40 2 If thick=`0.241.11` and family=`?` then class=`3` 0.8973 370
41 2
If bt=`?` and formal_ability=`?` and oil=`?` and
shape=`SHEET` and sfinish=`?` then class=`3` 0.98485 66
42 2 If thick=`1.982.85` and con=`?` then class=`3` 0.88571 35
43 2
If ferro=`?` and hardness=`024.75` and phos=`?`
and thick=`1.111.98` then class=`3` 0.60769 130
44 2 If hardness=`49.4974.24` then class=`3` 1 66
45 2
If shape=`SHEET` and strength=`0114.95` and
bt=`?` and len=`01870.92` and bw_me=`?` and
oil=`?` and carbon=`013.72` then class=`5` 0.176 125
46 2
If shape=`COIL` and bf=`?` and lustre=`?` and
bt=`?` then class=`5` 0.08511 235
47 2
If steel=`A` and bf=`?` and bore=`0000` and
lustre=`?` then class=`U` 0.10425 259
Table 22 Example Anneal Collective Rule Set
64
Results for Thyroid Data
Tables 23 through 28 display the results of the test performed using the Thyroid data set.
The traditional models performed very well achieving accuracies of 97% and 98% for the
unordered and ordered models respectively. A collective model was created that was able to
match the performance of the traditional model in the presence of 0, 1, 2, and 3 missing attributes
(highlighted in yellow). The CRA process created a model that achieved a higher accuracy than
the traditional method in the presence of 4 missing attributes. Figures 28 and 29 provide a
graphical representation of the accuracies achieved for each k.
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .97 .98 .1 52.3 59.7 2.22
3 .96 .98 0 59 64.9 2.13
5 .95 .97 0 51.7 51.4 2.01
7 .95 .97 0 50.3 49 1.78
9 .95 .96 0 48.2 46.1 1.7
11 .95 .96 0 47.2 46 1.72
13 .95 .96 0 47.7 45.7 1.7
15 .95 .96 0 47.9 45.8 1.62
Table 23 Thyroid Results Complete Data (No Missing Attributes)
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .97 .97 1.7 53.4 60.1 2.29
3 .95 .97 .3 58.4 64.2 2.18
5 .95 .97 .1 50.2 49.5 1.92
7 .95 .96 0 47.1 46.2 1.81
9 .95 .96 0 49 46.8 1.71
11 .95 .96 0 47.4 44.5 1.67
13 .95 .96 0 47.4 43.8 1.66
15 .96 .96 0 46.3 42.2 1.63
Table 24 Thyroid Results 1 Missing Attribute
65
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .96 .97 3.4 57.9 59.8 2.3
3 .96 .97 .8 62.7 61 2.17
5 .95 .97 0 49.3 49.2 2.12
7 .95 .96 0 44.7 44.2 1.72
9 .95 .96 0 44.9 42.1 1.64
11 .95 .96 0 43.7 40.4 1.63
13 .95 .96 0 43.1 40 1.61
15 .95 .96 0 41.4 38.1 1.56
Table 25 Thyroid Results 2 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .95 .97 1.9 45.6 52.8 2.34
3 .95 .97 .8 48 54.6 2.26
5 .95 .96 .4 56.4 55.7 2.07
7 .95 .96 .2 52.7 50.6 1.87
9 .95 .96 0 53.6 50.6 1.81
11 .95 .96 0 54.5 49.9 1.77
13 .95 .96 0 52.9 46.6 1.72
15 .95 .96 0 49 43.8 1.65
Table 26 Thyroid Results 3 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .94 .95 11.9 55.1 59.3 2.34
3 .95 .97 .5 61.5 63.4 2.39
5 .95 .97 .1 56.1 53.3 1.89
7 .95 .96 0 55 51.2 1.84
9 .95 .96 0 52.2 48.2 1.76
11 .95 .96 0 47.7 46.1 1.74
13 .95 .96 0 46.4 43.2 1.67
15 .95 .96 0 43.3 41.5 1.64
Table 27 Thyroid Results 4 Missing Attributes
Figure 28
Figure 29
Figure 30 displays the average rule lengths of the models created using the
set. The rules that were generated via the CRA method were shorter.
0.925
0.93
0.935
0.94
0.945
0.95
0.955
0.96
0.965
0.97
0.975
1 3
Ac
cu
rac
y
0.935
0.94
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
1 3
Ac
cu
rac
y
66
Thyroid Accuracies Unordered Models
Thyroid Accuracies Ordered Models
The rule lengths were 2.3
5 7 9 11 13 15
k value
5 7 9 11 13 15
k value
Thyroid data
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
and 1.8, on average, for the traditional and collective models respectively. The models created by
the traditional method were larger. On average, t
while the collective models contained 50.
this data set via the traditional and CRA methods.
Figure
Cluster Rules
1 0 If TSH=`164` then class=`hypothyroid`
2 0
If FTI=`060.23` and TSH=`23.9
2.99` then class=`hypothyroid`
3 0
If FTI=`060.23` and age=`58.87
1.59` and tumor=`f`
4 0 If TSH=`150` then class=`hypothyroid`
5 0 If TT4=`5.30` then class=`hypothyroid`
6 0
If TSH=`47.7971.69` and T3=`0
query_hypothyroid=`f` then class=`hypothyroid`
7 0
If FTI=`060.23` and TSH=`23.9
then class=`hypothyroid`
8 0
If FTI=`060.23` and age=`78.16
on_thyroxine=`f` then class=`hypothyroid`
9 0
If thyroid_surgery=`t` and age=`1
1.99` then class=`hypothyroid`
0
0.5
1
1.5
2
2.5
1 3
Av
era
ge
Ru
le
Le
ng
th
67
he traditional method models contained 52 rules
Tables 28 and 29 are examples of models generated for
30 Thyroid Average Rule Lengths
Accuracy
47.79` and T3=`1.99
78.16` and T4U=`1.36
then class=`hypothyroid`
1` and
47.79` and T3=`01`
97.45` and
20.29` and T3=`1
5 7 9 11 13 15
k value
Coverage
1 1
1 6
1 4
1 2
1 1
1 13
1 13
1 6
1 3
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
68
10 0
If query_hypothyroid=`t` and age=`78.1697.45` and
T3=`1.992.99` and FTI=`60.2399` then
class=`hypothyroid` 1 2
11 0 If TSH=`235` then class=`hypothyroid` 1 1
12 0 If TSH=`216` then class=`hypothyroid` 1 1
13 0
If FTI=`060.23` and age=`58.8778.16` and TT4=`10
55.48` and T3=`11.99` then class=`hypothyroid` 1 10
14 0 If TSH=`213` then class=`hypothyroid` 1 1
15 0
If T4U=`0.911.13` and on_thyroxine=`f` and
query_hypothyroid=`t` and TT4=`55.4899` and T3=`1
1.99` and age=`58.8778.16` and sex=`F` then
class=`hypothyroid` 0.5 2
16 0
If T4U=`0.911.13` and age=`39.5858.87` and
query_hypothyroid=`t` and TT4=`55.4899` and
T3=`1.992.99` then class=`hypothyroid` 1 1
17 0
If TSH=`100` and on_thyroxine=`f` then
class=`hypothyroid` 1 3
18 0
If TSH=`47.7971.69` and FTI=`060.23` and
on_antithyroid_medication=`f` and
query_hypothyroid=`f` then class=`hypothyroid` 1 20
19 0
If FTI=`060.23` and TSH=`23.947.79` and age=`20.29
39.58` then class=`hypothyroid` 1 7
20 0 If TSH=`288` then class=`hypothyroid` 1 1
21 0 If TSH=`145` then class=`hypothyroid` 1 3
22 0
If TSH=`71.6995.58` and age=`58.8778.16` then
class=`hypothyroid` 1 1
23 0 If TT4=`6` then class=`hypothyroid` 1 2
24 0
If FTI=`060.23` and TSH_measured=`y` and
age=`39.5858.87` and T4U=`0.911.13` and TT4=`10
55.48` and on_antithyroid_medication=`f` then
class=`hypothyroid` 0.72727 11
25 0
If thyroid_surgery=`t` and age=`120.29` and
FTI=`60.2399` then class=`hypothyroid` 0.66667 3
26 0
If TT4=`230` and age=`20.2939.58` and
query_hypothyroid=`f` then class=`hypothyroid` 1 1
27 0
If TT4=`4` and age=`20.2939.58` then
class=`hypothyroid` 1 2
28 0 If TSH=`95.5896` then class=`hypothyroid` 1 1
29 0
If FTI=`060.23` and TSH_measured=`y` and
on_thyroxine=`f` and sick=`f` and sex=`M` and
age=`39.5858.87` and query_hyperthyroid=`f` then
class=`hypothyroid` 0.75 8
30 0 If TSH=`109` then class=`hypothyroid` 1 1
31 0
If T4U=`1.131.36` and TT4=`55.4899` and age=`58.87
78.16` and T3=`1.992.99` then class=`hypothyroid` 0.5 2
69
32 0
If TT4=`1055.48` and T3=`01` and TSH=`23.947.79`
then class=`hypothyroid` 1 12
33 0
If FTI=`060.23` and T4U=`0.450.68` and sick=`f` then
class=`hypothyroid` 1 2
34 0 If age=`?` then class=`hypothyroid` 0.02924 342
35 0 If TT4=`7.50` then class=`hypothyroid` 1 1
36 0 If TSH=`143` then class=`hypothyroid` 1 1
37 0
If FTI=`060.23` and T4U=`1.131.36` and sex=`M` then
class=`hypothyroid` 1 5
38 0
If TT4=`1055.48` and T3=`01` and sex=`F` and
T4U=`0.680.91` and age=`58.8778.16` then
class=`hypothyroid` 0.66667 3
39 0
If FTI=`060.23` and thyroid_surgery=`t` and
age=`20.2939.58` then class=`hypothyroid` 1 1
40 0
If TT4=`1055.48` and thyroid_surgery=`t` then
class=`hypothyroid` 0.85714 7
41 0
If FTI=`60.2399` and age=`20.2939.58` and sex=`M`
and T3=`11.99` and T4U=`0.911.13` then
class=`hypothyroid` 0.14286 7
42 0
If goitre=`t` and TT4=`55.4899` and T3=`?` and
age=`39.5858.87` and TSH=`023.9` then
class=`hypothyroid` 1 1
43 0 If TSH=`530` then class=`hypothyroid` 1 1
44 0
If FTI=`060.23` and thyroid_surgery=`t` then
class=`hypothyroid` 0.75 8
45 0 If TSH=`165` then class=`hypothyroid` 1 1
46 0
If query_hypothyroid=`t` and T4U=`0.680.91` and
T3=`1.992.99` and age=`58.8778.16` then
class=`hypothyroid` 1 2
47 0
If TT4=`109` and TSH=`23.947.79` then
class=`hypothyroid` 1 1
48 0 If TSH=`117` then class=`hypothyroid` 1 1
49 0
If TSH=`71.6995.58` and age=`120.29` then
class=`hypothyroid` 1 1
50 0 If T3=`3.994.98` then class=`negative` 1 43
51 0
If TSH=`023.9` and age=`?` and sex=`M` then
class=`negative` 1 78
52 0 If FTI=`121` then class=`negative` 1 30
53 0 If TSH=`023.9` then class=`negative` 0.98279 2034
54 0 If TSH=`?` then class=`negative` 1 371
55 0
If T3=`11.99` and TSH=`23.947.79` and
query_hypothyroid=`f` and on_thyroxine=`f` and age=`?`
then class=`negative` 1 2
70
56 0
If T3=`11.99` and TSH=`23.947.79` and
query_hypothyroid=`f` and age=`58.8778.16` and
sex=`F` then class=`negative` 0.8 5
57 0
If TSH=`23.947.79` and T3=`11.99` and
query_hypothyroid=`f` and on_thyroxine=`f` and
thyroid_surgery=`f` then class=`negative` 0.76923 13
58 0 If query_hyperthyroid=`f` then class=`negative` 0.9503 2334
Table 28 Thyroid Example Traditional Model
Cluster Rules Accuracy Coverage
1 0 If TSH=`164` then class=`hypothyroid` 1 1
2 0
If FTI=`060.23` and TSH=`23.947.79` and T3=`1.99
2.99` then class=`hypothyroid` 1 6
3 0
If FTI=`060.23` and age=`58.8778.16` and T4U=`1.36
1.59` and tumor=`f` then class=`hypothyroid` 1 4
4 0 If TSH=`150` then class=`hypothyroid` 1 2
5 0 If TT4=`5.30` then class=`hypothyroid` 1 1
6 0
If TSH=`47.7971.69` and T3=`01` and
query_hypothyroid=`f` then class=`hypothyroid` 1 13
7 0
If FTI=`060.23` and TSH=`23.947.79` and T3=`01`
then class=`hypothyroid` 1 13
8 0
If FTI=`060.23` and age=`78.1697.45` and
on_thyroxine=`f` then class=`hypothyroid` 1 6
9 0
If thyroid_surgery=`t` and age=`120.29` and T3=`1
1.99` then class=`hypothyroid` 1 3
10 0
If query_hypothyroid=`t` and age=`78.1697.45` and
T3=`1.992.99` and FTI=`60.2399` then
class=`hypothyroid` 1 2
11 0 If TSH=`235` then class=`hypothyroid` 1 1
12 0 If TSH=`216` then class=`hypothyroid` 1 1
13 0
If FTI=`060.23` and age=`58.8778.16` and TT4=`10
55.48` and T3=`11.99` then class=`hypothyroid` 1 10
14 0 If TSH=`213` then class=`hypothyroid` 1 1
15 0
If T4U=`0.911.13` and on_thyroxine=`f` and
query_hypothyroid=`t` and TT4=`55.4899` and T3=`1
1.99` and age=`58.8778.16` and sex=`F` then
class=`hypothyroid` 0.5 2
16 0
If T4U=`0.911.13` and age=`39.5858.87` and
query_hypothyroid=`t` and TT4=`55.4899` and
T3=`1.992.99` then class=`hypothyroid` 1 1
17 0
If TSH=`100` and on_thyroxine=`f` then
class=`hypothyroid` 1 3
71
18 0
If TSH=`47.7971.69` and FTI=`060.23` and
on_antithyroid_medication=`f` and
query_hypothyroid=`f` then class=`hypothyroid` 1 20
19 0
If FTI=`060.23` and TSH=`23.947.79` and age=`20.29
39.58` then class=`hypothyroid` 1 7
20 0 If TSH=`288` then class=`hypothyroid` 1 1
21 0 If TSH=`145` then class=`hypothyroid` 1 3
22 0
If TSH=`71.6995.58` and age=`58.8778.16` then
class=`hypothyroid` 1 1
23 0 If TT4=`6` then class=`hypothyroid` 1 2
24 0
If FTI=`060.23` and TSH_measured=`y` and
age=`39.5858.87` and T4U=`0.911.13` and TT4=`10
55.48` and on_antithyroid_medication=`f` then
class=`hypothyroid` 0.72727 11
25 0
If thyroid_surgery=`t` and age=`120.29` then
class=`hypothyroid` 0.66667 6
26 0
If TT4=`230` and age=`20.2939.58` and
query_hypothyroid=`f` then class=`hypothyroid` 1 1
27 0
If TT4=`4` and age=`20.2939.58` then
class=`hypothyroid` 1 2
28 0 If TSH=`95.5896` then class=`hypothyroid` 1 1
29 0
If FTI=`060.23` and TSH_measured=`y` and
on_thyroxine=`f` and sick=`f` and sex=`M` and
age=`39.5858.87` and query_hyperthyroid=`f` then
class=`hypothyroid` 0.75 8
30 0 If TSH=`109` then class=`hypothyroid` 1 1
31 0
If T4U=`1.131.36` and TT4=`55.4899` and age=`58.87
78.16` and T3=`1.992.99` then class=`hypothyroid` 0.5 2
32 0
If TT4=`1055.48` and T3=`01` and TSH=`23.947.79`
then class=`hypothyroid` 1 12
33 0
If FTI=`060.23` and T4U=`0.450.68` and sick=`f` then
class=`hypothyroid` 1 2
34 0 If age=`?` then class=`hypothyroid` 0.02924 342
35 0 If TT4=`7.50` then class=`hypothyroid` 1 1
36 0 If TSH=`143` then class=`hypothyroid` 1 1
37 0
If FTI=`060.23` and T4U=`1.131.36` and sex=`M` then
class=`hypothyroid` 1 5
38 0
If TT4=`1055.48` and T3=`01` and sex=`F` and
T4U=`0.680.91` and age=`58.8778.16` then
class=`hypothyroid` 0.66667 3
39 0
If FTI=`060.23` and thyroid_surgery=`t` and
age=`20.2939.58` then class=`hypothyroid` 1 1
40 0
If TT4=`1055.48` and thyroid_surgery=`t` then
class=`hypothyroid` 0.85714 7
72
41 0
If FTI=`60.2399` and age=`20.2939.58` and sex=`M`
and T3=`11.99` and T4U=`0.911.13` then
class=`hypothyroid` 0.14286 7
42 0
If goitre=`t` and T3=`?` and TT4=`55.4899` and
age=`39.5858.87` then class=`hypothyroid` 0.5 2
43 0 If TSH=`530` then class=`hypothyroid` 1 1
44 0
If FTI=`060.23` and thyroid_surgery=`t` then
class=`hypothyroid` 0.75 8
45 0 If TSH=`165` then class=`hypothyroid` 1 1
46 0
If query_hypothyroid=`t` and T4U=`0.680.91` and
T3=`1.992.99` and age=`58.8778.16` then
class=`hypothyroid` 1 2
47 0
If TT4=`109` and TSH=`23.947.79` then
class=`hypothyroid` 1 1
48 0 If TSH=`117` then class=`hypothyroid` 1 1
49 0
If TSH=`71.6995.58` and age=`120.29` then
class=`hypothyroid` 1 1
50 0 If FTI=`124` then class=`negative` 1 26
51 0 If FTI=`109` then class=`negative` 1 31
52 0
If TSH=`023.9` and T4U=`0.680.91` and sex=`M` then
class=`negative` 1 259
53 0 If T3=`3.994.98` then class=`negative` 1 43
54 0
If TSH=`023.9` and age=`?` and sex=`M` then
class=`negative` 1 78
55 0 If FTI=`121` then class=`negative` 1 30
56 0 If TSH=`023.9` then class=`negative` 0.98279 2034
57 0 If TSH=`?` then class=`negative` 1 371
58 0
If T3=`11.99` and TSH=`23.947.79` and
query_hypothyroid=`f` and on_thyroxine=`f` and age=`?`
then class=`negative` 1 2
59 0
If T3=`11.99` and TSH=`23.947.79` and
query_hypothyroid=`f` and age=`58.8778.16` and
sex=`F` then class=`negative` 0.8 5
60 0
If TSH=`23.947.79` and T3=`11.99` and
query_hypothyroid=`f` and on_thyroxine=`f` and
thyroid_surgery=`f` then class=`negative` 0.76923 13
61 0 If query_hyperthyroid=`f` then class=`negative` 0.9503 2334
62 1 If age=`58.8778.16` then class=`hypothyroid` 0.06417 748
63 1 If FTI_measured=`y` then class=`negative` 0.9482 2336
64 2 If age=`20.2939.58` then class=`hypothyroid` 0.03978 553
65 2 If lithium=`f` then class=`negative` 0.95217 2530
Table 29 Thyroid Example Collective Model
73
Results for Mushroom Data
Tables 30 through 34 display the results of the test performed using the Mushroom data
set. The traditional models achieved accuracies of 89% and 100% for the unordered and ordered
models respectively. The CRA process generated a collective model that achieved a higher
accuracy than the traditional method with the occurrence of 2, 3, and 4 missing attributes.
Figures 31 and 32 provide a graphical representation of the accuracies achieved for each k.
Additionally, the CRA method matched the performance of the traditional method with the
occurrence of 0 and 3 missing attributes.
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .89 1 0 23.5 33.4 2.06
3 .89 .9 0 28.4 32.6 1.54
5 .88 .91 0 26.4 29.7 1.41
7 .87 .89 0 25.8 28.2 1.36
9 .85 .9 0 25.2 27.9 1.36
11 .85 .91 0 24.5 27.1 1.31
13 .8 .91 0 25.2 27.7 1.31
15 .76 .9 0 25.8 28.1 1.35
Table 30 Mushroom Results Complete Data (No Missing Attributes)
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .89 .96 40.8 23 33.4 2.01
3 .87 .92 .3 26.7 32.1 1.56
5 .86 .89 .1 25.2 29.3 1.45
7 .82 .87 0 25.3 28.7 1.41
9 .77 .87 0 23.8 27.3 1.36
11 .81 .9 0 23.3 25.5 1.31
13 .81 .9 0 23.6 26.3 1.31
15 .8 .9 0 24.4 26.9 1.35
Table 31 Mushroom Results 1 Missing Attribute
74
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .86 .92 75.1 22.2 31.3 1.95
3 .87 .88 2.3 28.9 36 1.73
5 .8 .87 .2 25.4 30.2 1.47
7 .78 .87 0 24.1 27 1.37
9 .79 .86 0 22.8 25 1.28
11 .77 .88 0 23.2 25.4 1.32
13 .76 .87 0 24.8 26.7 1.32
15 .76 .87 0 24.4 27.1 1.36
Table 32 Mushroom Results 2 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .8 .86 157 21.3 30.2 1.92
3 .82 .85 1.7 28 34.1 1.69
5 .8 .88 .7 25.9 29.8 1.49
7 .76 .81 0 25.6 28.7 1.38
9 .76 .86 0 24.6 27.1 1.28
11 .76 .86 .2 25.2 27.4 1.29
13 .76 .86 0 26.5 28.4 1.31
15 .77 .86 .1 25.9 27.9 1.29
Table 33 Mushroom Results 3 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .77 .82 201.7 24.3 35.2 2.06
3 .82 .87 2.4 25.1 29.9 1.52
5 .8 .86 1 28.4 32.5 1.59
7 .75 .87 .2 25.8 26.9 1.36
9 .74 .84 .3 23.6 26.8 1.36
11 .75 .85 .5 22.6 25.7 1.4
13 .71 .85 .2 23.5 27 1.39
15 .7 .85 .1 24.6 28 1.39
Table 34 Mushroom Results 4 Missing Attributes
Figure 31Mushroom Accuracies Unordered Models
Figure 33 depicts the average rule lengths of the models generated. The collective models
created with the CRA process contained smaller rules.
1.4 for the traditional and collective models respectively
method were generally larger than the traditional models. The collective models contained 25
rules while the traditional model contained 22 on average.
the traditional and collective models created using this data set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3
Ac
cu
rac
y
75
On average, the rule lengths were
. The models created via the CRA
Tables 35 and 36 display examples of
5 7 9 11 13 15
k value
2 and
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
Figure 32
Figure
Cluster Rules
1 0 If odor=`a` then class=`e`
2 0 If odor=`n` and stalk_shape=`t` then class=`e`
3 0 If habitat=`w` then class=`e`
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3
Ac
cu
rac
y
0
0.5
1
1.5
2
2.5
1 3
Av
era
ge
Ru
le
Le
ng
th
76
Mushroom Accuracies Ordered Models
33 Mushroom Average Rule Lengths
Accuracy
5 7 9 11 13 15
k value
5 7 9 11 13 15
k value
Coverage
1 319
1 2020
1 145
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
77
4 0 If odor=`l` then class=`e` 1 317
5 0 If stalk_color_above_ring=`o` then class=`e` 1 143
6 0
If ring_number=`t` and spore_print_color=`w` then
class=`e` 1 417
7 0
If odor=`n` and stalk_surface_below_ring=`f` then
class=`e` 1 362
8 0 If cap_shape=`s` then class=`e` 1 25
9 0 If odor=`n` and habitat=`u` then class=`e` 1 72
10 0 If cap_color=`c` and gill_size=`n` then class=`e` 1 21
11 0 If gill_spacing=`w` and cap_color=`n` then class=`e` 1 223
12 0 If cap_color=`y` then class=`p` 0.63392 855
13 0
If cap_surface=`s` and gill_spacing=`c` and
gill_attachment=`f` and ring_number=`o` then class=`p` 0.8406 1261
14 0 If gill_size=`n` and stalk_shape=`t` then class=`p` 0.9455 1468
15 0 If habitat=`p` and ring_number=`o` then class=`p` 0.91551 864
16 0
If stalk_shape=`e` and population=`y` and habitat=`g`
and gill_color=`p` then class=`p` 0.81081 74
17 0
If stalk_shape=`e` and ring_number=`o` and habitat=`g`
then class=`p` 0.6118 729
18 0 If habitat=`m` and gill_color=`w` then class=`p` 0.15789 57
19 0
If population=`v` and stalk_color_above_ring=`w` and
ring_type=`p` and cap_shape=`x` and ring_number=`o`
and gill_attachment=`f` and
stalk_surface_above_ring=`s` and
stalk_surface_below_ring=`s` and veil_color=`w` and
veil_type=`p` and cap_surface=`f` and habitat=`d` and
stalk_root=`b` and gill_color=`u` and
spore_print_color=`k` then class=`p` 0.41667 12
20 0 If stalk_root=`b` and cap_shape=`x` then class=`p` 0.50772 1554
21 0
If cap_color=`g` and bruises=`f` and ring_number=`o`
then class=`p` 0.69004 813
22 0
If population=`v` and stalk_color_below_ring=`w` and
stalk_color_above_ring=`w` and bruises=`t` then
class=`p` 0.60682 440
23 0
If population=`s` and stalk_surface_below_ring=`s` and
stalk_root=`e` then class=`p` 0.40892 269
24 0 If habitat=`d` then class=`p` 0.39825 2521
25 0
If ring_number=`o` and gill_spacing=`w` and
stalk_surface_below_ring=`s` and
stalk_surface_above_ring=`s` then class=`p` 0.23939 330
Table 35 Mushroom Example Traditional Model
78
Cluster Rules Accuracy Coverage
1 0 If odor=`n` and stalk_shape=`t` then class=`e` 1 2020
2 0 If population=`y` then class=`e` 0.62127 1373
3 0 If odor=`n` and cap_color=`g` then class=`e` 1 834
4 0 If habitat=`w` then class=`e` 1 145
5 0
If stalk_color_below_ring=`n` and gill_spacing=`w` then
class=`e` 1 39
6 0
If odor=`n` and stalk_surface_below_ring=`k` then
class=`e` 1 120
7 0 If cap_shape=`s` then class=`e` 1 25
8 0 If odor=`l` then class=`e` 1 317
9 0 If odor=`n` and cap_shape=`x` then class=`e` 0.99442 1255
10 0 If odor=`a` then class=`e` 1 319
11 0 If ring_type=`f` then class=`e` 1 39
12 0
If ring_number=`t` and spore_print_color=`w` then
class=`e` 1 417
13 0 If odor=`n` and habitat=`u` then class=`e` 1 72
14 0 If stalk_surface_below_ring=`k` then class=`p` 0.9346 1835
15 0
If population=`v` and stalk_color_below_ring=`w` and
cap_color=`e` then class=`p` 0.82641 409
16 0
If stalk_color_below_ring=`p` and
stalk_color_above_ring=`p` and cap_color=`e` then
class=`p` 0.78112 233
17 0
If stalk_color_above_ring=`w` and gill_spacing=`c` then
class=`p` 0.50811 2527
18 0
If gill_size=`n` and bruises=`f` and cap_shape=`x` then
class=`p` 0.91828 673
19 0 If ring_type=`e` then class=`p` 0.63653 2234
20 0 If gill_color=`w` then class=`p` 0.20329 974
21 1 If ring_type=`p` then class=`e` 0.79388 3168
22 1 If veil_color=`w` then class=`p` 0.49228 6348
23 1 If bruises=`f` then class=`p` 0.69286 3793
24 2 If ring_number=`o` then class=`e` 0.49175 5997
25 2 If spore_print_color=`w` then class=`e` 0.23862 1911
26 2 If bruises=`t` then class=`p` 0.18632 2705
Table 36 Mushroom Example Collective Model
79
Results for TicTacToe Data
Tables 37 through 41 display the results of the test performed using the TicTacToe data
set. The traditional models achieved accuracies of 82% and 99% for the unordered and ordered
models respectively. The CRA process generated a collective model that achieved a higher
accuracy than the traditional method with the occurrence of 1, 2, 3, and 4 missing attributes.
Additionally, the number of instances not classified decreased when using the collective models
created with the CRA process. Figures 34 and 35 provide a graphical representation of the
accuracies achieved for each k while Figure 36 displays the average rule lengths. The rules
created using the CRA method were shorter. On average the rule lengths were 3.35 and 2.38 for
the traditional and collective models respectively. The CRA models were larger than the
traditional models. The collective model contained 97 rules while the traditional model contained
52 rules on average. Tables 42 and 43 provide examples of the traditional and collective models
generated.
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .82 .99 .1 54.6 26.8 3.4
3 .79 .96 0 78.9 27 2.73
5 .76 .94 0 89.7 27 2.54
7 .78 .9 0 101.2 26.9 2.42
9 .76 .86 0 107.5 27 2.3
11 .75 .85 0 106.8 27 2.27
13 .75 .84 0 104 27 2.2
15 .74 .82 0 101.9 27 2.13
Table 37 TicTacToe Results Complete Data (No Missing Attributes)
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .75 .84 11.5 52.1 26.6 3.33
80
3 .75 .85 .1 80.9 26.9 2.72
5 .74 .83 0 88.3 27 2.6
7 .74 .81 0 105 27 2.42
9 .74 .8 0 106 27 2.32
11 .73 .8 0 99.9 26.9 2.23
13 .71 .78 0 98.5 26.9 2.16
15 .71 .78 0 98.3 27 2.12
Table 38 TicTacToe Results 1 Missing Attribute
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .68 .73 21.3 49.7 26.7 3.31
3 .72 .79 .6 78.2 27 2.76
5 .72 .78 0 87.9 27 2.58
7 .72 .77 0 98.1 27 2.42
9 .72 .75 0 106.1 27 2.34
11 .71 .75 0 105.7 27 2.26
13 .7 .74 0 105.1 27 2.18
15 .69 .73 0 102.3 27 2.14
Table 39 TicTacToe Results 2 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .57 .58 46.6 49.7 26.5 3.36
3 .68 .72 3.7 80.9 27 2.77
5 .7 .73 .5 86.8 27 2.62
7 .69 .72 .2 96.6 27 2.43
9 .69 .72 0 103.9 27 2.35
11 .69 .72 0 104.2 27 2.28
13 .67 .73 0 99.7 27 2.16
15 .66 .71 0 100.8 27 2.1
Table 40 TicTacToe Results 3 Missing Attributes
Number
Clusters
Unordered
Accuracy
Ordered
Accuracy
Average
Number
Not
Classified
Average
Number
of Rules
Average
Number
Unique
Attributes
Average
Rule
Length
1 .48 .49 67.6 52.2 26.8 3.34
3 .66 .68 8 77 26.8 2.75
5 .68
7 .68
9 .69
11 .68
13 .67
15 .66
Table 41 Tic
Figure 34 Tic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 3
Ac
cu
rac
y
81
.7 2.4 88.3 27
.71 .3 102.6 27
.71 0 109.8 27
.7 0 109.1 27
.7 0 102.1 26.9
.7 0 99.6 27
TacToe Results 4 Missing Attributes
TacToe Accuracies Unordered Models
5 7 9 11 13 15
k value
2.57
2.42
2.32
2.28
2.2
2.14
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
Figure 35 Tic
Figure 36
Cluster Rules
1 0
If bottom_left_square=`o` and top_left_square=`o` and
middle_left_square=`o` then class=`negative`
0
0.2
0.4
0.6
0.8
1
1.2
1 3
Ac
cu
rac
y
0
0.5
1
1.5
2
2.5
3
3.5
1 3
Av
era
ge
Ru
le
Le
ng
th
82
TacToe Accuracies Ordered Models
TicTacToe Average Rule Lengths
Accuracy
5 7 9 11 13 15
k value
5 7 9 11 13 15
k value
Coverage
1 32
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
0 Missing
1 Missing
2 Missing
3 Missing
4 Missing
83
2 0
If middle_middle_square=`o` and
bottom_middle_square=`o` and top_middle_square=`o`
then class=`negative` 1 32
3 0
If middle_middle_square=`o` and
bottom_left_square=`o` and top_right_square=`o` then
class=`negative` 1 38
4 0
If bottom_right_square=`o` and bottom_left_square=`o`
and bottom_middle_square=`o` then class=`negative` 1 30
5 0
If top_left_square=`o` and top_right_square=`o` and
top_middle_square=`o` then class=`negative` 1 28
6 0
If bottom_right_square=`o` and top_right_square=`o` and
middle_right_square=`o` then class=`negative` 1 28
7 0
If bottom_right_square=`o` and
bottom_middle_square=`x` and top_right_square=`x` and
middle_left_square=`x` and top_middle_square=`o` and
bottom_left_square=`o` and top_left_square=`x` then
class=`negative` 1 2
8 0
If bottom_right_square=`o` and
bottom_middle_square=`x` and middle_right_square=`x`
and top_right_square=`x` and top_middle_square=`o`
and top_left_square=`x` then class=`negative` 1 3
9 0
If middle_middle_square=`o` and top_left_square=`o`
and bottom_right_square=`o` then class=`negative` 1 39
10 0
If middle_right_square=`o` and
bottom_middle_square=`x` and top_right_square=`x` and
bottom_right_square=`x` and middle_left_square=`x`
then class=`negative` 1 2
11 0
If middle_middle_square=`o` and
middle_right_square=`o` and middle_left_square=`o`
then class=`negative` 1 27
12 0
If top_right_square=`o` and middle_left_square=`o` and
bottom_right_square=`o` and bottom_left_square=`x`
then class=`negative` 1 5
13 0
If top_right_square=`o` and bottom_middle_square=`o`
and middle_middle_square=`o` and
bottom_right_square=`x` then class=`negative` 1 7
14 0
If top_right_square=`o` and top_left_square=`o` and
bottom_middle_square=`o` and middle_left_square=`o`
then class=`negative` 1 1
15 0
If middle_middle_square=`b` and
middle_right_square=`o` then class=`positive` 0.8 50
16 0
If middle_middle_square=`x` and
bottom_left_square=`x` and bottom_middle_square=`o`
then class=`positive` 0.93103 58
84
17 0
If middle_middle_square=`x` and
bottom_left_square=`b` then class=`positive` 0.84146 82
18 0
If bottom_middle_square=`b` and
top_middle_square=`o` then class=`positive` 0.82432 74
19 0
If bottom_right_square=`x` and
bottom_middle_square=`x` and top_middle_square=`o`
and middle_middle_square=`o` then class=`positive` 0.80952 21
20 0
If middle_middle_square=`x` and
bottom_left_square=`x` and top_left_square=`o` and
bottom_right_square=`o` then class=`positive` 0.95652 23
21 0
If bottom_left_square=`x` and top_right_square=`o` and
middle_left_square=`x` and bottom_right_square=`b`
and middle_middle_square=`o` then class=`positive` 0.875 8
22 0
If middle_middle_square=`b` and
bottom_left_square=`x` and middle_right_square=`b`
then class=`positive` 0.89474 19
23 0
If middle_middle_square=`x` and
bottom_right_square=`b` and top_right_square=`o` then
class=`positive` 0.74074 27
24 0
If bottom_right_square=`x` and
middle_middle_square=`b` and bottom_left_square=`b`
then class=`positive` 0.76923 13
25 0
If bottom_right_square=`x` and
middle_middle_square=`b` and middle_right_square=`x`
then class=`positive` 0.84375 32
26 0
If middle_middle_square=`x` and top_left_square=`b`
and middle_left_square=`o` then class=`positive` 0.96154 26
27 0
If middle_middle_square=`x` and top_right_square=`b`
and middle_left_square=`x` then class=`positive` 0.91667 36
28 0
If top_middle_square=`x` and
bottom_middle_square=`o` then class=`positive` 0.6383 94
29 0
If middle_middle_square=`x` and
middle_left_square=`b` and top_right_square=`b` then
class=`positive` 0.84211 19
30 0
If bottom_right_square=`x` and middle_left_square=`b`
and bottom_middle_square=`x` then class=`positive` 0.79487 39
31 0
If bottom_right_square=`x` and top_right_square=`o` and
bottom_middle_square=`x` and top_middle_square=`x`
and top_left_square=`o` then class=`positive` 0.77778 9
32 0
If top_left_square=`x` and middle_middle_square=`b`
then class=`positive` 0.82609 69
33 0
If middle_left_square=`b` and bottom_left_square=`o`
then class=`positive` 0.67123 73
85
34 0
If middle_middle_square=`x` and
bottom_left_square=`x` then class=`positive` 0.85606 132
35 0
If middle_middle_square=`x` and top_right_square=`o`
and middle_left_square=`x` and
bottom_right_square=`x` then class=`positive` 0.84211 19
36 0 If bottom_middle_square=`b` then class=`positive` 0.69697 198
37 0
If top_middle_square=`b` and bottom_right_square=`x`
and middle_right_square=`x` and
middle_middle_square=`o` and top_left_square=`o` then
class=`positive` 0.88889 9
38 0
If top_right_square=`o` and bottom_left_square=`x` then
class=`positive` 0.64655 116
39 0
If middle_middle_square=`x` and top_left_square=`x`
then class=`positive` 0.82353 136
40 0
If top_left_square=`x` and bottom_right_square=`o` and
top_middle_square=`x` then class=`positive` 0.64407 59
41 0
If top_left_square=`b` and bottom_left_square=`o` and
bottom_right_square=`x` then class=`positive` 0.8 25
42 0
If middle_left_square=`x` and
bottom_middle_square=`x` and top_left_square=`b` then
class=`positive` 0.57895 19
43 0
If top_middle_square=`b` and bottom_right_square=`x`
and middle_middle_square=`o` and
middle_right_square=`x` and middle_left_square=`o`
then class=`positive` 0.88889 9
44 0
If top_middle_square=`b` and
bottom_middle_square=`o` then class=`positive` 0.85366 82
45 0
If middle_right_square=`o` and middle_left_square=`x`
and top_left_square=`x` and bottom_middle_square=`x`
then class=`positive` 0.52941 17
46 0
If bottom_middle_square=`x` and top_left_square=`o`
and top_right_square=`x` and bottom_left_square=`o`
and bottom_right_square=`o` and top_middle_square=`x`
then class=`positive` 0.33333 3
47 0
If middle_right_square=`b` and top_right_square=`b` and
middle_left_square=`o` then class=`positive` 0.83333 18
48 0 If middle_middle_square=`x` then class=`positive` 0.79109 359
49 0
If bottom_right_square=`x` and top_left_square=`o` and
top_right_square=`b` and bottom_middle_square=`x`
then class=`positive` 0.81818 22
50 0
If bottom_middle_square=`x` and
middle_middle_square=`o` and middle_left_square=`o`
and top_left_square=`o` then class=`positive` 0.42857 14
86
51 0 If bottom_middle_square=`x` then class=`positive` 0.58863 299
Table 42 TicTacToe Example Traditional Model
RuleID Cluster Rules Accuracy Coverage
1 0
If top_right_square=`o` and
bottom_right_square=`o` and
bottom_left_square=`b` then class=`negative` 0.57895 19
2 0 If bottom_middle_square=`o` then class=`positive` 0.69259 270
3 0 If middle_right_square=`o` then class=`positive` 0.70722 263
5 1
If bottom_right_square=`o` and
bottom_middle_square=`o` then class=`negative` 0.45977 87
6 1
If middle_middle_square=`o` and
bottom_middle_square=`o` then class=`negative` 0.5974 77
7 1
If bottom_right_square=`o` and top_left_square=`x`
and bottom_middle_square=`x` then
class=`negative` 0.5 36
8 1
If top_left_square=`o` and top_middle_square=`o`
and top_right_square=`o` then class=`negative` 1 28
9 1 If middle_middle_square=`x` then class=`positive` 0.79109 359
10 1 If bottom_right_square=`x` then class=`positive` 0.71554 341
11 2
If middle_left_square=`o` and
bottom_right_square=`b` then class=`negative` 0.38182 55
12 2
If top_middle_square=`x` and
middle_middle_square=`o` then class=`negative` 0.6036 111
13 2
If bottom_middle_square=`x` and
top_middle_square=`b` and bottom_left_square=`o`
and bottom_right_square=`b` then class=`negative` 0.85714 7
14 2
If bottom_left_square=`b` and
bottom_middle_square=`x` then class=`negative` 0.51724 58
15 2
If middle_middle_square=`b` and
bottom_right_square=`o` then class=`negative` 0.44444 45
16 2
If bottom_right_square=`o` and
top_right_square=`x` then class=`negative` 0.37879 132
17 2
If top_left_square=`x` and top_right_square=`x`
then class=`negative` 0.23256 129
18 2
If top_left_square=`x` and bottom_left_square=`o`
and middle_middle_square=`o` then
class=`negative` 0.71111 45
19 2
If bottom_right_square=`o` and
middle_right_square=`o` then class=`negative` 0.45783 83
20 2
If middle_right_square=`b` and top_left_square=`o`
then class=`negative` 0.51471 68
21 2 If top_middle_square=`o` and 0.62963 135
87
middle_right_square=`x` then class=`positive`
22 2
If middle_right_square=`x` and
top_middle_square=`b` and top_right_square=`x`
and bottom_left_square=`o` and
middle_middle_square=`o` then class=`positive` 0.83333 6
23 2
If bottom_right_square=`x` and top_left_square=`o`
and middle_left_square=`x` then class=`positive` 0.52778 36
24 2
If top_left_square=`b` and middle_left_square=`b`
and bottom_right_square=`x` then class=`positive` 0.80952 21
25 2
If middle_middle_square=`x` and
bottom_left_square=`o` then class=`positive` 0.70345 145
26 2
If bottom_right_square=`b` and
top_right_square=`o` then class=`positive` 0.5614 57
27 2
If bottom_middle_square=`b` and
bottom_right_square=`o` then class=`positive` 0.72603 73
28 2
If bottom_left_square=`x` and
top_middle_square=`o` then class=`positive` 0.72078 154
29 2
If bottom_left_square=`x` and
top_middle_square=`b` and
middle_middle_square=`o` then class=`positive` 0.70455 44
30 3
If middle_right_square=`o` and
top_right_square=`o` and bottom_right_square=`o`
then class=`negative` 1 28
31 3
If middle_left_square=`o` and
bottom_left_square=`x` and
bottom_middle_square=`x` then class=`negative` 0.24561 57
32 3 If middle_middle_square=`b` then class=`negative` 0.28467 137
33 3
If bottom_middle_square=`o` and
bottom_left_square=`o` then class=`negative` 0.45349 86
34 3
If middle_middle_square=`o` and
top_left_square=`o` and bottom_right_square=`o`
then class=`negative` 1 39
35 3
If bottom_right_square=`x` and
middle_left_square=`o` then class=`negative` 0.26923 156
36 3 If bottom_middle_square=`b` then class=`positive` 0.69697 198
37 3
If middle_right_square=`b` and
bottom_left_square=`b` then class=`positive` 0.625 32
38 3
If top_right_square=`b` and
bottom_middle_square=`x` then class=`positive` 0.73418 79
39 3
If middle_middle_square=`x` and
middle_left_square=`b` then class=`positive` 0.80208 96
40 3
If middle_right_square=`b` and top_left_square=`x`
then class=`positive` 0.79 100
41 3 If top_right_square=`o` and 0.34568 81
88
middle_middle_square=`o` then class=`positive`
42 3 If bottom_left_square=`x` then class=`positive` 0.727 337
43 4
If middle_left_square=`x` and
middle_right_square=`b` then class=`negative` 0.49333 75
44 4
If top_middle_square=`o` and
bottom_right_square=`b` then class=`negative` 0.43103 58
45 4
If top_left_square=`o` and middle_left_square=`o`
and bottom_left_square=`o` then class=`negative` 1 32
46 4
If top_middle_square=`o` and
bottom_left_square=`b` then class=`negative` 0.40984 61
47 4
If middle_left_square=`x` and top_left_square=`b`
then class=`negative` 0.44776 67
48 4 If top_left_square=`x` then class=`positive` 0.70088 341
49 4 If bottom_middle_square=`x` then class=`positive` 0.58863 299
50 4 If top_middle_square=`x` then class=`positive` 0.5914 279
51 5
If bottom_left_square=`o` and
middle_middle_square=`b` and
bottom_right_square=`o` then class=`negative` 0.75 16
52 5
If bottom_left_square=`o` and
middle_right_square=`o` and
bottom_right_square=`o` then class=`negative` 0.5 14
53 5
If bottom_left_square=`b` and
middle_right_square=`o` then class=`negative` 0.41818 55
54 5
If bottom_left_square=`o` and top_right_square=`o`
and bottom_right_square=`o` then class=`negative` 0.6 15
55 5
If bottom_left_square=`b` and
top_middle_square=`o` and
middle_middle_square=`o` then class=`negative` 0.77778 18
56 5
If bottom_left_square=`o` and
middle_right_square=`b` and
bottom_right_square=`o` and
bottom_middle_square=`o` then class=`negative` 1 11
57 5 If middle_middle_square=`o` then class=`positive` 0.43911 271
58 5 If middle_right_square=`b` then class=`positive` 0.66834 199
59 5 If top_right_square=`b` then class=`positive` 0.69143 175
60 5 If top_middle_square=`o` then class=`positive` 0.68132 273
61 5
If middle_left_square=`x` and
middle_middle_square=`b` then class=`positive` 0.61404 57
62 5 If middle_right_square=`x` then class=`positive` 0.59672 305
63 6
If bottom_left_square=`o` and
middle_left_square=`o` then class=`negative` 0.45783 83
64 6
If middle_middle_square=`o` and
bottom_right_square=`b` and
bottom_middle_square=`o` then class=`negative` 0.5 20
89
65 6
If top_right_square=`o` and
bottom_middle_square=`b` and
bottom_right_square=`x` then class=`negative` 0.45946 37
66 6
If top_right_square=`o` and middle_left_square=`b`
and top_middle_square=`o` then class=`negative` 0.58824 17
67 6
If middle_middle_square=`o` and
bottom_middle_square=`o` and
top_middle_square=`o` then class=`negative` 1 32
68 6
If top_middle_square=`x` and
bottom_left_square=`o` then class=`negative` 0.46715 137
69 6
If top_right_square=`o` and
middle_right_square=`o` then class=`negative` 0.45122 82
70 6
If middle_right_square=`o` and
bottom_left_square=`o` then class=`negative` 0.3 60
71 6
If top_middle_square=`x` and top_right_square=`b`
then class=`negative` 0.45161 62
72 6
If top_right_square=`o` and top_left_square=`o` and
top_middle_square=`o` then class=`negative` 1 28
73 6
If top_left_square=`x` and top_middle_square=`b`
then class=`negative` 0.33333 93
74 6
If bottom_right_square=`o` and
middle_middle_square=`o` then class=`negative` 0.66667 81
75 6
If top_middle_square=`x` and
middle_right_square=`o` then class=`negative` 0.34454 119
76 6
If bottom_left_square=`o` and
middle_middle_square=`o` then class=`negative` 0.7013 77
77 6
If bottom_left_square=`o` and
bottom_right_square=`o` then class=`negative` 0.5974 77
78 6
If top_middle_square=`x` and
bottom_right_square=`x` and
middle_left_square=`x` then class=`negative` 0.66667 12
79 6
If top_right_square=`x` and top_left_square=`o` and
bottom_left_square=`x` then class=`positive` 0.79032 62
80 6
If middle_middle_square=`b` and
bottom_right_square=`x` then class=`positive` 0.83582 67
81 6
If middle_middle_square=`x` and
bottom_left_square=`x` and middle_left_square=`o`
then class=`positive` 0.8871 62
82 6
If top_right_square=`b` and top_left_square=`o`
then class=`positive` 0.62121 66
83 6
If middle_middle_square=`x` and
middle_left_square=`x` then class=`positive` 0.784 125
84 6
If top_left_square=`b` and
middle_middle_square=`x` and 0.81818 22
90
middle_left_square=`b` then class=`positive`
85 6
If bottom_right_square=`x` and
middle_left_square=`o` and
middle_right_square=`x` then class=`positive` 0.77419 62
86 6
If bottom_right_square=`x` and
bottom_left_square=`b` and
middle_right_square=`x` then class=`positive` 0.79487 39
87 6
If middle_right_square=`b` and
bottom_right_square=`x` then class=`positive` 0.65476 84
88 6
If middle_middle_square=`b` and
bottom_left_square=`x` and
middle_right_square=`x` then class=`positive` 0.76923 13
89 6
If middle_left_square=`b` and
middle_right_square=`x` then class=`positive` 0.5 70
90 6 If top_right_square=`o` then class=`positive` 0.5625 256
91 7
If bottom_middle_square=`b` and
middle_right_square=`o` and
top_middle_square=`x` then class=`negative` 0.42424 33
92 7 If bottom_left_square=`o` then class=`negative` 0.44403 268
93 7
If middle_middle_square=`x` and
middle_right_square=`o` and
bottom_right_square=`o` then class=`negative` 0.42 50
94 7
If middle_middle_square=`x` and
bottom_right_square=`x` then class=`negative` 0.15 140
95 7
If middle_middle_square=`o` and
bottom_right_square=`o` then class=`negative` 0.66667 81
96 7
If top_left_square=`o` and bottom_left_square=`x`
and bottom_right_square=`x` then class=`positive` 0.83636 55
97 7
If bottom_right_square=`o` and
middle_left_square=`x` then class=`positive` 0.5473 148
98 8
If bottom_left_square=`o` and top_right_square=`b`
and top_left_square=`o` then class=`negative` 0.52381 21
99 8
If top_right_square=`o` and
bottom_middle_square=`b` then class=`negative` 0.48611 72
100 8
If bottom_left_square=`o` and top_left_square=`o`
and middle_left_square=`o` then class=`negative` 1 32
101 8
If top_left_square=`x` and
middle_middle_square=`o` then class=`negative` 0.48529 136
102 8
If bottom_middle_square=`o` and
top_right_square=`o` then class=`negative` 0.2963 54
103 8 If bottom_left_square=`b` then class=`positive` 0.66049 162
104 8
If top_right_square=`x` and
middle_right_square=`x` then class=`positive` 0.69466 131
91
105 8
If middle_right_square=`b` and
bottom_middle_square=`b` then class=`positive` 0.73913 46
Table 43 TicTacToe Example Collective Model
While the collective models generated using the CRA process were not able to achieve
higher accuracies than the traditional models in all cases. As the number of missing attributes
increased the performance of the traditional method (ordered and unordered model) suffered the
most. The collective models were less affected by the introduction of missing attributes. Take the
Anneal data set for example. The highest accuracies achieved by the traditional and collective
models (ordered) were 94% and 90% respectively under the best conditions (0 missing
attributes). These models achieved 84% and 87% accuracy under the worst conditions (4 missing
attributes). This equates to a 10% drop in accuracy for the traditional model compared to only a
3% drop in accuracy for the CRA model. Even though the models produced by the CRA process
achieved lower accuracies in some instances these models were less affected by the missing
attribute information. Table 44 provides the best performance of each data set under the best (no
missing attributes) and worst conditions (4 missing attributes) and the drop in accuracy for each
dataset for the ordered models. Table 45 displays this information for the unordered models.
Figures 37 and 38 illustrate the drop in accuracy for both types of models.
Data Set Traditional Best Collective Best Traditional Worst Collective Worst Traditional Drop Collective Drop
Congressional
Voting 99% 98% 79% 90% 20% 8%
Anneal 94% 90% 84% 87% 10% 3%
Thyroid 98% 98% 95% 97% 3% 1%
Mushroom 100% 90% 82% 87% 18% 3%
TicTacToe 99% 96% 49% 71% 50% 28%
Table 44 Best Performance for Best and Worst Conditions Ordered Models
Data Set Traditional Best Collective
Congressional
Voting 97%
Anneal 90%
Thyroid 97%
Mushroom 89%
TicTacToe 82%
Table 45 Best Performance for Best and Worst Conditions Unordered Models
Figure 37 Percentage Drop in Accuracy Ordered Models
60
50
40
30
20
10
0
(%
) P
erc
en
tag
e D
ro
p i
n A
ccu
rac
y
92
Best
Traditional
Worst
Collective
Worst
Traditional
Drop
97% 77% 89% 20%
82% 82% 81% 8%
96% 94% 95% 3%
89% 77% 82% 12%
79% 48% 69% 34%
Data Sets
Collective
Drop
8%
1%
1%
7%
10%
CRA
Traditional
Figure 38 Percentage Drop in Accuracy Unordered Models
The aforementioned results of the drop in accuracy
Sample ttest to determine if they were statistically significant. Tables 46 and 47 (ordered and
unordered models) display the results of this analysis for each data set. The
H0: No difference in means for Traditional vs. CRA drop in classification accuracy for this test.
For each data set evaluated, the p
hypothesis. Meaning, that there is evidence
created using the CRA process are more resilient against the affects of missing attribute
information.
Data Set
Congressional Voting
Anneal
40
35
30
25
20
15
10
5
0
93
were evaluated using a Welch?s Two
null hypothesis was
value was less than .05. Therefore we can reject the null
that supports the notion that the collective models
t Degrees of Freedom pvalue
5.8755 17.889 0.00001
2.6111 17.731 0.01783
Data Sets
CRA
Traditional
94
Thyroid 2.5697 12.488 0.02392
Mushroom 5.9832 16.013 0.00001
TicTacToe 12.6789 11.794 0.00000003
Table 46 Results of Two Sample Welch ttest (Ordered Models)
Data Set t Degrees of Freedom pvalue
Congressional Voting 5.1854 17.91 0.00006
Anneal 3.6934 15.388 0.002087
Thyroid 3.678 10.998 0.003640
Mushroom 2.1514 17.916 0.04534
TicTacToe 9.268 13.954 0.0000002
Table 47 Results Two Sample Welch ttest (Unordered Models)
The hypothesis of this research was that the model created with the CRA process
would be more resilient against missing attribute information in the test data. Resiliency
refers to minimizing any decreases in classification accuracy as the number of missing attributes
increases. For each data set the collective models created with the CRA process suffered less of a
drop in classification accuracy as compared to the traditional models (see Figures 37 and 38).
These findings are statistically significant (see tables 46 and 47) and support the hypothesis as
the collective model maintained the ability to make fairly accurate predictions when faced with
less than desirable conditions. This resiliency is a result of the collective model being generally
larger than the traditional model and containing shorter more general rules as described in the
analysis for each data set. This directly relates to the manner in which the collective models
95
were created. The CRA process of building individual models on smaller homogenous subsets of
the data produced collective models for all data sets that contained rules that were on average
shorter than the rules in the traditional model. Also, the collective models created for 4 out the 5
data sets contained more rules on average than the traditional model.
Figure 39 Ordered vs. Unordered Accuracies Overall
Figure 39 displays the overall performance of the unordered and ordered models.
As seen in the figure, the ordered models performed better than the unordered models from an
overall aspect. However, the difference in performance was negligible in some instances
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ac
cu
rac
y
Ordered
Unordered
96
Chapter 6
Summary
6.1 Summary and Conclusion
Data mining is a powerful and useful tool that can be used to extract information
from large data repositories. These techniques are used in a variety of domains to perform an
array of different tasks. Predictive modeling/classification is one of the most thoroughly studied
and researched subfield in data mining. The goal of these types of tasks is to predict some future
outcome based on historical information. Rule based classifiers are one of the most often used
types of predictive model. A rule based classifier builds a model from training data as a set of
high quality rules. The models that theses algorithms generate are easy to understand and
interpret and widely accepted. There have been many rule based classification systems designed,
studied and implemented that perform adequately in many applications. However, problems arise
when the data that is used to assess the predictive ability of the model is not as comprehensive as
the data used to create the model. When faced with missing attribute information it is desirable to
have a model that maintains the ability to make fairly accurate predictions. Missing attributes are
a well known issue in the field of data mining. (Batista & Monard, 2003 ) (Witten & Frank,
2005) (Tan, Steinbach, & Kumar, 2006) The majority of the research performed focuses on
handling missing attribute information in the training data. There has been little in the way of
studying the effects of missing information in the test data. We present a process which results
in a model that is more resilient against the effects of missing attribute information in the
97
new/unseen data. The clustering rulebased approach (CRA) seeks to added robustness
using data clustering. The training data is clustered and multiple models are built using the
cluster wise data. The models are then combined into a collective model and evaluated against
test data modified to contain missing attribute information. The experimental results support the
hypothesis that the collective model is more resilient against the effects of missing attribute
information. The collective model incurred a smaller loss in classification accuracy when faced
with missing attributes. Also, the collective model displayed the ability to classify instances
unable to be classified using traditional methods. These findings are encouraging as most real
world data sets are often not as complete as practitioners would like. The work presented and
evaluated here provides a new method of ensuring the model created will have the ability to
make fairly accurate predictions when faced with missing/incomplete data.
6.2 Future Work
There are many paths the direction of this research can take. We evaluated two methods
of combining the rules from the multiple models: sort in decreasing order of accuracy and a
voting scheme. Other approaches for combining the rules should be examined. Also, a wider
variety of data sets should be used to evaluate the CRA process. The effects of a different set of
k values should also be examined. Different clustering algorithms may also be studied, as well
as, combining this technique with other process proven to improve classification accuracy such
as ensemble methods.
98
References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large
databases. Proceedings of teh Twentieth International Conference on Very Large Databases, (pp.
487499). Santiago, Chile.
Anderberg, M. R. (1973). Cluters Analysis for Applications. Academic Press.
Batista, G. E., & Monard, M. C. (2003 ). An analysis of four missing data treatment methods for
supervised learning. Applied Artificial Intelligence 17(56) , 519533.
Baumann, T., & Germond, A. (1993). Application of the kohonen network to shortterm
forecasting. In ANNPS, (pp. 407412).
Berk, R. (2004). An Introduction to Ensemble Methods for Data Analysis. California Center for
Population Research. Online Working Paper Series.
Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. San Jose, CA: Acrue
Software.
Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM.
McGrawHill Companies.
Blake, E. K., & Merz, C. J. (1998). UCI repository of machine learning databases. Retrieved
from http://www.ics.uci.edu/~mlearn/MLRepository.html
Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees.
Belmont, CA: Wadsworth International Group.
CHAID. (n.d.). Retrieved February 15, 2008, from http://www.wikipedia.org/wiki/CHAID
Cheng, Y., & Church, G. (2000). Biclustering of experession data. ICMB , 93103.
Chess Endgame. (2008). Retrieved April 3, 2008, from http://www.wikipedia.org/wiki/Endgame
Cho, H., Dhillon, I. S., Guan, Y., & Sra, S. (2004). Minimum sum squared residue coclustering
of gene expression data. SDM .
Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning 3(4) , 261 
283.
99
Cluster Analysis. (2008). Retrieved March 6, 2008, from Wikipedia:
http://en.wikipedia.org/wiki/Data_clusterinCohen, W. (1995). Fast Effective Rule Induction.
Procedings of 12th International Conference on Machine Learning, (pp. 115123). Tahoe City,
CA.
CRISPDM. (2000). Retrieved February 16, 2008, from CRISPDM: Cross Industry Standard
Process for Data Mining: http://www.crispdm.org/Process/index.htm
Deodhar, M., & Ghosh, J. (2007). A Framework for Simultaneous Coclustering and Learning
from Complex Data. KDD'07, (pp. 250259). San Jose, California.
Djukanovic, B., Babic, B., Sobajic, D., & Pao, Y. (1993). Unsupervised/supervised learning
concept for 24house load forecasting. IEE ProceedingsGeneration, Transmission and
Distribution, (pp. 140:311318).
DMS. (2008). Retrieved March 16, 2008, from Decision Trees:
http://dms.irb.hr/tutorial/tut_dtrees.php
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of
continous features. ICML.
Freund, Y., & Schapire, R. (1999). A short introduction to boosting. Japanese Society for
Artificial Intelligence , 771780.
Friedman, N., & Goldszmidt, M. (1996). Building classifiers using Bayesian networks. National
Conference on Artificial Intelligence (pp. 12771284). Menlo Park, CA: AAAI Press.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine
Learning 29 , 131163.
Gilbert, J. E. (2006). Applicaitons Quest: Computing Diversity. Commmunications of the ACM,
49, 3 , pp. 99104.
Gilbert, J. E. (2008). Applications Quest: A Case Study on Holistic Admissions. Journal of
College Admission , 12  18.
Gilbert, J. E. (2006). Applications Quest: Computing Diversity. Communications of the ACM,
49, 3, ACM , 99104.
Gilbert, J. E. (2008). Patent No. US 2008/10183731 A1. USA.
Han, J., & Kamber, M. (2006). Data Mining Techniques and Concepts. Morgan Kaufmann.
Hu, H., & Li, J. (2005). Using Association Rules to Make Rulebased Classifiers Robust.
100
Proceedings of the Sixteenth Austrailian Database Conference (ADC 2005), (pp. 4754).
Newcastle, Australia.
Jain, A. K., Murty, M. M., & Flynn, P. J. (n.d.). Data Clustering: A Review. ACM Computing
Surveys 31, 3 , 264323.
Kass, G. V. (1980 ). An Exploratory Technique for Investigating Large Quantities of Categorical
Data. Applied Statitics,Vol 29, No 2 , 119127.
Langley, P., & Sage, S. (1994). Induction of Selective Bayesian Classifiers. Conference on
Uncertainty in Artificial Intelligence (pp. 399406). Seattle, WA: Morgan Kaufmann.
Larose, D. (2005). Discovering Knowledge in Data: An Introduction to Data Mining. Hoboken,
NJ: John Wiley & Sons.
Lewis, R. (2000). Indroduction to Classification and Regression Trees (CART). Annual Meeting
of the Society for Academic Emergency Medicine. San Francisco, CA.
Li, J., Topor, R., & Shen, H. (2002). Construct robust rule sets for classification. Proceedings of
the Eigth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
(pp. 564569). Edmonton, Alberta, Canada.
Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple
classassociation rules. Proceedings 2001 IEEE International Conference on Data Mining
(ICDM 2001) (pp. 369376). IEEE Computer Society Press.
Liu, B., Hsu, J., & Ma, Y. (1998). Integrating classification and assoiciation rule mining.
Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
(KDD98), (pp. 2731).
Lokmic, L., & Smith, K. (2000). Cash flow forecasting using supervised and unsupervised neural
netoworks. IJCNN , 06:6343.
Michalski, R., Mozetic, I., Hong, J., & Lavrac, N. (1986). The AQ15 inductive learning system:
an overview and experiments. Proceedings of IMAL 1986'. Universite de ParisSud, Orsay.
Microsoft. (2007). Microsoft SQL Server 2005. Retrieved March 15, 2008, from Microsoft
Corporation: http://www.microsoft.com/sql/prodinfo/overview/default.mspx
Mingers, J. (1989). An emprical comparison of selection measures for decision tree induction.
Machine Learning vol 3 , 319342.
Oh, K., & Han, I. (2001). An intelligent clustering forecasting system based on changepoint
detection and artificial neural networks: Application to financial economics. HICSS34 volume 3
, 3011.
101
Opitz, D., & Macline, R. (1999). Popular ensemble methods: An empirical study. Artificial
Intelligence Research , 169198.
ORACLE. (2007). ORACLE. Retrieved March 15, 2008, from Oracle Data Mining:
http://www.oracle.com/technology/products/bi/odm/index.html
Quinlan, J. (2006). Bagging,Boosting, and C4.5. Retrieved January 29, 2008, from RuleQuest:
http://www.rulequest.com/Personal/q.aaai96.ps
Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning vol. 1 , 81106.
Quinlan, J., & Gaetz, M. W. (1993). FOIL: A Midterm Report. Proceedings of the 1993
European Conference on Machine Learning, (pp. 320). Vienna, Austria.
Rapidi. (2008). Rapid Miner. Retrieved March 16, 2008, from http://rapidi.com
Tan, P.N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston, MA:
Addison Wesley.
TechTarget. (2008). Retrieved January 29, 2008, from What is data mining?:
http://searchsqlserver.techtarget.com/sDefinition/0,,sid87_gci211901,00.html
Wang, J., & Karypis, G. (2003). HARMONY: Efficiently Mining the Best Rules for
Classifcation. Proceedings of the 2003 SIAM International Conference on Data Mining, (pp.
205216). Newport Beach, CA.
Witten, I. H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and
Techniques. San Francisco, CA: Morgan Kaufmann.
Yin, X., & Han, J. (2003). CPAR: Classification based on Predictive Associaiton Rules.
Proceedings of the 2003 SIAM International Conference on Data Mining, (pp. 331335). San
Francisco, CA.