CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH 
 
 
 
Except where reference is made to the work of others, the work described in this thesis is 
my own or was done in collaboration with my advisory committee. This thesis does not 
include proprietary or classified information. 
 
 
 
__________________________ 
Philicity K. Williams 
 
 
 
 
 
 
 
 
Certificate of Approval: 
 
 
 
___________________________   ___________________________ 
Juan Gilbert       Cheryl Seals 
Associate Professor     Assistant Professor 
Computer Science and Software   Computer Science and Software 
Engineering      Engineering 
 
 
 
___________________________   ___________________________ 
Hari Narayanan      George T. Flowers 
Professor                 Interim Dean 
Computer Science and Software   Graduate School 
Engineering  
 
 
 
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH 
 
 
 
Philicity K. Williams 
 
 
 
 
 
A Thesis 
Submitted to 
the Graduate Faculty of  
Auburn University 
in Partial Fulfillment of the  
Requirements for the 
Degree of  
Master of Science 
 
 
Auburn, Alabama 
August 9, 2008 
 
 iii 
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH 
 
 
 
Philicity K. Williams 
Permission is granted to Auburn University to make copies of this thesis at its discretion, 
upon the request of individuals or institutions and at their expense. The author reserves 
all publication rights. 
 
_______________________ 
                                                                                     Signature of Author 
 
  _______________________ 
                                                                                   Date of Graduation 
 
 
 
 
 
 
 
 
 
 
 
 
 iv 
VITA 
 
 Philicity K. Williams, daughter of Allen and Ella Williams was born on January 
18, 1977 in Tuskegee, AL. She graduated from Booker T. Washington High School with 
honors in 1995. She attended Southern University and A & M College in the fall of 1996 
and received a Bachelor of Science in Computer Science in May 2000.  She entered 
Auburn University in the fall of 2003 where she is currently working towards her Ph.D.  
 
 v 
THESIS ABSTRACT 
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH  
 
Philicity K. Williams 
 
Master of Science, August 9, 2008 
(B.S., Southern University and A & M College, May 2000) 
 
47 Typed Pages 
Directed by Juan E. Gilbert 
 
Today?s data storage and collection abilities have allowed the accumulation of 
enormous amounts of data. Data mining can be a useful tool in transforming these large 
amounts of raw data into useful information. Predictive modeling is a very popular area 
in data mining. The results of these type tasks can contain helpful information that can be 
used in decision making. Ensemble method techniques involve using the results of 
multiple models in combination. Research has shown that by applying an ensemble 
method approach to predictive modeling one can increase the model?s accuracy. 
However, these techniques focus on classification data mining algorithms. This research 
investigates the notion of using a data clustering and predictive modeling approach to 
increase predictive model accuracy. 
 vi 
ACKNOWLEDGEMENTS 
 First and foremost, I would like to thank God for he is my strength and through 
him I can do a1l things.  I would like to thank Dr. Juan Gilbert for all his support, 
encouragement, and patience.  The many times when I did not know how I was going to 
make it, or didn?t think I could do it, he was there to guide me.  Thank you to my family 
for their love, support, and encouragement. They have been very supportive an 
encouraging throughout my academic tenure here at Auburn. Last but not least I would 
like to say thank you to all of the members of the Human Centered Computing Lab.  
There support, encouragement, and just being there will forever be appreciated!!  
 vii 
Style manual or journal used Journal of SAMPE 
Computer software used Microsoft Word 2003 
 
 viii 
TABLE OF CONTENTS 
 
LIST OF FIGURES ???????????????????????????x 
1. INTRODUCTION .......................................................................................................... 1 
2. LITERATURE REVIEW ............................................................................................... 3 
2.1 DATA MINING OVERVIEW.................................................................................. 3 
2.1.1 PREDICTIVE MODELING.............................................................................. 5 
2.1.1.1 DECISION TREES..................................................................................... 6 
2.1.1.1.1 ID3 ....................................................................................................... 8 
2.1.1.1.2 C4.5...................................................................................................... 9 
2.1.1.1.3 CLASSIFICATION AND REGRESSION TREES ............................ 9 
2.1.1.1.4 CHI-SQUARE AUTOMATIC INTERACTION DETECTOR ........ 10 
2.1.1.2 NAIVE BAYESIAN CLASSIFIERS ....................................................... 11 
2.1.1.3 NEURAL NETWORKS ........................................................................... 11 
2.1.2 CLUSTER ANALYSIS ................................................................................... 12 
2.1.2.1 HIERARCHICAL CLUSTERING........................................................... 13 
2.1.2.2 PARTITIONAL CLUSTERING .............................................................. 15 
2.1.2.2.1 K-MEANS ......................................................................................... 16 
2.1.2.2.2 EXPECTATION MAXIMIZATION (EM)....................................... 17 
2.1.3 ASSOCIATION ANALYSIS .......................................................................... 17 
2.1.4 ANOMALY DETECTION.............................................................................. 21 
2.2 ENSEMBLE METHODS ....................................................................................... 22 
2.2.1 BAGGING ....................................................................................................... 23 
2.2.2 BOOSTING ..................................................................................................... 24 
2.2.3 STACKING ..................................................................................................... 25 
2.3 DATA MINING TOOLS ........................................................................................ 26 
2.3.1 MICROSOFT SQL SERVER.......................................................................... 26 
2.3.2 RAPID MINER................................................................................................ 26 
2.3.3 WEKA.............................................................................................................. 27 
2.3.4 ORACLE DATA MINING.............................................................................. 27 
2.4 PROBLEM DEFINITION ...................................................................................... 27 
3. EXPERIMENT ............................................................................................................. 28 
3.1 DATA...................................................................................................................... 28 
3.2 MATERIALS/TOOLS............................................................................................ 28 
3.3 PROCEDURE......................................................................................................... 28 
3.4 RESULTS ............................................................................................................... 29
 ix
4. CONCLUSIONS??????????????????????????...34 
    4.1 CONCLUSIONS?????????????????????????.34  
    4.2 FUTURE WORK?????????????????????????34  
REFERENCES????????????????????????????..35  
 
 x
LIST OF FIGURES 
 
 
Figure 1 Data mining's relationship to other disciplines..................................................... 3 
Figure 2 CRISP-DM Model................................................................................................ 4 
Figure 3 Four of the core data mining tasks ....................................................................... 5 
Figure 4 Simple Decision Tree ........................................................................................... 7 
Figure 5 Example of Simple Neural Network .................................................................. 12 
Figure 6 Records to be clustered....................................................................................... 14 
Figure 7 Cluster Hierarchy................................................................................................ 14 
Figure 8 Linkage Examples .............................................................................................. 15 
Figure 9 Basic K-means algorithm ................................................................................... 16 
Figure 10 Lattice Structure ............................................................................................... 19 
Figure 11 Apriori Principle............................................................................................... 20 
Figure 12 Support based pruning...................................................................................... 21 
Figure 13 View of Ensemble Method............................................................................... 23 
Figure 14 Bagging and Boosting Example ....................................................................... 25 
Figure 15 Decision Tree for Training Data....................................................................... 30 
Figure 16 Cluster Map for Training Data ......................................................................... 31 
 1 
1. INTRODUCTION 
 
Advances in data collection and storage capabilities have enabled organizations, 
companies, institutions, etc to collect immense amounts of data. The challenge lies in 
trying to turn all of that raw data into useful information. One way to address this 
problem is with the use of data mining.  
Data Mining can be defined as ?sorting through data to identify patterns and 
establish relationships (TechTarget, 2006)?.  Data mining blends traditional data analysis 
methods with sophisticated algorithms for processing large amounts of data. These 
algorithms search large data repositories in order to find new and useful patterns that 
might have otherwise been unidentified. Data mining methods can and are being used in 
many domains to address a variety of tasks.  
One of the more popular uses of data mining techniques is predictive modeling. 
Certain types (i.e. decision trees, neural networks, etc.) of data mining algorithms have 
the ability to predict future outcomes using the attributes in the data set. For example, 
let?s assume that a long distance company wants to promote its new long distance plan. 
The company can apply predictive modeling algorithms to specifically target customers 
who are more likely to sign up for the plan based on their past behavior. 
As one could imagine, the accuracy of predictive modeling algorithms is very 
important. Their results contain information that can be very valuable and have the ability 
 2 
to help make decisions and drive change. An ensemble method approach involves 
using the results of individual classifiers in combination. Research says that applying an 
ensemble method approach will, more often than not, produce a predictive model with 
higher accuracy than with a single learning algorithm.  However, these ensemble method 
techniques only deal with classification algorithms (i.e. decision trees, neural networks, 
etc). This research investigates the premise of using a clustering algorithm then a 
predictive modeling approach.  
The rest of this thesis is organized as follows: Chapter 2 contains the literature 
review which gives an overview of data mining including a discussion of the various data 
mining tasks and techniques as well as the problem definition. Chapter 3 discusses the 
experiment procedure, materials/tools, and results. Chapter 4 presents conclusions and 
ideas for future work. 
 3 
2. LITERATURE REVIEW 
 
2.1 DATA MINING OVERVIEW 
 
Data mining is the process of discovering useful information in large amounts of 
data. It is fundamentally built on the principles of algorithms used in statistics, artificial 
intelligence, pattern recognition, and machine learning.  From statistics, data mining 
incorporates things such as sampling, estimation, and hypothesis testing. From artificial 
intelligence, pattern recognition, and machine learning, data mining incorporates the 
following: search algorithms, modeling techniques, and learning theories. Figure 1 
conveys data mining?s relationship to other areas (Tan, Steinbach, and Kumar, 2006). 
 
 
Figure 1 Data mining's relationship to other disciplines 
 
 4 
Data mining is being used in many domains in a variety of ways.  Due to this, a 
CRross Industry Standard for Data Mining Projects (CRISP-DM) was developed. This 
standard is tool, industry, and application neutral. The life cycle according to the standard 
has six phases: Business/Research Understanding, Data Understanding, Data Preparation, 
Modeling, Evaluation, and Deployment. Figure 2 depicts CRISP-DM (CRISP-DM, 
2000). 
 
 
Figure 2 CRISP-DM Model 
 
Most data mining tasks are usually either predictive or descriptive in nature. 
Predictive tasks are ones in which the goal is to predict the value of a particular attribute 
based on the values of other attributes. The attribute that is being predicted is frequently 
referred to as the target or dependent variable. The other attributes that are used for 
making the prediction are referred to as the explanatory or independent variables. 
 5 
Descriptive tasks are ones in which the goal is to derive patterns (i.e. correlations, 
clusters, trends) that will provide a summary of the underlying relationships in the data. 
These tasks are usually exploratory and often require some post-processing to confirm as 
well as explain the results.  
There are four tasks that are at the heart of data mining. They are Predictive 
Modeling, Cluster Analysis, Association Analysis, and Anomaly Detection (see Figure 3) 
(Tan, Steinbach, and Kumar, 2006).  
 
 
Figure 3 Four of the core data mining tasks 
 
 
2.1.1 PREDICTIVE MODELING 
 
Predictive modeling is one of the most popular subfields in data mining. It is the 
process of using the patterns found in the data set to predict future outcomes. Algorithms 
 6 
used for predictive modeling build a model for the dependent variable as a function of the 
independent variables (Thearling, 2008). There are two types of predictive modeling 
tasks: classification and regression. Classification tasks are used when the target or 
dependent variable is discrete. Regression tasks are used when the target variable is 
continuous (Tan, Steinbach, and Kumar, 2006). 
Predictive modeling problems are composed of four main things: a dependent 
variable, independent variables, a learning/training data set, and a test data set (Lewis, 
2000). The learning/training data set contains values for both the dependent and 
independent variables. This data is used to build the model. This model is then applied to 
the test set for evaluation. The test set is a subset of the training/learning data set. The 
performance of the model is based on the counts of the test records that are correctly or 
incorrectly predicted or classified. 
There are several predictive modeling techniques. The next section provides an 
overview of some of them. 
 
2.1.1.1 DECISION TREES 
 
A decision tree is a predictive model in which the results are displayed as a tree 
type structure (Thearling, 2008).  A decision tree consists of a collection of decision 
nodes, which are connected by branches, descending from the root node until coming to 
an end at the leaf nodes. Each branch of the tree is a classification question and the leaves 
are the partitions or segments of the dataset with the classification or decision (Larose, 
2005). 
 7 
The first step in the decision tree building process is to ?grow? the tree. The goal 
is to create a tree that works as close to perfect as possible with the data provided.  When 
growing the tree, the main task is to find the best question to ask at each branch or split 
point in the tree. This task is repeated until there is either only one record in the segment, 
each of the records in the segment are the same, or there is not any significant gain in 
making a split. When one of these conditions are met the tree stops growing (Thearling, 
2008).  Figure 4 conveys a simple decision tree example (DMS, 2008). 
 
 
Figure 4 Simple Decision Tree 
 
There are certain requirements that must be adhered to in order to apply a decision 
tree algorithm. A training data set must be provided. This training data should be varied 
in nature and provide the algorithm with a robust sampling of the types of records that 
 8 
will need classification in the future. This is very important as decision trees learn by 
example. If the algorithm is provided bad data then the ability to correctly classify future 
data will greatly diminish. Also, the target variable must be discrete (Larose, 2005). 
Decision trees are a part of data mining technology that has existed in some form 
for decades. Originally, these techniques were created for statisticians to automate the 
process of determining which attributes were useful or in correlation with the problem 
they were attempting to solve or understand. They are also particularly adept at handling 
raw data with little or no pre-processing (Thearling, 2008). 
Perhaps the most appealing aspect of decision trees is how easy they are to 
interpret. This is especially evident when deriving decision rules. Rules can be derived by 
simply traversing the tree from the root to any leaf node and come in the form of if 
antecedent, then consequent. The antecedent contains the attribute values from the 
branches taken by a particular path throughout the tree. The consequent contains the 
classification value for the dependent variable given by that particular leaf node (Larose, 
2005).  Using the simple decision tree in figure 4 an example of a decision rule would be 
if A = red and B < 4.5 then K = y. There are several algorithms that are used to produce 
decision trees. These include:  ID3, C4.5, Classification and Regression Trees (CART), 
and Chi-Square Automatic Interaction Detector (CHAID). The next sections describe 
each. 
 
2.1.1.1.1 ID3  
 
Developed in the 1970?s, ID3 was one of the first decision tree algorithms. 
Initially, the algorithm was used for things like learning good game play strategies for 
 9 
chess end games. In chess, the end game is the portion of the game where there are only a 
few pieces left on the board (Wikipedia, 2008).  However, ID3 has been used for a 
variety of tasks in various domains and has been modified and improved many times.  
ID3 works by choosing the predictors and their corresponding split values based 
on the gain in information that the split(s) will make available.  Gain is the difference in 
the amount of information that will be needed to properly make an accurate prediction 
before and after a split(s) is made (Thearling, 2008). 
 
2.1.1.1.2 C4.5  
 
C4.5 is an enhancement of the ID3 algorithm. It improved ID3 algorithm by 
allowing predictors with missing or continuous values to be used, introducing tree 
pruning, and enabling rule derivation. The trees produced by this algorithm are also not 
restricted to binary splits.  These trees are more variable in shape (Thearling, 2008 & 
Larose, 2005). 
 
2.1.1.1.3 CLASSIFICATION AND REGRESSION TREES  
 
Classification and Regression Trees (CART) was developed in 1984 by Leo 
Breiman, Jerome Friedman, Richard Olshen and Charles Stone (Brieman, Friedman, & 
Olshen, 1984).  Many of the C4.5 techniques appear in CART. The trees that are 
produced by the CART algorithm are strictly binary. This means that each node in the 
tree can only have two branches. CART analysis is a form of recursive partitioning. The 
algorithm recursively partitions the data in the training set into subsets of records with 
similar values for the target or dependent variable.  To grow the tree, CART performs an 
exhaustive search of available variables to pick the best or most optimal split variable. 
 10 
The fully grown tree that is initially produced will yield the lowest error rate when 
compared against the training data. However, this model may be too complex and usually 
results in overfitting.  An overfit model is one that too closely follows all of the traits of 
the training data and is not general enough to represent the overall population (Thearling, 
2008, Larose, 2005 & Lewis, 2000). 
CART uses a cross validation approach and pruning to combat overfitting. Cross 
validation, also know as leave-one-out training, is a method for confirming a procedure 
for building a model that is computationally intensive (Lewis, 2000). In cross-validation, 
the training data is divided randomly into N segments, stratified by the variable of 
interest. One of the segments is set aside for use as an independent test data set. The other 
(N-1) segments are combined for use as the training data set.  The entire model building 
process is repeated N times, using a different segment of data as the test data each time. 
N different models are produced, each of which that can be tested against the independent 
test data.  Based on the results of the cross-validation the initial complex tree is pruned to 
produce the most optimal general tree. The most complex tree rarely performs best on the 
test data. By using cross validation, the tree that is most likely to perform well on new 
data is produced (Lewis, 2000). 
 
2.1.1.1.4 CHI-SQUARE AUTOMATIC INTERACTION DETECTOR  
 
Chi-Square Automatic Interaction Detector (CHAID) was published in 1980 by 
Gordon V. Kass [10]. CHAID is very similar to CART but differs in the way splits are 
determined.  The algorithm uses the chi square test to choose the best split. The chi 
square test is used in contingency tables to determine which categorical predictor is the 
 11 
furthest from independence with the prediction values. All predictors must either be 
categorical or forced into a categorical form, because of CHAID?s reliance on 
contingency tables, to shape its test of significance for each predictor (Thearling, 2008). 
 
2.1.1.2 NAIVE BAYESIAN CLASSIFIERS  
 
Bayesian classifiers are based on the Bayes theorem. This theorem is a simple 
mathematical formula that is used for calculating conditional probabilities.  The na?ve 
Bayes classifier learns the conditional probability for each attribute given a particular 
class label. For example, if A and B are two random events then P(A|B) is the conditional 
probability of A occurring given B.  Classification is performed by using the Bayes 
theorem to calculate the probability of a class given a particular instance. The class with 
the highest posterior probability is then used as the prediction (Friedman & Goldszmidt, 
1996; Langley & Sage, 1994; Friedman, Geiger, & Goldszmidt, 1997). Using the 
previous example, P(A|B) is the posterior probability.  
 
2.1.1.3 NEURAL NETWORKS  
 
An example of an authentic neural network would be the human brain. The brain 
can recognize patterns, make predictions, and learn. Artificial neural networks seek to 
emulate these capabilities.  Neural networks in data mining are essentially computer 
programs applying pattern recognition and machine learning algorithms to build 
predictive models (Thearling, 2008). 
There are two main structures in a neural network: nodes and links. Nodes are 
artificial neurons and links are the connections between them. To make a prediction, the 
neural network accepts values for the independent variables or predictors at the input 
 12 
nodes. The values of these nodes are then multiplied by values stored in the links. These 
values are added together at the output node, after which some threshold function is used 
and the resulting number is the prediction. Figure 5 (Thearling, 2008) is an example of a 
simple neural network. In this example Age and Income are the input nodes and Default 
is the output node if a person will default on a bank loan. 
 
Figure 5 Example of Simple Neural Network 
 
 
Most neural networks are not as simple as depicted above. There is usually hidden 
layer of nodes between the input and output nodes. They are deemed ?hidden? because 
their contents are not made known to the end user. It is also possible to have more than 
one hidden layer, thus making the network very complex. 
 
2.1.2 CLUSTER ANALYSIS 
 
Clustering can be defined as ?a division of data into groups of similar objects 
(Berkhin, 2002)?.  Instances within these groups or clusters are more similar to each other 
than instances belonging to other clusters.  Clustering and classification are different 
from one another in that there is no target or dependent variable. Clustering does not 
attempt to classify or predict the value of the target variable in clustering. These 
 13 
algorithms attempt to segment the whole data set into homogenous clusters. The more 
similarity within a cluster and the bigger the difference between clusters the better the 
clustering.  For the best possible performance, clustering algorithms require that the data 
be normalized so that any one attribute or variable will not control the analysis.  
There are two main types of clustering algorithms: hierarchical and partitional. 
The next section describes each. 
 
2.1.2.1 HIERARCHICAL CLUSTERING 
 
Hierarchical clustering creates a tree of clusters known as a dendrogram. With 
this type of clustering, the smallest clusters in the tree join together to create the next 
level of clusters. At this level, the clusters then join together to create the next level of 
clusters.  The top or root of this tree is the cluster that contains all the records. There are 
two types of hierarchical clustering: agglomerative and divisive (Berkhin, 2002 & 
Thearling, 2008). 
Agglomerative clustering algorithms begin with having as many clusters as there 
are records. Each cluster will contain one record. Then the clusters that are closest to one 
another, based on some distance, are joined together to create the next largest cluster. 
This process is continued until the hierarchy is built with a single cluster, which contains 
all records, at the top. Figures 6 and 7 describe the agglomerative clustering process 
(Wikipedia, 2008).  Figure 6 contains a set of records that will be clustered.  Figure 7 
shows the cluster hierarchy after the agglomerative approach has been applied.  
 
 14 
 
Figure 6 Records to be clustered 
 
 
Figure 7 Cluster Hierarchy 
 
Divisive clustering algorithms apply an opposite approach. These types of 
algorithms start will all of the records in one cluster and then recursively splits the most 
suitable cluster. This process continues until some stopping criteria is met.  
How appropriate a cluster or clusters for merging or splitting depends on how 
similar or dissimilar items in each cluster are. The distance between individual records 
has to be generalized to the distance between clusters in order for splitting or merging to 
occur. This proximity measure is called a linkage metric.  The major linkage metrics 
 15 
include: Single Linkage, Complete Linkage, and Average Linkage (see Figure 8) (Tan, 
Steinbach, and Kumar, 2006 & Berkhin, 2002).  
Single linkage (nearest neighbor) is based on the minimum distance between any 
record in one cluster and any record in another cluster. Cluster similarity is based on the 
similarity of the most similar members from each cluster.  
Complete linkage (farthest neighbor) is based on the maximum distance of any 
record in one cluster and any record in another cluster.  Cluster similarity is based on the 
similarity of the most dissimilar members from each cluster.  
Average linkage was designed to decrease the dependence of the cluster linkage 
criteria on extreme values. The criteria here is the average distance of all the records in 
one cluster from all of the records in another cluster. 
 
 
Figure 8 Linkage Examples 
 
 
2.1.2.2 PARTITIONAL CLUSTERING 
 
Partitional clustering is dividing the data set in such a way that each record 
belongs to one and only one cluster. These algorithms don?t produce dendrogram as in 
hierarchical clustering. A single partition of the data is produced.  Partitional clustering 
algorithms begin with a randomly picked or user defined number of clusters. The 
 16 
algorithms then optimize each cluster based on some validity measure. There are several 
partitioning clustering approaches. The next section discusses two of these: K-Means and 
Expectation Maximization (EM) (Berkhin, 2002). 
 
2.1.2.2.1 K-MEANS 
 
K-means is one of the oldest and most widely used clustering techniques.  The 
name K-means comes from how each of the K clusters is represented by the mean of the 
points within that cluster. This point is called the centroid. The basic K-means algorithm 
is very simple and straight forward. Figure 9 describes this algorithm (Tan, Steinbach, 
and Kumar, 2006). 
 
 
Figure 9 Basic K-means algorithm 
 
The first step is to pick K initial centroids. K is the number of clusters wanted and 
is a user specified parameter. Next each data instance or point is assigned to the centroid 
it is closest to.  This collection of points is a cluster. The centroid of each of these clusters 
is updated based on the points assigned to the cluster. These steps are repeated until there 
is no change in the centroids. 
Instances are assigned to a centroid based on a proximity measure that quantifies  
closeness for a particular data set. There are several types of proximity measures such as 
Euclidean distance and Cosine similarity (Berkhin, 2002). 
 17 
 
2.1.2.2.2 EXPECTATION MAXIMIZATION (EM) 
 
The EM technique is very similar to the k-means approach. They are both 
iterative in nature and start with a random guess. The difference lies in the idea of hard 
versus soft cluster membership. In k-means a hard membership approach is adopted. This 
means that an instance belongs to one and only one cluster. In EM a soft membership 
approach is used, which means that membership of an instance can be spread amongst 
many clusters.  
The EM algorithm is a two step process: Expectation and Maximization. The 
expectation portion is the first step and involves calculating cluster probabilities (the 
expected class values). The maximization step involves calculating the distribution 
parameters. This step is the maximization of the possibilities of the distribution given the 
data. These steps are repeated until a ?log-likelihood convergence is achieved.? (Berkhin, 
2002 ,Witten & Frank 2005). 
 
2.1.3 ASSOCIATION ANALYSIS 
 
Association analysis is useful for finding interesting relationships that are hidden 
in large data sets. The goal of this type of analysis is to uncover rules (or associations) for 
quantifying the relationship between two or more attributes (Larose, 2005).  These 
relationships are displayed in the form of an association rule. An association rule is an 
implication expression of the form: YXYX confYX ,sup:? , where X and Y are disjoint 
itemsets (i.e., ?=?YX ), YXsup  is the support, and YXconf  is the confidence. 
 18 
 The strength or goodness a rule is measured by its support and confidence. Support 
defines how many times a rule is pertinent to a particular data set. Confidence defines 
how often item in Y appear in instances that contain X. For example, a store may find 
that out of 100 customers shopping on a Tuesday night, 20 bought diapers, and of the 20 
who bought diapers, 5 bought beer. The association rule for this example would be if buy 
diapers then buy beer. This rule would have a support of 5/100 = 5% and a confidence of 
5/20 = 25% (diapers barb2right beer: 0.05, 0.25) (Tan, Steinbach, and Kumar, 2006). 
Mining association rules from large data repositories involves a two step process: 
frequent itemset generation and rule generation. First, all of the frequent itemsets must be 
found.  Frequent itemsets are those that satisfy some minimum support threshold. Then, 
from the frequent item sets found in the first step, association rules are generated that 
satisfy the minimum support and confidence conditions. In association analysis an 
itemset is a collection of zero or more items. 
Frequent itemset generation is generally more expensive than rule generation. A 
dataset that has k items could produce up to 2k -1 itemsets. Because k can be quite large, 
the search space of itemsets that needs to be investigated is exponentially large. To 
enumerate the list of all possible itemsets a lattice structure can be used (Figure 10). The 
brute force method of finding frequent itemsets involves determining the support count 
for each candidate itemset in the lattice structure. There are ways of reducing the 
computational complexity of frequent itemset generation.  One of these is to reduce the 
number of candidate itemsets using the Apriori principle. The Apriori principle states that 
if an itemset is frequent, then all of its subsets must also be frequent (Figure 11). On the 
other hand, if an itemset is infrequent then all of its supersets must be infrequent too. For 
 19 
example in Figure 12, because {a,b} is an infrequent itemset the entire subgraph 
containing the supersets {a,b} can be pruned. This method of reducing the exponential 
search space based on the support measure is known as support-based pruning. Apriori 
was the first rule mining algorithm which pioneered the use of support based pruning 
(Tan, Steinbach, and Kumar, 2006).  This algorithm uses support based pruning to 
control the exponential growth of candidate itemsets. Apriori uses a bottom up technique 
in which the frequent item sets are extended one at a time. This is known as candidate 
itemset generation. This process continues until no further successful extensions are 
found (Wikipedia, 2008). 
 
 
Figure 10 Lattice Structure 
 
 
 
 20 
 
Figure 11 Apriori Principle 
 
 
 
 21 
 
Figure 12 Support based pruning 
 
 
 
 
2.1.4 ANOMALY DETECTION 
 
The main objective in anomaly detection is to locate instances that are different 
from most of the other instances. These abnormal objects are known as outliers and have 
attribute values that deviate considerably from the norm or expected values.  Some areas 
where this type of analysis is very important are fraud detection, intrusion detection, 
public health and medicine. Some common causes of anomalies are things such as data 
from different types, natural variation and data measurement or collection errors (Tan, 
Steinbach, and Kumar, 2006). 
 22 
There are three fundamental approaches to anomaly detection: supervised, 
unsupervised, and semi-supervised. The difference in these techniques is the degree to 
which known outcomes or classifications (class labels) are available for at least some 
portion of the data (Tan, Steinbach, and Kumar, 2006). 
The supervised anomaly detection approach requires a training set that contains 
both normal and abnormal instances. Unsupervised anomaly detection techniques seek to 
assign a score to each instance that reflects how abnormal that instance is. With semi-
supervised anomaly detection techniques the goal is to find an anomaly score (label) for a 
set or group of instances using information from labeled normal instances.  
 
2.2 ENSEMBLE METHODS  
 
As one could imagine the accuracy of predictive modeling algorithms is quite 
important. Ensemble methods seek to improve the accuracy of classifiers (such as 
decision trees) by combining the predictions of multiple classifiers (see Figure 13). These 
methods construct a set of base classifiers using the training data and makes 
classifications for new instances by voting on the prediction each base classifier make 
(Tan, Steinbach, and Kumar, 2006). Research has shown that ensembles have a tendency 
to perform better than single classifiers in the ensemble (Opitz & Macline 1999;Quinlan, 
2006, Freund & Schapire, 1999; Berk, 2004). Bagging, boosting, and stacking are well 
known ensemble techniques.  
 23 
 
Figure 13 View of Ensemble Method 
 
 
 
2.2.1 BAGGING 
 
Bagging or bootstrap aggregation trains each base classifier with a random 
redistribution of the training (Opitz & Macline, 1999). Each of the training sets for the 
base classifiers are generated by randomly selecting, with replacement, N examples, 
where N is the size of the original training set. Since sampling is done with replacement, 
some of the instances may possibly appear many times within the same training set. 
Likewise, some instances may be omitted from the training set (see Figure 14). 
As previously stated, bagging involves combining multiple models (for example 
decision trees). These models can be combined by having each model vote on each new 
instance. The class that receives the most votes is deemed to be the correct one.  When 
 24 
the target variable is numeric the average is used. The predictions or classifications made 
by voting become more dependable as more votes are considered. Researchers have 
determined that bagging is effective on the learning algorithms of which small changes in 
the training data can produce big changes in predictions (Tan, Steinbach, and Kumar, 
2006; Witten & Frank, 2005; Opitz & Macline, 1999). 
 
2.2.2 BOOSTING 
 
This ensemble method technique gets its name from its capacity to take a ?weak 
learning algorithm? and ?boosting? it into a ?strong? learning algorithm (Freund & 
Schapire, 1999).  A weak learning algorithm is one that performs slightly better than 
random guessing. Like bagging, boosting uses voting to combine the output of multiple 
models of the same type (for example decision trees). Boosting, however, is iterative, 
unlike bagging in which each base classifier is built separately; each new model is 
influenced by the performance of the models built before it. New models are pushed to 
become experts for instances mishandled by earlier models. Boosting weights a model?s 
contribution by its performance instead of giving equal weight to all as in bagging 
(Witten & Frank, 2005). For example in Figure 14 (Opitz & Macline, 1999), assume 
instance 1 is an outlier and hard to classify correctly. This instance appears more in later 
training sets because boosting will focus (increase its weight) more on correctly 
classifying it. 
 
 25 
 
Figure 14 Bagging and Boosting Example 
 
 
 
2.2.3 STACKING 
 
Stacking or Stacked Generalization differs from bagging and boosting in that it 
involves combining classifiers of different types (e.g., decision trees and neural 
networks). It is less commonly used because it can be difficult to analyze and there is no 
standard or accepted best practice. Stacking eliminates voting by introducing the idea of a 
metalearner. Stacking attempts to ?learn? which base classifiers are the most reliable by 
using the metalearner to determine how best to combine the output of the base classifiers 
(Witten & Frank, 2005). 
 26 
The inputs to the metalearner (level 1 model) are the predictions of the base 
classifiers (level 0 model). Each level 1 instance will have as many attributes as there are 
level 0 learners. When classifying, the new instance is given to each of the level 0 
learners and those results are given to the level 1 learner. The level 1 learner then 
?learns? the best way to combine the predictions to make the final prediction (Witten & 
Frank, 2005). 
 
2.3 DATA MINING TOOLS 
 
There are many data mining tools available. This section gives a brief description 
of some of them. 
 
2.3.1 MICROSOFT SQL SERVER 
 
Microsoft SQL Server is a ?comprehensive, integrated data management and 
analysis software that enables organizations to reliably manage mission-critical 
information and confidently run today?s increasingly complex business applications. 
(Microsoft, 2007) ? SQL Server 2005 is the platform leader in a variety of areas including 
business intelligence and database management systems.  
 
2.3.2 RAPID MINER 
 
?Rapid Miner (formerly YALE) is the world-leading open-source system for 
knowledge discovery and data mining (Rapid-I, 2008)?. This tool is not only available as 
a stand-alone application for data analysis but also as a data mining engine that can be 
integrated into your own products.  
 
 27 
2.3.3 WEKA 
 
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of 
machine learning software that is written in Java. The software was developed at the 
University of Waikato and is free under the GNU General Public License (Wikipedia, 
2008). 
 
 
2.3.4 ORACLE DATA MINING 
Oracle Data Mining (ODM) is an option of Oracle Database 10g Enterprise 
Edition. This tool provides the ability to ?produce actionable predictive information and 
build integrated business intelligence applications (Oracle, 2007)?. 
 
2.4 PROBLEM DEFINITION 
 
As stated previously the accuracy of predictive modeling algorithms is very 
important. The results of these models contain information that can be very valuable and 
useful in aiding decision making and driving change. Ensemble methods (e.g., bagging, 
boosting and stacking) have been proven to produce a model that will outperform single 
models. However, these methods only deal with combining classification algorithms. 
This research investigates the idea of using clustering and predictive modeling as an 
ensemble. The hypothesis is that by clustering the data set then applying a predictive 
model to each cluster and combining the results of each cluster?s predictive model we 
will produce a model that will have a higher accuracy than the predictive model alone.   
 28 
3. EXPERIMENT 
A preliminary experiment was conducted to determine if using a clustering then 
predictive modeling approach could possibly produce a model or set of rules that has a 
higher prediction accuracy rate than a single classifier.   
 
3.1 DATA 
 
The data for this experiment was taken from the UCI data repository.[22]  The 
breast cancer data set was used. This set contained 286 records with two classification 
possibilities:  recurrence (85) and no recurrence (201).  Eighty percent of the data was 
used for the training set and the remaining twenty percent was used for the test data set. 
Proportionate samples were used for the training and test sets. The training set contained 
161 (201 * 80%) no recurrence and 68 recurrence records. Similarly, the test set 
contained 40 no recurrence and 17 recurrence records. 
 
3.2 MATERIALS/TOOLS 
 
Microsoft SQL Server 2005 Business Intelligence, SQL Server Management 
Studio and Microsoft Excel were used for all data storage and manipulation for this 
experiment. 
3.3 PROCEDURE 
 
Both the training and test data sets were imported into MS SQL Server database 
tables. The training set was then used to build the decision tree predictive model. The 
 29 
default decision tree algorithm used in MS SQL Server is CART. All system defaults 
were used to build the decision tree model. This model was then evaluated against the test 
data set using MS SQL server. The Microsoft Clustering portion of SQL server was then 
run with the training data. The default clustering algorithm used in this tool is EM and the 
number of clusters created was 10. The data within each of these clusters was saved into 
Excel files and imported into SQL server database tables.  The Microsoft Decision tree 
algorithm was then run against each of these tables creating 10 different decision tree 
models. The minimum coverage for this portion of the experiment was changed to 1 in 
order for the trees to properly split. All other system defaults remained the same. These 
models were then combined and evaluated manually against the test data set. 
 
3.4 RESULTS 
 
Figure 15 depicts the decision tree predictive model derived from the training 
data.  Table 1 shows this model?s prediction accuracy when validated using the test data. 
In this chart the columns of the classification matrices correspond to actual values and the 
rows correspond to predicted values. The predictive model alone had an overall 
prediction accuracy rate of approx 65%. 37 of the 57 test instances were correctly 
classified.  36 of the 40 no-recurrence-events instances were correctly classified yielding 
and accuracy of 90%.  Only 1 of the 17 recurrence-events instances was correctly 
classified yielding and accuracy of approximately 0.06%. 
 
 
 30 
 
Figure 15 Decision Tree for Training Data 
 
 
 
 
Predicted No-recurrence-events (Actual) Recurrence-events (Actual) 
No-recurrence-events 36 16 
Recurrence-events 4 1 
Table 1 Mining Model Accuracy Chart 
 
 
Figure 16 depicts the cluster map after the EM clustering algorithm was applied to 
the training data.  Because there is currently no tool that does this type of analysis, 
validation against the test data set had to be done manually.  Table 2 contains the 
combined set of rules generated by each cluster?s decision tree. Table 3 contains the 
model?s accuracy chart for the clustering and predictive model ensemble method. 
 
 
 31 
 
Figure 16 Cluster Map for Training Data 
 
 
 
Rule Number Cluster Rule Classification 
1 1 Breast = left No-recurrence-events 
2 1 Breast = right No-recurrence-events 
3 2 Menopause = premeno No-recurrence-events 
4 2 Menopause != premeno Recurrence-events 
5 3 NodeCaps = no No-recurrence-events 
6 3 NodeCaps != no No-recurrence-events 
0 4 No tree generated No-recurrence-events 
7 5 InvNodes = 0-2 No-recurrence-events 
8 5 InvNodes != 0-2 No-recurrence-events 
9 6 TumorSize = 30-34 Recurrence-events 
10 6 
TumorSize != 30-34 and 
BreastQuad = leftUp No-recurrence-events 
11 6 
TumorSize != 30-34 and 
BreastQuad != leftUp Recurrence-events 
0 7 No tree generated No-recurrence-events 
12 8 DegMalig = 3 Recurrence-events 
13 8 DegMalig != 3 Recurrence-events 
0 9 No tree generated No-recurrence-events 
0 10 No tree generated Recurrence-events 
Table 2 Combined rules from the decision trees built for each cluster 
 
 32 
 
 
Predicted No-recurrence-events (Actual) Recurrence-events (Actual) 
No-recurrence-events 40 0 
Recurrence-events 0 17 
Table 3 Clustering Predictive Modeling Mining Model Accuracy Chart 
 
 
As shown in the table above, the clustering predictive modeling ensemble method 
generated a set of rules that could correctly classify every instance in the test data set 
(100% accuracy). Table 4 shows the ID for each test instance with its known 
classification and the manual classification and the rule which correctly classified it.  
These rules are highlighted in Table 2. 
 
 
ID Classification Ensemble Method Classification Rule Number 
1 Recurrence-events Recurrence-events 12 
2 Recurrence-events Recurrence-events 12 
3 Recurrence-events Recurrence-events 12 
4 Recurrence-events Recurrence-events 12 
5 Recurrence-events Recurrence-events 12 
6 Recurrence-events Recurrence-events 12 
7 Recurrence-events Recurrence-events 12 
8 Recurrence-events Recurrence-events 12 
9 Recurrence-events Recurrence-events 12 
10 Recurrence-events Recurrence-events 12 
11 Recurrence-events Recurrence-events 12 
12 Recurrence-events Recurrence-events 12 
13 Recurrence-events Recurrence-events 12 
14 Recurrence-events Recurrence-events 12 
15 Recurrence-events Recurrence-events 12 
16 Recurrence-events Recurrence-events 12 
17 Recurrence-events Recurrence-events 12 
18 No-recurrence-events No-recurrence-events 1 
19 No-recurrence-events No-recurrence-events 2 
20 No-recurrence-events No-recurrence-events 1 
21 No-recurrence-events No-recurrence-events 1 
 33 
22 No-recurrence-events No-recurrence-events 1 
23 No-recurrence-events No-recurrence-events 2 
24 No-recurrence-events No-recurrence-events 2 
25 No-recurrence-events No-recurrence-events 1 
26 No-recurrence-events No-recurrence-events 2 
27 No-recurrence-events No-recurrence-events 2 
28 No-recurrence-events No-recurrence-events 2 
29 No-recurrence-events No-recurrence-events 1 
30 No-recurrence-events No-recurrence-events 1 
31 No-recurrence-events No-recurrence-events 2 
32 No-recurrence-events No-recurrence-events 1 
33 No-recurrence-events No-recurrence-events 1 
34 No-recurrence-events No-recurrence-events 1 
35 No-recurrence-events No-recurrence-events 1 
36 No-recurrence-events No-recurrence-events 2 
37 No-recurrence-events No-recurrence-events 2 
38 No-recurrence-events No-recurrence-events 2 
39 No-recurrence-events No-recurrence-events 1 
40 No-recurrence-events No-recurrence-events 1 
41 No-recurrence-events No-recurrence-events 1 
42 No-recurrence-events No-recurrence-events 2 
43 No-recurrence-events No-recurrence-events 1 
44 No-recurrence-events No-recurrence-events 1 
45 No-recurrence-events No-recurrence-events 2 
46 No-recurrence-events No-recurrence-events 2 
47 No-recurrence-events No-recurrence-events 2 
48 No-recurrence-events No-recurrence-events 2 
49 No-recurrence-events No-recurrence-events 1 
50 No-recurrence-events No-recurrence-events 1 
51 No-recurrence-events No-recurrence-events 1 
52 No-recurrence-events No-recurrence-events 1 
53 No-recurrence-events No-recurrence-events 1 
54 No-recurrence-events No-recurrence-events 1 
55 No-recurrence-events No-recurrence-events 1 
56 No-recurrence-events No-recurrence-events 1 
57 No-recurrence-events No-recurrence-events 1 
Table 4 Test instances and classifications
 34 
4. CONCLUSIONS 
4.1 CONCLUSIONS 
 
Data mining is a powerful and useful tool that can be used to extract information 
from large data repositories. These techniques are used in a variety of domains to perform 
an array of different tasks. Predictive modeling is one such task. Accuracy in predictive 
modeling is very important. Researchers have determined that applying an ensemble 
approach to predictive modeling will produce higher accuracy models. However, these 
techniques focus on classification algorithms. The goal of this experiment was to 
investigate the idea of using a clustering and predictive modeling ensemble approach. 
The results suggest that the clustering and predictive modeling approach can possibly 
produce a model that has a higher accuracy versus a predictive model alone. 
 
4.2 FUTURE WORK 
This research focused on the idea of using a clustering and predictive modeling 
approach. The dataset used in this experiment was very small. We would like to do more 
experiments with data sets varying in size and domain. Also, there is currently no tool 
available to do this type of analysis. All of the analysis for the clustering and predictive 
modeling portion of this experiment had to be done manually.  
 35 
REFERENCES 
 
1.  TechTarget. (2008). What is data mining? Retrieved January 29, 2008, from 
       http://searchsqlserver.techtarget.com/sDefinition/0,,sid87_gci211901,00.html 
 
2.  Tan, P, Steinbach, M, & Kumar, V (2006). Introduction to Data Mining. Boston, MA:   
 Addison Wesley.  
 
3.  Witten, I. & Frank E. (2005). Data Mining. Practical Machine Learning Tools and  
 Techniques. San Francisco, CA: Morgan Kaufmann. 
 
4.  Thearling, K. (2008). Overview of Data Mining Techniques. Retrieved February 13,  
 2008, from http://www.thearling.com/text/dmtechniques/dmtechniques.htm 
 
5.  Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and  
 Regression Trees. Belmont, CA: Wadsworth International Group. 
 
6.  Larose, D. (2005). Discovering Knowledge in Data: An Introduction to Data Mining. 
 Hoboken, NJ: John Wiley & Sons. 
 
7.  Lewis, R. (2000). Introduction to Classification and Regression Tree (CART) 
 Analysis. San Francisco, CA: Annual Meeting of the Society for Academic 
 Emergency Medicine. 
 
8.  Wikipedia. (2007). CHAID. Retrieved February 15, 2008, from  
 http://en.wikipedia.org/wiki/CHAID 
  
9.  CRISP-DM. (2000). Cross Industry Standard Process for Data Mining. Retrieved 
 February 16, 2008, from http://www.crisp-dm.org/Process/index.htm 
 
10.  Friedman, N., Goldszmidt, M. (1996) Building classifiers using Bayesian networks.  
 National Conference on Artificial Intelligence, Menlo Park, CA, AAAI Press 
 1277?1284. 
 
11.  Langley, P. & Sage. S. (1994). Induction of Selective Bayesian Classifiers. 10th 
 Conference on Uncertainty in Artificial Intelligence, Seattle, WA, Morgan  
 Kaufman 399 ? 406. 
 
12.  Friedman, N., Geiger, D. & Goldszmidt, M.(1997). Bayesian Network Classifiers. 
 Machine Learning, 29, 131 ? 163
 36 
 
13.  Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Technical Report,  
 Accrue Software, San Jose, CA. 
 
14.  Wikipedia. (2008). Cluster Analysis. Retrieved March 6, 2008, from  
 http://en.wikipedia.org/wiki/Data_clustering 
 
15.  Thacker, P. (2005). Cluster Analysis. Retrieved March 6, 2008, from 
 http://csurs1.csr.uky.edu/~pthacker/cluster.html. 
 
16.  Opitz, D., & Macline, R. (1999). Popular ensemble methods: An empirical study. 
 Artificial Intelligence Research, 11, 169 ? 198. 
 
17.  Quinlan, J.R. (2006). Bagging, Boosting, and C4.5. Retrieved January 29, 2008, 
 from http://www.rulequest.com/Personal/q.aaai96.ps 
 
18.  Freund, Y. and Schapire, R. E. (1999). A short introduction to boosting. Japanese  
 Society for Artificial Intelligence, 14(5), 771 ? 780. 
 
19.  Berk, R. (2004). An Introduction to Ensemble Methods for Data Analysis. California 
 Center for Population Research. On-Line Working Paper Series. 
 
20.  Blake, C. F & Merz, C. (1998). UCI Machine Learning Repository 
 [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: 
 University of California, School of Information and Computer Science. 
 
21.  Microsoft Corporation. (2008). Microsoft SQL Server 2005. Retrieved March 15,  
 2008, from http://www.microsoft.com/sql/prodinfo/overview/default.mspx 
 
22. Wikipedia. (2008). Weka (machine learning). Retrieved March 15, 2008, from 
 http://en.wikipedia.org/wiki/Weka_(machine_learning). 
 
23.  Oracle. (2007). Oracle Data Mining. Retrieved March 15, 2008, from 
 http://www.oracle.com/technology/products/bi/odm/index.html 
 
24.  DMS. (2008). Decision Trees. Retrieved March 16, 2008, from 
 http://dms.irb.hr/tutorial/tut_dtrees.php 
 
25.  Wikipedia. (2008). Apriori Algorithm. Retrieved March 16, 2008, from 
 http://en.wikipedia.org/wiki/Apriori_algorithm 
 
26.  InformIT. (2008). Expectation Maximization Theory. Retrieved March 17, 2008,  
 from http://www.informit.com/articles/article.aspx?p=363730 
 
27.  Rapid-i. (2008). Rapid Miner. Retrieved March 16, 2008, from 
 37 
 http://rapid-i.com/ 
 
28. Wikipedia. (2008). Chess endgame. Retrieved April 3, 2008, from 
 http://en.wikipedia.org/wiki/Endgame