A Robust Version of Hotelling's T2 Control Chart
for Retrospective Location Analysis of Individuals Using BACON Estimators
by
Richard C. Bell, Jr.
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
December 7, 2011
Keywords: multivariate statistical process control, phase I, preliminary,
in-control reference sample, breakdown point, outliers
Copyright 2011 by Richard C. Bell, Jr.
Approved by
Nedret Billor, Chair, Associate Professor of Mathematics and Statistics
L. Allison Jones-Farmer, Co-chair, Associate Professor of Management
Asheber Abebe, Associate Professor of Mathematics and Statistics
ii
Abstract
Hotelling's T2 chart is commonly used for Phase I analysis of individual multivariate
normally distributed data. However, the presence of only a few outliers can significantly distort
classical estimates of location and scale, thus rendering the resulting analysis ineffective. This
poses a significant problem for the Hotelling's T2 chart practitioner because the desired output of
a Phase I analysis is an outlier-free reference sample which can be used to estimate control limits
for prospectively monitoring a process in Phase II. Careful selection of a robust parameter
estimation method is therefore critical when the initial reference sample is suspected to contain
multiple outliers.
The purpose of this research is to propose a version of Hotelling's T2 chart that uses the
blocked adaptive computationally efficient outlier nominators (BACON) algorithm to robustly
estimate location and scale parameters in Phase I. The proposed control chart, which assumes
individual multivariate normally distributed data with constant covariance, is designed to detect
both individual outliers and sustained mean shifts. Using Monte Carlo simulation, the proposed
method is compared to Hotelling's T2 chart using classical estimators as well as robust estimators
such as the minimum volume ellipsoid (MVE), minimum covariance determinant (MCD), and
clustering methods. Although the BACON-based version of Hotelling's T2 chart turned out to be
less powerful than expected, it is significantly better than the classical approach and offers some
improvement over existing robust methods at a fraction of the computational expense.
iii
Table of Contents
Abstract ........................................................................................................................................... ii
List of Tables .................................................................................................................................. v
List of Figures ................................................................................................................................ vi
List of Abbreviations .................................................................................................................... vii
1 Introduction and Literature Review ........................................................................................ 1
1.1 Special Considerations in Phase I Control Charting ........................................................ 1
1.2 Hotelling's T2 Control Chart for Individuals -- the Classical Approach .......................... 4
1.3 Previous Attempts to Improve the Robustness of Hotelling's T2 Chart ........................... 6
1.4 Organization of Thesis ..................................................................................................... 9
2 Hotelling's T2 Chart Using Robust Estimates of Location and Scatter ................................. 10
2.1 Desirable Properties of Robust Estimators..................................................................... 10
2.2 The MVE and MCD Methods of Robust Parameter Estimation .................................... 12
2.3 The BACON Method of Robust Parameter Estimation ................................................. 14
2.4 Empirical Control Limits for Hotelling's T2 Chart with Robust Estimators .................. 18
3 Evaluating Phase I Performance of Hotelling's T2 Chart ...................................................... 22
3.1 Simulating Multivariate Normally Distributed Data ...................................................... 22
iv
3.2 Assessing Out-of-Control Performance ......................................................................... 23
4 Comparing the BACON-Based Hotelling's T2 Chart to Other Robust Versions .................. 26
4.1 Detecting Randomly Occurring Outliers........................................................................ 26
4.2 Detecting a Sustained Shift of the Mean ........................................................................ 37
4.3 Application to an Example Data Set .............................................................................. 39
5 Summary and Conclusions ................................................................................................... 42
References ..................................................................................................................................... 45
Appendices .................................................................................................................................... 48
Appendix A: MATLAB Code for Simulating Hotelling?s T2 Chart Empirical Control Limits .. 49
Appendix B: MATLAB Code for Simulating Hotelling's T2 Chart Performance ....................... 52
v
List of Tables
Table 2.4.1 Empirical and Classical UCLs for Hotelling's T2 Chart ........................................... 19
Table 2.4.2 BACON Input Arguments for Each Dimension Evaluated ...................................... 19
Table 3.1.1 Summary of Planned Experiments ........................................................................... 23
Table 4.1.1 Empirical Alarm Probabilities for p = 3 ................................................................... 29
Table 4.1.2 Empirical Alarm Probabilities for p = 5 ................................................................... 33
Table 4.1.3 Empirical Alarm Probabilities for p = 10 ................................................................. 35
Table 4.2.1 Empirical Alarm Probabilities Under a 50% Sustained Shift of the Mean .............. 38
Table 4.3.1 Example Bivariate Data Set ...................................................................................... 39
Table 4.3.2 T2 Statistics for Original and Altered Samples Using BACON Estimators ............. 40
vi
List of Figures
Figure 2.3.1 IC FAPs for Hotelling's T2 Chart Using Tracy et al.'s (1992) UCL ....................... 17
Figure 2.4.1 Hotelling's T2 Chart IC FAPs Using Robust Estimators & Empirical UCLs .......... 21
Figure 4.1.1 Control Chart Performance on Bivariate Normal Data with k = 1, 3 Outliers ........ 27
Figure 4.1.2 Control Chart Performance on Bivariate Normal Data with k = 5, 7 Outliers ........ 28
Figure 4.1.3 Control Chart Performance with k Outliers When n = 30 and p = 3 ....................... 30
Figure 4.1.4 Effect of Increasing k on Control Chart Performance ............................................. 31
Figure 4.1.5 Control Chart Performance with Extreme Outliers ................................................. 32
Figure 4.1.6 Control Chart Performance with k Outliers When n = 50 and p = 5 ....................... 34
Figure 4.1.7 Control Chart Performance with 20% Outliers in Ten Dimensions ........................ 36
Figure 4.1.8 Effect of Increasing Dimension on Control Chart Performance ............................. 37
Figure 4.2.1 Control Chart Performance Under a 50% Sustained Shift of the Mean .................. 38
Figure 4.3.1 Application of the BACON-Based Hotelling's T2 Chart to Altered Data ............... 41
vii
List of Abbreviations
ARL average run length
BACON blocked adaptive computationally efficient outlier nominators
CL center line
EAP empirical alarm probability
FAP false alarm probability
HT2 Hotelling's T2
IC in control
LCL lower control limit
MCD minimum covariance determinant
MCUSUM multivariate cumulative sum
MEWMA multivariate exponentially weighted moving average
MVE minimum volume ellipsoid
OC out of control
RBP replacement breakdown point
RL run length
RMCD reweighted minimum covariance determinant
RMVE reweighted minimum volume ellipsoid
UCL upper control limit
1
1 Introduction and Literature Review
The first multivariate quality control chart is attributed to Harold Hotelling (1947), who
created the T2 chart to monitor bombsight data during World War II. Since its introduction,
many variations and refinements of Hotelling's T2 chart have been proposed, and it remains the
most familiar multivariate quality control chart in existence today [Montgomery (2005, p. 491)].
This research seeks to further broaden the appeal of Hotelling's T2 chart for individual
multivariate normally distributed data in a Phase I setting by using Billor, Hadi, and Velleman's
(2000) blocked adaptive computationally efficient outlier nominators (BACON) method of
robust parameter estimation to improve the T2 statistic's robustness to outliers.
1.1 Special Considerations in Phase I Control Charting
A control charting application is typically divided into two distinct phases. In Phase I,
when little is known about a process being studied, the objective is to identify an in-control (IC)
reference sample. This involves retrospective analysis of a historical data set in order to
eliminate any data points that do not accurately represent the routine operation of the process.
The resulting data are described as in control because it is believed that all remaining variability
in the process is inherent to the process itself and not due to assignable causes. Upon completion
of Phase I, the in-control reference sample is used to establish control limits for Phase II, the
monitoring stage of a control charting application. In Phase II, newly observed data points are
successively compared to the control limits to identify significant departures from the in-control
2
state. Should an observation fall outside the control limits, a search for an assignable cause is
immediately undertaken. If the change in process behavior can be linked to special causes or
external factors, the process is deemed out of control (OC) and corrective action is implemented
to fix the problem.
Prior to conducting any analysis in a control charting scenario, it is usually assumed that
the unedited reference sample may contain out-of-control points and the control limits are
unknown. The challenging nature of a Phase I analysis under these conditions has been
recognized since the earliest days of statistical process control. Shewhart (1939, p. 76) said, "In
the majority of practical instances, the most difficult job of all is to choose the sample that is to
be used as the basis for establishing the tolerance range. If one chooses such a sample without
respect to the assignable causes present, it is practically impossible to establish a tolerance range
that is not subject to a huge error."
Phase I control charts are designed with the goal of achieving a specified overall in-
control false alarm probability (FAP), defined as the probability of one or more observations
plotting outside the control limits in the absence of assignable causes. Phase I usually involves
iteratively comparing the reference sample to trial control limits (corresponding to the desired
overall in-control FAP) estimated from the sample. At each iteration of a Phase I analysis, an
out-of-control point is eliminated from the reference sample if an assignable cause is identified,
and trial control limits are updated excluding the out-of-control point. This iterative process
continues until all points in the reference sample are in control.
Phase I analysis requires careful consideration when it involves methods such as
Hotelling's T2 chart that compute independent control chart statistics consisting of individual
observations. Provided the observations originate from random sampling, the control chart
3
statistics are independent of one another. However, because the control limits in Phase I are
estimated from the reference sample itself, the control limits are dependent on each sample point
included in their calculation. Thus, simultaneous comparisons of chart statistics to control limits
in Phase I are statistically dependent despite the control chart statistics themselves being
independent. These dependencies often make it difficult to correctly determine the overall in-
control FAP for a Phase I analysis.
Phase II, on the other hand, consists of comparing new observations (in the form of a
control chart statistic) to the control limits previously established in Phase I. Because the control
limits in Phase II are fixed through conditioning, successive comparisons of chart statistics to
control limits are independent provided the chart statistics are independent of one another as in
the case of Hotelling's T2 chart. This is in contrast to multivariate exponentially weighted
moving average (MEWMA) or multivariate cumulative sum (MCUSUM) charts whose chart
statistics include past observations and are therefore naturally dependent.
Chart performance in Phase II is often measured using moments of the run length (RL)
distribution. The RL is the number of observations until an out-of-control signal is observed. If
the comparisons of the chart statistics to the control limits are independent, the RL is a geometric
random variable. The expected value of the RL is equal to 1/?, where ? is equal to the
probability that a single chart statistic plots outside the control limits in the absence of assignable
causes. The expected value of the RL is known as the average run length (ARL) and is
commonly used to describe control chart performance in Phase II.
4
1.2 Hotelling's T2 Control Chart for Individuals -- the Classical Approach
Hotelling's T2 control chart may be used to detect outliers in a p-dimensional multivariate
process, where each observation in the form of a p-vector 1( ,..., )ipXX?X is assumed to come
from a multivariate normal distribution when the process is in-control. More specifically, each
~ ( , ),ipNX ? ? where ? is the p-dimensional mean vector that defines the location of the process
and ? is the positive definite p x p covariance matrix that specifies the dispersion of the process.
Hotelling's T2 statistic is calculated for each iX as
? ? ? ?21 ,i i iT ??? ? ?XX??? (1.2.1)
and subsequently compared to either a Phase I or Phase II upper control limit (UCL).
In a Phase II control charting application when ? and ? are known, the UCL for
Hotelling's T2 chart is given by Montgomery (2005, p. 501) as
2,UCL ,p??? where ? = the desired in-control FAP. (1.2.2)
When ? and ? are unknown, as is usually the case in practice, they are typically replaced by the
classical sample mean vector and sample covariance matrix estimated from an in-control
reference sample consisting of n independent observations. The classical estimators are defined
as follows:
? ? ? ?
11
11a n d .1nni i i
iinn??
?? ? ? ????X X S X X X X (1.2.3)
Using these unbiased estimators, Hotelling's T2 statistic becomes
? ? ? ?21i i iT ??? ? ?X X S X X (1.2.4)
and according to Montgomery (2005, p. 501) has Phase II upper control limit
5
,,2( 1 ) ( 1 )U C L ,p n pp n n Fn n p ? ???? ?
where ? = the desired in-control FAP. (1.2.5)
If an in-control reference sample is not available, a Phase I analysis must first be
conducted using Hotelling's T2 statistic in Equation (1.2.4) and the following UCL specified by
Tracy, Young, and Mason (1992):
2
, / 2 ,( 1 ) / 2( 1 )U C L ,p n pn n ?? ????
where ? = the desired in-control FAP. (1.2.6)
Note that in Phase I, ? represents the desired in-control FAP for each observation. In order to set
??to achieve a desired overall in-control FAP for all n observations in a reference data set, the
following relation must be used:
? ?1/1 1 .noverall??? ? ? (1.2.7)
For example, for a reference sample consisting of n = 50 observations and a desired overall in-
control FAP of 0.05, ? ?1 / 5 01 1 0 .0 5 0 .0 0 1 0 2 5? ? ? ? ? would be used in Equation (1.2.6) to
determine the Phase I UCL. If the purpose of Hotelling's T2 chart is solely to identify location
shifts, the lower control limit (LCL) in both Phase I and Phase II is often defined as zero or not
specified at all. This is because location shifts in all directions result in increasingly positive T2
values, so an LCL is not necessary.
As noted by Vargas (2003), Hotelling's T2 statistic using classical estimators in Phase I is
effective in detecting a single moderately sized outlier. However, its inability to detect multiple
outliers has been well documented by Vargas (2003), Jobe and Pokojovy (2009), and Yanez,
Gonzalez, and Vargas (2010), and its poor performance in detecting sustained shifts in the mean
vector has been demonstrated by Sullivan and Woodall (1996) and Vargas (2003). This is
because the presence of even a single arbitrarily large outlier or a few moderately sized outliers
6
can significantly contaminate the parameter estimates and ,XS thus rendering the T2 statistic
ineffective. This is precisely the problem this research seeks to address.
1.3 Previous Attempts to Improve the Robustness of Hotelling's T2 Chart
In recent years, improving the robustness of Hotelling's T2 chart for individual
multivariate normally distributed data in Phase I has garnered much interest in multivariate
quality control research. Many proposed methods have been successful under certain conditions,
but none have proven to be universally superior. Thus, there is still room for improvement.
Before detailing the BACON-based robust version of Hotelling's T2 chart proposed by this
research, a brief summary of other robust versions of Hotelling's T2 chart for individual
multivariate normally distributed data will be provided.
In one of the earliest attempts to improve the robustness of Hotelling's T2 control chart,
Sullivan and Woodall (1996) proposed using vector differences between successive observations
to estimate the in-control covariance matrix of a process, and showed that this method used in
conjunction with Hotelling's T2 statistic results in enhanced detection ability of step (sudden) and
ramp (gradual) shifts in the mean during retrospective analysis of a data set. Later, Vargas
(2003) evaluated the performance of five different types of robust estimators for use with
Hotelling's T2 chart, including the minimum volume ellipsoid (MVE) estimators of Rousseeuw
(1984) and Rousseeuw and Van Zomeren (1990), the minimum covariance determinant (MCD)
method of Rousseeuw (1984) and Rousseeuw and Van Driessen (1999), a trimming approach
based on Mahalanobis distance, the aforementioned method of vector differences proposed by
Sullivan and Woodall (1996), and an outlier detection algorithm also proposed by Sullivan and
Woodall (1996).
7
Vargas (2003) ultimately recommended the MVE estimator for detecting multiple
outliers and the Sullivan and Woodall (1996) successive differences estimator for identifying
sustained shifts in the mean vector. Jensen, Birch, and Woodall (2007) further detailed the
advantages of the MVE and MCD methods as robust estimators and provided a detailed analysis
of when to use each type of estimator with the T2 statistic in a Phase I control chart setting.
Alfaro and Ortega (2008) proposed trimming each variable to obtain robust estimates for the
location vector and covariance matrix, and then using those estimates in Hotelling's T2 chart with
the Phase I UCL given in Equation (1.2.6) to provide enhanced outlier detection. The method of
Alfaro and Ortega (2008) demonstrated improvement over Hotelling's T2 chart using classical
estimators, but no other performance comparisons were offered.
In one of the most comprehensive studies performed, Jobe and Pokojovy (2009)
developed a computationally intensive two-step method of identifying the largest bulk of similar
multivariate data from a time-ordered sequence of individual points, and used the estimated
mean vector and covariance matrix from this bulk in the T2 statistic with empirical control limits.
The authors compared the performance of Hotelling's T2 chart using their method, the classical
method of parameter estimation, and the robust methods analyzed by Vargas (2003) and Jensen
et al. (2007), showing that their method resulted in improved performance in detecting outliers as
well as sustained shifts in location. Based on these findings, the results of Jobe and Pokojovy
(2009) will be used as the standard of comparison for the BACON-based version of Hotelling's
T2 chart proposed by this research.
The most recent attempts to improve the robustness of Hotelling's T2 chart in Phase I
include Oyeyemi and Ipinyomi's (2010) proposal to robustly estimate the covariance matrix by
identifying a subset of data that meets specified optimality criteria, and then iteratively
8
expanding the subset to a predetermined size. The method was shown to outperform the MVE
and MCD methods in a limited number of cases, but only bivariate samples of size m = 30 were
considered. Yanez, Gonzalez, and Vargas (2010) proposed a T2 chart using biweight S
estimators for location and scatter in conjunction with simulated limits, showing that it
outperforms Hotelling's T2 chart with MVE estimators for small samples.
Other authors of robust Hotelling's T2 charts for individual multivariate normally
distributed data focused their performance comparisons on Phase II rather than Phase I.
Chenouri and Steiner (2009) proposed a robust Phase II Hotelling's T2 chart with simulated limits
based on reweighted MCD (RMCD) estimators, as defined by Willems, Pison, Rousseeuw, and
Van Alest (2002), obtained directly from an unedited reference sample. The authors show that
their version of Hotelling's T2 chart outperforms Hotelling's T2 chart using classical estimators,
MVE estimators, and MCD estimators under certain conditions in Phase II. Chenouri and
Variyath (2011) built upon the work of Chenouri and Steiner (2009) by comparing Hotelling's T2
chart using RMCD estimators to Hotelling's T2 chart using reweighted MVE (RMVE) and S
estimators in Phase II, again favoring the RMCD estimators for location and scatter in most
scenarios evaluated. Mohammadi, Midi, Arasan, and Al-Talib (2011) also explored the merits of
using RMCD and RMVE estimators in a Phase II Hotelling's T2 chart, recommending the RMVE
method for small reference samples and the RMCD method for large reference samples.
Assessment of a robust version of Hotelling's T2 control chart exclusively in terms of Phase II
performance inherently assumes that a practitioner desires to use Hotelling's T2 chart in Phase II,
which may or may not be the case in reality. In order to allow a practitioner the flexibility to
choose another proven Phase II method (e.g. the MEWMA or MCUSUM chart) after outlier
9
removal is completed in Phase I, this research will focus on Hotelling's T2 chart performance in
Phase I only.
1.4 Organization of Thesis
The remainder of this document is dedicated to the detailed development and application
of a BACON-based robust version of Hotelling's T2 control chart for individual multivariate
normally distributed data in Phase I. Chapter 2 explores the properties of various robust
parameter estimation methods including BACON, and addresses the design of a BACON-based
Hotelling's T2 control chart. Chapter 3 provides the simulation plan for assessing the BACON-
based Hotelling's T2 control chart's in- and out-of-control performance. Chapter 4 contains the
results of the simulation study and comparisons to several existing robust Hotelling's T2 control
charts. This thesis concludes in Chapter 5 with a synopsis of research conducted,
recommendations to practitioners, and discussion of areas in need of further investigation.
10
2 Hotelling's T2 Chart Using Robust Estimates of Location and Scatter
Numerous robust parameter estimation methods have been used in control chart research,
including the MVE, MCD, and several other methods mentioned in Chapter 1. The BACON
method was shown by Billor et al. (2000) to perform similarly to well known robust parameter
estimation methods such as the MVE and MCD methods without the extreme computational
burden, making it a seemingly ideal candidate for improving the robustness of Hotelling's T2
chart. The BACON method demonstrates excellent balance between computational complexity
and robustness to outliers while also satisfying several other important statistical properties of
robust parameter estimation methods.
2.1 Desirable Properties of Robust Estimators
The performance of an estimator is commonly described by its finite-sample replacement
breakdown point (RBP). First defined by Donoho and Huber (1983), the RBP is the minimum
fraction of a sample that must be replaced by outliers in order to completely ruin an estimate, so
a low RBP indicates nonrobustness and a high RBP signifies robustness to outliers. Precise
definitions of RBPs for both location and scatter estimators are adapted from Donoho and Huber
(1983) and Lopuhaa and Rousseeuw (1991). Let ? ?1,...,nn?X XX be a random sample of size n
in .pR The RBP of a location estimator T at nX , or the smallest fraction k/n of outliers that can
take the resulting estimate beyond any bound, is defined as
11
, ,
( ; ) m i n : s u p ( ) ( ) ,
nkn n n k
kR B P T T Tn??? ? ? ?????
XX X X
(2.1.1)
where ,nkX is a contaminated sample found by replacing k points of nX with arbitrary values.
The RBP of a scatter estimator C at nX , or the smallest fraction k/n of outliers that can drive
either the largest eigenvalue of the resulting estimate to infinity or the smallest eigenvalue of the
resulting estimate to zero, is defined as
? ?
, ,
( ; ) m i n : s u p ( ) , ( ) ,
nkn n n k
kR B P C M C Cn??? ? ?????
XX X X
(2.1.2)
where ,nkX is defined as before, ? ?11
11( , ) m a x ( ) ( ) , ( ) ( ) ,ppM ? ? ? ???? ? ?A B A B A B
and
1( ) ( )p????AA are the ordered eigenvalues of the matrix A.
To illustrate the idea of an RBP, consider a sample of size n in 1R and two common
location estimators: the sample mean and the sample median. The sample mean has an RBP of
only 1/n because a single outlier could move the sample mean to infinity, so it is considered a
nonrobust location estimator. In contrast, the sample median has the highest possible RBP of 1/2
because 1/2 of the sample would have to be contaminated with outliers in order to effect a
corresponding shift in the sample median. Consequently, the sample median is the preferred
location estimator in 1R from a robustness standpoint.
In addition to having a high RBP, a location or scatter estimator should also be affine
equivariant. From Lopuhaa and Rousseeuw (1991), a location estimator T is affine equivariant
if ( ) ( )TT? ? ?A X b A X b for any p-vector b and any p x p nonsingular matrix A, and a positive
definite scatter estimator C is said to be affine equivariant if () TC C ( )??A X b A X Afor any p-
12
vector b and any p x p nonsingular matrix A. Affine equivariance means that an estimator does
not depend on the location, scale, or orientation of the data.
2.2 The MVE and MCD Methods of Robust Parameter Estimation
Rousseeuw's (1984) MVE method is an affine equivariant, computationally complex,
robust parameter estimation technique. The MVE method finds the ellipsoid of minimum
volume that covers a subset of at least h points, and uses the geometrical center of the ellipsoid
as the location estimator and the matrix defining the ellipsoid itself (multiplied by a constant) as
the covariance matrix estimator. Lopuhaa and Rousseeuw (1991) showed that the integer value
of h = (n+p+1)/2 provides the highest possible RBP of [(n-p+1)/2]/n, which converges to 50% as
.n??
Due to the computational complexity of Rousseeuw's (1984) original MVE method,
Rousseeuw and Leroy (1987) proposed an alternative method that approximates MVE estimates
using a subsampling algorithm. With the subsampling method, a fixed number of subsets are
first drawn from a data set. Rousseeuw and Leroy (1987, p. 199) recommend at least 500 subsets
for small data sets in low dimensions and even more subsets for larger n and p. Next, the sample
mean vector and sample covariance matrix are calculated for each subset. This determines the
shape of an ellipsoid, which is then increased in size through multiplication by a constant until
the ellipsoid covers at least the required h data points. Once this has been completed for each
subset, the ellipsoid having the smallest volume is used to obtain the MVE estimates of location
and scatter.
According to Jensen et al. (2007), the subsampling algorithm is widely used but suffers
from repeatability issues. More specifically, MVE estimates of location and scatter from the
13
same data set can vary widely depending on the number of subsets used in the subsampling
algorithm. The software package R was used in this research to calculate MVE estimates using
Rousseeuw and Leroy's (1987) subsampling algorithm. The MVE function in R was employed
using default input arguments, which are believed to be h = (n+p+1)/2 and 500 subsamples,
although the R documentation is not completely clear about this. The fact that the MVE results
determined by this research are somewhat different than those obtained by Vargas (2003) and
Jensen et al. (2007) is likely due to the authors' use of different MVE input arguments.
Rousseeuw's (1984) MCD method, which is also robust, affine equivariant, and
computationally complex, finds the subset of data that has the smallest covariance matrix
determinant while covering a specified minimum number of points, h. It then uses the sample
mean vector and the sample covariance matrix (as in the MVE approach, also multiplied by a
constant) of the points in the subset as estimators of location and scatter. Rousseeuw and Van
Driessen's (1999) FAST-MCD algorithm is a more computationally efficient version that
approximates MCD estimates based on an iterative scheme involving randomly selected subsets
of data. According to Jensen et al. (2007), the FAST-MCD algorithm does not suffer from the
same repeatability issues as the MVE subsampling method and is therefore a better estimator.
Like the MVE method, the integer value of h = (n+p+1)/2 provides the highest possible RBP for
the MCD method of [(n-p+1)/2]/n, which converges to 50% as .n?? Alternatively, h can be
increased to 0.75n if the percentage of OC points is thought to be low. This increases the
efficiency of the estimator because more IC observations from the reference sample are being
used. All results in this research were derived using h = 0.75n because the vast majority of
scenarios evaluated involve contamination levels less than or equal to 20%, so there is no need to
sacrifice statistical efficiency for a higher RBP. Using h = 0.75n, this research often achieved
14
better performance using the FAST-MCD method than Vargas (2003) and Jensen et al. (2007),
especially for small samples with a low percentage of outliers.
2.3 The BACON Method of Robust Parameter Estimation
The BACON method possesses all the desired properties of robust estimators and is very
computationally efficient, even for extremely large data sets and higher dimensions. It begins
with a small outlier-free subset of the data and then allows this subset to grow rapidly until a
stopping criteria is reached, sometimes taking only two to three iterations to converge to a
satisfactory solution. Two versions of this iterative forward selection method are available --
Version 2 which is nearly affine equivariant and has RBPs exceeding 40% for various
combinations of dimension and sample size, and Version 1 which is completely affine
equivariant with an approximate RBP of 20%. MATLAB code for the BACON method is
available from the authors upon request.
From Billor et al. (2000), the BACON algorithm is as follows:
Step 1: Identify an initial basic subset of m > p observations that can safely be assumed
free of outliers, where p is the dimension of the data and m is an integer chosen by the data
analyst. Billor et al. (2000) suggest m = cp, where c is a small integer (such as 4 or 5) chosen by
the data analyst.
Using Version 1 of the BACON method, Step 1 involves computing the Mahalanobis
distances
? ? ? ? ? ?1, , 1 , . . . , ,i i id i n??? ? ? ?X S X X S X X (2.3.1)
15
where X and S are the classical mean vector and covariance matrix, respectively, of the n
observations. The m = cp observations with the smallest values of ? ?,id XS are then nominated
as a potential basic subset. According to Billor et al. (2000), this start is not robust, but it is
affine equivariant and computationally easy. Furthermore, Billor et al. (2000) show that
subsequent iterations make up for the nonrobust start as long as the fraction of outliers is
relatively small (20% in 5 dimensions, 10% in 20 dimensions).
If Version 2 of the BACON method is used, Step 1 involves computing
, 1,..., ,i in??XM (2.3.2)
where M is the coordinatewise median and ? is the vector norm. The m = cp observations
with the smallest values of i?XM are then nominated as a potential basic subset. According
to Billor et al. (2000), this start is robust but not affine equivariant (because the coordinatewise
median is not affine equivariant) and slightly more computationally intensive because of the
need to find medians in all directions. However, Billor et al. (2000) also state that because
subsequent iterations are affine equivariant, the overall procedure is nearly affine equivariant,
and it achieves a high RBP of approximately 40%.
Step 2: Fit an appropriate model to the basic subset, and from that model compute
discrepancies for each of the observations. This involves computing
? ? ? ? ? ?1, , 1 , . . . , ,i b b i b b i bd i n??? ? ? ?X S X X S X X (2.3.3)
where bX and bS are the mean vector and covariance matrix, respectively, of the observations in
the basic subset.
Step 3: Find a larger basic subset consisting of observations known (by their
discrepancies) to be homogeneous with the basic subset. Generally, these are the observations
16
with the smallest discrepancies. This new basic subset may omit some of the previous basic
subset observations, but it must be as large as the previous basic subset.
In order to accomplish this, the new basic subset will include all points with
discrepancies less than ,/,npr p nc ?? where 2,p?? is the 1 - ? percentile of the chi-square
distribution with p degrees of freedom, cnpr = cnp + chr is a correction factor, chr = max{0, (h -
r)/(h + r)}, h = [(n + p + 1)/2], r is the size of the current basic subset, and
1 1 1 21 1 .
13np ppc n p n h p n p n p??? ? ? ? ? ?? ? ? ? ? ?
(2.3.4)
The parameter ? can be set to any number between 0 and 1, but ? = 0.05 is suggested by Billor
et al. (2000) for most applications.
Step 4: Iterate Steps 2 and 3 until the size of the basic subset no longer changes.
Step 5: Nominate the observations excluded by the final basic subset as outliers.
Once the final basic subset is determined, estimates of location and scatter using the
classical formulas for X and S are computed and used in conjunction with Hotelling's T2 chart to
perform a Phase I analysis of individual multivariate normally distributed data. It is, however,
important to recognize that the distribution of the T2 statistic using BACON estimates from the
outlier-free data set is not the same as the distribution of the T2 statistic using classical estimates
from the original data set. The same is true when MVE, MCD, or other estimators are used in
lieu of classical estimators. In such cases, Tracy et al.'s (1992) Phase I UCL for Hotelling's T2
chart given by Equation (1.2.6) is no longer appropriate.
Figure 2.3.1 is a graphical depiction of the results of incorrectly applying Tracy et al.'s
(1992) Phase I UCL to Hotelling's T2 chart using several non-classical estimators of location and
scatter. In this example, the data are multivariate normally distributed and the desired IC FAP is
17
Figure 2.3.1 IC FAPs for Hotelling's T2 Chart Using Tracy et al.'s (1992) UCL
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
30 50 75 100 125
Em
pir
ica
l IC
FA
P
Sample Size
p = 3
BACON
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
30 50 75 100 125
Em
pir
ica
l IC
FA
P
Sample Size
p = 5
BACON
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
30 50 75 100 125
Em
pirica
l I
C
FAP
Sample Size
p = 10
BACON
MVE
MCD
Classical
18
0.05, but only Hotelling's T2 chart using classical estimators maintains the desired IC FAP
throughout the range of sample sizes and dimensions considered. The empirical IC FAPs for
Hotelling's T2 chart using BACON, MVE, and MCD estimators are in most cases significantly
higher than the target of 0.05. This effect is usually most pronounced with small sample sizes
and worsens with increasing dimension. If the distributions of the T2 statistics using BACON,
MVE, and MCD estimators were known, corresponding Phase I UCLs for Hotelling's T2 chart
could be derived mathematically. Since the distribution of the T2 statistic using BACON
estimators is unknown and the distribution of the T2 statistic using MVE or MCD estimators is
known only asymptotically [Jensen et al. (2009, p. 20)], Phase I UCLs must be empirically
determined through simulation.
2.4 Empirical Control Limits for Hotelling's T2 Chart with Robust Estimators
For each combination of n and p, using a desired IC FAP of 0.05, empirical Phase I UCLs
for Hotelling's T2 chart using BACON, MVE, and MCD estimators were determined using a
methodology similar to the one outlined by Jensen et al. (2007):
1) Simulate 100,000 sets of individual multivariate normally distributed data with zero mean
vector and identity covariance matrix. Due to the affine equivariance of the 2T statistic,
0 and I may be used without loss of generality, so the resulting limits are applicable for
any ? and ?.
2) For each data set, determine robust estimates of location and scatter, compute 2T
statistics for all observations, and record the maximum 2T statistic.
19
3) From the set of 100,000 maximum 2T statistics, which represents the empirical
distribution of the maximum 2T statistic, determine the 95th percentile. This represents
the empirical UCL for a desired IC FAP of 0.05.
A table of BACON, MVE, MCD, and classical control limits for all combinations of n and p
used in this research is provided in Table 2.4.1, a table of input arguments for the BACON
method is illustrated in Table 2.4.2, and MATLAB code for simulating additional control limits
is provided in Appendix A. The input parameters in Table 2.4.2 were established through trial
and error in order to achieve the highest possible alarm probabilities under a variety of OC
conditions.
Table 2.4.1 Empirical and Classical UCLs for Hotelling's T2 Chart
Table 2.4.2 BACON Input Arguments for Each Dimension Evaluated
n p B A C O N U C L M V E U C L M C D U C L C l as s i c al U C L
30 2 21 .0 7 41 .6 5 27 .7 3 10 .5 5
30 3 24 .2 8 58 .6 5 41 .2 6 12 .2 1
50 3 22 .0 5 35 .3 9 28 .5 6 14 .1 4
100 3 21 .6 0 26 .1 5 24 .0 9 16 .4 1
30 5 34 .4 5 84 .2 6 69 .7 8 14 .9 2
50 5 29 .0 9 46 .4 9 43 .6 1 17 .4 1
100 5 27 .1 4 32 .5 6 31 .1 9 20 .2 1
30 10 20 .0 2 21 7. 91 19 2. 24 20 .0 5
50 10 49 .6 8 95 .3 5 10 3. 09 23 .9 8
100 10 40 .8 9 51 .3 2 52 .6 4 28 .0 9
D e s i r e d I C F A P = 0 .0 5
p V e r s ion ? c
2 2 0.10 6
3 2 0.10 6
5 2 0.10 4
10 2 0.10 3
B A C O N I n p u t A r gu m e n t s
20
Repeating the experiment reflected in Figure 2.3.1 using the empirical UCLs for
Hotelling's T2 chart with BACON, MVE, and MCD estimators in Table 2.4.1 yields the graphs
pictured in Figure 2.4.1. In this case, because the correct (empirical) UCLs were used, each
robust version of Hotelling's T2 chart maintains the desired IC FAP of 0.05 throughout the range
of sample sizes and dimensions evaluated. Minor deviations from the target IC FAP of 0.05 are
due to simulation noise. All simulations were conducted using the MATLAB code for
Hotelling's T2 chart provided in Appendix B, with the number of outliers set to zero.
21
Figure 2.4.1 Hotelling's T2 Chart IC FAPs Using Robust Estimators & Empirical UCLs
0.00
0.03
0.05
0.08
0.10
30 50 100
Em
pir
ica
l IC
FA
P
Sample Size
p = 3
BACON
MVE
MCD
0.00
0.03
0.05
0.08
0.10
30 50 100
Em
pirica
l I
C
FAP
Sample Size
p = 5
BACON
MVE
MCD
0.00
0.03
0.05
0.08
0.10
30 50 100
Em
pirica
l I
C
FAP
Sample Size
p = 10
BACON
MVE
MCD
22
3 Evaluating Phase I Performance of Hotelling's T2 Chart
To assess the effectiveness of Hotelling's T2 chart using BACON estimators to establish
an IC reference sample for individual multivariate normally distributed data, its performance was
compared to Jobe and Pokojovy's (2009) cluster-based Hotelling's T2 chart. This is a logical
standard of performance because, as mentioned in Chapter 1, Jobe and Pokojovy's (2009)
method was shown to be superior to previous attempts to improve the robustness of Hotelling's
T2 chart in most cases tested. Because the ultimate goal of this research is to match or exceed the
performance of Jobe and Pokojovy's (2009) chart, the entire set of scenarios evaluated by Jobe
and Pokojovy (2009) was replicated using Hotelling's T2 chart with BACON estimators. Similar
to Jobe and Pokojovy's (2009) design of experiments, Hotelling's T2 charts using MVE, MCD,
and classical estimators were also simulated for comparison purposes.
3.1 Simulating Multivariate Normally Distributed Data
Hotelling's T2 chart was tested on individual multivariate normally distributed data under
a variety of OC conditions, including situations in which simulated data in p = 2, 3, 5, and 10
dimensions contained a specified number of outliers as well as a scenario involving simulated
bivariate data containing a 50% sustained shift of the mean. Due to affine equivariance of the
mean vector and covariance matrix, multivariate normal data were generated without loss of
generality from the standard multivariate normal distribution, Np(0, I), where 0 is a p-
dimensional mean vector of all zeros and I is a p x p identity matrix. The data were simulated
23
using MATLAB code from the MathWorks Statistics Toolbox at
http://www.mathworks.com/help/toolbox/stats/. A summary of all planned simulations is
illustrated in Table 3.1.1, where p = dimension, n = sample size, k = number of outliers, and NCP
= shift size.
Table 3.1.1 Summary of Planned Experiments
3.2 Assessing Out-of-Control Performance
Hotelling's T2 charts using BACON, MVE, MCD, and classical estimators were first
evaluated in terms of their ability to detect k randomly occurring outliers, or mean-shifted data
points. The k outliers were randomly arranged throughout the data in order to achieve
consistency with Jobe and Pokojovy's (2009) simulation methodology. In general, the detection
power of Hotelling's T2 chart in Phase I is unaffected by the placement of outliers. This is
because most parameter estimation methods (e.g. BACON, MVE, MCD, and classical methods)
are unaffected by the time order of data, and Hotelling's T2 chart in Phase I does not consider the
time order of data when making simultaneous comparisons of all control chart statistics to the
UCL. In contrast, the clustering method of parameter estimation used by Jobe and Pokojovy
(2009) in their version of Hotelling's T2 chart does account for the sequencing of data, hence the
necessity of randomizing outliers in their simulation methodology.
p n k N C P
30 1( 2) 7 5( 5) 30
30 15 4, 5( 5) 30
30 2( 2) 6 5( 10) 25
50 2, 5, 10 5( 10) 25
100 5, 10, 20 5( 10) 25
2
3, 5, 10
24
Next, Hotelling's T2 charts using BACON, MVE, MCD, and classical estimators were
evaluated in terms of their ability to detect a 50% sustained shift of the mean occurring in the
latter half of a data set. Sustained shifts can be induced anywhere in the data set without loss of
generality, but were placed at the end of each data set for purposes of consistency with Jobe and
Pokojovy's (2009) methodology. It should be noted that this level of contamination exceeds the
BACON method's maximum RBP of 40%, so a significant degradation in performance of
Hotelling's T2 chart using BACON estimators was expected.
The magnitude of each shift was measured by the noncentrality parameter
1NCP ? ????? (3.2.1)
where the process mean vector shifts from o? to o??? and ? is the process covariance matrix.
Because the direction of a shift does not affect control chart performance with elliptically
symmetric distributions, shifts were fixed in the direction of ? ?1 1,0,...,0?e without loss of
generality [Stoumbos and Sullivan (2002), p. 265].
OC performance of Hotelling's T2 charts using BACON, MVE, MCD, and classical
estimators was quantified in terms of the empirical alarm probability (EAP), where EAP is
defined as the estimated probability of a chart signaling at least once in an OC situation. Ideally,
a control chart's EAP should be 100% for all scenarios involving outliers. The algorithm for
simulating OC performance of Hotelling's T2 chart is as follows:
1) Simulate n observations from a p-dimensional standard normal distribution.
2) Shift a specified number of randomly selected observations by the desired NCP.
25
3) Compute a 2T statistic for each observation and compare it to the empirical UCL if
robust estimators are used or Tracy et al.?s (1992) Phase I UCL if classical estimators
are used. If at least one 2T statistic exceeds the UCL, increment a counter by one.
4) Repeat steps 1 - 3 a total of 10,000 times.
5) Estimate the overall EAP = (final counter value)/10,000.
This process was repeated for all experiments listed in Table 3.1.1. The MATLAB program for
simulating Hotelling's T2 chart performance is provided in Appendix B.
26
4 Comparing the BACON-Based Hotelling's T2 Chart to Other Robust Versions
Through a variety of scenarios involving randomly occurring outliers, a 50% sustained
shift of the mean, and an example application to a bivariate data set, Hotelling's T2 chart with
BACON estimators was compared to Hotelling's T2 chart using Jobe and Pokojovy's (2009)
clustering method as well as the MVE, MCD, and classical methods included in Jobe and
Pokojovy's (2009) research. Due to unavailability of complete computer code for Jobe and
Pokojovy's (2009) algorithm, simulation results for Hotelling's T2 chart using clustering were
taken directly from Jobe and Pokojovy (2009).
For comparison to their cluster-based chart, Jobe and Pokojovy (2009) took tables of
simulation results for Hotelling's T2 chart using the MVE, MCD, and classical methods directly
from Vargas (2003). However, this is problematic because it is unknown whether Vargas (2003)
and Jobe and Pokojovy (2009) employed equivalent simulation methodologies (software, input
arguments, number of iterations, etc.). As discussed in Chapter 2, the choice of input arguments
can have a substantial impact on the performance of the MVE and MCD methods. In contrast,
simulation results for Hotelling's T2 chart using the MVE, MCD, and classical methods were
properly recreated here using the simulation methodology outlined in Chapter 3.
4.1 Detecting Randomly Occurring Outliers
The first set of performance comparisons involved the detection of k randomly occurring
outliers in a bivariate normally distributed data set of size n = 30. As indicated in the top panel
27
of Figure 4.1.1, Hotelling's T2 chart with classical estimators is superior when only one outlier is
present, although Hotelling's T2 chart using BACON estimators and Jobe and Pokojovy's (2009)
cluster-based chart offer only slightly lesser performance. As illustrated in the bottom panel of
Figure 4.1.1, however, the performance of Hotelling's T2 chart with classical estimators quickly
degrades as k is increased from 1 to 3, whereas Hotelling's T2 chart using MCD estimators,
BACON estimators, and Jobe and Pokojovy's (2009) clustering method continue to perform
exceptionally well.
Figure 4.1.1 Control Chart Performance on Bivariate Normal Data with k = 1, 3 Outliers
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 10 15 20 25 30
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 1, n = 30, p = 2
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 10 15 20 25 30
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 3, n = 30, p = 2
BACON
J & P
MVE
MCD
Classical
28
As depicted in the top panel of Figure 4.1.2, when the number of outliers is increased to
5, Hotelling's T2 chart using MCD estimators becomes the best option, although Jobe and
Pokojovy's (2009) cluster-based chart remains competitive. When the number of outliers is
increased to 7 as shown in the bottom panel of Figure 4.1.2, Jobe and Pokojovy's (2009) cluster-
based chart is slightly better than Hotelling's T2 chart using MVE or MCD estimators. For both k
= 5 and k = 7, the performance of Hotelling's T2 chart using BACON estimators falls behind the
other robust methods, especially for small to moderate NCPs.
Figure 4.1.2 Control Chart Performance on Bivariate Normal Data with k = 5, 7 Outliers
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 10 15 20 25 30
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 5, n = 30, p = 2
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 10 15 20 25 30
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 7, n = 30, p = 2
BACON
J & P
MVE
MCD
Classical
29
Trends similar to those witnessed with bivariate normally distributed data are also
observed in the three-dimensional scenarios summarized in Table 4.1.1, where the highest EAP
for each scenario is in bold. Hotelling's T2 chart using BACON estimators provides excellent
performance when the number of outliers is very small. However, as the number of outliers is
increased, Jobe and Pokojovy's (2009) cluster-based chart achieves superiority over all other
methods.
Table 4.1.1 Empirical Alarm Probabilities for p = 3
This trend is illustrated in Figure 4.1.3, which represents control chart performance using
a three-dimensional normally distributed sample of size n = 30 with increasing k. In the case of k
= 2 outliers, Hotelling's T2 chart using BACON estimators and Jobe and Pokojovy's (2009)
cluster-based chart provide nearly identical performance. Both charts are substantially better
than Hotelling's T2 chart using MCD, MVE, or classical estimators. As k is increased to 6,
however, the performance of Hotelling's T2 chart using BACON estimators drops below the
performance of Hotelling's T2 chart using MVE or MCD estimators, whereas Jobe and
M e t h od N C P k = 2 k = 4 k = 6 k = 2 k = 5 k = 10 k = 5 k = 10 k = 20
p = 3
B A C O N 5 0.0879 0.0706 0.0572 0.1027 0.0908 0.0476 0.1264 0.0904 0.0517
J & P 5 0.0710 0.0680 0.0730 0.0950 0.0890 0.0850 0.1410 0.1790 0.1760
M V E 5 0.0640 0.0631 0.0595 0.0770 0.0866 0.0659 0.1254 0.1221 0.0715
M C D 5 0.0782 0.0702 0.0660 0.0952 0.0907 0.0638 0.1240 0.1234 0.0588
C l a s s i c a l 5 0.0869 0.0663 0.0498 0.0988 0.0772 0.0462 0.1158 0.0830 0.0475
B A C O N 15 0.3871 0.2103 0.1194 0.5239 0.3056 0.0701 0.6671 0.2995 0.0478
J & P 15 0.3800 0.3770 0.3950 0.4860 0.6190 0.6640 0.7440 0.8490 0.9170
M V E 15 0.2031 0.2185 0.1904 0.3579 0.4550 0.3550 0.7194 0.7674 0.5733
M C D 15 0.3022 0.3162 0.2294 0.4780 0.5708 0.3362 0.7588 0.7864 0.3878
C l a s s i c a l 15 0.2627 0.0830 0.0475 0.4091 0.1391 0.0418 0.4205 0.1387 0.0435
B A C O N 25 0.7476 0.5382 0.3921 0.8846 0.6992 0.3191 0.9712 0.6730 0.1496
J & P 25 0.7470 0.7830 0.8070 0.8560 0.9490 0.9660 0.9870 0.9970 0.9990
M V E 25 0.4145 0.4634 0.4416 0.7044 0.8405 0.7999 0.9798 0.9928 0.9692
M C D 25 0.6084 0.6778 0.6056 0.8336 0.9260 0.8170 0.9874 0.9968 0.9266
C l a s s i c a l 25 0.4460 0.0897 0.0450 0.7469 0.1689 0.0468 0.6963 0.1633 0.0380
n = 30 n = 50 n = 100
30
Pokojovy's (2009) cluster-based chart maintains a significant performance advantage over all
other methods.
Figure 4.1.3 Control Chart Performance with k Outliers When n = 30 and p = 3
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 2, n = 30, p = 3
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 4, n = 30, p = 3
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 6, n = 30, p = 3
BACON
J & P
MVE
MCD
Classical
31
In all simulations performed, it was noted that for a given (k, n, p) combination, the
performance of Hotelling's T2 chart using BACON estimators improved as the NCP was
increased, which is what one would expect when using a robust parameter estimation method. In
general, OC points with the largest NCPs should be detected more frequently than those with
small to moderate NCPs because they are farthest from the center of the data as defined by the
robust mean estimate. However, as shown in Figure 4.1.4, for a given (n, p, NCP) combination,
the performance of Hotelling's T2 chart using BACON estimators actually worsened as k was
increased, which is somewhat counterintuitive. One would expect the presence of more OC
points to result in a higher probability of an alarm, assuming the OC points are correctly
excluded from computation of the T2 statistic by the BACON method.
Figure 4.1.4 Effect of Increasing k on Control Chart Performance
These trends together suggest that the BACON method is unable to consistently exclude
OC points with small to moderate NCPs from a data set, but rather performs optimally when the
NCP corresponding to an OC point is very large. In order to confirm this, additional simulations
were performed using NCPs beyond the range of those tested by Jobe and Pokojovy (2009).
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
k = 2 k = 4 k = 6
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Outliers
n = 30, p = 3, NCP = 25
BACON
J & P
MVE
MCD
Classical
32
Expanding the range of NCPs from NCP = 5(10)25 to NCP = 5(10)55, the performance of
Hotelling's T2 charts using BACON, MVE, MCD, and classical estimators are displayed in
Figure 4.1.5. As suspected, even with a large percentage of outliers (6/30 = 20% in this case),
Hotelling's T2 chart with BACON estimators gets progressively better as the NCP is increased,
and is second only to Hotelling's T2 chart with MCD estimators when the NCP > 25. Because of
its comparatively low computational burden, Hotelling's T2 chart with BACON estimators is
therefore an excellent option when it is suspected that extreme outliers are present in the data.
Jobe and Pokojovy's (2009) cluster-based chart was not included in this performance comparison
due to lack of availability of their simulation algorithm.
Figure 4.1.5 Control Chart Performance with Extreme Outliers
Table 4.1.2 shows empirical alarm probabilities for all charts evaluated using normally
distributed data in five dimensions, with the highest EAP for each scenario in bold. As with the
two- and three-dimensional cases, Hotelling's T2 chart using BACON estimators is competitive
as long as k is small relative to n, but Jobe and Pokojovy's (2009) cluster-based chart provides
the best overall performance across the range of scenarios evaluated. This is further illustrated
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25 35 45 55
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 6, n = 30, p = 3
BACON
MVE
MCD
Classical
33
for sample size n = 50 in Figure 4.1.6. In comparison to the results obtained in two and three
dimensions, there is a loss of detection power observed for all control charts in five dimensions.
This is to be expected because outliers are known to be more difficult to detect in higher
dimensions than in low dimensions.
Table 4.1.2 Empirical Alarm Probabilities for p = 5
M e t h od N C P k = 2 k = 4 k = 6 k = 2 k = 5 k = 10 k = 5 k = 10 k = 20
p = 5
B A C O N 5 0.0696 0.0614 0.0570 0.0759 0.0669 0.0523 0.0938 0.0739 0.0489
J & P 5 0.0830 0.0670 0.0590 0.0840 0.0920 0.0690 0.0900 0.1070 0.0840
M V E 5 0.0568 0.0571 0.0496 0.0666 0.0644 0.0561 0.0945 0.0886 0.0525
M C D 5 0.0684 0.0598 0.0598 0.0700 0.0724 0.0548 0.0924 0.0758 0.0498
C l a s s i c a l 5 0.0704 0.0584 0.0541 0.0792 0.0650 0.0484 0.0886 0.0683 0.0513
B A C O N 15 0.2175 0.1073 0.0652 0.3428 0.1531 0.0548 0.4475 0.1507 0.0504
J & P 15 0.2270 0.2550 0.2200 0.3820 0.3690 0.3450 0.6130 0.7280 0.7610
M V E 15 0.1312 0.1238 0.0915 0.2386 0.2744 0.1333 0.5592 0.5305 0.1877
M C D 15 0.1916 0.1764 0.1002 0.2834 0.3384 0.1616 0.6018 0.5884 0.2012
C l a s s i c a l 15 0.1649 0.0673 0.0503 0.2717 0.1017 0.0501 0.2689 0.0999 0.0476
B A C O N 25 0.4723 0.2611 0.1918 0.7114 0.3937 0.1444 0.8363 0.3271 0.0688
J & P 25 0.5030 0.5450 0.5480 0.7410 0.8110 0.8230 0.9690 0.9930 0.9910
M V E 25 0.2639 0.2667 0.1896 0.5443 0.6441 0.3683 0.9302 0.9370 0.5694
M C D 25 0.3904 0.3874 0.2544 0.6122 0.7586 0.5420 0.9586 0.9740 0.8046
C l a s s i c a l 25 0.2638 0.0701 0.0490 0.5284 0.1190 0.0433 0.4623 0.1135 0.0461
n = 30 n = 50 n = 100
34
Figure 4.1.6 Control Chart Performance with k Outliers When n = 50 and p = 5
The loss of detection power in higher dimensions is most evident in the results for the
ten-dimensional scenarios provided in Table 4.1.3. With most charts depicted in Table 4.1.3, the
detection power is relatively poor. In fact, for NCP = 5, the highest EAPs for all n and k are
barely above the IC FAP of 0.05. Even Jobe and Pokojovy's (2009) cluster-based chart provides
acceptable performance only for the largest sample size and shift evaluated (n = 100, NCP = 25).
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 2, n = 50, p = 5
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pir
ica
l A
lar
m
Pr
oba
bil
ity
Noncentrality Parameter
k = 5, n = 50, p = 5
BACON
J & P
MVE
MCD
Classical
35
Table 4.1.3 Empirical Alarm Probabilities for p = 10
The difficulty in detecting shifts in higher dimension (especially with very small samples)
is further illustrated in Figure 4.1.7, which represents ten-dimensional normally distributed data
with various sample sizes containing 20% outliers. Again, most charts are unable to detect the
presence of the shifted data, although Jobe and Pokojovy's (2009) cluster-based chart has some
success in doing so as the sample size, number of outliers, and NCP are raised.
M e t h od N C P k = 2 k = 4 k = 6 k = 2 k = 5 k = 10 k = 5 k = 10 k = 20
p = 10
B A C O N 5 0.0544 0.0569 0.0570 0.0584 0.0588 0.0490 0.0671 0.0601 0.0516
J & P 5 0.0610 0.0570 0.0530 0.0690 0.0730 0.0580 0.0680 0.0490 0.0530
M V E 5 0.0555 0.0562 0.0508 0.0509 0.0550 0.0498 0.0604 0.0569 0.0493
M C D 5 0.0528 0.0496 0.0536 0.0490 0.0460 0.0422 0.0612 0.0608 0.0474
C l a s s i c a l 5 0.0585 0.0497 0.0472 0.0611 0.0620 0.0522 0.0648 0.0593 0.0460
B A C O N 15 0.0874 0.0624 0.0503 0.1443 0.0744 0.0519 0.1819 0.0871 0.0551
J & P 15 0.0790 0.0880 0.0960 0.1810 0.2120 0.2020 0.2190 0.2300 0.2100
M V E 15 0.0723 0.0578 0.0541 0.0925 0.0859 0.0542 0.2433 0.1597 0.0557
M C D 15 0.0740 0.0604 0.0550 0.0938 0.0970 0.0576 0.2794 0.2662 0.0690
C l a s s i c a l 15 0.0860 0.0528 0.0530 0.1320 0.0719 0.0525 0.1334 0.0772 0.0443
B A C O N 25 0.1150 0.0576 0.0552 0.3213 0.1021 0.0537 0.4046 0.1089 0.0500
J & P 25 0.1600 0.1760 0.1710 0.4160 0.5010 0.5000 0.7360 0.8060 0.8140
M V E 25 0.0979 0.0670 0.0591 0.1939 0.1546 0.0574 0.5923 0.3966 0.0754
M C D 25 0.1326 0.1022 0.0532 0.2176 0.2594 0.0830 0.7256 0.7650 0.2610
C l a s s i c a l 25 0.1065 0.0610 0.0519 0.2510 0.0810 0.0506 0.2240 0.0849 0.0492
n = 30 n = 50 n = 100
36
Figure 4.1.7 Control Chart Performance with 20% Outliers in Ten Dimensions
In order to visualize the changes in detection power that occur as the dimension is
increased, Figure 4.1.8 shows control chart performance with k = 5 outliers using a sample size
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 6, n = 30, p = 10
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 10, n = 50, p = 10
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 20, n = 100, p = 10
BACON
J & P
MVE
MCD
Classical
37
of n = 100 in both three and ten dimensions. From the scenario depicted in Figure 4.1.8, it is
evident that the detection power for each variation of Hotelling's T2 chart is much greater when p
= 3 than it is when p = 10.
Figure 4.1.8 Effect of Increasing Dimension on Control Chart Performance
4.2 Detecting a Sustained Shift of the Mean
Following the outline of Jobe and Pokojovy (2009), the next scenario evaluated involved
a sample of 30 bivariate normally distributed observations with a sustained shift of the mean
occurring in the latter half of the data (so k = 15). The numerical results are provided in Table
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 5, n = 100, p = 3
BACON
J & P
MVE
MCD
Classical
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
5 15 25
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 5, n = 100, p = 10
BACON
J & P
MVE
MCD
Classical
38
4.2.1 where the highest EAP for each NCP is in bold, and the graphical results are portrayed in
Figure 4.2.1. Hotelling's T2 chart using BACON estimators was not expected to perform well
since the maximum RBP of the BACON method is approximately 40%. This proved to be the
case, although the BACON method did perform significantly better than the MVE, MCD, and
classical methods for the larger NCPs evaluated. Jobe and Pokojovy's (2009) cluster-based chart
demonstrated the best performance overall. Hotelling's T2 chart using MVE, MCD, and classical
estimators were unable to detect the presence of such a high level of contamination, producing
EAPs that were in most cases even lower than the desired IC FAP of 0.05.
Table 4.2.1 Empirical Alarm Probabilities Under a 50% Sustained Shift of the Mean
Figure 4.2.1 Control Chart Performance Under a 50% Sustained Shift of the Mean
M e t h o d \ N C P 4 5 10 15 20 25 30
B A C O N 0 .0 3 3 6 0 .0 3 1 3 0 .0 2 8 7 0 .0 5 0 0 0 .0 9 0 0 0 .1 6 2 2 0 .2 4 8 3
J & P 0 .2 6 5 0 0 .3 5 6 0 0 .6 9 3 0 0 .8 7 3 0 0 .9 5 2 0 0 .9 8 2 0 0 .9 8 6 0
M V E 0 .0 3 9 9 0 .0 3 9 8 0 .0 4 4 3 0 .0 4 2 5 0 .0 4 3 1 0 .0 5 1 0 0 .0 4 8 3
M C D 0 .0 3 9 8 0 .0 3 0 4 0 .0 3 1 2 0 .0 3 2 0 0 .0 3 2 4 0 .0 3 7 0 0 .0 3 2 0
C l a s s i c a l 0 .0 3 1 6 0 .0 2 8 3 0 .0 2 4 2 0 .0 2 2 8 0 .0 2 1 5 0 .0 2 2 5 0 .0 2 3 0
0.0 0.1
0.2 0.3
0.4 0.5
0.6 0.7
0.8 0.9
1.0
4 5 10 15 20 25 30
Em
pirica
l Ala
rm
P
ro
ba
bil
ity
Noncentrality Parameter
k = 15, n = 30, p = 2
BACON
J & P
MVE
MCD
Classical
39
4.3 Application to an Example Data Set
The final performance evaluation for Hotelling's T2 chart using BACON estimators
involved its application to the bivariate data set depicted in Table 4.3.1. The original data set
was presented by Quesenberry (2001) in 11 variables, but later pared down to two variables by
Vargas (2003) for purposes of comparing the performance of Hotelling's T2 chart using MVE,
MCD, and classical estimators as well as two methods proposed by Sullivan and Woodall
(1996). Jobe and Pokojovy (2009) subsequently used the bivariate data set from Vargas (2003)
to compare the effectiveness of their cluster-based chart to the five aforementioned methods
evaluated by Vargas (2003).
Simulation exercises resulted in observation 2 being identified as an outlier by all
versions of Hotelling's T2 chart. This is contrary to the findings of Vargas (2003), who reported
that Hotelling's T2 chart using MCD estimators failed to identify the lone outlier. As discussed in
Chapter 2, this is probably because the MCD method in this research used a higher percentage of
the sample than Vargas (2003) to compute more accurate estimates of location and scatter.
Table 4.3.1 Example Bivariate Data Set
i 1 2 3 4 5 6 7 8 9 10
x 1 0. 56 7 0. 53 8 0. 53 0 0. 56 2 0. 48 3 0. 52 5 0. 55 6 0. 58 6 0. 54 7 0. 53 1
x 2 60 .5 58 56 .3 03 59 .5 24 61 .1 02 59 .8 34 60 .2 28 60 .7 56 59 .8 23 60 .1 53 60 .6 40
i 11 12 13 14 15 16 17 18 19 20
x 1 0. 58 1 0. 58 3 0. 54 0 0. 45 8 0. 55 4 0. 46 9 0. 47 1 0. 45 7 0. 56 5 0. 66 4
x 2 59 .7 85 59 .6 75 60 .4 89 61 .0 67 59 .7 88 58 .6 40 59 .5 74 59 .7 18 60 .9 01 60 .1 80
i 21 22 23 24 25 26 27 28 29 30
x 1 0. 60 0 0. 58 6 0. 56 7 0. 49 6 0. 48 5 0. 57 3 0. 52 0 0. 55 6 0. 53 9 0. 55 4
x 2 60 .4 93 58 .3 70 60 .2 16 60 .2 14 59 .5 00 60 .0 52 59 .5 01 58 .4 76 58 .6 66 60 .2 39
40
Next, imitating the analysis of Vargas (2003), the bivariate sample was altered by
changing observation 16 to (0.469, 56.23) and observation 24 to (0.496, 56.08) to make them
outliers. Of all the variations of Hotelling's T2 chart evaluated by Vargas (2003) and Jobe and
Pokojovy (2009), as well as the BACON-based alternative presented here, only the clustering
and BACON versions correctly identified all three observations (2, 16, and 24) as outliers. The
classical, MVE, MCD, and both Sullivan and Woodall (1996) versions of Hotelling's T2 chart
failed to detect at least one of the three outlying observations. Again, this is contrary to Vargas
(2003), who determined that Hotelling's T2 chart using MVE estimators correctly identified all
three outliers and that Hotelling's T2 chart using MCD estimators failed to do so. As previously
stated, this is likely due to each author using different MVE and MCD input arguments.
Control chart statistics for Hotelling's T2 chart using BACON estimators for the original
and altered samples are provided in Table 4.3.2. Entries in bold indicate control chart statistics
that exceed the empirical UCL of 21.07 and therefore signal a potential out-of-control condition.
The corresponding control chart for the altered data is presented in Figure 4.3.1.
Table 4.3.2 T2 Statistics for Original and Altered Samples Using BACON Estimators
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
O r i g i n a l 0 .9 2 2 4 .9 6 0 .3 5 2 .6 1 1 .5 1 0 .3 1 1 .2 9 0 .9 3 0 .0 9 1 .0 3 0 .7 7 0 .9 6 0 .5 9 6 .1 1 0 .1 2
A l t e r e d 0 .8 7 2 6 .6 8 0 .5 1 2 .6 2 1 .8 7 0 .3 4 1 .2 5 0 .8 0 0 .0 6 0 .9 9 0 .6 5 0 .8 3 0 .5 4 6 .0 9 0 .1 0
i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
O r i g i n a l 4 .9 5 2 .3 0 3 .1 5 1 .8 7 6 .5 9 1 .9 0 5 .9 6 0 .3 9 1 .1 5 1 .6 3 0 .4 4 0 .5 1 4 .2 7 3 .0 4 0 .2 2
A l t e r e d 3 0 .1 5 2 .8 9 3 .7 8 1 .8 5 6 .5 5 1 .8 6 5 .9 3 0 .3 2 3 0 .9 4 2 .1 4 0 .3 5 0 .7 4 4 .5 1 3 .4 0 0 .1 7
41
Figure 4.3.1 Application of the BACON-Based Hotelling's T2 Chart to Altered Data
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
T2
Sta
tis
tic
Observation
Hotelling's T2 Chart Using BACON
UCL = 21.07
42
5 Summary and Conclusions
Hotelling's T2 chart using BACON estimators, although ultimately unsuccessful in
exceeding the standard of performance established by Jobe and Pokojovy's (2009) cluster-based
chart, was shown to be a viable option for Phase I analysis of individual multivariate normally
distributed data under certain conditions. If the level of contamination in a reference data set of
size n = 30 - 100 is thought to be relatively small (less than 5 - 10%, depending on the sample
size and dimension), or if any number of outliers are suspected to be extremely large (as
measured by their NCP, and also dependent on n and p), Hotelling's T2 chart using BACON
estimators offers good performance with a low computational burden. This is in contrast to other
robust parameter estimation methods such as MVE and MCD which, as discussed by Billor et al.
(2000), are significantly more computationally complex.
If, on the other hand, a more universally robust procedure is desired, Jobe and Pokojovy's
(2009) cluster-based chart was shown to be extremely effective across a wide range of scenarios,
including the presence of a 50% sustained shift of the mean -- a level of contamination which for
most methods renders IC data indistinguishable from OC data. However, Jobe and Pokojovy's
(2009) control chart is not without potential drawbacks. The relatively complicated clustering
algorithm may not be easy to understand for practitioners, and the computational complexity of
the procedure is not detailed by the authors. Furthermore, because complete computer code for
Jobe and Pokojovy's (2009) cluster-based chart is not readily available, their simulation results
cannot be validated. It is therefore impossible to give their method an unqualified endorsement
43
at this time. While it is assumed that the general observations regarding the comparative
performance of Jobe and Pokojovy's (2009) chart to other robust Hotelling's T2 charts are valid,
detailed conclusions cannot be drawn without replicating Jobe and Pokojovy's (2009) simulation
results using the process outlined in Chapter 3. This is a topic for future research.
Another area that merits further investigation is the effect of sample size on control chart
performance. In accordance with Jobe and Pokojovy's (2009) experimental design, only sample
sizes of n = 30, 50, and 100 were evaluated in this research. Though they may be commonly
encountered in Phase I scenarios, these are very small sample sizes for all but the smallest
dimensions considered in this research (p = 2, 3, 5, and 10). Billor et al. (2000) originally
designed the BACON method to be a computationally efficient robust parameter estimation
method for extremely large data sets in higher dimensions, and the authors offered a variety of
simulation results for n = 100 - 10,000 and p = 5 - 20 to demonstrate its performance. The
authors noted that the BACON method matched the performance of the MVE, MCD, and other
robust parameter estimation methods on all published test problems, but at a mere fraction of the
computational expense. If used as originally intended (with large n and p), BACON estimators
in conjunction with Hotelling's T2 chart might become a more attractive option than Hotelling's
T2 chart using MVE or MCD estimators and perhaps even Jobe and Pokojovy's (2009) cluster-
based chart, both in terms of speed and accuracy.
In conclusion, improving the robustness of Hotelling's T2 chart applied to individual
multivariate normally distributed data in Phase I has been the subject of much research over the
years, yet this study has shown that many questions remain unanswered. Numerous variations of
Hotelling's T2 chart have been proposed, but none so far appear to have attained a balance of
accuracy, robustness, computational complexity, and ease of implementation for a wide range of
44
n and p. The BACON-based version of Hotelling's T2 chart presented here, despite falling short
of its original performance objectives, provides a valuable contribution by demonstrating its
strengths and weaknesses as a Phase I method when sample sizes are small, and in the process
identifying several areas worthy of additional research.
45
References
Alfaro, J.L., & Ortega, J.F. (2008). A Robust Alternative to Hotelling's T2 Control Chart Using
Trimmed Estimators. Quality and Reliability Engineering International, 24, 601-611.
Bersimis, S., Psarakis, S., & Panaretos, J. (2007). Multivariate Statistical Process Control Charts:
An Overview. Quality and Reliability Engineering International, 23, 517-543.
Billor, N., Hadi, A.S., & Velleman, P.F. (2000). BACON: Blocked Adaptive Computationally
Efficient Outlier Nominators. Computational Statistics & Data Analysis, 34, 279-298.
Chenouri, S., & Steiner, S.H. (2009). A Multivariate Robust Control Chart for Individual
Observations. Journal of Quality Technology, 41(3), 259-271.
Chenouri, S., & Variyath, A.M. (2011). A Comparative Study of Phase II Robust Multivariate
Control Charts for Individual Observations. Quality and Reliability Engineering
International, 27(7), 857-865.
Donoho, D.L. & Huber, P.J. (1983). The Notion of a Breakdown Point. In P.J. Bickel, K.A.
Doksum and J.L. Hodges, Jr. (Eds.), A Festschrift for Eric L. Lehmann (pp. 157-184).
Belmont, CA: Wadsworth.
Hotelling, H. (1947). Multivariate Quality Control ? Illustrated By the Air Testing of Sample
Bombsights. In C. Eisenhart, M.W. Hastay, & W.A. Wallis (Eds.), Techniques of
Statistical Analysis (pp. 111-184). New York, NY: McGraw-Hill.
Jensen, W.A., Jones-Farmer, L.A., Champ, C.W., & Woodall, W.H. (2006). Effects of Parameter
Estimation on Control Chart Properties: A Literature Review. Journal of Quality
Technology, 38(4), 349-364.
Jensen, W.A., Birch, J.B., & Woodall, W.H. (2007). High Breakdown Estimation Methods for
Phase I Multivariate Control Charts. Quality and Reliability Engineering International,
23(5), 615-629.
Jobe, J.M., & Pokojovy, M. (2009). A Multistep, Cluster-Based Multivariate Chart for
Retrospective Monitoring of Individuals. Journal of Quality Technology, 41(4), 323-339.
Lopuhaa, H.P., & Rousseeuw, P.J. (1991). Breakdown Points of Affine Equivariant Estimators
of Location and Covariance Matrices. The Annals of Statistics, 19, 229-248.
46
Lowry, C.A., & Montgomery, D.C. (1995). A Review of Multivariate Control Charts. IIE
Transactions, 27, 800-810.
Mahalanobis, P. C. (1936). On the Generalized Distance in Statistics. Proceedings of the
National Institute of Science of India, 12, 49-55.
Mason, R.L., Champ, C.W., Tracy, N.D., Wierda, S.J., & Young, J.C. (1997). Assessment of
Multivariate Process Control Techniques. Journal of Quality Technology, 29(2), 140-143.
Mason, R.L., & Young, J.C. (2002). Multivariate Statistical Process Control with Industrial
Applications. Alexandria, VA: American Statistical Association; Philadelphia, PA:
Society for Industrial and Applied Mathematics.
Mohammadi, M., Midi, H., Arasan, J., & Al-Talib, B. (2011). High Breakdown Estimators to
Robustify Phase II Multivariate Control Charts. Journal of Applied Sciences, 11(3), 503-
511.
Montgomery, D.C. (2005). Introduction to Statistical Quality Control (5th ed.). Hoboken, NJ:
John Wiley & Sons, Inc.
Mudholkar, G.S., & Srivastava, D.K. (2000). A Class of Robust Stepwise Alternatives to
Hotelling's T2 Tests. Journal of Applied Statistics, 27(5), 599-619.
Nedumaran, G., & Pignatiello, J.J. (2000). On Constructing T2 Control Charts for Retrospective
Examination. Communications in Statistics ? Simulation and Computation, 29(2), 621-
632.
Oyeyemi, G.M., & Ipinyomi, R.A. (2010). A Robust Method of Estimating Covariance Matrix in
Multivariate Data Analysis. African Journal of Mathematics and Computer Science
Research, 3(1), 1-18.
Rousseeuw, P.J. (1984). Least Median of Squares Regression. Journal of the American
Statistical Association, 79, 871-880.
Rousseeuw, P.J., & Leroy, A.M. (1987). Robust Regression and Outlier Detection. New York,
NY: John Wiley & Sons, Inc.
Rousseeuw, P.J., & van Driessen, K. (1999). A Fast Algorithm for the Minimum Covariance
Determinant Estimator. Technometrics, 41, 212-223.
Rousseeuw, P.J., & van Zomeren, B.C. (1990). Unmasking Multivariate Outliers and Leverage
Points. Journal of the American Statistical Association, 85, 633-651.
Shewhart, W.A. (1939). Statistical Method from the Viewpoint of Quality Control. New York:
Dover Publications.
47
Srivastava, D.K., & Mudholkar, G.S. (2001). Trimmed T2: A Robust Analog of Hotelling's T2.
Journal of Statistical Planning and Inference, 97, 343-358.
Sullivan, J.H., & Woodall, W.H. (1996). A Comparison of Multivariate Control Charts for
Individual Observations. Journal of Quality Technology, 28, 398-408.
Sullivan, J.H., & Woodall, W.H. (1998). Adapting Control Charts for the Preliminary Analysis
of Multivariate Observations. Communications in Statistics ? Simulation and
Computation, 27(4), pp. 953-979.
Tiku, M.L., & Balakrishnan, N. (1988). Robust Hotelling-Type T2 Statistics Based on the
Modified Maximum Likelihood Estimators. Communications in Statistics ? Theory and
Methods, 17(6), 1789-1810.
Tiku, M.L., & Singh, M. (1982). Robust Statistics for Testing Mean Vectors of Multivariate
Distributions. Communications in Statistics ? Theory and Methods, 11(9), 985-1001.
Vargas, J.A. (2003). Robust Estimation in Multivariate Control Charts for Individual
Observations. Journal of Quality Technology, 35(4), 367-376.
Wierda, S.J. (1994). Multivariate Statistical Process Control ? Recent Results and Directions for
Future Research. Statistica Neerlandica, 48, 147-168.
Willems, G., Pison, G., Rousseeuw, P.J., & Van Aelst, S. (2002). A Robust Hotelling Test.
Metrika, 55, 125-138.
Woodall, W.H., & Montgomery, D.C. (1999). Research Issues and Ideas in Statistical Process
Control. Journal of Quality Technology, 31(4), 376-386.
Yanez, S., Gonzalez, N., & Vargas, J.A. (2010). Hotelling's T2 Control Charts Based on Robust
Estimators. Dyna, 163, 239-247.
48
Appendices
Appendix A: MATLAB Code for Simulating Hotelling?s T2 Chart Empirical Control Limits
Appendix B: MATLAB Code for Simulating Hotelling's T2 Chart Performance
49
Appendix A: MATLAB Code for Simulating Hotelling?s T2 Chart Empirical Control Limits
%=========================================================================%
% FINDING EMPIRICAL CONTROL LIMITS FOR HOTELLING'S T^2 CONTROL CHART %
%=========================================================================%
% -Created by Richard Bell 8/12/2011; last updated 9/25/2011 %
% -Finds empirical UCLs for Hotelling's T2 control chart with BACON, %
% MVE, or MCD location and scatter estimators %
% -Uses same method as Jensen, Birch, and Woodall (2007) %
%=========================================================================%
clear all % clear all objects in the MATLAB workspace
clc % clear the output screen
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%% INPUT SIMULATION PARAMETERS %%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
iterations=100000; % number of simulation iterations to be performed
input=xlsread('c:\Users\Rich\Documents\InputFile.xlsx','Sheet1','a1:b10');
inputRows=length(input(:,1)); % determine the number of rows of data in the
input file
empUCLtable=zeros(inputRows,1); % initialize the table of empirical UCLs to
all zeros
% NOTE: USE ONLY IF MVE METHOD IS EMPLOYED
[status,msg] = openR;
if status ~= 1
disp(['Problem connecting to R: ' msg]);
end
for row=1:inputRows % perform the simulation below for each scenario in the
input file
n=input(row,1); % read in the sample size
p=input(row,2); % read in the number of variables
alpha=.05; % desired overall false alarm probability (FAP) for the chart
percentile=(1-alpha)*100; % percentile corresponding to the desired
alpha level
maxT2vector=zeros(iterations,1); % initialize the vector of maximum T2
statistics to all zeros
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%% GENERATE DATA AND CONSTRUCT HOTELLING'S T2 CHART %%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
count=0; % initialize the counter for the number of iterations performed
while count < iterations % run the entire loop for a set number of
iterations
50
%=====> SIMULATE MULTIVARIATE NORMAL DATA
mu=zeros(1,p); % set the mean vector to all zeros
sigma=eye(p); % set the covariance matrix equal to the identity
matrix
X=mvnrnd(mu,sigma,n); % generate multivariate normal data
%=====> COMPUTE ROBUST ESTIMATES OF LOCATION AND SCATTER (must code
out 2 of the 3 methods listed using % symbols)
% BACON METHOD
out=baconV(X,2,.10,6); % compute BACON estimate for location using
Mahalanobis distance, alpha=0.05, and c=4; use version 2
(Euclidean distance) if expected contamination exceeds 20 percent
Xbar_robust=out.center3; % BACON estimate for mean vector
S_robust=out.cov3; % BACON estimate for covariance matrix
% MVE METHOD
evalR('library(MASS)');
putRdata('X',X);
putRdata('n',n);
putRdata('p',p);
Xbar_robust=evalR('cov.mve(X, cor=FALSE, quantile.used=floor((n + p +
1)/2), nsamp = "best")$center');
S_robust=evalR('cov.mve(X, cor=FALSE, quantile.used=floor((n + p +
1)/2), nsamp = "best")$cov');
% MCD METHOD
[rew,raw]=mcdcov(X,'plots',0); % compute MCD estimates for location
and scatter using default parameter values; suppress plot output
by adding the arguments ('plots',0)
Xbar_robust=rew.center; % MCD estimate for mean vector
S_robust=rew.cov; % MCD estimate for covariance matrix
%=====> COMPUTE HOTELLING'S T2 STATISTICS AND COMPARE TO UCL
T2vector=zeros(n,1); % initialize the vector of T2 statistics
for i=1:n % compute T2 control statistic for each observation
T2stat=(X(i,:)-Xbar_robust)/S_robust*(X(i,:)-Xbar_robust)';
T2vector(i)=T2stat; % store the T2 statistics in a vector
end
count=count+1; % increment the counter for the total number of
iterations performed
maxT2=max(T2vector); % identify the maximum T2 statistic
maxT2vector(count,1)=maxT2; % store the maximum T2 statistic in a
vector
end
51
empUCL=prctile(maxT2vector,percentile); % compute the empirical UCL for
the current scenario IAW Jensen, Birch, and Woodall (2007)
% store the results in a table and display the table on the output screen
empUCLtable(row,1)=empUCL;
disp(empUCLtable);
% send the results to an Excel file
xlswrite('c:\Users\Rich\Documents\OutFile.xlsx',empUCLtable,'Sheet1','A1');
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF PROGRAM %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
52
Appendix B: MATLAB Code for Simulating Hotelling's T2 Chart Performance
%=========================================================================%
% HOTELLING'S T^2 CONTROL CHART PROGRAM FILE %
%=========================================================================%
% -Created by Richard Bell on 9/15/2010; last updated on 9/25/2011 %
% -Based on Hotelling's T2 chart with classical or empirical UCLs %
% -File is set up to run multiple scenarios; before using, undesired %
% sections must be commented out using "%" %
%=========================================================================%
clear all % clear all objects in the MATLAB workspace
clc % clear the output screen
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%% INPUT SIMULATION PARAMETERS %%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% read in m, n, UCL, shift size, and p from an Excel file
iterations=10000; % number of simulation iterations to be performed
input=xlsread('c:\Users\Rich\Documents\InputFile.xlsx','Sheet1','a1:e10');
inputRows=length(input(:,1)); % determine the number of rows of data in the
input file
APtable=zeros(inputRows,1); % initialize the table of alarm probabilities to
all zeros
% NOTE: USE ONLY IF MVE METHOD IS EMPLOYED
[status,msg] = openR;
if status ~= 1
disp(['Problem connecting to R: ' msg]);
end
for row=1:inputRows % perform the simulation below for scenario in the input
file
n=input(row,1); % read in the sample size
p=input(row,2); % read in the number of variables
UCL=input(row,3); % read in the empirical upper control limit
shiftSize=input(row,4); % read in the size of the shift (as defined by
the NCP)
numOC=input(row,5); % read in the number of out-of-control points
count=0; % initialize the counter for the number of iterations performed
alarmCount=0; % initialize the alarm counter
while count < iterations % run the entire loop for a set number of
iterations
53
%=====> SIMULATE MULTIVARIATE NORMAL DATA
% NOTE: USE THIS UCL ONLY FOR NORMALLY DISTRIBUTED DATA!
alpha=.05; % desired overall false alarm probability (FAP) for the
chart
alphaAdjusted=1-(1-alpha)^(1/n); % desired FAP for each individual
comparison
UCL=((n-1)^2/n)*betainv(1-alphaAdjusted,p/2,(n-p-1)/2) % Tracy,
Young, and Mason's (1992) Phase I UCL
mu=zeros(1,p); % set the mean vector to all zeros
sigma=eye(p); % set the covariance matrix equal to the identity
matrix
X=mvnrnd(mu,sigma,n); % simulate multivariate normally distributed
data
shift=zeros(1,p); % initialize the shift vector to all zeros
shift(1)=sqrt(shiftSize); % place the desired shift in the first
position of the shift vector
% check the NCP to ensure it equals the desired value
NCP=shift/eye(p)*shift';
if abs(NCP-shiftSize) > 0.0001 % display error message if calculated
NCP is significantly different than shift size (they should be
equal since the theoretical covariance matrix of X is I)
disp('ERROR in NCP!')
end
% add the desired shift to randomly selected points
i=1;
randIndex=randperm(n)';
while i <= numOC
X(randIndex(i),:)=X(randIndex(i),:)+shift;
i = i + 1;
end
%=====> COMPUTE ROBUST ESTIMATES OF LOCATION AND SCATTER (must code
out 2 of the 3 methods listed using % symbols)
% BACON METHOD
out=baconV(X,2,.10,3); % compute BACON estimate for location using
Mahalanobis distance, alpha=0.05, and c=4; use version 2
(Euclidean distance) if expected contamination exceeds 20 percent
Xbar_robust=out.center3; % BACON estimate for mean vector
S_robust=out.cov3; % BACON estimate for covariance matrix
% MVE METHOD (requires code for R interface)
evalR('library(MASS)'); % call the R library named "MASS"
putRdata('X',X); % send sample data to R
Xbar_robust=evalR('cov.mve(X)$center'); % use R to find MVE estimate
for mean vector
S_robust=evalR('cov.mve(X)$cov'); % use R to find MVE estimate for
covariance matrix
54
% MCD METHOD
[rew,raw]=mcdcov(X,'plots',0); % compute MCD estimates for location
and scatter using default parameter values; suppress plot output
by adding the arguments ('plots',0)
Xbar_robust=rew.center; % MCD estimate for mean vector
S_robust=rew.cov; % MCD estimate for covariance matrix
% CLASSICAL METHOD
Xbar_robust=mean(X);
S_robust=cov(X);
%=====> COMPUTE HOTELLING'S T2 STATISTICS AND COMPARE TO UCL
alarm=0; % initialize indicator variable representing an alarm (=1)
or no alarm (=0)
T2vector=zeros(n,1); % initialize vector of T2 statistics
for i=1:n % perform loop for all observations in the sample
if alarm==0 % continue loop as long as no false alarms occur
T2stat=(X(i,:)-Xbar_robust)/S_robust*(X(i,:)-Xbar_robust)';
% compute T2 control statistic for each observation
T2vector(i)=T2stat; % store T2 control statistics in a
vector
if T2stat > UCL
alarm=1; % issue an alarm if the current T2 control
statistic exceeds the UCL
end
end
end
if alarm==1
alarmCount=alarmCount+1; % if a control chart issues an alarm,
increment the counter representing total alarms for all
iterations
end
count=count+1; % increment the counter for the total number of
iterations performed
end
AP=alarmCount/iterations; % estimate the alarm probability (AP) for the
current scenario
APtable(row,1)=AP; % store the current AP in a table
disp(APtable); % display AP table for Hotelling's T2 chart on screen
% send the estimated APs to an Excel file
xlswrite('c:\Users\Rich\Documents\OutFile.xlsx',APtable,'Sheet1','A1');
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF PROGRAM %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%