Genuine Sequential Estimation Procedures for Gamma Populations using Exact Evaluation Criteria Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classi ed information. Kevin Tolliver Certi cate of Approval: Hyejin Shin Assistant Professor Mathematics and Statistics Mark Carpenter, Chair Associate Professor Mathematics and Statistics Peng Zeng Assistant Professor Mathematics and Statistics George T. Flowers Acting Dean Graduate School Genuine Sequential Estimation Procedures for Gamma Populations using Exact Evaluation Criteria Kevin Tolliver A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 18, 2009 Genuine Sequential Estimation Procedures for Gamma Populations using Exact Evaluation Criteria Kevin Tolliver Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Vita Kevin Paul Tolliver, son of Kevin Leonadis Tolliver and Cheryl Gatewood Tolliver, was born in Washington, D.C., USA on April 17, 1985. He graduated from Edgewater High School in Orlando, FL in 2002. In 2005, he received his Bachelor of Science degree in Mathematics from Morehouse College in Atlanta, GA. He received his Doctor of Philosophy degree in Statistics from Auburn University in 2009. iv Dissertation Abstract Genuine Sequential Estimation Procedures for Gamma Populations using Exact Evaluation Criteria Kevin Tolliver Doctor of Philosophy, December 18, 2009 (B.S., Morehouse College, 2005) 80 Typed Pages Directed by Mark Carpenter In this dissertation, we develop genuine two-stage sequential procedures for bounded- risk and xed-width con dence interval estimation for Gamma distributed populations, based on exact evaluation criteria. The term \genuine" refers to the fact that, in contrast to previous methods, the procedures proposed herein are based on the combined samples from both the rst and second stages, rather than ignoring the data from the rst-stage sample. Accordingly, the terminal sample size and the estimate are no longer independent, which complicates the theory development signi cantly. The term \exact" refers to the fact the procedures are not evaluated on asymptotic or large sample theory, as is common in the literature predating this dissertation, and the derivations are based only on the properties of the underlying distribution, i.e., Gamma. The practical application of each procedure was also considered and examples are given for both problems, i.e., bounded-risk and xed- width. v Acknowledgments I would like to thank my advisor, Dr. Carpenter, for his academic guidance. Under his direction I became more than a mathematician but a mathematical statistician. I am very appreciative of his encouragement and belief in my abilities when the material was not always clear. Without his support, this thesis would not have been completed. It has been a privilege working with him. I also would like to extend my gratitude to the members of my committee: Dr. Peng Zeng and Dr. Hyejin Shin for their time and suggestions that led to me improving this work. Working with them in and out of the classroom has truly been a pleasure. In addition, I would like to thank Dr. Robert Norton for taking time out of his schedule to assist with the editing process. I would also like to extend a special thanks to Kandace Noah for helping me understand how my research is applied in her eld, and to Mary Mechler and Alicia Smith who gave important insight when writing my thesis. I am extremely grateful to my family: my siblings, loving aunts, uncles, and grand- mother. I give a special thats to my father, Kevin L. Tolliver, for being a positive role model in my life and my mother, Cheryl G. Tolliver, who always pushed me to do better myself. vi Style manual or journal used Journal of Approximation Theory (together with the style known as \aums"). Bibliograpy follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speci cally LATEX) together with the departmental style- le aums.sty. vii Table of Contents List of Figures x List of Tables xi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Sequential Analysis and Multistage Designs . . . . . . . . . . . . . . . . . . 4 1.4 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Special Cases of Gamma Distribution . . . . . . . . . . . . . . . . . 9 1.4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.1 Bounded Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 Fixed Width Interval Estimation . . . . . . . . . . . . . . . . . . . . 14 1.5.3 Modeling Times with Gamma . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Dissertation Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Bounded Risk Estimation 18 2.1 Shape Known and Scale Unknown . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Improving the Terminal Sample Size . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Performance Properties of New Estimation Procedure . . . . . . . . 26 2.3 Computer Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Shape Unknown and Scale Unknown . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Robustness Considerations . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Determining Initial Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Example in Understanding Precipitation Rates in Regional Climate Models 32 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 Fixed Width Confidence Interval Estimation 37 3.1 Con dence Intervals for Gamma Distribution . . . . . . . . . . . . . . . . . 37 3.2 Two-Stage Fixed-Width Con dence Interval for Scale . . . . . . . . . . . . 39 3.3 Computer Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Asymptotic Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Example in Air Force Aeronautical Maintenance . . . . . . . . . . . . . . . 45 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Conclusion 50 viii Bibliography 52 A Tables 56 B Figures 64 ix List of Figures 1.1 Gamma CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Gamma PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 B.1 Scatterplot of Alpha vs. Empirical Ratios . . . . . . . . . . . . . . . . . . . 65 B.2 Average Risk of Two Methods Compared to Risk Bound with Initial Sample of 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 B.3 Average Risk of Two Methods Compared to Risk Bound with Initial Sample of 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.4 Average Improved Risk of Two Methods Compared to Risk Bound with Ini- tial Sample of 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B.5 Average Improved Risk of Two Methods Compared to Risk Bound with Ini- tial Sample of 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 x List of Tables A.1 Shape Known, Scale Unknown . . . . . . . . . . . . . . . . . . . . . . . . . 56 A.2 Shape Known, Scale Unknown (Improved) . . . . . . . . . . . . . . . . . . . 57 A.3 Old Bound B as function of Shape and Initial Sample Size . . . . . . . . . . 58 A.4 New Bound B as function of Shape and Initial Sample Size . . . . . . . . . 58 A.5 Shape Unknown, Scale Unknown . . . . . . . . . . . . . . . . . . . . . . . . 59 A.6 Shape Unknown, Scale Unknown (Improved) . . . . . . . . . . . . . . . . . 60 A.7 Initial Sample Size Considerations Simulations . . . . . . . . . . . . . . . . 61 A.8 Con dence Interval Simulations of Terminal Sample Size . . . . . . . . . . . 62 A.9 Coverage Percents as Risk Bound Varies . . . . . . . . . . . . . . . . . . . . 63 A.10 Coverage Percents as Initial Sample Size Varies . . . . . . . . . . . . . . . . 63 xi Chapter 1 Introduction 1.1 Motivation Statistical modeling is a technique used in many di erent scienti c elds. By summa- rizing current results into one expression, statistical modeling aids researchers in explaining their current results. More importantly, observed outcomes could be utilized to make future predictions. In scienti c experimentation there are many factors that can contribute to a certain outcome in a research experiment. A model simply refers to the outcome that is expressed as the mathematical function of these factors. In order to make these predictions, the data in these models are assumed to be random and have some underlying distribution where one or all parameters are unknown, and parameter estimation is used to t these models. The Gamma distribution is often assumed to be the underlying distribution to model right-skewed variables with positive support. Because of its exibility this distribution has a wealth of applications and is often used to model random times-to-events. Two scienti c elds of study where the Gamma distribution is most often used to model data are Survival and Reliability Analysis. In Survival Analysis, variables such as lifespans of organisms as well as time till a treatment takes e ect can be modeled with the Gamma distribution. In Reliability Analysis Studies, lifespans of a system or systems components as well as chemical corrosion, e.g. can be modeled with the Gamma distribution. The information gained by statistical models in these two elds is used in developing life insurance plans, pertinent drug information, warranty information, quality control information, etc. A pa- rameter often studied in these elds is Mean-Time-To-Failure (MTTF) that is very useful for systems used on a regular basis. A general queue also models times with a Gamma 1 distribution. This is seen in various computer systems, call centers, and tra c ow man- agement systems. Articles that use the Gamma distribution in modeling times relating to queues include: Choe and Shro (1997), Amero and Bayarri (2007), Chu and Ke (1997), Clarke (1957). Though modeling times is the most frequent use, the Gamma distribution as a family of distributions can be assumed in any area where values have a positive support. For example, it is used frequently in climatology modeling both precipitation rates and precipitation intensity. This is seen in Maureil et. al (2007) and Gutowski et. al (2008). In addition, it is seen in censor imaging as shown in Chatelian (2007) and Chatelian (2008). These statistical models are reliant on the parameter estimation, therefore it is imperative for model t that estimates under some criterion are accurate, low bias with low variation. To have an accurate parameter estimate, Sequential Analysis is needed to determine how many observations are required An accuracy measure, such as standard error, is depen- dent upon the parameters of the unknown underlying distribution. In sequential analysis, all the observations are not sampled at once. In fact, having estimates with predetermined accuracy cannot be determined with a sample size known prior to sampling. It is well known that an unbiased estimator will become more accurate as the sample size increases. However, knowing the sample size needed to ensure the accuracy falls within the criterion is impossible to determine without any knowledge of the underlying distribution. Sequential problems such as assigned-accuracy problems deal more with the sample size than with the estimator itself. The nal model estimates are dependent upon information gained in prior sampling. Historically researchers have calculated measures of accuracy for sample design based on incorrect assumptions about the underlying distribution. For many years, the underlying distribution for the data is assumed to be Normal, even for time estimates where there is a positive support and the data is right skewed. Sampling that assumes that the data is Normal when it is not introduces the risk of not actually meeting the criteria. It could also lead to sampling more observations than is needed to meet the speci ed criterion. This is a very prevalent problem in statistics since many experiments and surveys are restricted by 2 budgetary restraints. 1.2 Research Question Our focus is on developing a sampling procedure that will ensure an accurate estimator, which means that the estimator will have a low bias and low variation. The model assumes that the estimator is unbiased so the concern is restricting the variation. This dissertation looks at two di erent problems involving predetermined accuracy. The rst problem is to ensure that the risk falls within a bound and the second problem is to ensure the width of the interval estimator is within a bound. In doing so, we will completely avoid using large sampling theory. There will be no asymptotic approximation of the underlying distribution and because this is done our procedure will hold for any number of initial observations. We develop the mathematical theory that ensures that the risk is within a pre-speci ed bound under a genuine two-stage sampling scheme that assumes that the data comes from a Gamma population. The term \genuine" refers to the fact that the sequential procedure is based on the combined sample from both stages. It may seem fairly obvious that a genuine two-stage estimation procedure will yield better results than one that disregards one of the two samples. This has not always been implemented. Arriving at a two-stage procedure that ensures risk is within a bound may come with a cost and could result in sampling more observations than previous sampling procedures. A more practical problem is considered where the goal is to sample the fewest number of observations that achieve this goal to avoid oversampling as described by Wald (1947). Using a relationship between risk and interval estimation, a genuine two-stage xed-width interval estimator sampling scheme is produced that is unlike anything that has ever been done in this eld before. 3 1.3 Sequential Analysis and Multistage Designs Sequential analysis is a statistical theory of data where the nal sample size is not known prior to the study. Sampling procedures where the nal sample size is known prior to sampling is known as a xed-sample size procedure. Sequential sampling procedures are used over xed-sample size procedures for (1) ethical reasons, (2) conceivability reasons, and (3) economical reasons. For example, in a drug trial for reducing hypertension, if there are m initial observations where some of them develop side e ects or there is signi cant evidence that the true mean is low, then medical ethics forbid further sampling. With other instances, arriving at an alternative solution is inconceivable. An example of conceivability reasons considers an industrial process. There is no known way of determining when a process will become out of control with a xed-sample size. There are occasions when sequential analysis is economical. An example of this is any sampling where there is an attached cost to each observation. Sequential analysis can reduce the number of observations, which will consequently reduce the cost of the experiment. Finding assigned accuracy estimators for parameters of a Gamma process or population can be all three. As noted before, the Gamma is often assumed in modeling times in clinical trials. There is no conceivable solution for determining the nal sample size needed to achieve predetermined accuracy with a xed sample size. Since one objective is to sample the fewest number of observations that achieve a certain goal, it has economical applications. For all of those reasons, a sequential design needs to be implemented to achieve pre-assigned accuracy. Sequential analysis consists of two components: (1) the stopping rule and the decision rule. The stopping rule indicates whether or not sampling should be stopped after m observations or whether additional observations should be sampled. A stopping rule is characterized as a mechanism for deciding whether to continue or stop a process on the basis of the present position and past events, which will almost always lead to a decision to stop at some time. The nal resultant sample size N is called the terminal sample size. The decision rule tells what actions need to be taken after sampling has been stopped. 4 De nition 1.1 If m is a known predetermined sample size, a sample is said to be sequential if the terminal sample size N is not xed, i.e. P(N = m) < 1 for m;N2N: The emphasis in this context is on having an estimator that will fall below some predeter- mined accuracy, the terminal sample size N will be the nal sample size that ensures this is the case. The terminal sample size depends on earlier observed information, X1;X2;:::;Xm making it a random variable. The optimal sample size (n ) is the number of observations that best achieves a re- searchers goal. This can mean a number of things for di erent problems. In our context, the optimal sample size is the fewest number of observations that ensures that our estimator is accurate under some predetermined criteria. The optimal sample size is xed and is de- pendent upon the unknown parameters of the underlying distribution. In an ideal situation the terminal sample size will equal that of the optimal sample size. The terminal sample size is assessed by looking at the ratio of the expectations, E[N=n ]: (1:3:1) The performance of the terminal sample size can be evaluated asymptotically limm!1E[N=n ] = 1: (1:3:2) and limm!1Var[N=n ] = 0: (1:3:3) In sequential analysis, there are two sub elds: (1) purely sequential designs and (2) multistage designs. Earlier it was mentioned that the nal sample size is not known prior to the start of sampling. However, this does not mean that each observation is observed one at a time. With purely sequential designs, each observation is observed one at a time and an analysis is performed after each observation is drawn. Whereas in multistage designs, 5 multiple observations are drawn at a time,called a stage, and there is a cap on the total number of stages. The terminal sample size does not necessarily consist of the prior m observations. In some instances the m observations are used to determine the terminal sample size and then disregarded for the analysis. De nition 1.2 Let X1=fX1;:::;Xm1g be an initial sample. For a decision rule sub- sequent samples are Xi=fXmi 1+1;:::;Xmig or Xi = ; for 1 i k. The sample X =[ki=1Xi: is a genuine k-stage sample. The advantages of purely sequential problems are that they yield better statistical results and the procedure will have a reduced chance of over sampling as described by Wald. How- ever depending on the design, multistage sampling can be more cost e cient and more manageable. For example, consider the problem of determining when an industrial process becomes out of control; data is read after each observation. In such cases, it will be practical to use a purely sequential design. As the data is read, the process can immediately determine when it has become out of control and there is no need to continue sampling. However in a clinical trial, it is not practical to treat one subject at a time. A multistage design is needed. Multistage problems are currently used in a wide range of areas. Some multistage sampling schemes use a set of observations from the population as their initial sample. The subsequent samples consist of analyzing a subset of that initial set. This is seen in the U.S. Census? Current Population Survey multistage method, given in Moore, McCabe, and Craig (2007). This is also seen in crop management, Finney (1984), as well as multistage clus- ter analysis Phillipi(2005). In the context of modeling times, multistage designs are often used in adaptive designs. In a broad overview of adaptive designs, several examples were given where statistical procedures were modi ed during the conduct of clinical trials. It is not only e cient to identify clinical bene ts of the test treatment under investigation, but also to increase the probability of success of clinical development, Chow and Chang (2008). Most adaptive designs in clinical trials can be referred to as adaptive randomization and 6 group sequential designs. With the exibility for stopping a trial early due to safety, futility and/or e cacy and sample size re-estimation at interim for achieving the desired statisti- cal power. In an article on uni ed theory of two-stage adaptive designs the mathematical theory is proposed to adaptations in literature. To summarize, the adaptations alter the sampling distribution, which means the assumed results may not be true, Lui, Proschan, and Pledger (2002). For example, for two-stage adaptive tests in particular, changes in the sampling distribution can occur. Only recently it has been thought of to alter the p-value. In their article, they arrive at a number of useful theories on two-stage adaptive designs. Using a large number of stages makes organization more complicated and both admin- istrative expenses and interest charges on the large investment increase. This is also the case with a number of other sequential sampling procedures. Because of this, in literature there are many two and three-stage designs; Mukhopadhyay and Pepe (2006), Mukhopad- hyay and Zacks (2006), Yao and Venkatraman (1998), Satagopan et al. (2002), Whittemore (1997), Jinn et al. (1987), Lorden (1983), Mukhopadyay (1995), etc., and not as many four, ve, and six-stage designs. Squared error loss is a measure of an estimate?s distance from its true parameter. De nition 1.3 If A> 0 is constant speci ed by the experimenter that penalizes deviations more or less as need be, squared error loss of an estimator with n observations is the squared distance between a parameter and its estimator ^ : Ln( ; ^ ) = A( ^ )2: In practice, this measure is assessed by its expected value, called risk. The risk gives an indication of the reliability of an estimate. High risk indicates the estimator is unreliable while a low risk indicates the estimator is reliable. Increasing the sample size is an action taken to lower risk. One method of accuracy measure in sequential analysis is the bounded risk estimator. 7 De nition 1.4 For a predetermined risk bound w, a bounded risk estimator with the ter- minal sample size N number of observations is the expectation of the squared error loss RN( ; ^ ) = E(LN) w: It is well documented that the risk will be a multiple of the variance plus the bias of the estimator squared. Consider the Normal distribution with known variance; in this instance, bounding the risk is an easy calculation. However, if the mean and variance are unknown, then this problem cannot be solved with a xed sample size. With tting statistical models with parameter estimation, the goal is to have accurate mean parameter estimates. Another accuracy measure in sequential analysis is the xed-width interval estimator. De nition 1.5 For a predetermined width d, a 1 a xed-width interval estimator of a real-valued parameter is any pair of functions L(X) and U(X), with L(X) < U(X) for with the inference L(X) < < U(X) is made. We say CX is the interval [L(X);U(X)], The width of the con dence interval is simply U(X) L(X) d, and P( 2CX) 1 a: Fixed-width con dence intervals are a large part of sequential estimation. It is known that interval estimation is more informative than point estimation due to the P(^ = ) = 0 for any continuous distribution. Interval estimators consist of width of the interval and the coverage probability. There are merits to both. A small coverage probability implies that the researcher has a larger chance of making an error, whereas a large width is uninformative. 1.4 The Gamma Distribution The model assumption for the proposed sampling scheme is that the underlying distri- bution is Gamma. The focus is on estimating the mean parameter of the Gamma distribu- tion. The Gamma distribution is a exible right-skewed distribution that has a variety of applications. It is often used in modeling times-to-event that is seen in biological science, 8 engineering, ecological, and probability elds. The density is f(x) = 1 ( ) x 1e x= ; for x> 0; (1:4:1) where ( ) = R10 t 1e tdt and ; > 0, with 12 =p and ( ) = ( + 1). Note, and are referred to as the shape and scale parameters, respectively. A property with the Gamma density is that it is closed under scalar product. That is if X Gamma( ; ), then Y = cX Gamma( ;c ): (1:4:2) The sum of k Gamma random variables with shape and scale is kX i=1 Xi Gamma(k ; ): (1:4:3) The moment generating function of this distribution is MX(t) = 1 1 t ;t< 1= : (1:4:4) Which makes the mean and variance EX = and Var(X) = 2: (1:4:5) 1.4.1 Special Cases of Gamma Distribution Some special cases of the Gamma distribution will be noted as they are referenced throughout this dissertation. Suppose X is distributed with Gamma with shape and scale . If the shape parameter is one, then X is exponentially distributed with scale . It is well documented that adding k exponentially distributed variables will yield a Gamma distribution with shape k and scale , as noted in equation (1.4.3). If the shape parameter is an integer then the variable is Erlang with shape and scale . If the scale is two, then X becomes a Chi-Square distribution with parameter 2 . It follows by (1.4.2) 9 Figure 1.1: Gamma CDF that 2X= Chi Square(2 ). Finally if there are two independent Gamma distributed variables, X Gamma( ; ) and Y Gamma( ; ), then XX+Y Beta( ; ). 1.4.2 Estimation As previously stated, the Gamma distribution is widely used in engineering, probabil- ity, ecological and biological science elds. The problem of nding reliable estimators for the mean dates back to the early 1950s. There are a number of di erent methods that can be used in nding estimators for this distribution: method of moments, maximum likelihood, and least squares. In particular, for this dissertation, the maximum likelihood method is used to obtain the estimator for and the method of moments estimator is used to obtain the estimator for The maximum likelihood method of estimation is the most popular technique for de- riving estimators. This technique has many ideal properties including the fact that it yields the best unbiased estimators. Using (1.4.1), the likelihood function for n identically and 10 Figure 1.2: Gamma PDF independently distributed variables. The likelihood function becomes: L( ; ) = 1[ ( )]n nY i=1 x 1i e 1= Pn i=1xi xi > 0; fori = 1;:::;n To ease computation, the natural logarithm of the likelihood is taken. This can be done because the natural log function is a monotone function, so the likelihood will maintain its optimum values. If shape is known, the maximum likelihood estimator for can be easily obtained and is shown to be ^ = X= : (1:4:5) If the shape is unknown, no close form maximum likelihood estimator or numerical solution needs to be given to arrive at its maximum likelihood approximation. This is the reason the maximum likelihood estimation approach is not used for the shape parameter. 11 The method of moments estimator is another common method for estimating a param- eter. It works by setting the kth moment to the sum of xi to the kth power, ^E(Xk) = 1 n nX i=1 Xki i = 1;:::;n k2N: Shape parameter ( ), which is widely considered a nuisance parameter. Because of the unattainability of a best unbiased estimator, method of moments estimator is used, ^ = X2=S2: (1:4:6) This is a slightly biased estimate that is asymptotically consistent. 1.5 Literature Review Early elements of sequential analysis appear in the 17th and 18th century, when math- ematicians Huyghens, Bernoulli, Montmort, DeMoiver, LaGrange and LaPlace worked on the Gamblers Ruin problem, (Ghosh and Sen 1991). This famous probability problem tries to determine at what point gamblers will completely deplete their funds. Dodge and Romig in 1929 were the earliest to apply what is now known as sequential analysis to a statistical problem. They developed a double sampling test, where two samples were taken and the proportion of defective units was observed. Shewart in 1931 developed theory on what instance does an industrial process become out of control. Wald in 1947 produced a well known book on sequential analysis that sparked interest from several authors world wide. 1.5.1 Bounded Risk Estimation A common sequential problem dealing with preassigned accuracy is bounded risk es- timation. For populations with known variance, there is a xed-sample size solution; no sequential methods need to be implemented. The problem arises when nothing is known 12 about the population. Stein (1945) proposed a two-stage bounded risk estimation procedure for Normal populations. This procedure incorporated using the standard deviation from the initial sample to yield the proper terminal sample size. Modeling times with a Normal underlying distribution will not yield ideal results because times are often skewed to the right. A better distribution to assume when modeling times is the Exponential distribution. Birnbaun and Healy (1960) developed a two-stage bounded risk estimation procedure for Exponential processes. This sampling scheme assumed that the underlying distribution was Exponential and found ways to bound the scale parameter using Chi-Square transfor- mations. Their result can be summarized as follows: if there are m 3 initial observations and BBH = Am 2 (m 1)(m 2) then NBH =dBBH X2m w e (1:5:1) is the sample size required so that the risk of the estimator is within the bound w. However, this procedure is not a genuine two-stage sampling procedure. The solution is only based on the second sample. The initial sample is used to determine the sample size required in achieving bounded risk and then it is not included in the nal estimate. This is done because taking observations from two di erent samples alters the sampling distribution. It is true that bounding the risk cannot be done with a xed sample size. However, dis- regarding readily available information is wasteful. This concept was improved by adding observations from the initial sample to the second sample, thus making the procedure a gen- uine two-stage sampling procedure. Although, this proved to be an asymptotically great bounded risk-estimator (Kubokawa 1989), no actual proof was provided for this result and it is uncertain if it holds for any number initial observations (> 3). Various works in sequential estimation of scale parameter of the Exponential distribu- tion is done by Mukhopadhyay (1995), (2006), (2006a), (2006b), (2007), etc. Mukhopadhyay and Pepe (2006) record an exact genuine two-stage sampling procedure. Their result which 13 holds for any initial sample size greater than three, can be summarized as follows: if there are m 3 initial observations and BMP = 2Am(m+ 1)(m 1)(m 2) then NMP =dBBH X2m w e (1:5:2) is the sample size required so that the risk of the estimator is within the bound w. The consequence of this is the expected value of the terminal sample size is more than twice that of the initial sample size; meaning on average the researcher will sample more than twice the observations needed. This is referred to as a penalty for exact bounded risk estimation. Exploring the distribution of this terminal sample size, a reduction of this terminal sample size could be found making this exact procedure more practical, Zacks and Mukhopadhyay (2006). 1.5.2 Fixed Width Interval Estimation The next sequential problem is restricting the interval estimators width. Interval es- timation is one of the fundamental aspects of statistics. Presenting interval estimators are often preferred over measures of variation, such as risk. This is probably due to the fact that con dence intervals can yield better interpretations. This is particularly important when the estimate does not have a Normal sampling distribution, Ramsey and Shafer (2002). Both measures give exibility to the estimator, but interval estimators give results in terms of what is probable, i.e. the probability that the true parameter lies within a 1 a con - dence region is 1 a. Fixed-width con dence consists of relatively high probabilities and relatively narrow 14 interval widths. Traditionally, they are of the form CX =f j XN d< < XN +dg; (1:5:3) with the terminal sample size N. There is no xed sample size solution to this when the variance of a distribution is unknown. The width of the interval estimator is dependent upon the variance of the distribution. Stein (1949) solves the Normal xed width problem by proposing a two-stage procedure to bound the con dence interval for mean when vari- ance 2 is not known. The terminal sample size of this procedure is given below, N = maxfm;d b2m 1;1 a=2S2m d2 eg; (1:5:4) where bm 1;1 a=2 is the 1 a=2 point of a t with m 1 degrees of freedom. This uses the fact that bm 1;1 a=2 will be larger than z1 a=2. This procedure was shown to be asymptotically inconsistent. As the initial sample size gets large, the ratio between widths of this proce- dure?s sample size and the optimal sample size will be (bm 1;1 a=2=z1 a=2). Ghurye (1958) proposes a two-stage xed-width con dence interval for a location parameter of a general density, f(x), along the lines of Stein. This is not used for mean. Chow and Robbins (1965) record a purely sequential interval estimator for the mean of a general density f(x). This sequential result uses an initial sample of m observations, then chooses the rst n for which the following is achieved size is N = minfn mjn d 2z2a=2(S2n +n 1)g; (1:5:5) However, their procedure uses asymptotic theory. This is not a practical approach for model estimates because it observes observations one at a time and we are avoiding Normal ap- proximation. A general method for determining xed width con dence intervals is given by Khan (1969). This method like the previous methods use Normal theory; in it he discusses almost sure convergence, asymptotic consistency, and asymptotic e ciency. Research is 15 continuing to be developed in this area. For example, Mukhopadhyay, Silva, and Waikar (2006) develop a two-stage sampling procedure to which they compare Steins xed width interval approach (1949) and Chapmans xed width interval approach (1950). As noted before when estimating mean time, it is better to model with the Exponen- tial distribution. Govindarajulu (1995) developed a sequential estimator for the mean of an Exponentially distributed population. This result is more applicable to modeling times, it is summarized that as follows: if zn = z[1 +n 1(1 +z2)=4 +o(n 1)] NG = minfn mjn z2n X2n=d2g (1:5:6) will bound the risk. This procedure again uses Normal approximation. However because of the shape of this distribution, no research has been found on re- stricting the width of the interval without using asymptotic approximation. 1.5.3 Modeling Times with Gamma It should be noted that both of the prior subsections ended with sequential research in statistical modeling for Exponential populations. This is because there is not much research in this area for Gamma populations. However, in the same instances where the Exponential model can be used, so can the Gamma. Speci cally statistically modeling random times, such as mean time-to-failures, are assumed to be Exponential. Where the longer one survives the smaller the probability is for continual survival. This is not always the case. For example, it is charted for life expectancy of an infant that there are many casualties during the rst few months of birth. So for a short period of time, the life expectancy increases the longer the infant lives. It will be more appropriate to model infant life expectancy with a Gamma or Weibull distribution. Both of these distributions are more general forms of the Exponential distribution. There are several works that discuss modeling MTTF as a 16 Gamma distributed variable: Coit and Jin (1999), Shapiro and Wardrop (1978), Barber and Jennison (2002), etc. These articles provide examples of when the Gamma should be used over the Exponential distribution. For example, when modeling failure times with a known number of failures and missing values are present. The time between one failure and the last record is Gamma with known shape. This happens often when data is recorded periodically and not after each failure, Coit and Jin (1999). 1.6 Dissertation Layout In the second chapter, two-stage bounded risk estimators are developed. The perfor- mances of these sample sizes are evaluated through simulation. Use of numerical methods is implemented to reduce the value for the sample size, making the estimator more asymp- totically consistent. In the third chapter, a xed-width interval estimator is created, and another example is given to illustrate how it works and how it relates to queueing theory. The nal chapter summarizes the results of this dissertation and discusses future research problems in this area. 17 Chapter 2 Bounded Risk Estimation The goal for the bounded risk problem is to sample the fewest number of observations so that the risk is just within a predetermined bound. Birnbaum-Healy (1960) developed a two- stage sampling procedure for the Exponential distribution; however their method does not use the information obtained from their initial sample in their nal estimate. Mukhopadhyay and Pepe (2006) develop a two-stage sampling procedure for the Exponential distribution that combines the initial sample with the second sample to derive the nal estimate. In this chapter, we generalize Mukhopadhyay and Pepe?s result to the Gamma distributed populations. It should follow that when the shape is equal to one, our results will be exactly that of Mukhopadhyay and Pepe. Additionally, we introduce some notations and basic concepts of decision theory as it applies to the Gamma distribution. First, the risk bounds are found when only the shape parameter is known. Secondly, risk bounds are found when both parameters are unknown. We also evaluate the performance of our bounds theoretically and through simulations and make possible improvements. 2.1 Shape Known and Scale Unknown There are a number of reasons the shape known case is studied: (1) There are partic- ular instances where the shape parameter is either known or can be assumed as known, (2) studying the alpha known case allows us to see how robust the Exponential assumption is, (3) there are times when the shape parameter is not known but there is a mathematical theory that allows us to assume the shape parameter is known, and (4) it lays the ground work for when shape parameter is unknown. 18 MTTF is often estimated as an Exponential random variable. This is merely a special case of the Gamma distribution when the shape is known and equal to one. Mukhopadhyay discusses this in a number of articles (1995), (2006), (2006a), (2006b), (2007), etc. However in many cases, MTTF is modeled with a Gamma distribution when the shape is known and not equal to one. Dopke (1992) and Coit and Jin (1999) discuss estimation of the MTTF as a Gamma random variable. The example given by Coit and Jin is when the time between each failure is not recorded. If there are k failures in a span t, then the MTTF is Gamma distributed with known shape k and unknown scale. They elaborate on why each failure time is not always recorded saying, \this is understandable because the elapsed time meter records time for the entire assembled item and not the individual components." This is similar to the idea of sum of Poisson process random variables. Other examples when the shape is known occur with modeling times and Normal distribution; there are modeling times when the shape parameter is assumed known just as there are instances in the Normal distribution when variance is assumed known. This can happen for a number of reasons; either there is so much historical evidence that the shape is consistent, there exists some mathematical theory for the shapes value, or the actual shape is of little concern as long as it is within reason. For example, in Maurellies (1999) precipitation models, he discusses the actual unimportance of knowing the exact shape. They state that since the data is right skewed, it is important to model the data with a low shape value. In this dissertation, they simply model precipitation intensity with = 2. In each of these examples it is important to have reliable estimates. This is not the only reason for exploring the shape known, scale unknown case. Study- ing the shape known case also gives an idea of how robust the assumption of an Exponential distribution is. Mukhopadhyay and Pepe?s (2006) result is only for the Exponential distri- bution. If there is some uncertainty that the shape is one, then there is no validity to their procedure. These forementioned reasons provide justi cation for studying the shape known case. Our goal is to develop a reliable estimation sampling scheme for when only the scale is 19 unknown. If the shape is known and it is only desired to estimate the scale , our goal is to nd the fewest number of observations n that will make its associated risk function less than or equal to a predetermined risk w > 0. Recall the mean of a Gamma distribution with parameters and is . Hence the risk is Rn( ; ^ ) w. If the shape is known the problem reduces down to: Rn = A 2 n w: (2:1:2) This implies, n A 2 w . Let the optimal sample size: n =dA 2 we: (2:1:3) This guarantees an integer value, which will ensure that the risk is within our bound. Sam- pling more observations than n is considered oversampling, sampling fewer observations than n will yield a high risk and thus an unreliable estimate. Notice n is dependent on the unknown parameter , so a sequential sampling proce- dure must be implemented to ensure that knowledge can be gained on this parameter. A pilot sample of m observations X1;:::;Xm i.i.d variables will be taken following a Gamma distribution ( ; ), with m > 3. From this sample the maximum likelihood estimator of can be found using the maximum likelihood estimator Xm= , see (1.4.5) to see how this was derived. That estimate is used to determine the terminal sample size N. This quantity will guarantee that we do not exceed the necessary number of observations for the statistical procedure by too much, as it might be costly or impractical, yet not fall short of an appropriate sample size either. After observing the rst m observations, our rst stage, a decision is needed to determine if the procedure can continue with the m ob- servations, or if more need to be added, our second stage. Yielding our two-stage procedure. 20 Theorem 2.1 If X1;:::;Xm i.i.d. Gamma ( ; ) initial observations are drawn, (m 3)and B is chosen to be B = A 2 m 2m2 (m 1) (m ) + (m3 +m2)[ (m 2)] (m ) : (2:1:5) and the terminal sample size is chosen to be N = max m;dBX 2 m 3w e ! : (2:1:4) Then if N m observations are drawn in the second stage the risk over all N observations will be less than a predetermined risk bound w : RN w Proof. We can re-express the risk on all N observations as RN = AE(XN )2 side of the inequality as AE " m2 N2 X m 2 + 2 N m N2 # : Recall m N, so the ratio is mN 1, and N BX 2 m 3w . Now, AE " m2 N2 X m 2# AE " m N X m 2# m 3w B AE 1 2 2 Xm + 2 X2m ! Also, 2 AE( N m N ) = 2 AE( 1 N(1 m N)) 2 AE( 1 N) 2w B AE 2 X2m Thus, using the two inequalities above with the reexpression fact we have AE X m 2 m 2w B AE m 2m Xm + m 2 X2m + 2 X2m ! 21 Using the fact that Pmi=1Xi Gamma(m ; ), it is easily seen that 2 1Xm will be distributed 2 with 2m degrees of freedom. It can be veri ed that expectation will be (m k) 2k (m ) . We obtain the equation: RN = AE X N 2 A 2w B m 2m2 (m 1) (m ) + (m3 +m2)[ (m 2)] (m ) : So to ensure the expected loss is less than our risk bound w, we set the righthand of the inequality to equal w then solve for B accordingly and obtain equation (2.1.5). 2.2 Improving the Terminal Sample Size In the prior section, we found results that certainly achieved the goal of having the risk within the risk bound. An alternative to the asymptotic sampling that ensured the risk is within a bound was found. Remember that is only part of the goal; the goal is to sample the fewest number of observations that achieves the bounded risk goal. It is important to investigate the relationship between the terminal sample size and the optimal sample size. Exploring the relationship of the N and n is the rst step in seeing if N needs to be reduced. If m 1. No further work needs to be done. However, the fact that as increases B decreases we choose to further our research and develop a new bound coe cient as a function of the old B and . The risk function of our two-stage sampling procedure was investigated as a function of the scale, in order to approximate the where the maximal risk occur. Clearly, larger values result in larger risks if N were to remain constant. However, larger values tend to result in larger values for N, which reduce the risk. So the maximal risk under this two- stage sampling procedure is not necessarily an in nite entity. In fact, through simulations the maximal risks most commonly occurred between ve and six. Once this was done, B was identi ed for each , then empirically reduced so that the risk is just within the bound w. The new B is the ratio of the empirical B found to the B as a result of mathematical theory given in (2.1.5). In Figure B.1, we can see a scatter plot of these ratios and . For each value of there was a corresponding ratio. For example, = 1 would corre- spond to 0:560. Looking at the scatter plot, there appears to be a negatively exponential relationship between and what the appropriate ratio should be, leveling o around = 20. A regression is performed to exploit this relationship. This is done only to nd coe - cient of log ( ). For this problem, we are not trying to t the curve, but have a curve that gently sits above each of the points. In order to do this, the same regression is used but the slope needs to be altered. As stated earlier, 0:565 su ces for all > 1, this will be used to nd the intercept. With this we nd that: Bnew = [ 0:031 log( ) + 0:597]B ; < 20 Bnew = 0:505B ; 20 (2:3:4) 25 This will give smaller values for B see Table A.4. 2.2.1 Performance Properties of New Estimation Procedure A genuine two-stage procedure for sampling data to an assigned accuracy only assuming the Gamma distribution was found in Theorem 2.1. This section showed that ultimately this procedure would continue to sample nearly twice as many observations than needed. A more practical solution of how to select a smaller number of observations was also considered in this section. Unlike section 2.1, a mathematically rigorous proof was not provided. However, su cient analysis was performed to substantiate the belief that the risk will always be within the risk bound w and that a resulting reduction in terminal sample size of nearly half the observations will be selected. In fact if = 1;w!0 as m!1 and E[N=n ] <1 then limE[Nnew=n ] = 1:13; and if 20, then limE[Nnew=n ] = 1:01: This means that instead of sampling nearly twice as many observations, the improved two- stage sampling procedure will sample between approximately 1.01 and 1.13 depending on the value of and the initial sample size. This is seen in Table A.4. Not only does this improved result give reliable estimators for but the terminal sample size nears the optimal sample size. 2.3 Computer Simulations In the previous section, the mathematical theory was provided to ensure the risk stays with the predetermined bound. To verify the results in the previous section a simulation study was conducted using R software. In the simulation, di ering values for optimal sample size n were chosen: 25, 50, 100, 500. We x = 5 since this result is not dependent upon 26 and vary = f0:5;1;2;5;10g. A is a constant expression, we choose it to be 2; 10,000 replications were used for each case. The quantity N is an estimate for the expected value of N and r is an estimate of the risk with the original terminal sample size. This simulation was repeated with the improved terminal sample size Nnew. Our desired result is to see r fall beneath w and to see rnew be just below w and Nnew to be above n . Also, since this is a generalization of Mukhopadhyay?s research we would like to see the same results with = 1. As such, our results for = 1 and = 5 should resemble Mukhopadhyay and Pepe?s (2006) results. Figure B.2, B.3, B.4, and B.5 give visual representation of the estimated risks compared to their risk bounds as shape varies 0.5, 1.0, 2.0, 5.0, 10.0 and n is xed to 25. Figure B.4 and B.5 display the same for the improved results. Notice, the estimated risks fall within the risk bound. With the improved results the estimated risks are closer to the risk bound. Further detail is given in Table A.1 and Table A.2. See Table A.1 and Table A.2, we note the following: 1) In Table A.1, for the case = 1 and = 5, our mean value for N and mean value for r are nearly identical as those given by Mukhapadyay and Pepe. The B given in this article is exactly that of their B. This is to be expected because this is a generalization of their result. 2) In Table A.1, the mean value for N nears twice that of n . Note N is a function of and m, so as both variables get larger N gets closer to 2n . We can see this if we look across rows and down columns. 3) In Table A.1, the mean value for r is always nearly half of our predetermined risk w. This follows since the expected risk is inversely proportional to the number of observations drawn. 4) Naturally as the initial sample size m increased, we obtained more information about the sample with which to make our decision and consequently we obtained better values for 27 r and N. 5) From Table A.1 and A.2, we observe that at no point in this simulation do the estimates for the risk nor the improved risk ever exceed the risk bound w. 6) In Table A.2, we observe that the newer results risks are much closer to the bound, and the newer sample sizes have been reduced on average by 43%. 7) In Table A.2, we observe that the average terminal sample size Nnew is still larger than n . It is my conjecture, that it will be unable to improve Nnew any further. Our average risks are just within the bounds, which is our goal. Reducing the sample size any further might result in having the average risk eclipse the risk bound. 2.4 Shape Unknown and Scale Unknown In this section, we provide a solution to the question of how many samples should be selected if both parameters are unknown. If variables are both unknown, nding bounds become more di cult. The goal remains the same, to nd an appropriate sample size N that will make the associated risk function less than or equal to a predetermined risk w> 0. Recall if X Gamma( ; ) then the mean is and the estimator for the mean is X. The goal is to nd N such that: AE(X )2 0. (2) N is nite with probability 1 for every d> 0. (3) N=n !1 as d!0 in probability or a.s. (4) E(N)=n !1 as d!0. (5) limd!0P( 2CX) = 1 a In the following section, we will show that under certain conditions these properties hold. Theorem 3.1 For signi cance level a and predetermined width d, if X1;:::;Xm i.i.d. Gamma ( ; ) initial observations are drawn (m 3). If N is de ned in 2.1.4 and gq be the qth quantile of the Gamma(n ;1=n) distribution M = minfn N2Nj r Nw A [ g1 a=2 ga=2 ]g (3:1:6) Then if CX =f j XM= q Nw A [g1 a=2= 1] < < XM= q Nw A [ga=2= 1] = dg Then (1) P( 2CX) 1 a (2) CX d 40 Proof. Theorem 2.1 ensures that A 2 N 0j 1(g1 a=2(m) ga=2(m)) = dg: The ratio of expectations E[M=n ] becomes E 2 4minfn>Nj q Nw A 1(g1 a=2(M) ga=2(M)) = dg minfn> 0j 1(g1 a=2(m) ga=2(m)) = dg 3 5: Clearly if n >N, this reduces to E " 1 r Nw A # : 43 Thus, E[M=n ] p 1 + (m ) 1: It is obvious that as d! 0, M !1. If the terminal sample size increases the random variable (G= 1) should be examined asymptotically. It was mentioned earlier that G Gamma(M ;1=M). This means that the mean and variance ofG= will be one and (M ) 1 respectively. According to the Central Limit Theorem pM (G= 1) = Z N(0;1): Also, the random variable XM= Gamma(M ; =M ) with the standard deviation ( ) equaling =pM . Let n 0 be the optimal sample size of the risk estimator. If m!1 and w!0 such that E[N=n 0] <1 and d!0. 1:01 < limE "r Nw A # < 1:13 Under those conditions, the upper bound of the proposed interval estimator Xm= r Nw A (ga=2= 1): approximately becomes Xm= M(za=2): Because the Normal distribution is symmetric it is the same as Xm= + M(z1 a=2): Similarly, the lower bound will be the same lower bound as the Normal lower bound. The Normal distribution will have all the optimal properties. Thus, under those conditions 44 asymptotically the proposed procedure will have all of the optimal properties. To summa- rize, the procedure was developed using exact methodology and will hold for any number of initial observations (m > 3). A small initial sample size may come with a price of sampling more observations than needed. For large m such that m