Rank-Based Regression for Nonlinear and Missing Response Models by Huybrechts Frazier Achard Bindele A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 4, 2012 Keywords: Asymptotic normality, bounded influence, signed-rank norm, Sobolev space, strong consistency, missing at random. Copyright 2012 by Huybrechts Frazier Achard Bindele Approved by Asheber Abebe, Chair, Associate Professor of Mathematics & Statistics Geraldo S. de Souza, Professor of Mathematics & Statistics Peng Zeng, Associate Professor of Mathematics & Statistics Mark D. Carpenter, Professor of Mathematics & Statistics Abstract This dissertation is mainly concerned with the rank-based estimation of model pa- rameters in complex regression models: a general nonlinear regression model and a semi- parametric regression model with missing responses. For the estimation of nonlinear regres- sion parameters, we consider weighted generalized-signed-rank estimators. The generaliza- tion allows us to study rank estimators as well as popular estimators such as the least squares and least absolute deviations estimators. However, the generalization by itself does not give bounded influence estimators. Added weights provide estimators with bounded influence function. We establish conditions needed for the consistency and asymptotic normality of the proposed estimator and discuss how weight functions can be chosen to achieve bounded influence function of the estimator. Real life examples and Monte Carlo simulation experi- ments demonstrate that the proposed estimator is robust, efficient, and useful in detecting outliers in nonlinear regression. For the estimation of the linear regression parameter of a semi-parametric model with missing response, we propose imputed rank estimators under simple imputation and imputation by inverse probability. It is shown that these rank es- timators have favorable asymptotic properties. Moreover, it is demonstrated that the rank estimators perform better than the classical least squares estimator under heavy tailed error distributions and cases containing contamination while they are generally comparable to the least squares estimator under normal error. Moreover, rank estimators with inverse proba- bility imputation are superior than their least squares counterpart when the proportion of missing data is large. This makes rank estimation extremely appealing for situations where we encounter high rates of missing information. ii Acknowledgments Let me start by thanking God for the strength, the health and the knowledge given to me throughout all these years and simply throughout my life. Without all your blessings, I could never have accomplished this long journey. This research project would not have been possible without the support of many people. I wish to express my sincere gratitude to my supervisor, Dr. Ash Abebe who was abundantly helpful and offered invaluable assistance, support and guidance. Deepest gratitude are also due to the members of the supervisory committee, Drs. Geraldo de Souza, Peng Zeng, and Mark Carpenter without whose knowledge and assistance this study would not have been successful. Big dedication of this dissertation to my parents Philippe Bindele and Suzanne Mbeke who have always been supportive and have made a lot of sacrifices for the great success of my life. Special thanks to my grand mother Therese Mayena and his husband Andre Moumpan- gou, who, even though did not live long enough to see this achievement, represent the key point of my success by providing all necessary advices and encouragement to my education life. Another special thanks to my uncle Cyrille Ngayi Mvouembe who always stands up for me for any educational purposes. I would like also to express my profound gratitude to the department Chair and the graduate Chair of the Mathematics and Statistics department Dr. Michel Smith, Chris Rodger and Paul Schmidt for all their supports and encouragements during my graduate student life at Auburn University. At the same time, I would like to thank our department staff: Lori Bell, Gwen Kirk and Carolyn Donegan for their tremendous help provided when needed. I wishes to express my gratitude to the following: iii Prof. Charles Chidume whose supports, advices and encouragements has made my journey to the US possible. All Professors who taught me during the course work for the marvelous job done, par- ticularly, Drs. Wanxian Shen, Jersy Szulga, Erkan Nane, Geraldo de Souza, Asheber Abebe, Mark Carpenter, Trevor Park and Peng Zeng. Also all beloved faculty members such as Drs. Greg Harris, Nedret Billor, Overtoun Jenda, Peter Johson, Olav Kallenberg, Amnon Meir, Andreas Bezdek, Tin-Yao Tam, Bertram Zinner, Frank Uhlig and Narendra Govil. My course mates and friends, Gerald Chidume, Julian Allagan, Fidele Ngwane, Nar Rawal, Bertrand Sedar Ngoma, Guy-Vanie Miankokana, Brice Merlin Nguelifack, Sean O?Neil, Frank Sturn, Serge Phanzu, Jebessa Mijena, Melody Donhere, Dawit Tadesse, Omer Tadjion, Al-Hassem Nayam, Daniel Bongue, Guy Richard Bossoto, Justin Mabiala, Simplice Mananga, Brice Malonda, Justin Moutsouka, Seth Kwame Kermaussuor, Kuena, Gouala Medea, Telemina Gaston, Charles Lebon Mberi Kimpolo, Eddy Kwessi, Rolland Kendou, Merlin Ngachi, Gladisse Ngodo, Moses Tam, Murielle Mahoungou, Aude Didas Massika, Roland Loufouma, Darius Kissala, Fortun? Massamba, Amin Bahmanian, Patrice Bassissa for the good time and fun we have had. My brothers, sisters and family members, Durand Olivier Ngoyi, Elvis Darel Ngouala Nzaou, Tancraide Malanda Ngouala, Mabelle Ngouala Mboussi, Aude Durgie Nsimba, Na- dine Dihoulou, Emilie Kiloula, Michel Ngoyi, Joseph Nzoussi, Joseph Kimbassa, Antoine Yila, Guyen Kitanda Ndolo, Martin Ndolo, Serge Mawa, Daniel Ndamba, Marie Laure Gayi for their support. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Backround . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 On the Consistency of a Class of Nonlinear Regression Estimators . . . . . . . . 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Definition and Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Breakdown Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Bounded Influence Nonlinear Signed-Rank Regression . . . . . . . . . . . . . . . 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Weighted SR Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Weight Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Plug-in Estimator of the Weight . . . . . . . . . . . . . . . . . . . . . 35 3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 37 v 3.4.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Rank Regression with Missing Response . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 Estimation of the function g . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.7 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.8 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.9 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.10 Empirical log-likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . 75 vi List of Figures 3.1 Plot of fitted exponential curves including WSR . . . . . . . . . . . . . . . . . . 39 3.2 Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Residual plot versus x1 and fitted values for Lakes data . . . . . . . . . . . . . . 44 4.1 Scenario 1, Case 1: MSE vs Proportion of Contamination and t-df . . . . . . . . 66 4.2 Scenario 1, Case 2: MSE vs Proportion of Contamination and t-df . . . . . . . . 66 4.3 Scenario1,Case3: MSEvsProportionofMissingDataforSIundercontaminated normal error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Scenario 1, Case 3: MSE vs Proportion of Missing Data for SI under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under contaminated normal error distribution . . . . . . . . . . . . . . . . 68 4.6 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.7 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under contaminated normal error distribution . . . . . . . . . . . . . . . . 69 4.8 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 vii 4.9 Scenario 2, Case 1: MSE vs Proportion of Contamination and t-df . . . . . . . . 70 4.10 Scenario 2, Case 2: MSE vs Proportion of Contamination and t-df . . . . . . . . 70 4.11 Scenario2,Case3: MSEvsProportionofMissingDataforSIundercontaminated normal error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.12 Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.13 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under contaminated normal error distribution . . . . . . . . . . . . . . . . 72 4.14 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.15 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under contaminated normal error distribution . . . . . . . . . . . . . . . . 73 4.16 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.17 Scenario 2, Case 4: MSE vs Proportion of Contamination and t-df . . . . . . . . 77 viii List of Tables 3.1 Average Estimates(MSEs) of 0 = 1 in 1000 simulated samples . . . . . . . . . . 38 3.2 Parameter estimates for Lakes Data . . . . . . . . . . . . . . . . . . . . . . . . 41 ix Chapter 1 Introduction 1.1 Backround The historical approach to fitting linear and nonlinear models of the form: yi = f(xi; 0) +"i; i = 1; ;n; for some generic function f (linear or nonlinear), proceeds by finding coefficient estimate b of 0 that minimizes the sum of squared errors: Pni=1(yi f(xi; ))2 . Such estimators, knownasleastsquaresestimators, arecomputationallysimpleandpossessgeneraloptimality properties. However, the optimality can be lost due to the existence of even a single extreme outlier data point. This problem is seen with the sample mean, y, which is itself the least squares solution to the model yi = 0 + "i. To overcome this problem, we briefly survey a few approaches that have been taken to develop estimators of the coefficients that are not as easily affected as the least-squares estimators. Two methods have been investigated: one is a generalization of least-squares estimation called M-estimation, and the other is the so-called rank based estimation methods referred to as rank-based regression methods. For each of these two approaches a first method was developed to downweight outlier data points, but was later shown to be susceptible to high leverage points (outliers in the x space) in regression problems, and newer methods have emerged to address both outlier and leverage problems. One approach that has been used to lessen the impact of outliers in linear and nonlinear models is to use the least abolute deviation also known as the L1 regression, that is, finding coefficient estimates b that minimize Pni=1jyi f(xi; )j. A further generalization to this, 1 was made by Huber in the 1960s. He obatained the so-called M-estimators by minimizing Pn i=1 yi f(xi; ) b i , where ( ) is a symmetric function and b i is an estimate of the standard devaition of the errors "i. It was shown that these M-estimators have the advantage of downweighting outliers while retaining efficiency when compared to least squares estimators. However, theoriginalM-estimatorscanbeaffectedbyleveragepoints(outliersinthe x space) in regression problems. A type of M-estimator developed to protect against aoutliers and leverage points, is the least trimmed squares estimator that minimizesPki=1 yi f(xi; ) 2(i), where k n. This estimator ignores the largest n k residuals. However, the fact that it does not use the entire data results in a loss of efficiency. More recently, Yohai and others have developed extensions of these methods, called MM estimators, that protect against both outliers and leverage points while retaining efficiency. AtthetimewhenHuberandothersweredevelopingthetheoryofM-estimators,methods based on ranking were known as R-estimation and were used for simple problems such as estimating location and scale or making location comparisons for two-sample problems. They were not considered to be generalizable to linear and nonlinear models. Later Jaeckel, Hettmansperger, McKean and others showed that rank-based estimators could also be cast as estimators obtained by minimizingPni=1 a R(zi( )) zi( ), where R(zi( )) is the rank of zi( ) = yi f(xi; ) and a( ) some score function. The rank-based estimators, sometimes called Wilcoxon estimators, can be used for any general linear model, and have been shown to have high efficiency compared to least squares estimators. However, these rank-based estimators, can also be affected by leverage points in regression problems. A weighted Wilcoxon (WW) method were later developed to take care of leverage points and shown to possess highly efficient. 2 1.2 Contribution The first part of this Ph.D dissertation is concerned with the study of conditions suf- ficient for strong consistency of a class of estimators of parameters of nonlinear regres- sion models. The study considers continuous functions depending on a vector of param- eters and a set of random regressors. The estimators chosen are minimizers of a gener- alized form of the signed-rank norm, that is, minimizing 1nPni=1 an(i) (jz( )j(i)), where zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j. The function : R+ !R+ is continuous and strictly increasing. The numbers an(i) are scores generated as an(i) = ?+(i=(n + 1)), for some bounded score function ?+ : (0;1)!R+ that has at most a finite number of discontinuities. The generalization allows to make consistency statements about minimizers of a wide variety of norms including the L1 and L2 norms. By implementing trimming, it is shown that high breakdown estimates can be obtained based on the proposed dispersion function. The second part of this dissertation is motivated by the fact that almost all the methods discussed above fail to take care of outliers and high leverage points in regression problems. The generalization of the signed-rank norm (above) allows us to include popular estimators such as the least squares and least absolute deviations estimators but by itself does not give bounded influence estimators. To address this problem, we considered weighted forms of the generalized signed-rank estimators and measured the robustness of these estimators using their influence functions. Carefully chosen weights result in estimators with bounded influ- ence function. Also conditions needed for the consistency and asymptotic normality of the proposed estimator are established and discussions about how weight functions can be cho- sen to achieve bounded influence function of the estimator are provided. Real life examples and Monte Carlo simulation experiments demonstrate the robustness and efficiency of the proposed estimator. Finally, an example showing that the weighted signed-rank estimators can be useful to detect outliers in nonlinear regression is provided. 3 The last part of this thesis, considered a robust estimation for the missing data problem in regression analysis as an application of the weighted signed-rank estimation method intro- duced above. In this particular case, important work has been done in the estimation of the regression parameters involving the least squares method and eventually the M-estimation method in the literature. As pointed out by many authors, for non-normal error distributions or in the case of uncontrolled designs, the least squares method may not provide suitable es- timators. To this end, we consider the linear semi-parametric regression model with missing response at random, that is, yi = x0i +g(Ti) +"i, i = 1 n, where the yi?s are missing at random, xi?s and Ti?s are fully observed. Based on the rank objective function, we provide estimators of andg simultaneously, using the usual nonparametric kernel estimation. Also, under some suitable conditions, we show that the obtained estimator of ispn-consistent and also satisfies the normal approximation property. To illustrate this case a simulation study is conducted and shows that the rank estimator behaves better than the least squares estimator when dealing with missing response problem. 4 Chapter 2 On the Consistency of a Class of Nonlinear Regression Estimators 2.1 Introduction Over the last twenty five years considerable work has been done on robust procedures for linear models. Several classes of robust estimates have been proposed for these models. One such class is the generalized signed-rank class of estimates. This class uses an objective function which depends on the choice of a score function, ?+. If ?+ is monotone then the objective function is a norm and the geometry of the resulting robust analysis, (estimation, testing, and confidence procedures), is similar to that of the geometry of the traditional least squares (LS) analysis; see McKean and Schrader (1980). Generally this robust analysis is highly efficient relative to the LS analysis; see the monograph by Hettmansperger and McKean (1998) for a discussion of this analysis. For the simple location model, if Wilcoxon scores, ?+(u) = u, are used then this estimate is the famous Hodges-Lehmann estimate while if sign scores are used, ?+(u) 1, it is the sample median. If the monotonicity of ?+ is relaxed then high breakdown estimates can be obtained; see H?ssjer (1994). Thus the signed-rank family of robust estimates for the linear model contain estimates which range from highly efficient to those with high breakdown and they generalize traditional nonparametric procedures in the simple location problem. Many interesting problems, though, are nonlinear in nature. Traditional procedures based on LS estimation have been used for years. Since these LS procedures for nonlinear models use the Euclidean norm they are as easily interpreted as their linear model counter- parts. The asymptotic theory for nonlinear LS has been developed by Jennrich (1969) and Wu (1981), among others. In this chapter, we propose a nonlinear analysis based on the signed-rank objective function. The objective function is a norm if ?+ is monotone; hence, 5 the estimates are easily interpretable. We keep our development quite general, though, to include nonlinear estimates based on H?ssjer-type estimates also. Hence our estimates in- clude the nonlinear extensions of the signed-rank Wilcoxon estimate and the L1 estimate as well as the extensions of high breakdown linear model estimates. Thus we offer a rich family of estimates from which to select for nonlinear models. In Section 2.2 we present our family of estimates for nonlinear models. In Section 2.3, we show that these estimates are strongly consistent under certain assumptions. We discuss these assumptions, contrasting them with assumptions for current existing estimates. The same section contains a general discussion of interesting special cases such as the L1 and the Wilcoxon. Section 2.4 discusses the conditions needed to achieve positive breakdown of our estimator. 2.2 Definition and Existence Consider the following general regression model yi = f(xi; 0) +ei; 1 i n (2.1) where 0 2 is a vector of parameters, xi 2X is a vector of independent variables, and f is a real-valued function defined on X . Let V =f(y1;x1);:::;(yn;xn)gbe the set of sample data points. Note that V V R X. We shall assume that is compact, 0 is an interior point of , and f(x; ) is a continuous function of for each x2X and a measurable function of x for each 2 . We define the estimator of 0 to be any vector minimizing Dn(V; ) = 1n nX i=1 an(i) (jz( )j(i)) (2.2) 6 where zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j. The function : R+ ! R+ is continuous and strictly increasing. The numbers an(i) are scores generated as an(i) = ?+(i=(n+1)), for some bounded score function ?+ : (0;1)!R+ that has at most a finite number of discontinuities. This estimator will be denoted by b n. Because Dn(V; ) is continuous in , Lemma 2 of Jennrich (1969) implies the existence of a minimizer of Dn(V; ). We adopt Doob?s (1994) convention and denote by Lp, 1 p 1, the space of measurable functions h : (0;1) ! R for which jhjp is integrable for 1 p < 1 and the space of essentially bounded measurable functions for p =1. The Lp norm of h iskhkp fRjhjpg1=p for 1 p<1andkhk1 ess supjhjfor p =1. All integrals are with respect to Lebesgue measure on (0;1). The range of integration will be assumed to be (0;1) unless specified otherwise. 2.3 Consistency Let ( ;F;P) be a probability space. For i = 1;:::;n, assume that xi and ei = yi f(xi; 0) are independent random variables (carried by ( ;F;P)) with distributions H and G, respectively. We shall write x, e and jz( )j for x1, e1 and jz1( )j respectively. Let ~G denotes the distribution ofjz( )jand we will assume A1: P(f(x; ) = f(x; 0)) < 1 for any 6= 0; A2: for 1 q 1, assume there exists a function h such thatj (G 1 (y))j h(y),8 2 with E[hq(Y)] <1and, A3: G has a density g that is symmetric about 0 and strictly decreasing on R+. As usual, we let a:s: convergence, denote almost sure convergence, i.e., pointwise con- vergence everywhere except for possibly an event inF of probability 0. 7 Theorem 1. Under A1 - A3, b n a:s: ! 0. Before giving the proof of this theorem, let us discuss some related lemmas and corol- laries that will be used in the proof. Let (1);:::; (n) be order statistics from a sample of n i.i.d. uniform(0;1) random variables. Let Jn : (0;1) ! R; n = 1;2;::: be Lebesgue measurable functions and let g : (0;1) !R be a Borel measurable function. Define gn(t) g( ([nt]+1)). In the defining expression for the function Dn(V; ), (2.2), let ~G dnote the cdf of jz( )j. Then we can express Dn(V; ) as Dn(V; ) = 1n nX i=1 an(i)( ~G 1 )( (i)): (2.3) The following is Corollary 2.1 of van Zwet (1980) in the notation of this paper and is given for completeness. Lemma 1 (van Zwet). Let 1 p 1, 1=p + 1=q = 1, and suppose that Jn 2 Lp for n = 1;2;:::, g2Lq, and there exists a function J 2Lp such that limn!1Rt0 Jn = Rt0 J for all t2(0;1). If either (i) 1

g. This condition is satisfied if we have convergence in L1 of Jn [cf. also Doob (1994), Theorem VI.18]. To this end, we will marginally violate assumption (ii) of Lemma 1 and assume that sup n kJnkp sup n n1 n nX i=1 j?+(i=(n+ 1))jp o1=p <1 (2.4) for 1 p 1. Notice also that 1 n [nt]X i=1 ?+(i=(n+ 1)) Z t 0 Jn 1n [nt]+1X i=1 ?+(i=(n+ 1)): Taking the limit as n!1we obtain that limn!1Rt0 Jn = Rt0 ?+ for all t2(0;1) provided that ?+ has at most a finite number of discontinuities. Thus if ?+ satisfies (2.4) and g2Lq all the conditions of Lemma 1 hold. The following corollary is a special case of this result. Corollary 1. Let W1;:::;Wn be a random sample from a distribution F with support on R+. Let : R+ !R+ be a continuous Borel measurable function. Suppose, for 1 p;q 1 with 1=p+ 1=q = 1, E[ (W)]q <1andk?+kp <1. Then Tn n 1 nX i=1 ?+(i=(n+ 1)) (W(i)) a:s: ! Z (?+) ( F 1) <1: A formal proof of Corollary 1 may be constructed along the lines described in the paragraph preceding it with the function g defined as F 1. It will not be included here for the sake of brevity. Lemma 2. Under assumptions A1 - A3 Dn(V; ) a:s: ! ( ) a.e. V, uniformly for all 2 , (2.5) 9 where : !R is a function satisfying inf 2 ( ) > ( 0); (2.6) for any a closed subset of not containing 0. Proof. The a:s: pointwise convergence of Dn(V; ) follows from expression (2.3) and Corol- lary 1, which also furnishes the function ( ) Z (?+) ( ~G 1 ) <1: (2.7) Then under A1 - A3, Theorem 2 of Jennrich (1969) gives (2.5). To establish (2.6) we follow a similar strategy as in H?ssjer (1994). Under A1 and A3 for any s> 0, for 6= 0, ~G (s) = P(je ff(x; ) f(x; 0)gj s) = ExfPe(je ff(x; ) f(x; 0)gj sjx)g ( 0) whenever 6= 0. Thus for any 2 , we have a 2R such that ( ) > > ( 0). Then it follows from the compactness of that inf 2 ( ) > ( 0). Lemma 3. Letfhngbe a sequence of continuous functions defined on a compact set Rp and that converges uniformly to h. Thenfhngis equicontinuous on . Proof. Sincefhngconverges uniformly to h, for any > 0, there exists an N2N such that jhn( ) h( )j< =3 for all n N. The function h being continuous on a compact set, it 10 is uniformly continuous. Thus there exists some > 0 such that for all ; 02 such that k 0k< , we havejh( ) h( 0)j< =3. Then for all n N and for all ; 02 such thatk 0k< , we have jhn( ) hn( 0)j jhn( ) h( )j+jhn( 0) h( 0)j+jh( ) h( 0)j< : Also, by uniform continuity of fhng, for any fixed n2f1;:::;N 1g, there exists a n > 0 such that for all ; 02 withk 0k< n, we havejhn( ) hn( 0)j< . Now set 0 = minf 1;:::; N 1g. Then for alln2f1;:::;N 1gand all ; 02 withk 0k< 0, we havejhn( ) hn( 0)j< . Therefore, setting = minf ; 0g, for all n2N and all ; 02 withk 0k< , we havejhn( ) hn( 0)j< . Proof of Theorem 1. By Lemma 1 of Wu (1981), to establish the consistency of b n, it is sufficient to show that lim infn!1 inf 2 D n(V; ) Dn(V; 0) > 0 a.s. (2.8) for any a closed subset of not containing 0. But lim infn!1 inf 2 D n(V; ) Dn(V; 0) lim inf n!1 inf 2 An(V; )+ inf 2 B( ; 0) + lim infn!1 Cn(V; 0); (2.9) where An(V; ) = Dn(V; ) ( ), B( ; 0) = ( ) ( 0), and Cn(V; 0) = ( 0) Dn(V; 0). As a result of Corollary 1, lim infn!1Cn(V; 0) = 0 a.s. Due to Lemma 2 we have inf 2 B( ; 0) > 0. For the statement given in (2.8) to hold, it suffices to show is that lim infn!1inf 2 An(V; ) = 0 a.s. Again by Lemma 2, An(V; ) a:s: ! 0 uniformly for all 11 2 . Also An(V; ), being continuous on a compact set , is uniformly continuous on . Then An(V; ) is equicontinuous on a.e. V by Lemma 3. Thus8 > 0 there exists a > 0 such that8 ; 02 , ifk 0k< thenjAn(V; ) An(V; 0)j< ; a:e: V; 8n2N: (2.10) Let D 0 = f : k 0k< g, for 02 . Then D 0, 02 , forms an open covering of . But is compact, hence there is a finite subcovering D 0j, j = 1;:::;m which covers . Let be an arbitrary point in . Then for some j = 1;:::;m, 2D 0j. Hence, k 0jk< . Thus by (2.10) jAn(V; ) An(V; 0j)j< ; a:e: V; 8n2N: That is, An(V; 0j) 0 and n0 such that for all n n0 inf 2 n 1 nX i=1 jli( )jminfG(jli( )j=2) 1=2 ; 1=2 G( jli( )j=2)g : for all such where li( ) = f(xi; ) f(xi; 0). O2: E(jej) <1and E([f(x; ) f(x; 0)]2) <1for all 2 , and O3: G(0) = 1=2. Here O3 is weaker than C3. However, O2 is stronger than C2. Following similar contra- positive arguments as in the least squares case, we can easily show that O1 is also stronger 15 than C1 (see also Oberhofer (1982) p. 318). For a detailed discussion of this and sufficient conditions for O1, the reader is referred to Oberhofer (1982). Signed-Rank Wilcoxon Set ?+(u) = u for 0 >>> < >>> >: 1( k); if u< 2k 1; 1(u+12 ); if 2k 1 u< 2k 1; 1(k); if u 2k 1 . 16 Usually we take k = 4. 2.4 Breakdown Point One of the virtues of the estimators discussed in this paper is that they allow for trim- ming. This in turn provides us with estimates that are robust when one or more of the model assumptions are violated. In this section we will consider the breakdown point of our estimator as a measure of its robustness. Assuming that the true value of the parameter to be estimated is in the interior of the parameter space , breakdown represents a severe form of inconsistency in that the estimator converges to a point on the boundary of instead of 0. Recall that V =f(x1;y1);:::;(xn;yn)g V denotes the sample data points. LetVm be the set of all data sets obtained by replacing any m points in V by arbitrary points. The finite sample breakdown point of an estimatorb is defined as [see Donoho and Huber (1983)] " n(b ;V) = min 1 m n m n : supZ2Vmj b (Z) b (V)j=1 ; (2.11) where b (V) is the estimate obtained based on the sample V. In nonlinear regression, how- ever, this definition of the breakdown point fails since " is not invariant to nonlinear repa- rameterizations. For a discussion of this see Stromberg and Ruppert (1992). We will adopt the definition of breakdown point for nonlinear models given by Stromberg and Ruppert (1992). The definition proceeds by defining finite sample upper and lower breakdown points, "+ and " , which depend on the regression model, f. For any x0 2X, the upper and lower breakdown points are defined as "+(f;b ;V;x0) = 8 >>>> >>< >>>> >>: min0 m n mn : supZ2Vmf(x0;b (Z)) = sup f(x0; ) if sup f(x0; ) >f(x0;b ); 1 otherwise, (2.12) 17 and " (f;b ;V;x0) = 8 >>> >>>< >>> >>>: min0 m n mn : infZ2Vmf(x0;b (Z)) = inf f(x0; ) if inf f(x0; ) 0g where k [n=2] + 1. Here [b] stands for the greatest integer less than or equal to b. This forces at least the first half of the ordered absolute residuals to contribute to the dispersion function. In light of this, the dispersion function may be written as Dn(V; ) = 1n kX i=1 an(i) (jz( )j(i)) The following theorem is a version of Theorem 3 of Stromberg and Ruppert (1992). We impose the same conditions but give the result in terms of k. The results given are for upper 18 breakdown. Analogues for lower breakdown are straightforward. The proof is obtained by replacing med1 i n with n 1Pki=1 and m with n k in Stromberg and Ruppert?s (1992) proof of Theorem 3. In the following, #(A) denotes the cardinality of the set A. Theorem 2. Assume for some fixed x0 there exist k fi : 1 i ng where #( k) = 2n [n=2] k such that lim M"1 inf f :f(x; )>Mg finf i2 k f(xi; )g = sup f(x0; ) Then "+(f;b ;V;x0) n k + 1n : Theorem 2 establishes that even when the regression function f lies on the boundary for a portion of the data, the bias of the estimator of 0 remains within reasonable bounds if trimming is implemented. The following corollary gives the asymptotic (as n ! 1) breakdown point of b n. Corollary 5. Let = supfu : ?+(u) > 0g. The asymptotic breakdown point ofb n is at least 1 . This is reminiscent of the breakdown point of a linear function of order statistics which is equal to the smaller one of the two fractions of mass at either ends of the distribution which receive weights equal to zero (Hampel, 1971). The same result obtained in Corollary 5 was given by Hampel (1971) for one-sample location estimators based on linear functions of order statistics (see sec. 7 (i) of Hampel (1971)). Consider the class of models with the form f(x; ) = g( 0 + 1x), where ( 0; 1) 2R2 and g(t) is monotone increasing in t. This class of models is considered by Stromberg and Ruppert (1992) and contains popular models like the logistic regression model g( 0; 1x) = f1 + exp( ( 0 + 1x))g 1. A breakdown point of 1 can be achieved ifb n is obtained via a minimization of (2.2) with an(i) = ?+(i=(n+ 1)) such that = supfu : ?+(u) > 0g. 19 Remark 5. A definition of breakdown based on ?badness measures? which includes the defini- tion given by Stromberg and Ruppert (1992) was given by Sakata and White (1995). Under our assumptions this definition reduces to the one used in the current paper as shown in Theorem 2.3 of Sakata and White (1995). 20 Chapter 3 Bounded Influence Nonlinear Signed-Rank Regression 3.1 Introduction As in the previous chapter, let us consider the following nonlinear regression model yi = f(xi; 0) +ei; 1 i n; (3.1) where 0 2 is a vector of parameters, xi is a vector of independent variables in a vector space X, and f is a real-valued function defined on X . Let V =f(y1;x1);:::;(yn;xn)g be the set of sample data points. Note that V V R X. We shall assume that is compact, 0 is an interior point of , and f(x; ) is a twice continuously differentiable function of for each x2X and a measurable function of x for each 2 . The errors ei are assumed to be iid with a distribution function G. The asymptotic normality of the least squares (LS) estimator of 0 has been discussed in Jennrich (1969), Wu (1981), and Wang (1996) among others. The asymptotic normality of the least absolute deviations (LAD) estimator of 0 is discussed in Wang (1995). However, as pointed out in Haupt and Oberhofer (2009), the treatment of Wang (1995) and Wang (1996) were missing some necessary global conditions. The estimator that will be introduced in this chapter is based on a generalized form of the signed-rank objective function. It provides a unified treatment of a class of estimators including those considered in Wang (1995) and Wang (1996). Moreover, we show how a weight functions can be incorporated to obtain estimators with bounded influence function (Hampel, 1974). Simply stated, the influence function represents the amount of change in the estimator caused by infinitesimal 21 contamination in the data. Thus it is a measure of the sensitivity of an estimator to outliers and it is desired that this function be bounded. Rank-based estimators of linear models (where f(x; 0) = x0 0 in (3.1)) have been studiedextensively. Jaeckel(1972)gaveageneralclassofrankestimatorsforlinearregression parameters that are efficient and robust to outliers in the response space. These include the Wilcoxon estimator which is equal to the median of pairwise slopes (Yj Yi)=(xj xi) for the case of simple linear regression. These estimators, however, were found to be sensitive to outliers in the x direction (Hettmansperger et al., 2000; Hettmansperger and McKean, 1998); thus having an unbounded influence function. Sievers (1983) introduced weighted Wilcoxon estimators that were later shown to possess a bounded influence function by Naranjo and Hettmansperger (1994). Chang et al. (1999) provided one-step estimators that have high breakdown point based on the weighted Wilcoxon pseudonorm, where the weights depend on a robust and consistent estimator of 0. The signed-rank (SR) estimator of the slope parameter in the linear model is also ef- ficient and robust to outliers in the y direction but sensitive to outliers in the x direction (Hettmansperger et al., 2000). As the Wilcoxon estimator, the SR estimator is suitable when dealing with datasets from studies with controlled designs. However, it may be ad- versely affected when exploring datasets based on uncontrolled studies. To address the lack of robustness in the x direction, Tableman (1990) provided a one step signed-rank estima- tor for the linear model that has a bounded influence function. The results of Tableman (1990) were motivated by the work of Krasker and Welsch (1982) who gave a class of M- estimators with bounded influence function for linear regression estimation. A framework similar to Tableman (1990) has been investigated by Wiens and Zhou (1994) who provided bounded-influence rank estimators in the linear model using a general form of the SR objec- tive function. They also show how the efficiency can be optimized by appropriate choices of scores and weights under a boundedness constraint on the influence function. 22 For the general nonlinear model given in (3.1), Abebe and McKean (2007) studied the asymptotic properties of the Wilcoxon estimator of 0. Just as in linear models, this estimator was shown to be efficient but sensitive to local changes in the direction of x. Jure?kov? (2008) also studied the asymptotic properties of the rank estimator of 0 in (3.1). Her approach takes advantage of the asymptotic equivalence of regression quantiles and regression rank scores to provide rank scores based on the regression function. The approach results in a restricted set of scores. Also, the resulting estimator does not possess a bounded influence function. In this chapter, we propose a class of rank-based estimators of 0 in (3.1) based on the minimization of a weighted signed-rank objective function. In contrast with the approach of Abebe and McKean (2007) and Jure?kov? (2008), this approach allows for a set of scores generated by any nondecreasing bounded score function that has at most a finite number of discontinuities. Also, by utilizingthe theory of Sobolev spaces, this approach removescertain restrictive assumptions such as compactness of X, Lipschitz continuity of the regression function, boundedness of the first derivative of the density of the error distribution that were needed in the work of Jure?kov? (2008). Our objective function is very general. For instance, the LS objective function is a special case of our objective function. However, the objective function of Jure?kov? (2008) does not include the LS objective function. We also show how Krasker-Welsch type weights (Krasker and Welsch, 1982) can be defined based on the regression function f to result in a bounded influence function. Moreover, simulation studies show that the proposed weighted estimators are also efficient attaining a relative efficiency of .955 versus least squares when G is Gaussian. Other robust approaches to nonlinear regression include Stromberg (1993) who provided computational algorithms for computing high breakdown nonlinear regression parameters using the least median of squares (Rousseeuw, 1984) and MM (Yohai, 1987) estimators. Stromberg (1995) establishes the consistency of the least trimmed squares (LTS) estimator 23 for the nonlinear model in (3.1). The LTS was shown to have a high breakdown point by Stromberg and Ruppert (1992). For linear models, the estimator proposed in this chapter can be regarded as a general- ization of the objective function of Tableman (1990) to include other norms such as weighted LAD and LS. Moreover, since we do not restrict ourselves to the linear model, not only is it an extension of signed rank estimators for the linear model to the nonlinear regression case, but it is also a generalization of LAD and LS type estimators for the nonlinear regression model. The remainder of the chapter is organized as follows. Our proposed estimator is given in Section 3.2. Section 3.2 also contains asymptotic and robustness results concerning the proposed weighted estimator. Section 3.3 gives the results using plug-in estimator of the weights based on a consistent estimator of the regression parameter. Real data and sim- ulation examples are given in Section 3.4. Section 3.5 provides a discussion. Proofs and technical results are given in the appendix. 3.2 Weighted SR Estimator Consider the signed-rank (SR) estimator, b S, of 0 in equation (3.1) that minimizes T+n ( ) = Pni=1Rijzi( )j, where zi( ) = yi f(xi; ) and Ri = #fj : jzj( )j jzi( )jg is the rank ofjzi( )j, i = 1;:::;n. The least squares (LS) and least absolute deviation (LAD) estimators of 0 minimize Pni=1z2i ( ) and Pni=1jzi( )j, respectively. It is well known that the LS estimator is sensitive to outliers in both x and y directions while the SR and LAD estimators are sensitive to outliers in the x-direction. There is clearly a need for a method that is not sensitive to outliers in both x and y directions. We obtain this by considering a weighted form of the SR estimator. 24 We define the weighted SR (WSR) estimator b n of 0 to be any vector minimizing Dn(V;w; ) = 1n nX i=1 w(xi; 0)an(i) (jz( )j(i)) (3.2) where zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j. The function : R+ ! R+ is continuous and strictly increasing. The numbers an(i) are scores generated as an(i) = ?+(i=(n + 1)), for some bounded and non-decreasing score function ?+ : (0;1)!R+ that has at most a finite number of discontinuities. The function w : X !R+ is a continuous weight function. Because Dn(V;w; ) is continuous in , Lemma 2 of Jennrich (1969) implies the existence of a minimizer of Dn(V;w; ). It is clear that weighted LS and LAD are special cases of WSR. Weighted LS is obtained by taking?+ 1 and (t) = t2, t 0 while weighted LAD is obtained by taking?+ 1 and (t) = t. In our analyses, however, LS and LAD refer to the unweighted versions obtained by taking w 1. In the following, we will establish the asymptotic properties of b n and discuss how weights can be used to obtain a bounded influence function. As given in (3.2), the weights depend on the unknown true parameter 0. This will make our derivations cleaner. However, to be of practical use, the weights would have to be estimated. In Section 3.3, we will discuss a plug-in estimator of the weights based on a consistent estimator of 0 and how estimators based on these estimated weights have the same asymptotic properties as their counterparts based on ?true? weights. 3.2.1 Preliminaries The following definitions and notations will be used throughout this paper. Let be a domain. We denote by Lp( ;P), 1 p 1, the space of P-measurable functions on for which Z jhjpdP <1 with the usual modification for p = 1. C1( ) is the space of smooth(infinitelydifferentiable)functionsdefinedin ,D( ) isthespaceofsmoothfunctions 25 with compact support in and L1loc( ) is the space of locally integrable functions in ; that is, functions that are integrable in any compact subset of . Let = ( 1;:::; n) 2 Nn0, N0 = N[f0g, be a multi-index. The differential operator is defined as D = @ j j @ 11 :::@ nn ; where j j = Pni=1 i and = ( 1;:::; n). Let 2 L1loc( ). Given 2 Nn0, a function 2L1loc( ) is called the th-weak derivative of if for all 2D( ) Z D u du = ( 1)j j Z du; and we put = D . As an example, consider (u) = juj. Clearly, is not differentiable in the usual sense at 0. But 2L1loc(R), is weakly differentiable and 0(u) = sgn(u). Let m2N0 and 1 p 1. The Sobolev space denoted by Wm;p( ) is defined as Wm;p( ) =f 2Lp( ) : D 2Lp( ) withj j mg : Given a function K2L1 such that Z Rn K(x)dx = 1, let K (x) = nK(x= ). The family of functionsfK ; > 0g, is called a mollifier with kernel K and K is known as the Friedrichs? mollifier. SomeimportantfactsrelatedtoSobolevspacesthatmaybeusefulinourdiscussion are listed below without proofs. A detailed discussion of these can be found in Brezis (1983) and Adams (1975). (S1) K 2C1(Rn), supp(K ) =fx2Rn : kxk g, K 0, and Z Rn K (x)dx = 1. Here supp(K ) denotes the support of K . 26 (S2) (Regularization Theorem) Let K be a Friedrichs? mollifier. If 2L1loc(Rn), then the convolution product K (x) = Z Rn (x y)K (y)dy exists for all x2Rn. Moreover, K 2C1(Rn), supp( K ) supp( ) + B0(0; ) where B0(0; ) = fx2Rn : kxk g, D ( K ) = D K and supp(D K ) supp(K ). Also ifMis a compact set of points of continuity of , then K ! uniformly onMas !0. (S3) Let K be a Friedrichs? mollifier. Let 2Wm;p( ) for 1 p 1. Then, K ! in Lp( ) and K ! in Wm;p(!) as ! 0 for all ! . ! means ! is open, the closure of !, ! is compact and ! . 3.2.2 Consistency Let ( 0;F;P) be a probability space. For i = 1;:::;n, assume that xi and ei = yi f(xi; 0) are independent random variables (carried by ( 0;F;P)) with distributions H and G, respectively. Setting ~G to denote the distribution of jz( )j, we can rewrite Dn(V;w; ) as Dn(V;w; ) = 1n nX i=1 w(xi; 0)an(i)( ~G 1 )( (i)) where (i) are order statistics from the uniform U(0;1) distribution. Theorem 3. Let (I1) P(f(x; ) = f(x; 0)) < 1 for any 6= 0, (I2) w2Lp(X ) and there exists a function h2Lq(V) such thatj ( ~G 1 (v))j h(v), for all 2 and all 1 p;q 1such that 1=p+ 1=q = 1, and (I3) G has a density g that is symmetric about 0 and strictly decreasing on R+. 27 Then b n a:s: ! 0. Before giving the proof, we state the following lemma without proof. The proof of this Lemma may be constructed following Lemma 2. Lemma 4. Under assumptions (A1) (A3), Dn(V;w; ) a:s: ! ( ) a.e. V, uniformly for all 2 , where : !R is a function satisfying inf 2 ( ) > ( 0) for any a closed subset of not containing 0. Proof. By Lemma 1 of Wu (1981), to establish the consistency of b n, it is sufficient to show that lim infn!1 inf 2 D n(V;w; ) Dn(V;w; 0) > 0 a.s. (3.3) forany aclosedsubsetof notcontaining 0. TothatendletAn(V;w; ) = Dn(V;w; ) ( ), B( ; 0) = ( ) ( 0), and Cn(V;w; 0) = ( 0) Dn(V;w; 0). By Lemma 2, we have An(V;w; ) a:s: ! 0 uniformly for all 2 , inf 2 B( ; 0) > 0, and lim infn!1 Cn(V;w; 0) = 0 a.s. For the statement given in (3.3) to hold, it suffices to show that lim inf n!1 inf 2 An(V;w; ) = 0 a.s. An(V;w; ), being uniformly convergent and continu- ousonacompactset ,isequicontinuouson a.e.V. Thisgives lim infn!1 inf 2 fAn(V;w; )g= 0 a.s. and the proof is complete. Assumption (I1) is a very weak condition needed for 0 to be identified. The linear version of (I1) was given by H?ssjer (1994) as P(j 0xj = 0) < 1 for any 6= 0 under the assumption that 0 = 0. Since?+ is bounded, by (I2), we havekw?+kp <1. Moreover, (I2) and H?lder?s inequality ensure that the product (w?+)( ~G 1 ) is integrable. (I3) admits a wide variety of error distributions examples of which are the normal, double exponential and Cauchy distributions with location parameter equal to 0. 28 3.2.3 Asymptotic Normality Write (t) = [ ~G 1 (t)] and i = w(xi; 0)an(R i) where R i, i = 1;:::;n are the rank of 1;:::; n. Then (3.2) can be written as Dn(V;w; ) = 1n nX i=1 w(xi; 0)an(i)( ~G 1 )( (i)) = 1n nX i=1 i ( i) : By (I2),k ikp <1for 1 p 1. Now set n( ) = D Dn(V;w; ) and (t) = D (t) forj j= 1. Since the dependence of on y is only through z( ), we will suppress y in the notation and write (x). Now denote the n p matrix X by X = 0(x1);:::; 0(xn) and define hnii to be the ith diagonal component of X (X TX ) 1X T. Now b n is a zero of n( ) = 1n nX i=1 i ( i): (3.4) Thus b n can be seen as a weighted M-estimator with weights 1;:::; n. So, under some conditions, the asymptotic theory of the weighted M-estimation can be applied. In addition to (I1) - (I3), consider the following conditions: (I4) ! (t) is a map in W3;p(B), where B is a neighborhood of 0 for every fixed t. (I5) There exist functions 2W2;p(V) such thatjD (t)j (t) for every 2B and j j 2. (I6) A 0 = E w(x; 0)?+( ) [D ( )] = 0 , where U(0;1), is a positive definite matrix forj j= 1. (I7) limn!1 max 1 i n hnii!0 Example 1. The assumptions above allow us to define certain types of hybrid estimators which may be constructed in the interest of efficiency and robustness. One such estimator is one that behaves like an LS estimator for small absolute residuals and like an LAD estimator 29 for large absolute residuals. As an illustration, let us consider the one-dimensional case with = [a;b][[b;c], where a >< >>: z2( ); 2[a;b] jz( )j+z2(b) jz(b)j; 2[b;c] : This function is strictly increasing and continuous as a function of but not differentiable in the usual sense at = b. The test function 2D( ) has to satisfy (a) = (b) = (c) = 0. Using the definition of the weak derivative of satisfies Z (jz( )j) 0( )d = Z ( ) ( )d : But Z (jz( )j) 0( )d = Z b a z2( ) 0( )d + Z c b jz( )j 0( )d = Z b a 2z( ) _f( ;x) ( )d + Z c b sgn(z( )) _f( ;x) ( )d = Z h 2z( ) _f( ;x)I[a;b]( ) +sgn(z( )) _f( ;x)I[b;c]( ) i ( )d where the second equality is from integration by parts and _f( ;x) = @f( ;x)=@ . Thus ( ) = 2z( ) _f( ;x)I[a;b]( ) +sgn(z( )) _f( ;x)I[b;c] : Higher order derivatives can be computed similarly. In this case, (I4) and (I5) are satisfied as long as f and its higher order derivatives (up to order 3) are bounded by integrable functions. Particularly, assumption (I4) allows us to include a large variety of generalized functions that are not differentiable in the usual sense. The typical example is the absolute valued function given by x!jxjthat includes the LAD and SR estimators, is not differentiable in the usual sence but is locally integrable, and therefore, weakly differentiable. It can be seen 30 that none of the computions above can be done without assumption (I4). Assumption (I5) ensurestheintegrabilityof (w?+) (D ) forany suchthatj j 2 forwhichtheSLLNcan be applied (see remark below). A 0 in assumption (I6) denotes the limiting Hessian matrix. (I7) is a common assumption known as Noether?s condition that ensures the asymptotic normality of n( 0). Remark 6. Under (I1);(I2), and (I3), Lemma 2 in the appendix gives the pointwise almost sure convergence of Dn(V;w; ) for any 2 . If in addition (I5) holds, then we have [D Dn(V;w; )] = 0 a:s: ! ( 0) a.e. V, where ( 0) E w(x; 0)?+( ) [D ( )] = 0 for any such thatj j 3. The following theorem gives the asymptotic normality ofb n. The approach of the proof is similar to that given in van der Vaart (1998) for M-estimators. Theorem 4. Under assumptions (I1) (I7), pn(b n 0) D !N(0; A 1 0 0A 1 0 ); where 0 = E w(x; 0)?+( ) 0( )( 0( ))T . Proof. By (I4), ! is a map in W3;p(B), properties (S1) ? (S3) imply that D( B) is dense in Wm;p(B), where B is the closure of B. Thus, in the the following, we may assume without loss of generality that 2D( B). Implementing the Taylor expansion at 0 of n( ), we get 0 = n(b n) = n( 0) + _ n( 0)(b n 0) + 12(b n 0)0 n( n)(b n 0) where n is a point between 0 and b n, _ n( 0) = D n( ) = 0 forj j= 1 and n( 0) = D n( ) = 0 forj j= 2. Now using (I7) and by an application of the Cram?r-Wold device (Serfling, 1980) and the central limit theorem for linear combinations of functions of order statistics (H?jek and 31 ?id?k, 1967), pn n( 0) converges to a multivariate normal distribution with mean 0 and covariance matrix 0. Also _ n( 0) converges almost surely to A 0 (see Remark 6) and hence in probability. By (I4) and Theorem 3, limn!1P(f n2Bg) = 1. So under the event f n2Bg, k n( n)k C1n nX i=1 w(xi; 0) ( i) where C stands for the bound of the score function. The right hand side of the above inequality is bounded in probability by the law of large numbers for n sufficiently large. These and the consistency of b n for 0 give n( 0) = A 0 +op(1) + 12(b n 0)Op(1) (b n 0) = A 0 +op(1) (b n 0) since (b n 0)Op(1) = op(1)Op(1) ! 0 in probability. Also with probability tending to 1, the matrix A 0 +op(1) is invertible. Thus, pn(b n 0) = pn A 0 +op(1) 1 n( 0) = pnA 1 0 n( 0) +op(1) : (3.5) The proof is complete by an application of Slutsky?s lemma and noting that pn n( 0) is asymptotically normal. In general, for estimators with w 1 and (t) = t, we can simplify the asymptotic covariance matrix ofpnb n to obtain 2?+ E[rf(x; 0)rf(x; 0)T] 1, where 2?+ = R1 0f? +(u)g2du R1 0 ? +(u)?+g (u)du 2 ; with ?+g (u) = g0(G 1(u+12 ))=g(G 1(u+12 )). It is easy to show that 2?+ = f2g(0)g2 for the LAD estimator and 2?+ = p 12R1 1fg(t)g2dt 2 for the SR estimator. The asymptotic covariancematrixoftheLSestimatoris 2 E[rf(x; 0)rf(x; 0)T] 1, where 2 = R u2dG is the error variance. 32 For Rp, two estimators can be compared using the asymptotic relative efficiency (ARE), which is the reciprocal of ratio of their asymptotic covariance matrices to the power 1=p (Serfling, 1980). The evaluation of the ARE is very complicated in general. We consider a simulation study with G taken to be the t distribution in Section 3.4. The ARE may be determined in closed form for comparing the simpler estimators such as SR versus LS or LAD for some error distributions such as the Cauchy, logistic, and normal. These are the same as those for comparing the signed-rank, the sign, and thettest in the location problem and are studied extensively in Hettmansperger and McKean (1998). 3.2.4 Robustness In this section, we discuss conditions needed for the boundedness of the influence func- tion. If we can write pn(b n 0) = 1p n nX i=1 ( 0;yi;xi) +op(1) ; then the influence function of b n at a given point (y0;x0) is ( 0;y0;x0) (e.g. Corollary 3.5.7 of Hettmansperger and McKean (1998)). However, by equation (3.5) in the appendix we have pn(b n 0) = 1p n nX i=1 ( i)A 1 0 0(xi) +op(1) : Thus ( 0;y0;x0) = (x0;y0)A 1 0 0(x0), where (x0;y0) = w(x0)?+(G(z( 0)). Assume that (I8) : there exists a constant M > 0 such thatkw(x) 0(x)k M. Note that from the boundedness of the score function ?+, for (I8) to hold, it suffices that kw(x) 0(x)kbe bounded as a function of x. This is the same condition as Assumption 8 of Coakley and Hettmansperger (1993) for 0(x) = x corresponding to f(x; 0) = x0 0. Thus, 33 the choice of weights discussed in Section 6.1 of Coakley and Hettmansperger (1993) may be extended to our case. Theorem 5. Under (I1) (I8), b n has a bounded influence function. 3.3 Weight Specification Based on the work of Giltinan et al. (1986) and Simpson et al. (1992) for the linear model, one may set w(x; 0) = min h 1; d(x; 0) i (3.6) where d(x; 0) = (x mx )TC 1x (x mx ) is a robust Mahalanobis distance, with mx and Cx being robust estimates of location and covariance of x = 0(x), respectively and being some positive constant. As it can be seen from its definition, the weight function depends on the true parameter 0. So, in our study, we will consider plug-in estimators of d(x; 0) based on strongly consistent estimators of 0. As Coakley and Hettmansperger (1993) we will take resistant distances ford(x; 0) that use location and scale estimates based on the minimum covariance determinant (MCD) and minimum volume ellipsoid (MVE) of Rousseeuw (1984) and Rousseeuw (1985), respectively. A discussion of these and other high breakdown estimators in the multivariate case is given in Hubert et al. (2008). Analogous to Tableman (1990) and Krasker and Welsch (1982), one may use the weight function w(x; 0) in (4.13) by setting d(x; 0) = x TB 1 0 x ; with B 0 = Ex;y h w2(x; 0) ?+[2G(z( 0)) 1] 2 x Tx i x = 0(x) and = M= for such that sup t [?+(t)] = . The difficulty in using this weight function for the nonlinear regression model is on estimatingd(x; 0) for which the expression of B 0 depends on the weight function w(x; 0). So we face a situation where we have to solve an implicit value problem. For linear models, however, since d(x; 0) can be specified 34 free of model parameters (see for example Section 6 of Coakley and Hettmansperger, 1993). An iterative scheme for computing d(x; 0) can be found in Krasker and Welsch (1982) and in Section 2 of Tableman (1990), and, under some suitable extra conditions (like the invertibility of the score function ?), one may extend the iterative computation of d(x; 0) to the nonlinear case. Another weight function is that considered by Wiens and Zhou (1994) and uses d(xi; 0) that is equivalent tokA 1 0 x ik 1, wherek kis the Euclidean norm. 3.3.1 Plug-in Estimator of the Weight Let ~ n be the minimizer of Dn(V;w; ) with w 1; that is, the unweighted estimator of 0. b S given at the beginning of Section 3.2 is one such estimator. Under (A1) (A7), ~ n a:s: ! 0 as n !1. Hence ~ n = 0 + o(1) w.p. 1. Denote by ~w = w(xi;~ n), ~ i = w(xi;~ n)?+ Ri n+1 and ~ n( ) = 1n nX i=1 ~ i ( i) where ~ (x0;y0) = w(x0;~ )?+ G(z( 0)) . Set bb n = Argmin 2 ~Dn(V; ~w; ), where ~Dn(V; ~w; ) = 1nPni=1w(xi;~ )an(i)( ~G 1 )( (i)). The following theorem shows that bb n and b n have the same asymptotic properties. Theorem 6. Under (I2), ~ n( 0) = n( 0) +o(1) w.p.1. Proof. From the fact that ~ = 0 +o(1) w.p.1 and by the continuity of the weight functionw, we have w(xi;~ ) = w(xi; 0) + o(1) w.p.1. Now using this fact in the expression of ~ n( 0), we get ~ n( 0) = n( 0) +o(1) 1 n nX i=1 ?+ Rin+ 1 0( i) : Set n( 0) = 1nPni=1 ?+ Rin+1 0( i). Then, under (A2), by the SLLN for functions of order statistics (Van Zwet, 1980), we have n( 0) a:s: !E[?+( ) 0( )] <1; where U(0;1). Thus n( 0) is bounded in probability and hence o(1) n( 0) ! 0 w.p.1 as n!1. Therefore ~ n( 0) = n( 0) +o(1) w.p.1. 35 Theorem 7. Under (I1) (I7), pn(bb n b n) = op(1) : Proof. As in the proof of Theorem 4, the Taylor expansion of ~ n(bb n) about 0 gives 0 = ~ n(bb n) = ~ n( 0) + _~ n( 0)(bb n 0) + 12(bb n 0)0 ~ n( n)(bb n 0) ; where n lies on the line segment joiningbb n and 0. Also under the assumptions of Theorem 6, one can show that _~ n( 0) = _ n( 0)+o(1) w.p.1. Now the fact thatw(xi;~ ) = w(xi; 0)+ o(1) w.p.1 and the SLLN of functions of the order statistics, we have _~ n( 0) a:s: !A 0. Note that under (A1) (A3), bb n a:s: ! 0. From this and assumption (I4), limn!1P(f n2Bg) = 1. So under the eventf n2Bg, ~ n( n) is bounded in probability. Hence pn(bb n 0) = pnA 1 0 ~ n( 0) +op(1) = 1pn nX i=1 ( ~ i)A 1 0 0(xi) +op(1) = 1pn nX i=1 ( i)A 1 0 0(xi) + o(1) A 1 0 1pn nX i=1 0(xi) +op(1) ; where the last equality follows from ~ i = i + o(1) w.p.1 by the continuity of the weight function and the boundedness of ?+. Clearly, by the central limit theorem, 1pn nX i=1 0(xi) converges to a normal distribution, thus bounded in probability. Therefore pn(bb n 0) = 1p n nX i=1 ( i)A 1 0 0(xi) +op(1): (3.7) Now combining (3.5) and (3.7) yieldpn(bb n b n) = op(1). 36 Remark 7. Theorem 4 and Theorem 7 show that bb n has the asymptotic Gaussian distribu- tion given in Theorem 4. Moreover, from equation 3.7 in the appendix gives pn(bb n 0) = 1p n nX i=1 ( i)A 1 0 0(xi) +op(1) : Thus, bb n has the same influence function as b n. 3.4 Examples All analyses in this section were performed using the R software environment (R Devel- opment Core Team, 2009). Estimates were found by means of the ?nonlinear minimization subject to box constraints? algorithm (Gay, 1983, 1984) implemented in the function nlminb in the ?stats? package of R with a range of starting values. We use weight functions based on robust Mahalanobis distances. MCD estimates of location and covariance were found using the fast MCD algorithm of Rousseeuw and Van Driessen (1999). Both MCD and MVE are implemented in the R function cov.rob in the package ?MASS?. For simplicity, all WSR estimators considered in the following use ?+(u) = u and (t) = t. In this case ~ = b s. 3.4.1 Monte Carlo Simulation Exponential Regression Consider the one-parameter exponential regression model yi = e 0xi +"i ; i = 1;:::;25 : (3.8) We took x1;:::;x25 to be iid N(0;1). We then performed B = 1000 iterations where in each iteration "1;:::;"25 were randomly generated from the N(0;:01) distribution and y1;:::;y25 were computed asyi = exi+"i. That is, the true value of 0 was taken to be 1. We considered two cases: an outlier in the y-direction by adding 50 to y20 and an outlier in the x-direction 37 by adding 5 to x20. In each of the B repetitions we estimated 0 (giving ^ (1);:::; ^ (B)) using LS, LAD, SR, and WSR. For WSR estimation, we used the weight function w(x; ~ ) = min 1; 2 :95(1) d(x; ~ ) where 2:95(1) isthe95%percentilepointofthe 2(1) distributionandd(x;~ ) = Pni=1 2(xie~ xi )2. Here ~ is the SR estimate of 0 whereas and 2 are the MCD center and variance, respectively, of xe~ x. As the estimated value of 0, we took B 1P^ (i) and as the estimate of the MSE, we took B 1P(^ (i) 0)2. The results are given in Table 3.1. Table 3.1: Average Estimates(MSEs) of 0 = 1 in 1000 simulated samples LS LAD SR WSR y-outlier -3.766(22.7108) .999(.0001) .999(.0001) .973(.0031) x-outlier .233(.5878) .342(.4327) .264(.5413) .993(.0001) From Table 3.1, it is clear that the LS estimate is affected rather adversely by the presence of the y-outlier. We can also observe that the LS, LAD, and SR estimates are affected by the x-outlier while the WSR estimate remained close to 0 = 1 in both cases. Figure 3.1 gives the plots of the fitted curves obtained using LS, LAD, SR, and WSR. It is clear from the plot that the WSR fit is not affected by either the outlying x or the outlying y values. A Study of Relative Efficiency It is well known that in linear models the relative efficiency (RE) of the signed rank estimator to the least squares estimator is 3= :955 when G is the normal distribution. This RE increases with the heaviness of the tail of the error distribution. To study this for nonlinearmodels, weconsideredasimulationexperimentinvolvingtheMichaelis-Mentenand Gompertz functions. The Michaelis-Menten function describes the relation between velocity 38 Figure 3.1: Plot of fitted exponential curves including WSR -0.5 0.0 0.5 1.0 1.5 1 2 3 4 Outlier in y x y True LS LAD SR WSR -0.5 0.0 0.5 1.0 1.5 1 2 3 4 Outlier in x x y True LS LAD SR WSR of reaction f(x) of an enzyme with substrate concentration x. It is given by f(x; ) = x +x ; where = ( ; ), is the maximum velocity of the reaction and is the half-saturation constant. On the other hand the Gompertz model is defined by f(x; ) = expf e xg where = ( ; ; ) is the vector of parameters of interest. One of the applications of this functionisformodelinggrowthoftumors. Infact, tumorsarecellularpopulationsgrowingin a confined space where the availability of nutrients is limited. In this situation, f(x) denotes the tumor size, the carrying capacity, i.e., the maximum size that can be reached with the available nutrients, = ln f(0) , where f(0) is the tumor size at the starting observation time and is the growth rate of the tumor. For both models, n = 100 values of x were generated at random from the standard exponential distribution. Then for each of B = 2000 repetitions, n random errors " were 39 generated from the t distribution with d degrees of freedom and n values of y = f(x; ) +" were computed. This was done for degrees of freedom d ranging from 3 to 102 in steps of 3. The t distribution is ideal for simulating varying tail thicknesses as it provides distributions from the Cauchy (d = 1) to the normal (d = 1). We are interested in estimating the parameter in both models. The weighting scheme used for WSR estimation is the same as the one in the previous Monte Carlo experiment (MCD along with a 2:95 cutoff). Figure 3.2 contains the plot of degrees of freedom versus the estimated relative efficiency given by ratios of estimated MSEs. The top dashed horizontal line is at 3= and the bottom dashed horizontal line is at 2= . These values are the theoretical asymptotic relative efficien- cies of SR versus LS and LAD versus LS, respectively, for the linear model. Estimators with relative efficiencies above the solid line are more efficient than the LS estimator. The plot shows that the SR and WSR estimators are more efficient than the LAD estimator. More- over, we observe that the relative efficiencies approach the theoretical values under normality as the degree of freedom increases. We can also see that LAD, SR, and WSR estimators are more efficient than the LS estimator for heavy tailed error distributions. SR and WSR provide similar estimates of relative efficiency. Figure 3.2: Relative Efficiency 0 20 40 60 80 100 1.0 1.5 2.0 Michaelis?Menten t df relative efficiency vs LS SR LAD WSR 0 20 40 60 80 100 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Gompertz t df relative efficiency vs LS SR LAD WSR 40 3.4.2 Real Data We now consider the Lakes Data given in Stromberg (1993). The data were collected from 29 lakes in Florida. Three variables were collected on each lake. The response is the mean annual nitrogen concentration (y). The predictors are the average influent nitrogen concentration (x1) and the water retention time (x2). The model recommended by the investigator is yi = x1i1 + x 2i +"i ; i = 1;:::;29 : We fit the model using LS, LAD, SR, and WSR. For the weighted SR, we used the weight function w(x) = min 1; 2 :95(2) d(x;~ ) : We computed two versions of WSR. WSR1 uses weights where d(x;~ ) is taken to be the robust Mahalanobis distance constructed using the MVE estimates of the center and covari- ance ofr f(x;~ ) with f(x; ) = x1=(1 + x 2 ), = ( ; ), and x = (x1;x2). WSR2 has the same setup as WSR1 except that it uses weights based on classical Mahalanobis distances. We take ~ to be the SR estimate of . The results obtained taking ~ as LS or LAD estimates are similar and hence not reported. Table 3.2 gives the estimates of the parameters, standard errors, and scale. Table 3.2: Parameter estimates for Lakes Data Method Parameter Estimate SE Z p-value Scale LS 5.08423 1.93891 2.62220 0.00874 1.2637 1.27852 0.35334 3.61835 0.00030 LAD 3.88311 0.79337 4.89443 0.00000 1.0046 1.29880 0.23082 5.62686 0.00000 SR 5.09126 1.66082 3.06551 0.00217 1.3877 1.22612 0.29373 4.17432 0.00003 WSR 0.88588 0.58543 1.51320 0.13023 1.2904 0.39018 0.17357 2.24799 0.02458 41 For the original data, it is apparent from Table 3.2 that WSR1 gives estimates that are quite different from the other methods. The residual diagnostic plots given in Figure 3.3 can help us determine which of the results is the most appropriate. The figure contains plot of residuals from the fitted models versus x1 (left panel) and residuals versus ^y (right panel). The plots for LAD and SR (not shown) are very similar to those of LS. We observe that all plots of residuals versus x1 identify two potential outliers (denoted by solid squares) in the x1 direction. However, these observations do not stand out as aberrant in the residual versus fits plot for LS whereas WSR2 identifies only one of them as a potential outlier. The WSR1 fit clearly identifies the two points as outliers. Stromberg (1993) also identifies the same points using the high breakdown least median of squares and MM estimators. The residual plot of WSR1 given in Figure 3.3 and that of MM given in Stromberg (1993) are quite similar. The ?Outliers Removed? part of Table 3.2 gives the estimates and standard errors after removing the two outliers identified by WSR1. The estimates due to the different methods are quite similar to each other. We removed the one outlier identified by WSR2 and refit all the models. LS, SR, and LAD fail to identify the remaining one outlier. However, WSR2 identifies the outlier. This indicates a potential masking problem associated with using the classical Mahalanobis distance in the weights. This is very much the same masking problem one faces when using hat matrix weights in linear regression. This issue is discussed in Wiens & Du (2000). In this case, the use of robust Mahalanobis distances appears to be a good solution to the masking problem. 3.5 Discussion In this article, we proposed a rank-based analysis of nonlinear regression models. Our study uses a weighted generalized signed-rank dispersion function. The generalized signed- rank dispersion function results in a class of estimators including the signed-rank, least squares, and least absolute deviations estimators. These estimators do not have bounded 42 influence functions. The influence function of LAD and LS are unbounded in both response and design spaces while the that of the signed-rank is unbounded in design space. Thus, the signed-rank estimation procedure may not be suitable for studies with uncontrolled designs. Added weights allow us to construct estimators with bounded influence. However, it is rather complicated to use the proposed weights directly since they cannot be expressed free of the regression parameter for nonlinear models. Our solution is to replace the regression parameter by its consistent estimator in the weight function. This results in estimators that are asymptotically equivalent to those in which the weights are based on the unknown true value of the parameter. The unweighted versions of our estimators are consistent; hence, they can be used to estimate the weights. We recommend weight functions that are based on robust Mahalanobis distances using robust estimates of location and covariance of the Jacobian matrix of the regression function. They are explicitly defined and are simple to construct. Moreover, as our simulation studies demonstrate, the weighted signed-rank estimators with weights based on robust Mahalanobis distances are efficient as well as robust. It is also shown that these weighted estimators can be useful in detecting outliers in nonlinear regression. 43 s Figure 3.3: Residual plot versus x1 and fitted values for Lakes data 0 5 10 15 20 25 30 35 -1 0 1 2 3 x1 LS Residuals 1 2 3 4 -1 0 1 2 3 LS Fits LS Residuals 0 5 10 15 20 25 30 35 -10 -5 0 x1 WSR1 Residuals 0 5 10 15 -10 -5 0 WSR1 Fits WSR1 Residuals 0 5 10 15 20 25 30 35 -4 -2 0 2 x1 WSR2 Residuals 1 2 3 4 5 6 -4 -2 0 2 WSR2 Fits WSR2 Residuals 44 Chapter 4 Rank Regression with Missing Response 4.1 Introduction The problem of missing data is nowadays in the center of almost all statistical studies. It occurs in a wide array of application areas for several reasons. Among the reasons are: a sensor in a remote sensor network may be damaged and cease to transmit data certain regions of a gene microarray may fail to yield measurements of the underlying gene expressions due to scratches, finger prints, or manufacturing defects in a clinical trial, participants may drop out during the course of the study leading to missing observations at subsequent time points all applicable tests while diagnosing a patient may not be ordered by a doctor usersofarecommendersystemrateextremelysmallfractionofavailablebooks, movies, or songs, leading to massive amount of missing data. Also, data may be missing because equipment malfunctioned, the weather was terrible, or people got sick, or the data were not entered correctly. The type of missingness to be considered in this chapter is the missing response in the context of regression analysis that often arise in various experimental settings such as market research surveys, medical studies, opinion polls and socioeconomic investigations. The statistical investigation of such a problem is a very difficult task since in most cases, missing data themselves contain either little or no information about the missing data mechanism (MDM). Fundamentally and most commonly used assumption about the 45 MDM is the MAR assumption discussed in Rubin (1976). The idea behind MAR is that the probability that a response variable is observed can depend only on the values of those other variables that have been observed. The scientific literature provides extensive studies and effective computational methods for handling missing data under the MAR assumption. 4.2 Model Definition Consider the linear semi-parametric regression model Yi = X i +g(Ti) +"i; 1 i n; (4.1) where 2B is a vector of parameters, Xi?s are i.i.d p-variable random covariate vectors, Ti?s are i.i.d univariable random covariates defined on [0;1], the function g : [0;1] !R is unknown and the model errors "i are independent with conditional mean zero given the covariates. Also, 0 < E("2ijZi) <1with Zi = (Xi;Ti). In this paper, we are interested in inference on the true value 0 of the parameter , when there are missing responses in the linear semi-parametric model (4.1). This model has captured a lot of attention in recent years. An application of (4.1) to mouthwashexperimentwasgivenbySpeckman(1988). Anexampleusing (4.1)isprovidedin Green and Silverman (1994). The least squares estimation approach of the setting discussed above was studied by Wang and Sun (2007). A semi-parametric mixed model for analyzing the CD4 cell count in HIV seroconverters was studied by Zeger and Diggle (1994). Model 4.1 has also been applied in several fields such as biometrics, see Gray (1994), econometrics and others. For complete data setting, this model has been extensively studied by many authors such as Heckman (1986), Speckman (1988), Robinson (1988), Rice (1986) among others. In the framework of model (4.1), Wang et al. (2004) developed inference tools in missing response case for the mean of Y based on the least squares estimation approach and under the MAR assumption. The historical method for constructing confidence interval 46 for the true mean of Y is the empirical likelihood method introduced by Owen (1990). As pointed out by many authors, this method has many advantages over others methods such as those based in normal approximations or the bootstrap (Hall and La Scala, 1990). Many authors have studied this framework including Kitamura (1997), Peng (2004), Wang et al. (2004), Xue and Zhu (2007b), Xue and Zhu (2007a), Xue and Zhu (2006), Wang and Rao (2002a), Sun et al. (2009), Chen and Hall (1993) among others. When dealing with missing data, the main approach is to impute a plausible value for each missing datum and then analyze the results as if they were complete. In most of the regression problems, the commonly used approaches include linear regression imputation (Healy and Westmacott, 1956), nonparametric kernel regression imputation (Cheng, 1994; WangandRao,2002b),semi-parametricregressionimputation(WangandSun,2007),among others. Wang and Sun (2007) also considered the semi-parametric regression imputation approach to estimate the true mean of Y. The other well known approach for handling missing data is the inverse probability weighting. This approach has gained considerable attention as a way to deal with missing data problems. For a discussion of this approach, see Wang et al. (1997), Robins et al. (1994), Wang et al. (2004), Zhao et al. (1996) and references therein. As pointed out by Wang and Sun (2007), for missing problems, the inverse probability weighting approach usually depends on high dimensional smoothing for estimating the completely unknown propensity score function, leading to the well known problem of "curse of dimensionality" that may restrict to use of the resulting estimator. One way to avoid such a problem is to use the inverse marginal probability weighted method suggested by Wang et al. (2004). 4.3 Estimation In model (4.1), consider the case where some values of Y in the sample of size n may be missing, but X and T are fully observed. That is, we obtain the following incomplete 47 observations (Yi; i;Xi;Ti); i = 1;2;:::;n from (4.1), where Xi?s and Ti?s are observed, and, i = 8 >< >: 0; Yi is missing; 1; otherwise. We assume that Y is missing at random (MAR). The MAR assumption implies that and Y are conditionally independent given X and T. That is P( = 1jY;X;T) = P( = 1jX;T). As discussed above, this is a common assumption for statistical analysis with missing data and is reasonable in many practical situations, see Little and Rubin (1987). Let us introduce the following notations: Z = (X;T), 2(Z) = E("2jZ), (z) = P( = 1jZ = z) and (t) = P( = 1jT = t). Consider the following rank objective function, DCn ( ) = 1n nX i=1 ? R(ei( )) n+ 1 ei( ); (4.2) whereei( ) = i"i andR(ei( )) is theith rank ofei( ). Note that in the expression ofDCn ( ) in (4.2), and g are unknown. Also, E[ei( )jZi] = 0 by the MAR assumption. So, before dealing with the estimation of 0, let us consider first step of the estimation of g based on the complete data, that is, estimating g as a known function of t but unknown with respect to . As discussed in Wang et al., pre-multiplying (4.1) by the observation indicator, we have iYi = iX i + ig(Ti) + i"i; and taking conditional expectations given T, we have E[ iYijTi = t] = E[ iXijTi = t] +E[ ijTi = t]g(t); (4.3) 48 from which, it follows that gC1 (t) = E[ XjT = t]E[ jT = t] and gC2 (t) = E[ YjT = t]E[ jT = t] : and so g(t) = gC2 (t) gC1 (t) : (4.4) LetK( ) be a kernel function andbn be a bandwidth sequence such thatbn!0 as n!1. Define weights as WCnj(t) = K( t Tj bn )P n j=1 jK( t Tj bn ) : Then ~gC1n(t) = nX j=1 jWCnj(t)Xj and ~gC2n(t) = nX j=1 jWCnj(t)Yj are strongly consistent estimators of gC1 (t) and gC2 (t), respectively. Now, define eDCn by eDCn ( ) = 1 n nX i=1 ? R( ni( )) n+ 1 ni( ) (4.5) where in( ) = i[(Yi ~gC2n(Ti)) (Xi ~gC1n(Ti)) ]. Clearly, from the fact that ~gC1n(t) = nX j=1 jWCnj(t)Xj and ~gC2n(t) = nX j=1 jWCnj(t)Yj are strongly consistent estimators of gC1 (t) and gC2 (t) respectively, it can be shown that in( ) !ei( ) in distribution as n!1. Define the rank estimator based on complete data as e C? = Argmin 2B eDCn ( ) : Below, we will study the asymptotic properties of rank estimators of 0, some of which are defined later. The required assumptions are given and discussed in Section 4.4. Theorem 8. Under assumptions (J1) (J7) except (J4), limn!1sup 2B jeDCn ( ) DCn ( )j= 0 49 Proof. In this proof,Lis taken to be an arbitrary positive constant not necessarily the same. By the mean value Theorem, there exists n(i) such that ? R( ni( )) n+ 1 ? R(ei( )) n+ 1 = ?0( n(i)) R( ni( )) n+ 1 R(ei( )) n+ 1 eDCn ( ) DCn ( ) = 1n nX i=1 h ? R( ni( )) n+ 1 ni( ) ? R(ei( )) n+ 1 ei( ) i = 1n nX i=1 h ? R( ni( )) n+ 1 ni( ) + f?0( n(i)) R( ni( )) n+ 1 R(ei( )) n+ 1 ? R( ni( )) n+ 1 gei( ) i Also, by uniform continuity of ? (since ?0 is bounded with bound L> 0) and for fix n, one can choose (n) = (n)L+1 > 0 , such that for (n)!0 as n!1and R( ni( ))n+ 1 R(ei( ))n+ 1 (n) =) ? R( ni( )) n+ 1 ? R(ei( )) n+ 1 < (n): From this, we have, eDCn ( ) DCn ( ) 1n nX i=1 ? R( ni( )) n+ 1 j ni( ) ei( )j + 1n nX i= j?0( n(i))j R( ni( ))n+ 1 R(ei( ))n+ 1 jei( )j From the boundedness of ?, there exists a constant L > 0 such that ? R( ni( )) n+ 1 L. Note that ni( ) ei( ) = i[gn(Ti) g(Ti)] where gn(t) = ~gC2n(t) (~gC1n(t)) . Also, note that sup 2Bjgn(t) g(t)j!0 as n!0. This can be seen from the fact that gn(t) g(t) = (~gC2n(t) gC2 (t)) (~gC1n(t) gC1 (t)) and then, sup 2B jgn(t) g(t)j j~gC2n(t) gC2 (t))j+ sup 2B k kk~gC1n(t) gC1 (t))k!0 (4.6) 50 since ~gC2n(t) gC2 (t)!0 and ~gC1n(t) gC1 (t)!0 w.p. 1 as n!1. This implies that sup 2B eDCn ( ) DCn ( ) Ln nX i=1 sup 2B j ni( ) ei( )j+ (n)Ln nX i= sup 2B jei( )j Now, 1 n nX i=1 sup 2B j ni( ) ei( )j 1n nX i=1 sup 2B jgn(Ti) g(Ti)j = Z 1 0 sup 2B jgn(t) g(t)jdmn(t) where mn(t) = nX i=1 I(Ti t). Applying the Dominated Convergence Theorem, it is easily seen that Z 1 0 sup 2B jgn(t) g(t)jdmn(t)!0 as n!1: On the other hand, by the strong law of large numbers, 1 n nX i= sup 2B jei( )j!E h sup 2B jej i <1: Therefore, lim n!1 sup 2B jeDCn ( ) DCn ( )j= 0 Remark 8. This theorem implies that in the neighborhood of the true parameter 0, e C? and the minimizer of DCn ( ) are equivalent. Under some mild-conditions, ~ C? is a strongly consistent estimator of 0; that is, ~ C? ! 0 w.p.1. Also, ~ C? is a pn-consistent estimator of 0, that is,pn( ~ C? 0) = Op(1). For reference to these facts, please see Hettmansperger and McKean (1998). 51 For i = 1;:::;n, by imputation (j = 1) and inverse probability (j = 2), define eYijn by eYijn = 8 >< >: iYi + 1 i X i e C? +bgCn (Ti) ; if j = 1; i b (Ti)Yi + 1 i b (Ti) X i e C ? +bg C n (Ti) ; if j = 2, where, bgCn (t) = ~gC2n(t) (~gC1n(t)) e C?, b (t) = nX j=1 !nj(t) j, with !nj(t) = ( t Tj hn )P n j=1 ( t Tj hn ) : is a kernel function and hn a bandwidth sequence satisfying hn!0 as n!1. If we set, Yij = 8 >< >: iYi + 1 i X i +g(Ti) ; if j = 1; i (Zi)Yi + 1 i (Zi) X i +g(Ti) ; if j = 2, (4.7) under the MAR assumption, E(Yi1jZi) = E[ iYi + 1 i X i +g(Ti) jZi] = X i +g(Ti): Thus Yi1 = X i + g(Ti) + ei, with E(eijZi) = 0. Also, for the inverse probability case and under the MAR condition, we have E h i (Zi)Yi+ 1 i (Z i) X i +g(Ti) Zi i = E h i (Ti)Yi+ 1 i (T i) X i +g(Ti) Zi i and E h i (Ti)Yi + 1 i (T i) X i +g(Ti) Zi i = X i +g(Ti) : Thus Yi2 = X i +g(Ti) + i, with E( ijZi) = 0. 52 Also, taking the conditional expectation of (4.1) with respect to T, we have E[YijTi = t] = E[X ijTi = t] +g(t); (4.8) Let g1(t) = E[XjT = t] and gj2(t) = E[YjT = t] = E[YijjT = t], j = 1;2. This suggests that g(t) = g2(t) g1(t) : (4.9) Let Wnj(t) = M( t Tj kn )P n j=1M( t Tj kn ) : where M is a kernel function and kn is a bandwidth sequence (kn! 0 as n!1). Define ^g1n(t) = nX k=1 Wnk(t)Xk and ^gj2n(t) = nX k=1 Wnk(t)eYkjn. Under some mild conditions, it can be shown that ^g1n(t)!g1(t) and ^gj2n(t)!gj2(t) with probability 1. Theorem 9. Under assumptions (J5) (J9) and Remark 8 jgn(t) ^gCn (t)j!0 as n!1 w:p 1: Proof. jgn(t) ^gCn (t)j = j~gn(t) g(t) +g(t) ^gCn (t)j jgn(t) g(t)j+jg(t) ^gCn (t)j Clearly, from (4.6),jgn(t) g(t)j!0 w.p.1 as n!1. Then, to complete the proof, we just need to show thatjg(t) ^gCn (t)j!0 w.p.1 as n!1. Indeed, g(t) ^gCn (t) = gC2 (t) (gC1 (t)) ~gC2n(t) (~gC1n(t)) 0 = (gC2 (t) ~gC2n(t)) + (~gC1n(t) ~gC1 (t)) 0 + (~gC1 (t)) ( ~ C? 0) 53 Then,jg(t) ^gCn (t)j jgC2 (t) ~gC2n(t)j+k~gC1n(t) ~gC1 (t)kk 0k+k~gC1n(t)kk~ C? 0k. Clearly, jgC2 (t) ~gC2n(t)j! 0 andjgC1 (t) ~gC1n(t)j! 0 w.p.1 by consistency. ~gC1n(t) is bounded since it converges. Then, k~gC1n(t)kk~ C? 0k! 0 w.p.1 since ~ C? 0 ! 0 w.p.1. as n !1. Therefore, jgn(t) ^gCn (t)j!0 as n!1 w:p 1: Set vijn ( ) = eYijn ^gj2n(Ti) Xi ^g1n(Ti) : Now,basedonsimpleimputation,therankestimator b I? of 0 isdefinedby b I? = Argmin 2B DIn( ) with DIn( ) defined by DIn( ) = 1n nX i=1 ? RI(vi1 n ( )) n+ 1 vi1n ( ) ; where RI(vi1n ( )) is the ith rank of vi1n ( ). Now letting SIn( ) = rDIn( ), we have SIn( ) = 1n nX i=1 ? RI(vi1 n ( )) n+ 1 Xi ~g1n(Ti) : (4.10) On the other hand, based on the inverse probability, the rank estimator b IP? of 0, is defined by b IP? = Argmin 2B DIPn ( ) with DIPn ( ) = 1n nX i=1 ? RIP(vi2 n ( )) n+ 1 vi2n ( ) ; where RIP(v2in( )) is the ith rank of vi2n ( ). Letting SIPn ( ) = rDIPn ( ), we have SIPn ( ) = 1n nX i=1 ? RIP(vi2 n ) n+ 1 Xi ~g1n(Ti) (4.11) 54 Note that ^g1n(Ti) is a p-vector and the estimation of gj2, involves all the Yi?s, making fvijn ( 0)gni=1 to be a set of dependent random variables. Also if we set G1in and G2in to denote distribution functions of vi1n ( 0) and vi2n ( 0) respectively, it can be easily shown that G1i(s) = limn!1G1in(s) = limn!1P(vi1n ( 0) s) and G2i(s) = limn!1G2in(s) = limn!1P(vi2n ( 0) s); where G1i and G2i are conditional distribution functions of ei and i given Zi, respectively. 4.4 Assumptions The following assumptions were used above to motivate the theory leading to the def- inition of the rank estimators b I? and b IP? . They will be useful in studying the asymptotic properties of the rank estimators b I? and b IP? . (J1) P(X = X 0) < 1 for any 6= 0, (J2) ? is a bounded, twice continuously differentiable score function with bounded deriva- tives, defined on (0;1), and, satisfying: Z 1 0 ?(u)du = 0 and Z 1 0 ?2(u)du = 1 (J3) The cumulative distribution H of Y given Z is symmetric about 0 and has a corre- sponding density h that is absolutely continuous with finite Fisher information, i.e., I(h) = Z 1 1 hh0(") h(") i2 h(")d"<1. Define ? = Z 1 0 ?(u)?H(u)du,where?H(u) = h 0(H 1(u)) h(H 1(u)) . 55 (J4) DefineX = X ~g1n(T). Assumethat 1nX X! , 1n ~G n ~Gn!G and 1n ~G nX +X ~Gn !B, for some positive definite matrices , G and B. Also, assume that P(kX k cn) = oP(dn) for dn!0 as cn!1: (J5) inft (t) > 0, infz (z) > 0 and the density of T, say m(t), exits and satisfies 0 < inft m(t) sup t m(t) <1: Also ( ) has bounded derivatives. (J6) gC1 ( ); gC2 ( ); g1( ), and gj2( ) have continuous derivatives at t and bounded derivatives up to order 2. (J7) The kernels K( ), ( ) and M( ) are regular kernels of order r(> 2). (J8) The bandwidth bn, hn and kn satisfy the following conditions: i:) nknbn!1, nb4rn !0, nk4rn !0 and b2n=kn!0 ii:) nhn!1and nh4rn !0 iii:) C(logn=n) < hn; bn; kn < n, for any C > 0, = 1 2=p, p > 2 and n not necessarily the same for the three bandwidths such that C(logn=n) < n < 1 satisfying n!0 as n!1. (J9) sup t E[kXkpjT = t] <1and sup t E[jYjpjT = t] <1for p> 2. Remark 9. Assumption (J1) stands for the identifiability condition and together with (J2), ensures the consistency of the resulting estimator. (J5), (J6), (J7), (J8) and (J9) are stan- dard assumptions for nonparametric regression problems. Specifically (J5) ensures the non missingnesss with probability 1 anywhere in the domain of Z. Assumption iii:) in (J8) and 56 (J9) ensure the uniform consistency of the Nadaraya Watson estimator used to the true un- known function g, for more discussion see Einmahl and Mason (2005). As pointed out by Xue (2009), the assumption P(kX k cn) = op(dn) for dn ! 0 as cn !1 in (J4) is commonly used for avoiding the boundary problem. This assumption is also used by Zhu and Fang (1996) and Wang and Rao (2002a). For a class of distributions that satisfy this assumption, see discussion in Xue (2009). (J1), (J2), (J3) and (J4) together with others assumptions listed above, ensure the asymptotic normality of the resulting estimator. For practical issue, the optimal bandwidth can be chosen to lie in the interval [a:n 1=5;b:n 1=5] for 0 < a < b <1. See Einmahl and Mason (2005) for more discussion. Most of regular kernels in (J7) satisfy assumption (C6) of Xue (2009) or Wang and Sun (2007). 4.5 Asymptotic Normality Put X = (X1; ;Xn) and ~Gn = (~g1n(T1); ;~g1n(Tn)) two matrices given by X = 0 BB BB BB B@ X11 X1n X21 X2n Xp1 Xpn 1 CC CC CC CA ~Gn = 0 BB BB BB B@ ~g11n(T1) ~g1pn(T1) ~g11n(T2) ~g1pn(T2) ~g11n(Tn) ~g1pn(Tn) 1 CC CC CC CA : For simplicity, set Dln = DIn, Sln = SIn for l = I and Dln = DIPn , Sln = SIPn for l = IP. Now, as discussed in Hettmansperger and McKean (1998), if we set Mln( ) = (2 ?) 1X X ( 0) ( 0) Sln( 0) +Dln( 0); we obtain the following result called asymptotic quadraticity which was proved by Jaeckel (1972). 57 Theorem 10. Under assumptions (J1) (J6),8 > 0 and C > 0, limn!1P 0 h max k 0k Cpn jDln( ) Mln( )j i = 0 This result provides a quadratic approximation ofDln byMln and leads to the asymptotic linearity derived by Jureckova (1971) and is displayed as follows. Theorem 11. Under (J1) (J9), for any C > 0 and > 0, limn!1P 0 h sup k 0k n 1=2C n 1=2[Sln( ) Sln( 0)] +n 1=2 1? X X ( 0) i = 0 Proofs of Theorem 10 and Theorem 11 can be constructed in a straightforward manner along the lines discussed in Hettmansperger and McKean (1998) for the linear model setting. Therefore, for the sake of brevity, they will not be included here. Theorem 12. Under assumptions (J1) (J9),pnSln( 0) Np(0; jn) D !Np(0;Vj) where j = 1 for l = I and j = 2 for l = IP. Vj = lim n!1 jn. Before giving the proof of Theorem 12, consider the following notation from Brunner and Denker (1994). Set in = Xi ^g1n(Ti); Jjn(s) = 1n nX i=1 Gjin(s); ^Jjn(s) = 1n nX i=1 I(vi1n ( 0) s); Fjn(s) = 1n nX i=1 inGjin(s); ^Fjn(s) = 1n nX i=1 inI(vijn ( 0) s) and Tln( 0) = S l n( 0) (n) E hSl n( 0) (n) i : Lemma 5 (Brunner and Denker (1994)). Suppose that M0n m1(n) M1n for some constants 0 0, and that jn Cna for some constant a; C2R, 58 where Ujn = Z ?(Jjn(s))( ^Fjn Fjn)(ds) + Z ?0(Jjn(s))( ^Jjn(s) Jjn(s))Fjn(ds): and jn is the minimum eigenvalue of Var(Ujn). Then nW 1jnTln( 0) is asymptotically stan- dard multivariate normal, provided?is twice continuously differentiable, with bounded second derivative and < (a+ 1)=2. Proof of Theorem 12. Assumethat max 1 i n k ink= (n) <1. Fromthesettingdefinedabove, Sln( 0) = 1n nX i=1 ? Rl(vij n ( 0)) n+ 1 Xi ^g1n(Ti) = Z ? n n+ 1 ^Jjn dFjn Now define n = n2E[(Tln( 0))(Tln( 0)) ]; Bjn = Z ( ^Fjn Fjn)d?(Jjn) + Z ( ^Jjn Jjn)dFjndJ jn d?(Jjn) and Wjn = n2Var(Bjn). From the fact that 0 = Argmin 2B E(Dln( )), it can be seen that E[Sln( 0)] = 0. This implies that E[Tln( 0)] = 0 and Tln( 0) = S l n( 0) (n) . Now under assump- tion (J2) and by Brunner and Denker (1994), for Theorem 12 to hold, it suffices to check if conditions of Lemma 5 are satisfied. To that end, the fact that Var("ijZi) > 0, there exists > 0 such that the minimum eigenvalue of Var(Bjn), say jn satisfies jn > nb, for 0 < b < 1=2. This is obtained under the assumption that &jn=n!1 putting jn !1 as n !1, see Brunner and Denker (1994) for more discussion. Again by Brunner and Denker (1994), Ujn = nBjn and then, Var(Ujn) = n2Var(Bjn). Thus, jn = n2 jn n2+b. Hence, putting a = 2 + b, = 1, M0 = M1 and C = , conditions of Lemma 5 are satis- fied. This, shows that nW 1jnTln( 0) is asymptotically multivariate standard normal. There- fore pnSln( 0) is asymptotically multivariate normal with mean 0 and covariance matrix jn = (n)pn Wjn. 59 Remark 10. For l = I or l = IP, note that by definition of b l?, Sln(b l?) = 0. Also, from the fact that Mln is a quadratic function of , it is uniquely minimized by ~ l? = ? X X 1Sln( 0). Assumption (J4) ensures that 1n X X ! as n!1 for some positive definite matrix . Then, under the assumptions of Theorem 11, pn(b l? ~ l?) = op(1). The proof of this claim can also be constructed along the lines given in Hettmansperger and McKean (1998). Moreover, conditioning on Z and using the SLLN, SIn( 0)!0 w:p:1. Thefollowingtheoremgivesapracticalwayofestimatingthecovariancematrix jn = (n)pn Wjn. Theorem 13. Suppose that assumptions (J1) (J9) hold and let j = 1 for l = I and j = 2 for l = IP. Define bZlj = nX i=1 in? R(vijn (b l?)) n+ 1 + 1n nX i=1 in?0 R(vijn (b l?)) n+ 1 R(vijn (b l?)) = nSln(b l?) + 1n nX i=1 in?0 R(vijn (b l?)) n+ 1 R(vijn (b l?)) = 1n nX i=1 in?0 R(vijn (b l?)) n+ 1 R(vijn (b l?)); since Sln(b l?) = 0. Then, cWjn = [bZlj E(Zlj)][bZlj E(Zlj)] !Wjn (positive definite) in the L2-norm as n!1, where Zlj = n Z ?(Jjn(t)) ^Fjn(dt) + Z ?0(Jjn(t)) ^Jjn(t)Fjn(dt) = nSln( 0) + 1n nX i=1 in?0 R(vij n ( 0)) n+ 1 R(vijn ( 0)) Proof. By Theorem 4.1 of Brunner and Denker (1994), to prove this theorem it is sufficient to prove that limn!1 n& jn = 0, where &jn represents the minimum eigenvalue of Wjn. This is obtained from the fact that jn nb for 0 < >: (z)H(u); if j = 1; (z) (t) H(u); if j = 2. For either j = 1 or j = 2, considering the change of variables, say, s = (z)H(u) or s = (z) (t) H(u), we get Z ?0(Gj(u))dGj(u) = Z ?0(s)ds Hence, E[Zlj] nX i=1 in Z ?0(s)ds Theorem 14. Under assumptions (J1) (J6), we have pn(b l ? 0) D !N(0; 2 ? 1 j ): where j = 1Vj 1 : The proof of this Theorem is obtained by combining results of Theorem 11, Lemma 12 and the discussion in Remark 10. 61 4.6 Estimation of the function g As it can seen from the discussion in section 2, function g is not fully estimated. In estimating 0, the first step of the estimation of g was used by setting it as a known function of t but unknown as a function of throughout the estimates of g1 and gj2. As soon as the estimated values b ? of 0 are obtained, one can now obtain the estimated function ^gn of g accordingly to the estimated value of 0 by putting: ^gn(t) = ^gj2n(t) ^g1n(t)b l? where j = 1 for l = I and j = 2 for l = IP. Theorem 15. Under assumptions (J1) (J9), we have ^gn(t) g(t) = o(1) with probability 1. Proof. By definition of ^gn(t), we have ^gn(t) g(t) = ^gj2n(t) g2(t) ^g1n(t) g1(t) (b ? 0) (g1(t)) (b ? 0) ^g1n(t) g1(t) 0 This implies that, j^gn(t) g(t)j j^gj2n(t) g2(t)j+k^g1n(t) g1(t)kkb ? 0k+k(g1(t))kkb ? 0k+k^g1n(t) g1(t)kk 0k 62 We continue the proof with j = 1 and similar argument can be used to derive that of j = 2. By definition of ^g12n(t), ^g12n(t) g2(t) = nX i=1 Wni(t)eYi1n g2(t) = nX i=1 Wni(t)[ iYi + 1 i X i e C? +bgCn (Ti) g2(t)] = nX i=1 Wni(t)(Yi1 g2(t)) + nX i=1 Wni(t)(1 i)X i ( ~ C? 0) + nX i=1 Wni(t)(1 i)(^gCn g(t)): Note that E[Yi1jTi = t] = g2(t) and E[k(1 i)XikjTi] <1. Then, under mild-conditions, we have,j nX i=1 Wni(t)Yi1 g2(t)j!0 w:p:1 and by Theorem 9, j^gCn g(t)j!0 w:p:1: Also, conditioning on T and using the SLLN, we have nX i=1 Wni(t)(1 i)Xi!E[k(1 )XkjT] w:p:1: Hence, nX i=1 Wni(t)(1 i)Xi = O(1). From the fact that ~ C? 0 = o(1), we have ^g12n(t) g2(t)!0 w:p:1: Using the same argument, it can be shown that ^g1n(t) g1(t)!0 w:p:1. Combining these facts with Remark 10 completes the proof. 63 4.7 Bandwidth Selection An important issue when dealing with kernel estimation, is the selection of an appro- priate bandwidth sequence. In the nonparametric regression literature, this problem has been studied extensively. The common approach used in selecting the bandwidth, is the delete-one cross-validation rule. This approach consists in minimizing the "leave-one-out" version of objective function evaluated at the estimator of 0 as a function the bandwidth h. That is, bn, hn and kn may be obtained by minimizing nX i=1 ? R(^u i(h)) n+ 1 ^u i(h); where ^u i(h) = 8 >>> >< >>> >: i;n( ~ C?;h); for bn; v i;jn (b l?;h); for kn . i b i(Ti;h); for for hn. with i;n( ~ C?;h), v i;jn (b l?;h) and i b i(Ti;h) beingthe"leave-one-out"versionsof i;n( ~ C? ), vijn (b l?), i b (Ti), respectively, in which bn; kn; hn are replaced by h. 4.8 Simulation To fully understand the optimality properties of the proposed approach, we will consider the finite sample behavior of the estimator. To do so, a simulation study is conducted. The model used in the simulation is given by Y = 0x+g(T) +" withxgenerated from a normal distribution with mean 1 and variance 1. The random errors " are generated from the contaminated normal distribution CN( ; ) = (1 )N(0;1) + 64 N(0; 2) with different degrees of contamination and the t-distribution with various degrees of freedom. The kernels K( ) and M( ) were taken to be the Epanechnikov Kernel K(t) = M(t) = 0:75(1 t2)I(jtj 1): Based on assumption (J8), all the three bandwidths bn, kn and hn were taken to be propor- tional to n 1=5. Two simulation scenarios were considered. In Scenario 1, the true g was taken to be 0 and T = in2, for i = 1; ;n and for z = (x;t), three cases were considered: Case 1: (z) = 0:8 + 0:2jx 1jifjx 1j 1, and 0:95 elsewhere. Case 2: (z) = 0:9 0:2jx 1jifjx 1j 4:5, and 0:1 elsewhere. Case 3: (z) is taken to be a sequence of proportions of missingness starting from 0% to 50% with steps of 10%. In Scenario 2, the true g was taken to be g(t) = (sin(2 t2))1=3 and T is generated from the uniform distribution U[0;1=4]. One more case is added to the three previous. Case 4: (z) = 0:9 0:2(jx 1j+jt 0:5j) ifjx 1j+jt 0:5j 1:5, and 0:8 elsewhere. In all the cases, was generated from the Bernoulli( (z)) distribution. Under both scenarios is estimated using simple imputation (SI) and based on the inverse probability procedure where (t) is chosen in two ways. First, as defined above, we take it to be kernel (Ker), that is (t) = M(t), and secondly by taking it to be the logistic function (Log) given by (t) = 11 +e t as in M?ller (2009). From 5000 simulations, the MSEs of the resulting rank (R) estimator of 0 are reported and are compared to those of the least squares (LS) estimator of 0. Figures 4.1 ? 4.17 contain the results of the simulation experiments. 4.9 Results and Discussion The results of the simulation experiment are described below: 65 Figure 4.1: Scenario 1, Case 1: MSE vs Proportion of Contamination and t-df 0.00 0.05 0.10 0.15 0.20 0.25 0.04 0.06 0.08 0.10 Contaminated Normal (Ker) Prop. Contamination MSE LS R 0.00 0.05 0.10 0.15 0.20 0.25 0.06 0.08 0.10 0.12 0.14 Contaminated Normal (Log) Prop. Contamination MSE LS R 0.00 0.05 0.10 0.15 0.20 0.25 0.04 0.06 0.08 0.10 Contaminated Normal (SI) Prop. Contamination MSE LS R 5 10 15 20 25 30 0.060 0.070 0.080 0.090 t Distribution (Ker) t df MSE LS R 5 10 15 20 25 30 0.040 0.045 0.050 0.055 t Distribution (Log) t df MSE LS R 5 10 15 20 25 30 0.045 0.050 0.055 0.060 0.065 0.070 t Distribution (SI) t df MSE LS R Figure 4.2: Scenario 1, Case 2: MSE vs Proportion of Contamination and t-df 0.00 0.05 0.10 0.15 0.20 0.25 0.04 0.06 0.08 0.10 0.12 Contaminated Normal (Ker) Prop. Contamination MSE LS R 0.00 0.05 0.10 0.15 0.20 0.25 0.04 0.06 0.08 0.10 Contaminated Normal (Log) Prop. Contamination MSE LS R 0.00 0.05 0.10 0.15 0.20 0.25 0.10 0.15 0.20 Contaminated Normal (SI) Prop. Contamination MSE LS R 5 10 15 20 25 30 0.04 0.05 0.06 0.07 t Distribution (Ker) t df MSE LS R 5 10 15 20 25 30 0.08 0.10 0.12 0.14 t Distribution (Log) t df MSE LS R 5 10 15 20 25 30 0.050 0.060 0.070 0.080 t Distribution (SI) t df MSE LS R 66 Figure4.3: Scenario1,Case3: MSEvsProportionofMissingDataforSIundercontaminated normal error distribution 0.1 0.2 0.3 0.4 0.035 0.040 0.045 CN( 0 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.035 0.040 0.045 0.050 CN( 0.01 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 0.060 0.065 CN( 0.05 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 CN( 0.1 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 0.10 CN( 0.15 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 CN( 0.25 , 3 ), Simple Prop. missing MSE LS R Figure 4.4: Scenario 1, Case 3: MSE vs Proportion of Missing Data for SI under t error distribution 0.1 0.2 0.3 0.4 0.050 0.060 0.070 0.080 t df = 5 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.045 0.050 0.055 0.060 0.065 t df = 10 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 0.060 t df = 15 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 t df = 20 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 t df = 25 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 t df = 30 , Simple Prop. missing MSE LS R 67 Figure 4.5: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under contaminated normal error distribution 0.1 0.2 0.3 0.4 0.035 0.045 0.055 0.065 CN( 0 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.050 0.060 0.070 CN( 0.01 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 CN( 0.05 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.07 0.09 0.11 CN( 0.1 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.08 0.10 0.12 0.14 CN( 0.15 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 0.16 0.18 CN( 0.25 , 3 ), Kernel Prop. missing MSE LS R Figure 4.6: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under t error distribution 0.1 0.2 0.3 0.4 0.06 0.08 0.10 0.12 t df = 5 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.07 0.08 0.09 0.10 t df = 10 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 15 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 20 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 25 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 30 , Kernel Prop. missing MSE LS R 68 Figure 4.7: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under contaminated normal error distribution 0.1 0.2 0.3 0.4 0.040 0.050 0.060 CN( 0 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.050 0.060 0.070 CN( 0.01 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 CN( 0.05 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 CN( 0.1 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.08 0.10 0.12 0.14 CN( 0.15 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 0.16 0.18 0.20 CN( 0.25 , 3 ), Logistic Prop. missing MSE LS R Figure 4.8: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under t error distribution 0.1 0.2 0.3 0.4 0.06 0.08 0.10 0.12 0.14 t df = 5 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.07 0.08 0.09 0.10 t df = 10 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 15 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 20 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 25 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 30 , Logistic Prop. missing MSE LS R 69 Figure 4.9: Scenario 2, Case 1: MSE vs Proportion of Contamination and t-df 0.00 0.10 0.20 0.06 0.08 0.10 0.12 0.14 Contaminated Normal (Ker) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Contaminated Normal (Log) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.04 0.06 0.08 0.10 0.12 Contaminated Normal (SI) Prop. Contamination MSE LS R 5 10 15 20 25 30 0.060 0.065 0.070 0.075 0.080 0.085 t Distribution (Ker) t df MSE LS R 5 10 15 20 25 30 0.055 0.065 0.075 0.085 t Distribution (Log) t df MSE LS R 5 10 15 20 25 30 0.045 0.050 0.055 0.060 0.065 0.070 t Distribution (SI) t df MSE LS R Figure 4.10: Scenario 2, Case 2: MSE vs Proportion of Contamination and t-df 0.00 0.10 0.20 0.06 0.08 0.10 0.12 Contaminated Normal (Ker) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Contaminated Normal (Log) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Contaminated Normal (SI) Prop. Contamination MSE LS R 5 10 15 20 25 30 0.07 0.08 0.09 0.10 t Distribution (Ker) t df MSE LS R 5 10 15 20 25 30 0.055 0.065 0.075 t Distribution (Log) t df MSE LS R 5 10 15 20 25 30 0.08 0.09 0.10 0.11 t Distribution (SI) t df MSE LS R 70 Figure 4.11: Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under contami- nated normal error distribution 0.1 0.2 0.3 0.4 0.04 0.05 0.06 0.07 CN( 0 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.04 0.05 0.06 0.07 CN( 0.01 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 CN( 0.05 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 CN( 0.1 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.07 0.08 0.09 0.10 0.11 0.12 CN( 0.15 , 3 ), Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 0.16 CN( 0.25 , 3 ), Simple Prop. missing MSE LS R Figure 4.12: Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under t error distribution 0.1 0.2 0.3 0.4 0.10 0.11 0.12 0.13 0.14 t df = 5 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.09 0.10 0.11 0.12 t df = 10 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.09 0.10 0.11 0.12 t df = 15 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.09 0.10 0.11 0.12 t df = 20 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.09 0.10 0.11 0.12 t df = 25 , Simple Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.09 0.10 0.11 0.12 t df = 30 , Simple Prop. missing MSE LS R 71 Figure 4.13: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under contaminated normal error distribution 0.1 0.2 0.3 0.4 0.045 0.055 0.065 CN( 0 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.045 0.055 0.065 0.075 CN( 0.01 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 CN( 0.05 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.06 0.07 0.08 0.09 0.10 0.11 0.12 CN( 0.1 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 CN( 0.15 , 3 ), Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 0.16 0.18 CN( 0.25 , 3 ), Kernel Prop. missing MSE LS R Figure 4.14: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Ker) under t error distribution 0.1 0.2 0.3 0.4 0.06 0.07 0.08 0.09 0.10 0.11 0.12 t df = 5 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 10 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 15 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 t df = 20 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.04 0.05 0.06 0.07 0.08 t df = 25 , Kernel Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.04 0.05 0.06 0.07 0.08 t df = 30 , Kernel Prop. missing MSE LS R 72 Figure 4.15: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under contaminated normal error distribution 0.1 0.2 0.3 0.4 0.055 0.065 0.075 CN( 0 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.060 0.065 0.070 0.075 0.080 0.085 CN( 0.01 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.07 0.08 0.09 0.10 0.11 0.12 CN( 0.05 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 CN( 0.1 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.08 0.10 0.12 0.14 0.16 0.18 CN( 0.15 , 3 ), Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.10 0.14 0.18 0.22 CN( 0.25 , 3 ), Logistic Prop. missing MSE LS R Figure 4.16: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability (Log) under t error distribution 0.1 0.2 0.3 0.4 0.05 0.06 0.07 0.08 0.09 t df = 5 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.050 0.060 0.070 t df = 10 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.040 0.045 0.050 0.055 0.060 0.065 t df = 15 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.035 0.040 0.045 0.050 0.055 0.060 t df = 20 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.035 0.040 0.045 0.050 0.055 0.060 t df = 25 , Logistic Prop. missing MSE LS R 0.1 0.2 0.3 0.4 0.035 0.040 0.045 0.050 0.055 0.060 t df = 30 , Logistic Prop. missing MSE LS R 73 Scenario 1: Case 1: For the case of theCN distribution, theRestimator becomes more and more superior toLS when the proportion of contamination increased. The same is true under the t distribution with the R becoming superior to LS as the degrees of freedom decrease (heavier tails). The MSEs are generally larger for the logistic case in the contaminated normal case but smaller in the t distribution case. Case 2: Once again R is increasingly superior to LS as the contamination increases or as the tail-thickness of the distribution increases. For CN distribution, (SI) generally does worse than (Log) and (Ker) and fortdistribution (Log) does worse than (Ker) and (SI). Case 3: Under(SI),RperformscomparablytoLS undernormality (CN(0;3)) ortails close to the tails of the normal distribution (CN(:01;3) and t30) but, as in the previous cases, does much better under heavy tails or larger contamination. What is additionally notable under (Ker) and (Log) is that R consistently outperforms LS when the proportion of missing data is large, even under normality. Scenario 2: Case 1: Similar to Scenario 1, Case 1. Case 2: Similar to Scenario 1, Case 2. Case 3: Similar patterns are observed as in Case 3 or Scenario 1. There are, however, some notable differences. Under Scenario 1, for the (SI) case, R and LS were comparable under normality. Under Scenario 2, LS is clearly superior to R. Under Scenario 1, for the (Ker) case, R does better than LS when the proportion of missing increases. This is no longer the case when the t degree of increases where R and LS are comparable. A similar observation can be made for the (Log) case. 74 Case 4: Similar to Case 2. The first take-home message is thatR performs better under heavy tailed error distributions and cases containing contaminations. It is generally comparable to LS under normal error. Thesecondtake-homemessageisthatRwithinverseprobabilityimputationdoesbetterthan its LS counterpart when the proportion of missing data is large. This makes R estimation extremely appealing for situations where we encounter high rates of missing information. Wefinishthissectionbygivinganalternativeapproachtodefiningrankestimatorsunder responses missing at random. The theory follows from the above theorems in a (mostly) straightforward manner. 4.10 Empirical log-likelihood approach Setting ij( ) = ? Rl(vij n ( )) n+ 1 Xi ~g1n(Ti) for j = 1;2. One can define the empirical log-likelihood function of as follows Lijn( ) = 2 sup (p1;:::;pn)2(0;1)n n nX i=1 log(npi) : nX i=1 pi = 1; nX i=1 pi ij = 0 o ; j = 1;2: Now take 2Rd for which the following equation holds: 1 n nX i=1 ij( ) 1 + ij( ) = 0: (4.12) From this, Lijn( ) can be rewritten as Lijn( ) = 2 nX i=1 log 1 + ij( ) : (4.13) Consider the matrix Anj defined by Anj = 1n nX i=1 ij( 0) ij( 0). Clearly from the fact that E("ijZi) = 0 and E("2ijZi) <1, Anj ! Vj as n!1. Based on Theorem 12, we have 75 max 1 n k ij( 0)k= oP(cn) for some cn. Also, ? being bounded, implies that there exists a positiveconstantC suchthatcn = C (n). Hence, onecanobtaintheasymptoticdistribution of Lijn( ) at the true parameter. This is given in the following theorem. Theorem 16. Under the assumptions (J1) (J9), we have Lijn( 0) D ! 2p The proof of this Theorem is obtained by putting together Theorem 12 and Slutsky?s Lemma. Remark 11. Note that considering the right hand side of (4.13) as a function of and performing the Taylor expansion of order 1 at ij( 0) and using (4.12), we obtain Lijn( 0) = A 1 nj n nX i=1 ij( 0) nX i=1 ij( 0) (1 +op(1)): Setting Lijn( 0) = A 1 nj n nX i=1 ij( 0) nX i=1 ij( 0) = nA 1nj Sln( 0) Sln( 0); we see that for n large enough, Lijn( 0) Lijn( 0). Hence, Lijn( 0) D ! 2p. Now putting ? = Argmax 2B f Lijn( )g, the maximum empirical likelihood estimator of 0, we have follow- ing theorem. Theorem 17. Under the assumptions (J1) (J9), pn( ? 0) D !N(0; 2 ? 1 j ); 76 Figure 4.17: Scenario 2, Case 4: MSE vs Proportion of Contamination and t-df 0.00 0.10 0.20 0.08 0.10 0.12 0.14 0.16 0.18 Contaminated Normal (Ker) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.10 0.15 0.20 Contaminated Normal (Log) Prop. Contamination MSE LS R 0.00 0.10 0.20 0.04 0.06 0.08 0.10 Contaminated Normal (SI) Prop. Contamination MSE LS R 5 10 15 20 25 30 0.060 0.070 0.080 0.090 t Distribution (Ker) t df MSE LS R 5 10 15 20 25 30 0.050 0.060 0.070 0.080 t Distribution (Log) t df MSE LS R 5 10 15 20 25 30 0.07 0.08 0.09 0.10 0.11 t Distribution (SI) t df MSE LS R 77 Bibliography Abebe, A. and McKean, J. W. (2007). Highly efficient nonlinear regression based on the Wilcoxon norm. In Umbach, D., editor, Festschrift in Honor of Mir Masoom Ali, pages 340?357. Adams, R. A. (1975). Sobolev spaces. Academic Press, New York-London. Brezis, H. (1983). Analyse fonctionnelle: Th?orie et applications. Masson, Paris. Brunner, E. and Denker, M. (1994). Rank statistics under dependent observations and applications to factorial designs. Journal of Statistical Planning and Inference, 42(3):353 ? 378. Chang, W. H., McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1999). High-breakdown rank regression. J. Amer. Statist. Assoc., 94(445):205?219. Chen, S. X. and Hall, P. (1993). Smoothed empirical likelihood confidence intervals for quantiles. The Annals of Statistics, 21(3):pp. 1166?1181. Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425):pp. 81?87. Coakley, C. W. and Hettmansperger, T. P. (1993). A bounded influence, high breakdown, efficient regression estimator. J. Amer. Statist. Assoc., 88(423):872?880. Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. The Annals of Statistics, 33(3):pp. 1380?1403. Gay, D. M. (1983). ALGORITHM 611 ? subroutines for unconstrained minimization using a model/trust-region approach. ACM Trans. Math. Software, 9(4):503?524. 78 Gay, D. M. (1984). A trust-region approach to linearly constrained optimization. In Griffiths, D. F., editor, Numerical Analysis. Proceedings, Dundee 1983, pages 72?105. Springer- Verlag. Lecture Notes in Mathematics 1066. Giltinan, D. M., Carroll, R. J., and Ruppert, D. (1986). Some new estimation methods for weighted regression when there are possible outliers. Technometrics, 28(3):219?230. Gray, R. J. (1994). Spline-based tests in survival analysis. Biometrics, 50(3):pp. 640?652. Green, P. J. and Silverman, B. W. (1994). Nonparametric regression and generalized linear models, volume 58 of Monographs on Statistics and Applied Probability. Chapman & Hall, London. A roughness penalty approach. H?jek, J. and ?id?k, Z. (1967). Theory of rank tests. Academic Press, New York. Hall, P. and La Scala, B. (1990). Methodology and algorithms of empirical likelihood. International Statistical Review, 58(2):109?127. Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc., 69:383?393. Haupt, H. and Oberhofer, W. (2009). On asymptotic normality in nonlinear regression. Statist. Probab. Lett., 79(6):848?849. Healy, M. and Westmacott, M. (1956). Missing values in experiments analysed on automatic computers. Journal of the Royal Statistical Society. Series C (Applied Statistics), 5(3):pp. 203?206. Heckman, N. E. (1986). Spline smoothing in a partly linear model. Journal of the Royal Statistical Society. Series B (Methodological), 48(2):pp. 244?248. Hettmansperger, T. P. and McKean, J. W. (1998). Robust nonparametric statistical methods, volume 5 of Kendall?s Library of Statistics. Edward Arnold, London. 79 Hettmansperger, T. P., McKean, J. W., and Sheather, S. J. (2000). Robust nonparametric methods. J. Amer. Statist. Assoc., 95(452):1308?1312. Hubert, M., Rousseeuw, P. J., and Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statist. Sci., 23(1):92?119. Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Math. Statist., 43:1449?1458. Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. Ann. Math. Statist., 40:633?643. Jureckova, J. (1971). Nonparametric estimate of regression coefficients. The Annals of Mathematical Statistics, 42(4):pp. 1328?1338. Jure?kov?, J. (2008). Regression rank scores in nonlinear models. In Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen, volume 1 of Inst. Math. Stat. Collect., pages 173?183. Inst. Math. Statist., Beachwood, OH. Kitamura, Y. (1997). Empirical likelihood methods with weakly dependent processes. The Annals of Statistics, 25(5):pp. 2084?2102. Krasker, W. S. and Welsch, R. E. (1982). Efficient bounded-influence regression estimation. J. Amer. Statist. Assoc., 77(379):595?604. Little, R.J.A.andRubin, D.B.(1987). Statistical analysis with missing data. WileySeriesin Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New York. M?ller, U. U. (2009). Estimating linear functionals in nonlinear regression with responses missing at random. Ann. Statist., 37(5A):2245?2277. Naranjo, J. D. and Hettmansperger, T. P. (1994). Bounded influence rank regression. J. Roy. Statist. Soc. Ser. B, 56(1):209?220. 80 Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):pp. 90?120. Peng, L. (2004). Empirical-likelihood-based confidence interval for the mean with a heavy- tailed distribution. The Annals of Statistics, 32(3):pp. 1192?1214. R Development Core Team (2009). R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Rice, J. (1986). Convergence rates for partially splined models. Statistics & Probability Letters, 4(4):203 ? 208. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Asso- ciation, 89(427):pp. 846?866. Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, 56(4):pp. 931?954. Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc., 79(388):871?880. Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Grossmann, W., Pflug, G., Vincze, I., andWertz, W., editors,Mathematical Statistics and Applications, volume 8, pages 283?297. Reidel, Dordrechet. Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212?223. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581?592. Serfling, R. J. (1980). Approximation theorems of mathematical statistics. John Wiley & Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics. 81 Sievers, G. L. (1983). A weighted dispersion function for estimation in linear models. Comm. Statist. A?Theory Methods, 12(10):1161?1179. Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992). On one-step gm estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87(418):pp. 439?450. Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), 50(3):pp. 413?436. Stromberg, A. J. (1993). Computation of high breakdown nonlinear regression parameters. Journal of the American Statistical Association, 88(421):pp. 237?244. Stromberg, A. J. (1995). Consistency of the least median of squares estimator in nonlinear regression. Comm. Statist. Theory Methods, 24(8):1971?1984. Stromberg, A. J. and Ruppert, D. (1992). Breakdown in nonlinear regression. J. Amer. Statist. Assoc., 87(420):991?997. Sun, Z., Wang, Q., and Dai, P. (2009). Model checking for partially linear models with missing responses at random. J. Multivar. Anal., 100(4):636?651. Tableman, M. (1990). Bounded-influence rank regression: a one-step estimator based on Wilcoxon scores. J. Amer. Statist. Assoc., 85(410):508?513. van der Vaart, A. W. (1998). Asymptotic statistics, volume 3 of Cambridge Series in Statis- tical and Probabilistic Mathematics. Cambridge University Press, Cambridge. Van Zwet, W. R. (1980). A strong law for linear functions of order statistics. The Annals of Probability, 8(5):pp. 986?990. Wang, C. Y., Wang, S., Zhao, L.-P., and Ou, S.-T. (1997). Weighted semiparametric estima- tion in regression analysis with missing covariate data. Journal of the American Statistical Association, 92(438):pp. 512?525. 82 Wang, J. (1996). Asymptotics of least-squares estimators for constrained nonlinear regres- sion. Ann. Statist., 24(3):1316?1326. Wang, J. D. (1995). Asymptotic normality of L1-estimators in nonlinear regression. J. Multivariate Anal., 54(2):227?238. Wang, Q., Linton, O., and H?rdle, W. (2004). Semiparametric regression analysis with missing response at random. Journal of the American Statistical Association, 99(466):pp. 334?345. Wang, Q. and Rao, J. N. K. (2002a). Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics, 30(3):pp. 896?924. Wang, Q. and Rao, J. N. K. (2002b). Empirical likelihood-based inference under imputation for missing response data. Ann. Statist., 30(3):896?924. Dedicated to the memory of Lucien Le Cam. Wang, Q. and Sun, Z. (2007). Estimation in partially linear models with missing responses at random. Journal of Multivariate Analysis, 98(7):1470 ? 1493. Wiens, D. and Zhou, J. (1994). Bounded-influence rank estimation in the linear model. Canad. J. Statist., 22(2):233?245. Wu, C.-F. (1981). Asymptotic theory of nonlinear least squares estimation. Ann. Statist., 9(3):501?513. Xue, L.(2009). Empiricallikelihoodconfidenceintervalsforresponsemeanwithdatamissing at random. Scandinavian Journal of Statistics, 36(4):671?685. Xue, L. and Zhu, L. (2007a). Empirical likelihood for a varying coefficient model with longitudinal data. Journal of the American Statistical Association, 102(478):642?654. Xue, L. and Zhu, L. (2007b). Empirical likelihood semiparametric regression analysis for longitudinal data. Biometrika, 94(4):921?937. 83 Xue, L.-G. and Zhu, L. (2006). Empirical likelihood for single-index models. Journal of Multivariate Analysis, 97(6):1295 ? 1312. Yohai, V.J.(1987). Highbreakdown-pointandhighefficiencyrobustestimatesforregression. Ann. Statist., 15(2):642?656. Zeger, S. L. and Diggle, P. J. (1994). Semiparametric models for longitudinal data with application to cd4 cell numbers in hiv seroconverters. Biometrics, 50(3):pp. 689?699. Zhao, L. P., Lipsitz, S., and Lew, D. (1996). Regression analysis with missing covariate data using estimating equations. Biometrics, 52(4):pp. 1165?1182. Zhu, L.-X. and Fang, K.-T. (1996). Asymptotics for kernel estimate of sliced inverse regres- sion. The Annals of Statistics, 24(3):pp. 1053?1068. 84