Rank-Based Regression for Nonlinear and Missing Response Models
by
Huybrechts Frazier Achard Bindele
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
August 4, 2012
Keywords: Asymptotic normality, bounded influence, signed-rank norm, Sobolev space,
strong consistency, missing at random.
Copyright 2012 by Huybrechts Frazier Achard Bindele
Approved by
Asheber Abebe, Chair, Associate Professor of Mathematics & Statistics
Geraldo S. de Souza, Professor of Mathematics & Statistics
Peng Zeng, Associate Professor of Mathematics & Statistics
Mark D. Carpenter, Professor of Mathematics & Statistics
Abstract
This dissertation is mainly concerned with the rank-based estimation of model pa-
rameters in complex regression models: a general nonlinear regression model and a semi-
parametric regression model with missing responses. For the estimation of nonlinear regres-
sion parameters, we consider weighted generalized-signed-rank estimators. The generaliza-
tion allows us to study rank estimators as well as popular estimators such as the least squares
and least absolute deviations estimators. However, the generalization by itself does not give
bounded influence estimators. Added weights provide estimators with bounded influence
function. We establish conditions needed for the consistency and asymptotic normality of
the proposed estimator and discuss how weight functions can be chosen to achieve bounded
influence function of the estimator. Real life examples and Monte Carlo simulation experi-
ments demonstrate that the proposed estimator is robust, efficient, and useful in detecting
outliers in nonlinear regression. For the estimation of the linear regression parameter of a
semi-parametric model with missing response, we propose imputed rank estimators under
simple imputation and imputation by inverse probability. It is shown that these rank es-
timators have favorable asymptotic properties. Moreover, it is demonstrated that the rank
estimators perform better than the classical least squares estimator under heavy tailed error
distributions and cases containing contamination while they are generally comparable to the
least squares estimator under normal error. Moreover, rank estimators with inverse proba-
bility imputation are superior than their least squares counterpart when the proportion of
missing data is large. This makes rank estimation extremely appealing for situations where
we encounter high rates of missing information.
ii
Acknowledgments
Let me start by thanking God for the strength, the health and the knowledge given to
me throughout all these years and simply throughout my life. Without all your blessings, I
could never have accomplished this long journey.
This research project would not have been possible without the support of many people.
I wish to express my sincere gratitude to my supervisor, Dr. Ash Abebe who was abundantly
helpful and offered invaluable assistance, support and guidance. Deepest gratitude are also
due to the members of the supervisory committee, Drs. Geraldo de Souza, Peng Zeng, and
Mark Carpenter without whose knowledge and assistance this study would not have been
successful.
Big dedication of this dissertation to my parents Philippe Bindele and Suzanne Mbeke
who have always been supportive and have made a lot of sacrifices for the great success of
my life.
Special thanks to my grand mother Therese Mayena and his husband Andre Moumpan-
gou, who, even though did not live long enough to see this achievement, represent the key
point of my success by providing all necessary advices and encouragement to my education
life. Another special thanks to my uncle Cyrille Ngayi Mvouembe who always stands up for
me for any educational purposes.
I would like also to express my profound gratitude to the department Chair and the
graduate Chair of the Mathematics and Statistics department Dr. Michel Smith, Chris
Rodger and Paul Schmidt for all their supports and encouragements during my graduate
student life at Auburn University. At the same time, I would like to thank our department
staff: Lori Bell, Gwen Kirk and Carolyn Donegan for their tremendous help provided when
needed. I wishes to express my gratitude to the following:
iii
Prof. Charles Chidume whose supports, advices and encouragements has made my
journey to the US possible.
All Professors who taught me during the course work for the marvelous job done, par-
ticularly, Drs. Wanxian Shen, Jersy Szulga, Erkan Nane, Geraldo de Souza, Asheber Abebe,
Mark Carpenter, Trevor Park and Peng Zeng. Also all beloved faculty members such as Drs.
Greg Harris, Nedret Billor, Overtoun Jenda, Peter Johson, Olav Kallenberg, Amnon Meir,
Andreas Bezdek, Tin-Yao Tam, Bertram Zinner, Frank Uhlig and Narendra Govil.
My course mates and friends, Gerald Chidume, Julian Allagan, Fidele Ngwane, Nar
Rawal, Bertrand Sedar Ngoma, Guy-Vanie Miankokana, Brice Merlin Nguelifack, Sean
O?Neil, Frank Sturn, Serge Phanzu, Jebessa Mijena, Melody Donhere, Dawit Tadesse, Omer
Tadjion, Al-Hassem Nayam, Daniel Bongue, Guy Richard Bossoto, Justin Mabiala, Simplice
Mananga, Brice Malonda, Justin Moutsouka, Seth Kwame Kermaussuor, Kuena, Gouala
Medea, Telemina Gaston, Charles Lebon Mberi Kimpolo, Eddy Kwessi, Rolland Kendou,
Merlin Ngachi, Gladisse Ngodo, Moses Tam, Murielle Mahoungou, Aude Didas Massika,
Roland Loufouma, Darius Kissala, Fortun? Massamba, Amin Bahmanian, Patrice Bassissa
for the good time and fun we have had.
My brothers, sisters and family members, Durand Olivier Ngoyi, Elvis Darel Ngouala
Nzaou, Tancraide Malanda Ngouala, Mabelle Ngouala Mboussi, Aude Durgie Nsimba, Na-
dine Dihoulou, Emilie Kiloula, Michel Ngoyi, Joseph Nzoussi, Joseph Kimbassa, Antoine
Yila, Guyen Kitanda Ndolo, Martin Ndolo, Serge Mawa, Daniel Ndamba, Marie Laure Gayi
for their support.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Backround . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 On the Consistency of a Class of Nonlinear Regression Estimators . . . . . . . . 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Definition and Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Breakdown Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Bounded Influence Nonlinear Signed-Rank Regression . . . . . . . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Weighted SR Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Weight Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Plug-in Estimator of the Weight . . . . . . . . . . . . . . . . . . . . . 35
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 37
v
3.4.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Rank Regression with Missing Response . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Estimation of the function g . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.10 Empirical log-likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
List of Figures
3.1 Plot of fitted exponential curves including WSR . . . . . . . . . . . . . . . . . . 39
3.2 Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Residual plot versus x1 and fitted values for Lakes data . . . . . . . . . . . . . . 44
4.1 Scenario 1, Case 1: MSE vs Proportion of Contamination and t-df . . . . . . . . 66
4.2 Scenario 1, Case 2: MSE vs Proportion of Contamination and t-df . . . . . . . . 66
4.3 Scenario1,Case3: MSEvsProportionofMissingDataforSIundercontaminated
normal error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Scenario 1, Case 3: MSE vs Proportion of Missing Data for SI under t error
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under contaminated normal error distribution . . . . . . . . . . . . . . . . 68
4.6 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under contaminated normal error distribution . . . . . . . . . . . . . . . . 69
4.8 Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vii
4.9 Scenario 2, Case 1: MSE vs Proportion of Contamination and t-df . . . . . . . . 70
4.10 Scenario 2, Case 2: MSE vs Proportion of Contamination and t-df . . . . . . . . 70
4.11 Scenario2,Case3: MSEvsProportionofMissingDataforSIundercontaminated
normal error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.12 Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under t error
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.13 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under contaminated normal error distribution . . . . . . . . . . . . . . . . 72
4.14 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.15 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under contaminated normal error distribution . . . . . . . . . . . . . . . . 73
4.16 Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under t error distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.17 Scenario 2, Case 4: MSE vs Proportion of Contamination and t-df . . . . . . . . 77
viii
List of Tables
3.1 Average Estimates(MSEs) of 0 = 1 in 1000 simulated samples . . . . . . . . . . 38
3.2 Parameter estimates for Lakes Data . . . . . . . . . . . . . . . . . . . . . . . . 41
ix
Chapter 1
Introduction
1.1 Backround
The historical approach to fitting linear and nonlinear models of the form:
yi = f(xi; 0) +"i; i = 1; ;n;
for some generic function f (linear or nonlinear), proceeds by finding coefficient estimate
b of 0 that minimizes the sum of squared errors: Pni=1(yi f(xi; ))2 . Such estimators,
knownasleastsquaresestimators, arecomputationallysimpleandpossessgeneraloptimality
properties. However, the optimality can be lost due to the existence of even a single extreme
outlier data point. This problem is seen with the sample mean, y, which is itself the least
squares solution to the model yi = 0 + "i. To overcome this problem, we briefly survey
a few approaches that have been taken to develop estimators of the coefficients that are
not as easily affected as the least-squares estimators. Two methods have been investigated:
one is a generalization of least-squares estimation called M-estimation, and the other is the
so-called rank based estimation methods referred to as rank-based regression methods. For
each of these two approaches a first method was developed to downweight outlier data points,
but was later shown to be susceptible to high leverage points (outliers in the x space) in
regression problems, and newer methods have emerged to address both outlier and leverage
problems.
One approach that has been used to lessen the impact of outliers in linear and nonlinear
models is to use the least abolute deviation also known as the L1 regression, that is, finding
coefficient estimates b that minimize Pni=1jyi f(xi; )j. A further generalization to this,
1
was made by Huber in the 1960s. He obatained the so-called M-estimators by minimizing
Pn
i=1
yi f(xi; )
b i
, where ( ) is a symmetric function and b i is an estimate of the standard
devaition of the errors "i. It was shown that these M-estimators have the advantage of
downweighting outliers while retaining efficiency when compared to least squares estimators.
However, theoriginalM-estimatorscanbeaffectedbyleveragepoints(outliersinthe x space)
in regression problems. A type of M-estimator developed to protect against aoutliers and
leverage points, is the least trimmed squares estimator that minimizesPki=1 yi f(xi; ) 2(i),
where k n. This estimator ignores the largest n k residuals. However, the fact that it
does not use the entire data results in a loss of efficiency. More recently, Yohai and others
have developed extensions of these methods, called MM estimators, that protect against
both outliers and leverage points while retaining efficiency.
AtthetimewhenHuberandothersweredevelopingthetheoryofM-estimators,methods
based on ranking were known as R-estimation and were used for simple problems such
as estimating location and scale or making location comparisons for two-sample problems.
They were not considered to be generalizable to linear and nonlinear models. Later Jaeckel,
Hettmansperger, McKean and others showed that rank-based estimators could also be cast
as estimators obtained by minimizingPni=1 a R(zi( )) zi( ), where R(zi( )) is the rank of
zi( ) = yi f(xi; ) and a( ) some score function. The rank-based estimators, sometimes
called Wilcoxon estimators, can be used for any general linear model, and have been shown
to have high efficiency compared to least squares estimators. However, these rank-based
estimators, can also be affected by leverage points in regression problems. A weighted
Wilcoxon (WW) method were later developed to take care of leverage points and shown to
possess highly efficient.
2
1.2 Contribution
The first part of this Ph.D dissertation is concerned with the study of conditions suf-
ficient for strong consistency of a class of estimators of parameters of nonlinear regres-
sion models. The study considers continuous functions depending on a vector of param-
eters and a set of random regressors. The estimators chosen are minimizers of a gener-
alized form of the signed-rank norm, that is, minimizing 1nPni=1 an(i) (jz( )j(i)), where
zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j. The
function : R+ !R+ is continuous and strictly increasing. The numbers an(i) are scores
generated as an(i) = ?+(i=(n + 1)), for some bounded score function ?+ : (0;1)!R+ that
has at most a finite number of discontinuities. The generalization allows to make consistency
statements about minimizers of a wide variety of norms including the L1 and L2 norms. By
implementing trimming, it is shown that high breakdown estimates can be obtained based
on the proposed dispersion function.
The second part of this dissertation is motivated by the fact that almost all the methods
discussed above fail to take care of outliers and high leverage points in regression problems.
The generalization of the signed-rank norm (above) allows us to include popular estimators
such as the least squares and least absolute deviations estimators but by itself does not give
bounded influence estimators. To address this problem, we considered weighted forms of the
generalized signed-rank estimators and measured the robustness of these estimators using
their influence functions. Carefully chosen weights result in estimators with bounded influ-
ence function. Also conditions needed for the consistency and asymptotic normality of the
proposed estimator are established and discussions about how weight functions can be cho-
sen to achieve bounded influence function of the estimator are provided. Real life examples
and Monte Carlo simulation experiments demonstrate the robustness and efficiency of the
proposed estimator. Finally, an example showing that the weighted signed-rank estimators
can be useful to detect outliers in nonlinear regression is provided.
3
The last part of this thesis, considered a robust estimation for the missing data problem
in regression analysis as an application of the weighted signed-rank estimation method intro-
duced above. In this particular case, important work has been done in the estimation of the
regression parameters involving the least squares method and eventually the M-estimation
method in the literature. As pointed out by many authors, for non-normal error distributions
or in the case of uncontrolled designs, the least squares method may not provide suitable es-
timators. To this end, we consider the linear semi-parametric regression model with missing
response at random, that is, yi = x0i +g(Ti) +"i, i = 1 n, where the yi?s are missing at
random, xi?s and Ti?s are fully observed. Based on the rank objective function, we provide
estimators of andg simultaneously, using the usual nonparametric kernel estimation. Also,
under some suitable conditions, we show that the obtained estimator of ispn-consistent
and also satisfies the normal approximation property. To illustrate this case a simulation
study is conducted and shows that the rank estimator behaves better than the least squares
estimator when dealing with missing response problem.
4
Chapter 2
On the Consistency of a Class of Nonlinear Regression Estimators
2.1 Introduction
Over the last twenty five years considerable work has been done on robust procedures
for linear models. Several classes of robust estimates have been proposed for these models.
One such class is the generalized signed-rank class of estimates. This class uses an objective
function which depends on the choice of a score function, ?+. If ?+ is monotone then the
objective function is a norm and the geometry of the resulting robust analysis, (estimation,
testing, and confidence procedures), is similar to that of the geometry of the traditional
least squares (LS) analysis; see McKean and Schrader (1980). Generally this robust analysis
is highly efficient relative to the LS analysis; see the monograph by Hettmansperger and
McKean (1998) for a discussion of this analysis. For the simple location model, if Wilcoxon
scores, ?+(u) = u, are used then this estimate is the famous Hodges-Lehmann estimate
while if sign scores are used, ?+(u) 1, it is the sample median. If the monotonicity of
?+ is relaxed then high breakdown estimates can be obtained; see H?ssjer (1994). Thus
the signed-rank family of robust estimates for the linear model contain estimates which
range from highly efficient to those with high breakdown and they generalize traditional
nonparametric procedures in the simple location problem.
Many interesting problems, though, are nonlinear in nature. Traditional procedures
based on LS estimation have been used for years. Since these LS procedures for nonlinear
models use the Euclidean norm they are as easily interpreted as their linear model counter-
parts. The asymptotic theory for nonlinear LS has been developed by Jennrich (1969) and
Wu (1981), among others. In this chapter, we propose a nonlinear analysis based on the
signed-rank objective function. The objective function is a norm if ?+ is monotone; hence,
5
the estimates are easily interpretable. We keep our development quite general, though, to
include nonlinear estimates based on H?ssjer-type estimates also. Hence our estimates in-
clude the nonlinear extensions of the signed-rank Wilcoxon estimate and the L1 estimate as
well as the extensions of high breakdown linear model estimates. Thus we offer a rich family
of estimates from which to select for nonlinear models.
In Section 2.2 we present our family of estimates for nonlinear models. In Section 2.3,
we show that these estimates are strongly consistent under certain assumptions. We discuss
these assumptions, contrasting them with assumptions for current existing estimates. The
same section contains a general discussion of interesting special cases such as the L1 and the
Wilcoxon. Section 2.4 discusses the conditions needed to achieve positive breakdown of our
estimator.
2.2 Definition and Existence
Consider the following general regression model
yi = f(xi; 0) +ei; 1 i n (2.1)
where 0 2 is a vector of parameters, xi 2X is a vector of independent variables, and
f is a real-valued function defined on X . Let V =f(y1;x1);:::;(yn;xn)gbe the set of
sample data points. Note that V V R X.
We shall assume that is compact, 0 is an interior point of , and f(x; ) is a
continuous function of for each x2X and a measurable function of x for each 2 .
We define the estimator of 0 to be any vector minimizing
Dn(V; ) = 1n
nX
i=1
an(i) (jz( )j(i)) (2.2)
6
where zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j.
The function : R+ ! R+ is continuous and strictly increasing. The numbers an(i) are
scores generated as an(i) = ?+(i=(n+1)), for some bounded score function ?+ : (0;1)!R+
that has at most a finite number of discontinuities.
This estimator will be denoted by b n.
Because Dn(V; ) is continuous in , Lemma 2 of Jennrich (1969) implies the existence
of a minimizer of Dn(V; ).
We adopt Doob?s (1994) convention and denote by Lp, 1 p 1, the space of
measurable functions h : (0;1) ! R for which jhjp is integrable for 1 p < 1 and the
space of essentially bounded measurable functions for p =1. The Lp norm of h iskhkp
fRjhjpg1=p for 1 p<1andkhk1 ess supjhjfor p =1. All integrals are with respect
to Lebesgue measure on (0;1). The range of integration will be assumed to be (0;1) unless
specified otherwise.
2.3 Consistency
Let ( ;F;P) be a probability space. For i = 1;:::;n, assume that xi and ei = yi
f(xi; 0) are independent random variables (carried by ( ;F;P)) with distributions H and
G, respectively. We shall write x, e and jz( )j for x1, e1 and jz1( )j respectively. Let ~G
denotes the distribution ofjz( )jand we will assume
A1: P(f(x; ) = f(x; 0)) < 1 for any 6= 0;
A2: for 1 q 1, assume there exists a function h such thatj (G 1 (y))j h(y),8 2
with E[hq(Y)] <1and,
A3: G has a density g that is symmetric about 0 and strictly decreasing on R+.
As usual, we let a:s: convergence, denote almost sure convergence, i.e., pointwise con-
vergence everywhere except for possibly an event inF of probability 0.
7
Theorem 1. Under A1 - A3, b n a:s: ! 0.
Before giving the proof of this theorem, let us discuss some related lemmas and corol-
laries that will be used in the proof.
Let (1);:::; (n) be order statistics from a sample of n i.i.d. uniform(0;1) random
variables. Let Jn : (0;1) ! R; n = 1;2;::: be Lebesgue measurable functions and let
g : (0;1) !R be a Borel measurable function. Define gn(t) g( ([nt]+1)). In the defining
expression for the function Dn(V; ), (2.2), let ~G dnote the cdf of jz( )j. Then we can
express Dn(V; ) as
Dn(V; ) = 1n
nX
i=1
an(i)( ~G 1 )( (i)): (2.3)
The following is Corollary 2.1 of van Zwet (1980) in the notation of this paper and is
given for completeness.
Lemma 1 (van Zwet). Let 1 p 1, 1=p + 1=q = 1, and suppose that Jn 2 Lp for
n = 1;2;:::, g2Lq, and there exists a function J 2Lp such that limn!1Rt0 Jn = Rt0 J for
all t2(0;1). If either
(i) 1
g. This condition is satisfied if we have convergence
in L1 of Jn [cf. also Doob (1994), Theorem VI.18]. To this end, we will marginally violate
assumption (ii) of Lemma 1 and assume that
sup
n
kJnkp sup
n
n1
n
nX
i=1
j?+(i=(n+ 1))jp
o1=p
<1 (2.4)
for 1 p 1. Notice also that
1
n
[nt]X
i=1
?+(i=(n+ 1))
Z t
0
Jn 1n
[nt]+1X
i=1
?+(i=(n+ 1)):
Taking the limit as n!1we obtain that limn!1Rt0 Jn = Rt0 ?+ for all t2(0;1) provided
that ?+ has at most a finite number of discontinuities. Thus if ?+ satisfies (2.4) and g2Lq
all the conditions of Lemma 1 hold. The following corollary is a special case of this result.
Corollary 1. Let W1;:::;Wn be a random sample from a distribution F with support on
R+. Let : R+ !R+ be a continuous Borel measurable function. Suppose, for 1 p;q 1
with 1=p+ 1=q = 1, E[ (W)]q <1andk?+kp <1. Then
Tn n 1
nX
i=1
?+(i=(n+ 1)) (W(i)) a:s: !
Z
(?+) ( F 1) <1:
A formal proof of Corollary 1 may be constructed along the lines described in the
paragraph preceding it with the function g defined as F 1. It will not be included here
for the sake of brevity.
Lemma 2. Under assumptions A1 - A3
Dn(V; ) a:s: ! ( ) a.e. V, uniformly for all 2 , (2.5)
9
where : !R is a function satisfying
inf
2
( ) > ( 0); (2.6)
for any a closed subset of not containing 0.
Proof. The a:s: pointwise convergence of Dn(V; ) follows from expression (2.3) and Corol-
lary 1, which also furnishes the function
( )
Z
(?+) ( ~G 1 ) <1: (2.7)
Then under A1 - A3, Theorem 2 of Jennrich (1969) gives (2.5).
To establish (2.6) we follow a similar strategy as in H?ssjer (1994). Under A1 and A3
for any s> 0, for 6= 0,
~G (s) = P(je ff(x; ) f(x; 0)gj s)
= ExfPe(je ff(x; ) f(x; 0)gj sjx)g
( 0) whenever 6= 0. Thus for any 2 , we
have a 2R such that ( ) > > ( 0). Then it follows from the compactness of
that inf 2 ( ) > ( 0).
Lemma 3. Letfhngbe a sequence of continuous functions defined on a compact set Rp
and that converges uniformly to h. Thenfhngis equicontinuous on .
Proof. Sincefhngconverges uniformly to h, for any > 0, there exists an N2N such that
jhn( ) h( )j< =3 for all n N. The function h being continuous on a compact set, it
10
is uniformly continuous. Thus there exists some > 0 such that for all ; 02 such that
k 0k< , we havejh( ) h( 0)j< =3. Then for all n N and for all ; 02 such
thatk 0k< , we have
jhn( ) hn( 0)j jhn( ) h( )j+jhn( 0) h( 0)j+jh( ) h( 0)j< :
Also, by uniform continuity of fhng, for any fixed n2f1;:::;N 1g, there exists a
n > 0 such that for all ; 02 withk 0k< n, we havejhn( ) hn( 0)j< . Now set
0 = minf 1;:::; N 1g. Then for alln2f1;:::;N 1gand all ; 02 withk 0k< 0,
we havejhn( ) hn( 0)j< .
Therefore, setting = minf ; 0g, for all n2N and all ; 02 withk 0k< ,
we havejhn( ) hn( 0)j< .
Proof of Theorem 1. By Lemma 1 of Wu (1981), to establish the consistency of b n, it is
sufficient to show that
lim infn!1 inf
2
D
n(V; ) Dn(V; 0)
> 0 a.s. (2.8)
for any a closed subset of not containing 0. But
lim infn!1 inf
2
D
n(V; ) Dn(V; 0)
lim inf
n!1 inf 2 An(V; )+
inf
2
B( ; 0) + lim infn!1 Cn(V; 0); (2.9)
where An(V; ) = Dn(V; ) ( ), B( ; 0) = ( ) ( 0), and Cn(V; 0) = ( 0)
Dn(V; 0).
As a result of Corollary 1, lim infn!1Cn(V; 0) = 0 a.s. Due to Lemma 2 we have
inf 2 B( ; 0) > 0. For the statement given in (2.8) to hold, it suffices to show is that
lim infn!1inf 2 An(V; ) = 0 a.s. Again by Lemma 2, An(V; ) a:s: ! 0 uniformly for all
11
2 . Also An(V; ), being continuous on a compact set , is uniformly continuous on
. Then An(V; ) is equicontinuous on a.e. V by Lemma 3. Thus8 > 0 there exists
a > 0 such that8 ; 02 ,
ifk 0k< thenjAn(V; ) An(V; 0)j< ; a:e: V; 8n2N: (2.10)
Let D 0 = f : k 0k< g, for 02 . Then D 0, 02 , forms an open covering of
. But is compact, hence there is a finite subcovering D 0j, j = 1;:::;m which covers
. Let be an arbitrary point in . Then for some j = 1;:::;m, 2D 0j. Hence,
k 0jk< . Thus by (2.10)
jAn(V; ) An(V; 0j)j< ; a:e: V; 8n2N:
That is,
An(V; 0j) 0 and n0 such that
for all n n0
inf
2
n 1
nX
i=1
jli( )jminfG(jli( )j=2) 1=2 ; 1=2 G( jli( )j=2)g :
for all such where li( ) = f(xi; ) f(xi; 0).
O2: E(jej) <1and E([f(x; ) f(x; 0)]2) <1for all 2 , and
O3: G(0) = 1=2.
Here O3 is weaker than C3. However, O2 is stronger than C2. Following similar contra-
positive arguments as in the least squares case, we can easily show that O1 is also stronger
15
than C1 (see also Oberhofer (1982) p. 318). For a detailed discussion of this and sufficient
conditions for O1, the reader is referred to Oberhofer (1982).
Signed-Rank Wilcoxon
Set ?+(u) = u for 0
>>>
<
>>>
>:
1( k); if u< 2k 1;
1(u+12 ); if 2k 1 u< 2k 1;
1(k); if u 2k 1 .
16
Usually we take k = 4.
2.4 Breakdown Point
One of the virtues of the estimators discussed in this paper is that they allow for trim-
ming. This in turn provides us with estimates that are robust when one or more of the
model assumptions are violated. In this section we will consider the breakdown point of our
estimator as a measure of its robustness. Assuming that the true value of the parameter to
be estimated is in the interior of the parameter space , breakdown represents a severe form
of inconsistency in that the estimator converges to a point on the boundary of instead of
0.
Recall that V =f(x1;y1);:::;(xn;yn)g V denotes the sample data points. LetVm be
the set of all data sets obtained by replacing any m points in V by arbitrary points. The
finite sample breakdown point of an estimatorb is defined as [see Donoho and Huber (1983)]
" n(b ;V) = min
1 m n
m
n : supZ2Vmj
b (Z) b (V)j=1
; (2.11)
where b (V) is the estimate obtained based on the sample V. In nonlinear regression, how-
ever, this definition of the breakdown point fails since " is not invariant to nonlinear repa-
rameterizations. For a discussion of this see Stromberg and Ruppert (1992). We will adopt
the definition of breakdown point for nonlinear models given by Stromberg and Ruppert
(1992). The definition proceeds by defining finite sample upper and lower breakdown points,
"+ and " , which depend on the regression model, f. For any x0 2X, the upper and lower
breakdown points are defined as
"+(f;b ;V;x0) =
8
>>>>
>><
>>>>
>>:
min0 m n mn : supZ2Vmf(x0;b (Z)) = sup f(x0; )
if sup f(x0; ) >f(x0;b );
1 otherwise,
(2.12)
17
and
" (f;b ;V;x0) =
8
>>>
>>><
>>>
>>>:
min0 m n mn : infZ2Vmf(x0;b (Z)) = inf f(x0; )
if inf f(x0; ) 0g
where k [n=2] + 1. Here [b] stands for the greatest integer less than or equal to b. This
forces at least the first half of the ordered absolute residuals to contribute to the dispersion
function. In light of this, the dispersion function may be written as
Dn(V; ) = 1n
kX
i=1
an(i) (jz( )j(i))
The following theorem is a version of Theorem 3 of Stromberg and Ruppert (1992). We
impose the same conditions but give the result in terms of k. The results given are for upper
18
breakdown. Analogues for lower breakdown are straightforward. The proof is obtained by
replacing med1 i n with n 1Pki=1 and m with n k in Stromberg and Ruppert?s (1992)
proof of Theorem 3. In the following, #(A) denotes the cardinality of the set A.
Theorem 2. Assume for some fixed x0 there exist k fi : 1 i ng where #( k) =
2n [n=2] k such that
lim
M"1
inf
f :f(x; )>Mg
finf
i2 k
f(xi; )g = sup
f(x0; )
Then
"+(f;b ;V;x0) n k + 1n :
Theorem 2 establishes that even when the regression function f lies on the boundary
for a portion of the data, the bias of the estimator of 0 remains within reasonable bounds
if trimming is implemented. The following corollary gives the asymptotic (as n ! 1)
breakdown point of b n.
Corollary 5. Let = supfu : ?+(u) > 0g. The asymptotic breakdown point ofb n is at least
1 .
This is reminiscent of the breakdown point of a linear function of order statistics which
is equal to the smaller one of the two fractions of mass at either ends of the distribution
which receive weights equal to zero (Hampel, 1971). The same result obtained in Corollary 5
was given by Hampel (1971) for one-sample location estimators based on linear functions of
order statistics (see sec. 7 (i) of Hampel (1971)).
Consider the class of models with the form f(x; ) = g( 0 + 1x), where ( 0; 1) 2R2
and g(t) is monotone increasing in t. This class of models is considered by Stromberg and
Ruppert (1992) and contains popular models like the logistic regression model g( 0; 1x) =
f1 + exp( ( 0 + 1x))g 1. A breakdown point of 1 can be achieved ifb n is obtained via
a minimization of (2.2) with an(i) = ?+(i=(n+ 1)) such that = supfu : ?+(u) > 0g.
19
Remark 5. A definition of breakdown based on ?badness measures? which includes the defini-
tion given by Stromberg and Ruppert (1992) was given by Sakata and White (1995). Under
our assumptions this definition reduces to the one used in the current paper as shown in
Theorem 2.3 of Sakata and White (1995).
20
Chapter 3
Bounded Influence Nonlinear Signed-Rank Regression
3.1 Introduction
As in the previous chapter, let us consider the following nonlinear regression model
yi = f(xi; 0) +ei; 1 i n; (3.1)
where 0 2 is a vector of parameters, xi is a vector of independent variables in a vector
space X, and f is a real-valued function defined on X . Let V =f(y1;x1);:::;(yn;xn)g
be the set of sample data points. Note that V V R X. We shall assume that
is compact, 0 is an interior point of , and f(x; ) is a twice continuously differentiable
function of for each x2X and a measurable function of x for each 2 . The errors ei
are assumed to be iid with a distribution function G.
The asymptotic normality of the least squares (LS) estimator of 0 has been discussed
in Jennrich (1969), Wu (1981), and Wang (1996) among others. The asymptotic normality
of the least absolute deviations (LAD) estimator of 0 is discussed in Wang (1995). However,
as pointed out in Haupt and Oberhofer (2009), the treatment of Wang (1995) and Wang
(1996) were missing some necessary global conditions. The estimator that will be introduced
in this chapter is based on a generalized form of the signed-rank objective function. It
provides a unified treatment of a class of estimators including those considered in Wang
(1995) and Wang (1996). Moreover, we show how a weight functions can be incorporated
to obtain estimators with bounded influence function (Hampel, 1974). Simply stated, the
influence function represents the amount of change in the estimator caused by infinitesimal
21
contamination in the data. Thus it is a measure of the sensitivity of an estimator to outliers
and it is desired that this function be bounded.
Rank-based estimators of linear models (where f(x; 0) = x0 0 in (3.1)) have been
studiedextensively. Jaeckel(1972)gaveageneralclassofrankestimatorsforlinearregression
parameters that are efficient and robust to outliers in the response space. These include the
Wilcoxon estimator which is equal to the median of pairwise slopes (Yj Yi)=(xj xi) for
the case of simple linear regression. These estimators, however, were found to be sensitive to
outliers in the x direction (Hettmansperger et al., 2000; Hettmansperger and McKean, 1998);
thus having an unbounded influence function. Sievers (1983) introduced weighted Wilcoxon
estimators that were later shown to possess a bounded influence function by Naranjo and
Hettmansperger (1994). Chang et al. (1999) provided one-step estimators that have high
breakdown point based on the weighted Wilcoxon pseudonorm, where the weights depend
on a robust and consistent estimator of 0.
The signed-rank (SR) estimator of the slope parameter in the linear model is also ef-
ficient and robust to outliers in the y direction but sensitive to outliers in the x direction
(Hettmansperger et al., 2000). As the Wilcoxon estimator, the SR estimator is suitable
when dealing with datasets from studies with controlled designs. However, it may be ad-
versely affected when exploring datasets based on uncontrolled studies. To address the lack
of robustness in the x direction, Tableman (1990) provided a one step signed-rank estima-
tor for the linear model that has a bounded influence function. The results of Tableman
(1990) were motivated by the work of Krasker and Welsch (1982) who gave a class of M-
estimators with bounded influence function for linear regression estimation. A framework
similar to Tableman (1990) has been investigated by Wiens and Zhou (1994) who provided
bounded-influence rank estimators in the linear model using a general form of the SR objec-
tive function. They also show how the efficiency can be optimized by appropriate choices of
scores and weights under a boundedness constraint on the influence function.
22
For the general nonlinear model given in (3.1), Abebe and McKean (2007) studied
the asymptotic properties of the Wilcoxon estimator of 0. Just as in linear models, this
estimator was shown to be efficient but sensitive to local changes in the direction of x.
Jure?kov? (2008) also studied the asymptotic properties of the rank estimator of 0 in (3.1).
Her approach takes advantage of the asymptotic equivalence of regression quantiles and
regression rank scores to provide rank scores based on the regression function. The approach
results in a restricted set of scores. Also, the resulting estimator does not possess a bounded
influence function.
In this chapter, we propose a class of rank-based estimators of 0 in (3.1) based on the
minimization of a weighted signed-rank objective function. In contrast with the approach
of Abebe and McKean (2007) and Jure?kov? (2008), this approach allows for a set of scores
generated by any nondecreasing bounded score function that has at most a finite number of
discontinuities. Also, by utilizingthe theory of Sobolev spaces, this approach removescertain
restrictive assumptions such as compactness of X, Lipschitz continuity of the regression
function, boundedness of the first derivative of the density of the error distribution that
were needed in the work of Jure?kov? (2008). Our objective function is very general. For
instance, the LS objective function is a special case of our objective function. However, the
objective function of Jure?kov? (2008) does not include the LS objective function. We also
show how Krasker-Welsch type weights (Krasker and Welsch, 1982) can be defined based on
the regression function f to result in a bounded influence function. Moreover, simulation
studies show that the proposed weighted estimators are also efficient attaining a relative
efficiency of .955 versus least squares when G is Gaussian.
Other robust approaches to nonlinear regression include Stromberg (1993) who provided
computational algorithms for computing high breakdown nonlinear regression parameters
using the least median of squares (Rousseeuw, 1984) and MM (Yohai, 1987) estimators.
Stromberg (1995) establishes the consistency of the least trimmed squares (LTS) estimator
23
for the nonlinear model in (3.1). The LTS was shown to have a high breakdown point by
Stromberg and Ruppert (1992).
For linear models, the estimator proposed in this chapter can be regarded as a general-
ization of the objective function of Tableman (1990) to include other norms such as weighted
LAD and LS. Moreover, since we do not restrict ourselves to the linear model, not only is it
an extension of signed rank estimators for the linear model to the nonlinear regression case,
but it is also a generalization of LAD and LS type estimators for the nonlinear regression
model.
The remainder of the chapter is organized as follows. Our proposed estimator is given
in Section 3.2. Section 3.2 also contains asymptotic and robustness results concerning the
proposed weighted estimator. Section 3.3 gives the results using plug-in estimator of the
weights based on a consistent estimator of the regression parameter. Real data and sim-
ulation examples are given in Section 3.4. Section 3.5 provides a discussion. Proofs and
technical results are given in the appendix.
3.2 Weighted SR Estimator
Consider the signed-rank (SR) estimator, b S, of 0 in equation (3.1) that minimizes
T+n ( ) = Pni=1Rijzi( )j, where zi( ) = yi f(xi; ) and Ri = #fj : jzj( )j jzi( )jg is
the rank ofjzi( )j, i = 1;:::;n. The least squares (LS) and least absolute deviation (LAD)
estimators of 0 minimize Pni=1z2i ( ) and Pni=1jzi( )j, respectively. It is well known that
the LS estimator is sensitive to outliers in both x and y directions while the SR and LAD
estimators are sensitive to outliers in the x-direction. There is clearly a need for a method
that is not sensitive to outliers in both x and y directions. We obtain this by considering a
weighted form of the SR estimator.
24
We define the weighted SR (WSR) estimator b n of 0 to be any vector minimizing
Dn(V;w; ) = 1n
nX
i=1
w(xi; 0)an(i) (jz( )j(i)) (3.2)
where zi( ) = yi f(xi; ) andjz( )j(i) is the ith ordered value amongjz1( )j;:::;jzn( )j.
The function : R+ ! R+ is continuous and strictly increasing. The numbers an(i) are
scores generated as an(i) = ?+(i=(n + 1)), for some bounded and non-decreasing score
function ?+ : (0;1)!R+ that has at most a finite number of discontinuities. The function
w : X !R+ is a continuous weight function. Because Dn(V;w; ) is continuous in ,
Lemma 2 of Jennrich (1969) implies the existence of a minimizer of Dn(V;w; ).
It is clear that weighted LS and LAD are special cases of WSR. Weighted LS is obtained
by taking?+ 1 and (t) = t2, t 0 while weighted LAD is obtained by taking?+ 1 and
(t) = t. In our analyses, however, LS and LAD refer to the unweighted versions obtained
by taking w 1.
In the following, we will establish the asymptotic properties of b n and discuss how
weights can be used to obtain a bounded influence function. As given in (3.2), the weights
depend on the unknown true parameter 0. This will make our derivations cleaner. However,
to be of practical use, the weights would have to be estimated. In Section 3.3, we will discuss
a plug-in estimator of the weights based on a consistent estimator of 0 and how estimators
based on these estimated weights have the same asymptotic properties as their counterparts
based on ?true? weights.
3.2.1 Preliminaries
The following definitions and notations will be used throughout this paper. Let be
a domain. We denote by Lp( ;P), 1 p 1, the space of P-measurable functions on
for which
Z
jhjpdP <1 with the usual modification for p = 1. C1( ) is the space of
smooth(infinitelydifferentiable)functionsdefinedin ,D( ) isthespaceofsmoothfunctions
25
with compact support in and L1loc( ) is the space of locally integrable functions in ; that
is, functions that are integrable in any compact subset of . Let = ( 1;:::; n) 2 Nn0,
N0 = N[f0g, be a multi-index. The differential operator is defined as
D = @
j j
@ 11 :::@ nn ;
where j j = Pni=1 i and = ( 1;:::; n). Let 2 L1loc( ). Given 2 Nn0, a function
2L1loc( ) is called the th-weak derivative of if for all 2D( )
Z
D u du = ( 1)j j
Z
du;
and we put = D .
As an example, consider (u) = juj. Clearly, is not differentiable in the usual sense
at 0. But 2L1loc(R), is weakly differentiable and 0(u) = sgn(u).
Let m2N0 and 1 p 1. The Sobolev space denoted by Wm;p( ) is defined as
Wm;p( ) =f 2Lp( ) : D 2Lp( ) withj j mg :
Given a function K2L1 such that
Z
Rn
K(x)dx = 1, let K (x) = nK(x= ). The family of
functionsfK ; > 0g, is called a mollifier with kernel K and K is known as the Friedrichs?
mollifier. SomeimportantfactsrelatedtoSobolevspacesthatmaybeusefulinourdiscussion
are listed below without proofs. A detailed discussion of these can be found in Brezis (1983)
and Adams (1975).
(S1) K 2C1(Rn), supp(K ) =fx2Rn : kxk g, K 0, and
Z
Rn
K (x)dx = 1. Here
supp(K ) denotes the support of K .
26
(S2) (Regularization Theorem)
Let K be a Friedrichs? mollifier. If 2L1loc(Rn), then the convolution product
K (x) =
Z
Rn
(x y)K (y)dy
exists for all x2Rn. Moreover, K 2C1(Rn), supp( K ) supp( ) + B0(0; )
where B0(0; ) = fx2Rn : kxk g, D ( K ) = D K and supp(D K )
supp(K ). Also ifMis a compact set of points of continuity of , then K !
uniformly onMas !0.
(S3) Let K be a Friedrichs? mollifier. Let 2Wm;p( ) for 1 p 1. Then, K !
in Lp( ) and K ! in Wm;p(!) as ! 0 for all ! . ! means ! is
open, the closure of !, ! is compact and ! .
3.2.2 Consistency
Let ( 0;F;P) be a probability space. For i = 1;:::;n, assume that xi and ei =
yi f(xi; 0) are independent random variables (carried by ( 0;F;P)) with distributions
H and G, respectively. Setting ~G to denote the distribution of jz( )j, we can rewrite
Dn(V;w; ) as
Dn(V;w; ) = 1n
nX
i=1
w(xi; 0)an(i)( ~G 1 )( (i))
where (i) are order statistics from the uniform U(0;1) distribution.
Theorem 3. Let
(I1) P(f(x; ) = f(x; 0)) < 1 for any 6= 0,
(I2) w2Lp(X ) and there exists a function h2Lq(V) such thatj ( ~G 1 (v))j h(v), for
all 2 and all 1 p;q 1such that 1=p+ 1=q = 1, and
(I3) G has a density g that is symmetric about 0 and strictly decreasing on R+.
27
Then b n a:s: ! 0.
Before giving the proof, we state the following lemma without proof. The proof of this
Lemma may be constructed following Lemma 2.
Lemma 4. Under assumptions (A1) (A3), Dn(V;w; ) a:s: ! ( ) a.e. V, uniformly for all
2 , where : !R is a function satisfying inf
2
( ) > ( 0) for any a closed
subset of not containing 0.
Proof. By Lemma 1 of Wu (1981), to establish the consistency of b n, it is sufficient to show
that
lim infn!1 inf
2
D
n(V;w; ) Dn(V;w; 0)
> 0 a.s. (3.3)
forany aclosedsubsetof notcontaining 0. TothatendletAn(V;w; ) = Dn(V;w; )
( ), B( ; 0) = ( ) ( 0), and Cn(V;w; 0) = ( 0) Dn(V;w; 0).
By Lemma 2, we have An(V;w; ) a:s: ! 0 uniformly for all 2 , inf
2
B( ; 0) > 0,
and lim infn!1 Cn(V;w; 0) = 0 a.s. For the statement given in (3.3) to hold, it suffices to show
that lim inf
n!1
inf
2
An(V;w; ) = 0 a.s. An(V;w; ), being uniformly convergent and continu-
ousonacompactset ,isequicontinuouson a.e.V. Thisgives lim infn!1 inf
2
fAn(V;w; )g= 0
a.s. and the proof is complete.
Assumption (I1) is a very weak condition needed for 0 to be identified. The linear
version of (I1) was given by H?ssjer (1994) as P(j 0xj = 0) < 1 for any 6= 0 under the
assumption that 0 = 0. Since?+ is bounded, by (I2), we havekw?+kp <1. Moreover, (I2)
and H?lder?s inequality ensure that the product (w?+)( ~G 1 ) is integrable. (I3) admits
a wide variety of error distributions examples of which are the normal, double exponential
and Cauchy distributions with location parameter equal to 0.
28
3.2.3 Asymptotic Normality
Write (t) = [ ~G 1 (t)] and i = w(xi; 0)an(R i) where R i, i = 1;:::;n are the rank
of 1;:::; n. Then (3.2) can be written as
Dn(V;w; ) = 1n
nX
i=1
w(xi; 0)an(i)( ~G 1 )( (i)) = 1n
nX
i=1
i ( i) :
By (I2),k ikp <1for 1 p 1. Now set n( ) = D Dn(V;w; ) and (t) = D (t)
forj j= 1. Since the dependence of on y is only through z( ), we will suppress y in the
notation and write (x). Now denote the n p matrix X by X = 0(x1);:::; 0(xn)
and define hnii to be the ith diagonal component of X (X TX ) 1X T. Now b n is a zero of
n( ) = 1n
nX
i=1
i ( i): (3.4)
Thus b n can be seen as a weighted M-estimator with weights 1;:::; n. So, under some
conditions, the asymptotic theory of the weighted M-estimation can be applied.
In addition to (I1) - (I3), consider the following conditions:
(I4) ! (t) is a map in W3;p(B), where B is a neighborhood of 0 for every fixed t.
(I5) There exist functions 2W2;p(V) such thatjD (t)j (t) for every 2B and
j j 2.
(I6) A 0 = E w(x; 0)?+( ) [D ( )] = 0 , where U(0;1), is a positive definite matrix
forj j= 1.
(I7) limn!1 max
1 i n
hnii!0
Example 1. The assumptions above allow us to define certain types of hybrid estimators
which may be constructed in the interest of efficiency and robustness. One such estimator is
one that behaves like an LS estimator for small absolute residuals and like an LAD estimator
29
for large absolute residuals. As an illustration, let us consider the one-dimensional case with
= [a;b][[b;c], where a
><
>>:
z2( ); 2[a;b]
jz( )j+z2(b) jz(b)j; 2[b;c] :
This function is strictly increasing and continuous as a function of but not differentiable in
the usual sense at = b. The test function 2D( ) has to satisfy (a) = (b) = (c) = 0.
Using the definition of the weak derivative of satisfies
Z
(jz( )j) 0( )d =
Z
( ) ( )d :
But
Z
(jz( )j) 0( )d =
Z b
a
z2( ) 0( )d +
Z c
b
jz( )j 0( )d
=
Z b
a
2z( ) _f( ;x) ( )d +
Z c
b
sgn(z( )) _f( ;x) ( )d
=
Z
h
2z( ) _f( ;x)I[a;b]( ) +sgn(z( )) _f( ;x)I[b;c]( )
i
( )d
where the second equality is from integration by parts and _f( ;x) = @f( ;x)=@ . Thus
( ) = 2z( ) _f( ;x)I[a;b]( ) +sgn(z( )) _f( ;x)I[b;c] :
Higher order derivatives can be computed similarly. In this case, (I4) and (I5) are satisfied as
long as f and its higher order derivatives (up to order 3) are bounded by integrable functions.
Particularly, assumption (I4) allows us to include a large variety of generalized functions
that are not differentiable in the usual sense. The typical example is the absolute valued
function given by x!jxjthat includes the LAD and SR estimators, is not differentiable in
the usual sence but is locally integrable, and therefore, weakly differentiable. It can be seen
30
that none of the computions above can be done without assumption (I4). Assumption (I5)
ensurestheintegrabilityof (w?+) (D ) forany suchthatj j 2 forwhichtheSLLNcan
be applied (see remark below). A 0 in assumption (I6) denotes the limiting Hessian matrix.
(I7) is a common assumption known as Noether?s condition that ensures the asymptotic
normality of n( 0).
Remark 6. Under (I1);(I2), and (I3), Lemma 2 in the appendix gives the pointwise almost
sure convergence of Dn(V;w; ) for any 2 . If in addition (I5) holds, then we have
[D Dn(V;w; )] = 0 a:s: ! ( 0) a.e. V, where ( 0) E w(x; 0)?+( ) [D ( )] = 0
for any such thatj j 3.
The following theorem gives the asymptotic normality ofb n. The approach of the proof
is similar to that given in van der Vaart (1998) for M-estimators.
Theorem 4. Under assumptions (I1) (I7),
pn(b
n 0)
D !N(0; A 1
0 0A
1
0 );
where 0 = E w(x; 0)?+( ) 0( )( 0( ))T .
Proof. By (I4), ! is a map in W3;p(B), properties (S1) ? (S3) imply that D( B) is
dense in Wm;p(B), where B is the closure of B. Thus, in the the following, we may assume
without loss of generality that 2D( B).
Implementing the Taylor expansion at 0 of n( ), we get
0 = n(b n) = n( 0) + _ n( 0)(b n 0) + 12(b n 0)0 n( n)(b n 0)
where n is a point between 0 and b n, _ n( 0) = D n( ) =
0
forj j= 1 and
n( 0) = D n( )
= 0 forj j= 2.
Now using (I7) and by an application of the Cram?r-Wold device (Serfling, 1980) and
the central limit theorem for linear combinations of functions of order statistics (H?jek and
31
?id?k, 1967), pn n( 0) converges to a multivariate normal distribution with mean 0 and
covariance matrix 0. Also _ n( 0) converges almost surely to A 0 (see Remark 6) and
hence in probability. By (I4) and Theorem 3, limn!1P(f n2Bg) = 1. So under the event
f n2Bg,
k n( n)k C1n
nX
i=1
w(xi; 0) ( i)
where C stands for the bound of the score function. The right hand side of the above
inequality is bounded in probability by the law of large numbers for n sufficiently large.
These and the consistency of b n for 0 give
n( 0) = A 0 +op(1) + 12(b n 0)Op(1) (b n 0) = A 0 +op(1) (b n 0)
since (b n 0)Op(1) = op(1)Op(1) ! 0 in probability. Also with probability tending to 1,
the matrix A 0 +op(1) is invertible. Thus,
pn(b
n 0) =
pn A
0 +op(1)
1
n( 0) =
pnA 1
0 n( 0) +op(1) : (3.5)
The proof is complete by an application of Slutsky?s lemma and noting that pn n( 0) is
asymptotically normal.
In general, for estimators with w 1 and (t) = t, we can simplify the asymptotic
covariance matrix ofpnb n to obtain 2?+ E[rf(x; 0)rf(x; 0)T] 1, where
2?+ =
R1
0f?
+(u)g2du
R1
0 ?
+(u)?+g (u)du
2 ;
with ?+g (u) = g0(G 1(u+12 ))=g(G 1(u+12 )). It is easy to show that 2?+ = f2g(0)g2 for
the LAD estimator and 2?+ =
p
12R1 1fg(t)g2dt
2
for the SR estimator. The asymptotic
covariancematrixoftheLSestimatoris 2 E[rf(x; 0)rf(x; 0)T] 1, where 2 = R u2dG
is the error variance.
32
For Rp, two estimators can be compared using the asymptotic relative efficiency
(ARE), which is the reciprocal of ratio of their asymptotic covariance matrices to the power
1=p (Serfling, 1980). The evaluation of the ARE is very complicated in general. We consider
a simulation study with G taken to be the t distribution in Section 3.4. The ARE may be
determined in closed form for comparing the simpler estimators such as SR versus LS or
LAD for some error distributions such as the Cauchy, logistic, and normal. These are the
same as those for comparing the signed-rank, the sign, and thettest in the location problem
and are studied extensively in Hettmansperger and McKean (1998).
3.2.4 Robustness
In this section, we discuss conditions needed for the boundedness of the influence func-
tion. If we can write
pn(b
n 0) =
1p
n
nX
i=1
( 0;yi;xi) +op(1) ;
then the influence function of b n at a given point (y0;x0) is ( 0;y0;x0) (e.g. Corollary
3.5.7 of Hettmansperger and McKean (1998)). However, by equation (3.5) in the appendix
we have
pn(b
n 0) =
1p
n
nX
i=1
( i)A 1 0 0(xi) +op(1) :
Thus ( 0;y0;x0) = (x0;y0)A 1 0 0(x0), where (x0;y0) = w(x0)?+(G(z( 0)). Assume
that
(I8) : there exists a constant M > 0 such thatkw(x) 0(x)k M.
Note that from the boundedness of the score function ?+, for (I8) to hold, it suffices that
kw(x) 0(x)kbe bounded as a function of x. This is the same condition as Assumption 8 of
Coakley and Hettmansperger (1993) for 0(x) = x corresponding to f(x; 0) = x0 0. Thus,
33
the choice of weights discussed in Section 6.1 of Coakley and Hettmansperger (1993) may
be extended to our case.
Theorem 5. Under (I1) (I8), b n has a bounded influence function.
3.3 Weight Specification
Based on the work of Giltinan et al. (1986) and Simpson et al. (1992) for the linear
model, one may set
w(x; 0) = min
h
1; d(x;
0)
i
(3.6)
where d(x; 0) = (x mx )TC 1x (x mx ) is a robust Mahalanobis distance, with mx
and Cx being robust estimates of location and covariance of x = 0(x), respectively and
being some positive constant. As it can be seen from its definition, the weight function
depends on the true parameter 0. So, in our study, we will consider plug-in estimators
of d(x; 0) based on strongly consistent estimators of 0. As Coakley and Hettmansperger
(1993) we will take resistant distances ford(x; 0) that use location and scale estimates based
on the minimum covariance determinant (MCD) and minimum volume ellipsoid (MVE) of
Rousseeuw (1984) and Rousseeuw (1985), respectively. A discussion of these and other high
breakdown estimators in the multivariate case is given in Hubert et al. (2008).
Analogous to Tableman (1990) and Krasker and Welsch (1982), one may use the weight
function w(x; 0) in (4.13) by setting
d(x; 0) = x TB 1 0 x ; with B 0 = Ex;y
h
w2(x; 0) ?+[2G(z( 0)) 1] 2 x Tx
i
x = 0(x) and = M= for such that sup
t
[?+(t)] = . The difficulty in using this weight
function for the nonlinear regression model is on estimatingd(x; 0) for which the expression
of B 0 depends on the weight function w(x; 0). So we face a situation where we have to
solve an implicit value problem. For linear models, however, since d(x; 0) can be specified
34
free of model parameters (see for example Section 6 of Coakley and Hettmansperger, 1993).
An iterative scheme for computing d(x; 0) can be found in Krasker and Welsch (1982)
and in Section 2 of Tableman (1990), and, under some suitable extra conditions (like the
invertibility of the score function ?), one may extend the iterative computation of d(x; 0)
to the nonlinear case. Another weight function is that considered by Wiens and Zhou (1994)
and uses d(xi; 0) that is equivalent tokA 1 0 x ik 1, wherek kis the Euclidean norm.
3.3.1 Plug-in Estimator of the Weight
Let ~ n be the minimizer of Dn(V;w; ) with w 1; that is, the unweighted estimator
of 0. b S given at the beginning of Section 3.2 is one such estimator. Under (A1) (A7),
~ n a:s: ! 0 as n !1. Hence ~ n = 0 + o(1) w.p. 1. Denote by ~w = w(xi;~ n), ~ i =
w(xi;~ n)?+
Ri
n+1
and ~ n( ) = 1n
nX
i=1
~ i ( i) where ~ (x0;y0) = w(x0;~ )?+ G(z( 0)) .
Set bb n = Argmin 2 ~Dn(V; ~w; ), where ~Dn(V; ~w; ) = 1nPni=1w(xi;~ )an(i)( ~G 1 )( (i)).
The following theorem shows that bb n and b n have the same asymptotic properties.
Theorem 6. Under (I2), ~ n( 0) = n( 0) +o(1) w.p.1.
Proof. From the fact that ~ = 0 +o(1) w.p.1 and by the continuity of the weight functionw,
we have w(xi;~ ) = w(xi; 0) + o(1) w.p.1. Now using this fact in the expression of ~ n( 0),
we get
~ n( 0) = n( 0) +o(1) 1
n
nX
i=1
?+ Rin+ 1 0( i) :
Set n( 0) = 1nPni=1 ?+ Rin+1 0( i). Then, under (A2), by the SLLN for functions of order
statistics (Van Zwet, 1980), we have
n( 0) a:s: !E[?+( ) 0( )] <1;
where U(0;1). Thus n( 0) is bounded in probability and hence o(1) n( 0) ! 0
w.p.1 as n!1. Therefore ~ n( 0) = n( 0) +o(1) w.p.1.
35
Theorem 7. Under (I1) (I7),
pn(bb
n b n) = op(1) :
Proof. As in the proof of Theorem 4, the Taylor expansion of ~ n(bb n) about 0 gives
0 = ~ n(bb n) = ~ n( 0) + _~ n( 0)(bb n 0) + 12(bb n 0)0 ~ n( n)(bb n 0) ;
where n lies on the line segment joiningbb n and 0. Also under the assumptions of Theorem
6, one can show that _~ n( 0) = _ n( 0)+o(1) w.p.1. Now the fact thatw(xi;~ ) = w(xi; 0)+
o(1) w.p.1 and the SLLN of functions of the order statistics, we have _~ n( 0) a:s: !A 0. Note
that under (A1) (A3), bb n a:s: ! 0. From this and assumption (I4), limn!1P(f n2Bg) = 1.
So under the eventf n2Bg, ~ n( n) is bounded in probability. Hence
pn(bb
n 0) =
pnA 1
0 ~ n( 0) +op(1)
= 1pn
nX
i=1
( ~ i)A 1 0 0(xi) +op(1)
= 1pn
nX
i=1
( i)A 1 0 0(xi) + o(1) A 1 0 1pn
nX
i=1
0(xi) +op(1) ;
where the last equality follows from ~ i = i + o(1) w.p.1 by the continuity of the weight
function and the boundedness of ?+. Clearly, by the central limit theorem, 1pn
nX
i=1
0(xi)
converges to a normal distribution, thus bounded in probability. Therefore
pn(bb
n 0) =
1p
n
nX
i=1
( i)A 1 0 0(xi) +op(1): (3.7)
Now combining (3.5) and (3.7) yieldpn(bb n b n) = op(1).
36
Remark 7. Theorem 4 and Theorem 7 show that bb n has the asymptotic Gaussian distribu-
tion given in Theorem 4. Moreover, from equation 3.7 in the appendix gives
pn(bb
n 0) =
1p
n
nX
i=1
( i)A 1 0 0(xi) +op(1) :
Thus, bb n has the same influence function as b n.
3.4 Examples
All analyses in this section were performed using the R software environment (R Devel-
opment Core Team, 2009). Estimates were found by means of the ?nonlinear minimization
subject to box constraints? algorithm (Gay, 1983, 1984) implemented in the function nlminb
in the ?stats? package of R with a range of starting values. We use weight functions based on
robust Mahalanobis distances. MCD estimates of location and covariance were found using
the fast MCD algorithm of Rousseeuw and Van Driessen (1999). Both MCD and MVE are
implemented in the R function cov.rob in the package ?MASS?. For simplicity, all WSR
estimators considered in the following use ?+(u) = u and (t) = t. In this case ~ = b s.
3.4.1 Monte Carlo Simulation
Exponential Regression
Consider the one-parameter exponential regression model
yi = e 0xi +"i ; i = 1;:::;25 : (3.8)
We took x1;:::;x25 to be iid N(0;1). We then performed B = 1000 iterations where in each
iteration "1;:::;"25 were randomly generated from the N(0;:01) distribution and y1;:::;y25
were computed asyi = exi+"i. That is, the true value of 0 was taken to be 1. We considered
two cases: an outlier in the y-direction by adding 50 to y20 and an outlier in the x-direction
37
by adding 5 to x20. In each of the B repetitions we estimated 0 (giving ^ (1);:::; ^ (B)) using
LS, LAD, SR, and WSR. For WSR estimation, we used the weight function
w(x; ~ ) = min
1;
2
:95(1)
d(x; ~ )
where 2:95(1) isthe95%percentilepointofthe 2(1) distributionandd(x;~ ) = Pni=1 2(xie~ xi
)2. Here ~ is the SR estimate of 0 whereas and 2 are the MCD center and variance,
respectively, of xe~ x. As the estimated value of 0, we took B 1P^ (i) and as the estimate
of the MSE, we took B 1P(^ (i) 0)2. The results are given in Table 3.1.
Table 3.1: Average Estimates(MSEs) of 0 = 1 in 1000 simulated samples
LS LAD SR WSR
y-outlier -3.766(22.7108) .999(.0001) .999(.0001) .973(.0031)
x-outlier .233(.5878) .342(.4327) .264(.5413) .993(.0001)
From Table 3.1, it is clear that the LS estimate is affected rather adversely by the
presence of the y-outlier. We can also observe that the LS, LAD, and SR estimates are
affected by the x-outlier while the WSR estimate remained close to 0 = 1 in both cases.
Figure 3.1 gives the plots of the fitted curves obtained using LS, LAD, SR, and WSR. It is
clear from the plot that the WSR fit is not affected by either the outlying x or the outlying
y values.
A Study of Relative Efficiency
It is well known that in linear models the relative efficiency (RE) of the signed rank
estimator to the least squares estimator is 3= :955 when G is the normal distribution.
This RE increases with the heaviness of the tail of the error distribution. To study this for
nonlinearmodels, weconsideredasimulationexperimentinvolvingtheMichaelis-Mentenand
Gompertz functions. The Michaelis-Menten function describes the relation between velocity
38
Figure 3.1: Plot of fitted exponential curves including WSR
-0.5 0.0 0.5 1.0 1.5
1
2
3
4
Outlier in y
x
y
True
LS
LAD
SR
WSR
-0.5 0.0 0.5 1.0 1.5
1
2
3
4
Outlier in x
x
y
True
LS
LAD
SR
WSR
of reaction f(x) of an enzyme with substrate concentration x. It is given by
f(x; ) = x +x ;
where = ( ; ), is the maximum velocity of the reaction and is the half-saturation
constant. On the other hand the Gompertz model is defined by
f(x; ) = expf e xg
where = ( ; ; ) is the vector of parameters of interest. One of the applications of this
functionisformodelinggrowthoftumors. Infact, tumorsarecellularpopulationsgrowingin
a confined space where the availability of nutrients is limited. In this situation, f(x) denotes
the tumor size, the carrying capacity, i.e., the maximum size that can be reached with the
available nutrients, = ln f(0) , where f(0) is the tumor size at the starting observation time
and is the growth rate of the tumor.
For both models, n = 100 values of x were generated at random from the standard
exponential distribution. Then for each of B = 2000 repetitions, n random errors " were
39
generated from the t distribution with d degrees of freedom and n values of y = f(x; ) +"
were computed. This was done for degrees of freedom d ranging from 3 to 102 in steps of 3.
The t distribution is ideal for simulating varying tail thicknesses as it provides distributions
from the Cauchy (d = 1) to the normal (d = 1). We are interested in estimating the
parameter in both models. The weighting scheme used for WSR estimation is the same
as the one in the previous Monte Carlo experiment (MCD along with a 2:95 cutoff).
Figure 3.2 contains the plot of degrees of freedom versus the estimated relative efficiency
given by ratios of estimated MSEs. The top dashed horizontal line is at 3= and the bottom
dashed horizontal line is at 2= . These values are the theoretical asymptotic relative efficien-
cies of SR versus LS and LAD versus LS, respectively, for the linear model. Estimators with
relative efficiencies above the solid line are more efficient than the LS estimator. The plot
shows that the SR and WSR estimators are more efficient than the LAD estimator. More-
over, we observe that the relative efficiencies approach the theoretical values under normality
as the degree of freedom increases. We can also see that LAD, SR, and WSR estimators
are more efficient than the LS estimator for heavy tailed error distributions. SR and WSR
provide similar estimates of relative efficiency.
Figure 3.2: Relative Efficiency
0 20 40 60 80 100
1.0
1.5
2.0
Michaelis?Menten
t df
relative efficiency vs LS
SR
LAD
WSR
0 20 40 60 80 100
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Gompertz
t df
relative efficiency vs LS
SR
LAD
WSR
40
3.4.2 Real Data
We now consider the Lakes Data given in Stromberg (1993). The data were collected
from 29 lakes in Florida. Three variables were collected on each lake. The response is the
mean annual nitrogen concentration (y). The predictors are the average influent nitrogen
concentration (x1) and the water retention time (x2). The model recommended by the
investigator is
yi = x1i1 + x
2i
+"i ; i = 1;:::;29 :
We fit the model using LS, LAD, SR, and WSR. For the weighted SR, we used the weight
function
w(x) = min
1;
2
:95(2)
d(x;~ )
:
We computed two versions of WSR. WSR1 uses weights where d(x;~ ) is taken to be the
robust Mahalanobis distance constructed using the MVE estimates of the center and covari-
ance ofr f(x;~ ) with f(x; ) = x1=(1 + x 2 ), = ( ; ), and x = (x1;x2). WSR2 has the
same setup as WSR1 except that it uses weights based on classical Mahalanobis distances.
We take ~ to be the SR estimate of . The results obtained taking ~ as LS or LAD estimates
are similar and hence not reported. Table 3.2 gives the estimates of the parameters, standard
errors, and scale.
Table 3.2: Parameter estimates for Lakes Data
Method Parameter Estimate SE Z p-value Scale
LS 5.08423 1.93891 2.62220 0.00874 1.2637
1.27852 0.35334 3.61835 0.00030
LAD 3.88311 0.79337 4.89443 0.00000 1.0046
1.29880 0.23082 5.62686 0.00000
SR 5.09126 1.66082 3.06551 0.00217 1.3877
1.22612 0.29373 4.17432 0.00003
WSR 0.88588 0.58543 1.51320 0.13023 1.2904
0.39018 0.17357 2.24799 0.02458
41
For the original data, it is apparent from Table 3.2 that WSR1 gives estimates that are
quite different from the other methods. The residual diagnostic plots given in Figure 3.3 can
help us determine which of the results is the most appropriate. The figure contains plot of
residuals from the fitted models versus x1 (left panel) and residuals versus ^y (right panel).
The plots for LAD and SR (not shown) are very similar to those of LS. We observe that
all plots of residuals versus x1 identify two potential outliers (denoted by solid squares) in
the x1 direction. However, these observations do not stand out as aberrant in the residual
versus fits plot for LS whereas WSR2 identifies only one of them as a potential outlier. The
WSR1 fit clearly identifies the two points as outliers. Stromberg (1993) also identifies the
same points using the high breakdown least median of squares and MM estimators. The
residual plot of WSR1 given in Figure 3.3 and that of MM given in Stromberg (1993) are
quite similar. The ?Outliers Removed? part of Table 3.2 gives the estimates and standard
errors after removing the two outliers identified by WSR1. The estimates due to the different
methods are quite similar to each other.
We removed the one outlier identified by WSR2 and refit all the models. LS, SR,
and LAD fail to identify the remaining one outlier. However, WSR2 identifies the outlier.
This indicates a potential masking problem associated with using the classical Mahalanobis
distance in the weights. This is very much the same masking problem one faces when using
hat matrix weights in linear regression. This issue is discussed in Wiens & Du (2000). In this
case, the use of robust Mahalanobis distances appears to be a good solution to the masking
problem.
3.5 Discussion
In this article, we proposed a rank-based analysis of nonlinear regression models. Our
study uses a weighted generalized signed-rank dispersion function. The generalized signed-
rank dispersion function results in a class of estimators including the signed-rank, least
squares, and least absolute deviations estimators. These estimators do not have bounded
42
influence functions. The influence function of LAD and LS are unbounded in both response
and design spaces while the that of the signed-rank is unbounded in design space. Thus, the
signed-rank estimation procedure may not be suitable for studies with uncontrolled designs.
Added weights allow us to construct estimators with bounded influence. However, it is
rather complicated to use the proposed weights directly since they cannot be expressed free
of the regression parameter for nonlinear models. Our solution is to replace the regression
parameter by its consistent estimator in the weight function. This results in estimators
that are asymptotically equivalent to those in which the weights are based on the unknown
true value of the parameter. The unweighted versions of our estimators are consistent;
hence, they can be used to estimate the weights. We recommend weight functions that are
based on robust Mahalanobis distances using robust estimates of location and covariance of
the Jacobian matrix of the regression function. They are explicitly defined and are simple
to construct. Moreover, as our simulation studies demonstrate, the weighted signed-rank
estimators with weights based on robust Mahalanobis distances are efficient as well as robust.
It is also shown that these weighted estimators can be useful in detecting outliers in nonlinear
regression.
43
s
Figure 3.3: Residual plot versus x1 and fitted values for Lakes data
0 5 10 15 20 25 30 35
-1
0
1
2
3
x1
LS Residuals
1 2 3 4
-1
0
1
2
3
LS Fits
LS Residuals
0 5 10 15 20 25 30 35
-10
-5
0
x1
WSR1 Residuals
0 5 10 15
-10
-5
0
WSR1 Fits
WSR1 Residuals
0 5 10 15 20 25 30 35
-4
-2
0
2
x1
WSR2 Residuals
1 2 3 4 5 6
-4
-2
0
2
WSR2 Fits
WSR2 Residuals
44
Chapter 4
Rank Regression with Missing Response
4.1 Introduction
The problem of missing data is nowadays in the center of almost all statistical studies.
It occurs in a wide array of application areas for several reasons. Among the reasons are:
a sensor in a remote sensor network may be damaged and cease to transmit data
certain regions of a gene microarray may fail to yield measurements of the underlying
gene expressions due to scratches, finger prints, or manufacturing defects
in a clinical trial, participants may drop out during the course of the study leading to
missing observations at subsequent time points
all applicable tests while diagnosing a patient may not be ordered by a doctor
usersofarecommendersystemrateextremelysmallfractionofavailablebooks, movies,
or songs, leading to massive amount of missing data.
Also, data may be missing because equipment malfunctioned, the weather was terrible, or
people got sick, or the data were not entered correctly. The type of missingness to be
considered in this chapter is the missing response in the context of regression analysis that
often arise in various experimental settings such as market research surveys, medical studies,
opinion polls and socioeconomic investigations.
The statistical investigation of such a problem is a very difficult task since in most
cases, missing data themselves contain either little or no information about the missing
data mechanism (MDM). Fundamentally and most commonly used assumption about the
45
MDM is the MAR assumption discussed in Rubin (1976). The idea behind MAR is that the
probability that a response variable is observed can depend only on the values of those other
variables that have been observed. The scientific literature provides extensive studies and
effective computational methods for handling missing data under the MAR assumption.
4.2 Model Definition
Consider the linear semi-parametric regression model
Yi = X i +g(Ti) +"i; 1 i n; (4.1)
where 2B is a vector of parameters, Xi?s are i.i.d p-variable random covariate vectors,
Ti?s are i.i.d univariable random covariates defined on [0;1], the function g : [0;1] !R is
unknown and the model errors "i are independent with conditional mean zero given the
covariates. Also, 0 < E("2ijZi) <1with Zi = (Xi;Ti). In this paper, we are interested in
inference on the true value 0 of the parameter , when there are missing responses in the
linear semi-parametric model (4.1).
This model has captured a lot of attention in recent years. An application of (4.1) to
mouthwashexperimentwasgivenbySpeckman(1988). Anexampleusing (4.1)isprovidedin
Green and Silverman (1994). The least squares estimation approach of the setting discussed
above was studied by Wang and Sun (2007). A semi-parametric mixed model for analyzing
the CD4 cell count in HIV seroconverters was studied by Zeger and Diggle (1994). Model
4.1 has also been applied in several fields such as biometrics, see Gray (1994), econometrics
and others. For complete data setting, this model has been extensively studied by many
authors such as Heckman (1986), Speckman (1988), Robinson (1988), Rice (1986) among
others. In the framework of model (4.1), Wang et al. (2004) developed inference tools in
missing response case for the mean of Y based on the least squares estimation approach
and under the MAR assumption. The historical method for constructing confidence interval
46
for the true mean of Y is the empirical likelihood method introduced by Owen (1990). As
pointed out by many authors, this method has many advantages over others methods such
as those based in normal approximations or the bootstrap (Hall and La Scala, 1990). Many
authors have studied this framework including Kitamura (1997), Peng (2004), Wang et al.
(2004), Xue and Zhu (2007b), Xue and Zhu (2007a), Xue and Zhu (2006), Wang and Rao
(2002a), Sun et al. (2009), Chen and Hall (1993) among others.
When dealing with missing data, the main approach is to impute a plausible value for
each missing datum and then analyze the results as if they were complete. In most of the
regression problems, the commonly used approaches include linear regression imputation
(Healy and Westmacott, 1956), nonparametric kernel regression imputation (Cheng, 1994;
WangandRao,2002b),semi-parametricregressionimputation(WangandSun,2007),among
others. Wang and Sun (2007) also considered the semi-parametric regression imputation
approach to estimate the true mean of Y.
The other well known approach for handling missing data is the inverse probability
weighting. This approach has gained considerable attention as a way to deal with missing
data problems. For a discussion of this approach, see Wang et al. (1997), Robins et al.
(1994), Wang et al. (2004), Zhao et al. (1996) and references therein. As pointed out by
Wang and Sun (2007), for missing problems, the inverse probability weighting approach
usually depends on high dimensional smoothing for estimating the completely unknown
propensity score function, leading to the well known problem of "curse of dimensionality"
that may restrict to use of the resulting estimator. One way to avoid such a problem is to
use the inverse marginal probability weighted method suggested by Wang et al. (2004).
4.3 Estimation
In model (4.1), consider the case where some values of Y in the sample of size n may
be missing, but X and T are fully observed. That is, we obtain the following incomplete
47
observations
(Yi; i;Xi;Ti); i = 1;2;:::;n
from (4.1), where Xi?s and Ti?s are observed, and,
i =
8
><
>:
0; Yi is missing;
1; otherwise.
We assume that Y is missing at random (MAR). The MAR assumption implies that and
Y are conditionally independent given X and T. That is P( = 1jY;X;T) = P( = 1jX;T).
As discussed above, this is a common assumption for statistical analysis with missing data
and is reasonable in many practical situations, see Little and Rubin (1987).
Let us introduce the following notations: Z = (X;T), 2(Z) = E("2jZ), (z) = P( =
1jZ = z) and (t) = P( = 1jT = t). Consider the following rank objective function,
DCn ( ) = 1n
nX
i=1
?
R(ei( ))
n+ 1
ei( ); (4.2)
whereei( ) = i"i andR(ei( )) is theith rank ofei( ). Note that in the expression ofDCn ( )
in (4.2), and g are unknown. Also, E[ei( )jZi] = 0 by the MAR assumption.
So, before dealing with the estimation of 0, let us consider first step of the estimation
of g based on the complete data, that is, estimating g as a known function of t but unknown
with respect to . As discussed in Wang et al., pre-multiplying (4.1) by the observation
indicator, we have
iYi = iX i + ig(Ti) + i"i;
and taking conditional expectations given T, we have
E[ iYijTi = t] = E[ iXijTi = t] +E[ ijTi = t]g(t); (4.3)
48
from which, it follows that
gC1 (t) = E[ XjT = t]E[ jT = t] and gC2 (t) = E[ YjT = t]E[ jT = t] :
and so
g(t) = gC2 (t) gC1 (t) : (4.4)
LetK( ) be a kernel function andbn be a bandwidth sequence such thatbn!0 as n!1.
Define weights as
WCnj(t) = K(
t Tj
bn )P
n
j=1 jK(
t Tj
bn )
:
Then ~gC1n(t) =
nX
j=1
jWCnj(t)Xj and ~gC2n(t) =
nX
j=1
jWCnj(t)Yj are strongly consistent estimators
of gC1 (t) and gC2 (t), respectively. Now, define eDCn by
eDCn ( ) = 1
n
nX
i=1
?
R( ni( ))
n+ 1
ni( ) (4.5)
where in( ) = i[(Yi ~gC2n(Ti)) (Xi ~gC1n(Ti)) ]. Clearly, from the fact that ~gC1n(t) =
nX
j=1
jWCnj(t)Xj and ~gC2n(t) =
nX
j=1
jWCnj(t)Yj are strongly consistent estimators of gC1 (t) and
gC2 (t) respectively, it can be shown that in( ) !ei( ) in distribution as n!1. Define
the rank estimator based on complete data as
e C? = Argmin
2B
eDCn ( ) :
Below, we will study the asymptotic properties of rank estimators of 0, some of which are
defined later. The required assumptions are given and discussed in Section 4.4.
Theorem 8. Under assumptions (J1) (J7) except (J4),
limn!1sup
2B
jeDCn ( ) DCn ( )j= 0
49
Proof. In this proof,Lis taken to be an arbitrary positive constant not necessarily the same.
By the mean value Theorem, there exists n(i) such that
?
R( ni( ))
n+ 1
?
R(ei( ))
n+ 1
= ?0( n(i))
R( ni( ))
n+ 1
R(ei( ))
n+ 1
eDCn ( ) DCn ( )
=
1n
nX
i=1
h
?
R( ni( ))
n+ 1
ni( ) ?
R(ei( ))
n+ 1
ei( )
i
=
1n
nX
i=1
h
?
R( ni( ))
n+ 1
ni( )
+ f?0( n(i))
R( ni( ))
n+ 1
R(ei( ))
n+ 1
?
R( ni( ))
n+ 1
gei( )
i
Also, by uniform continuity of ? (since ?0 is bounded with bound L> 0) and for fix n, one
can choose (n) = (n)L+1 > 0 , such that for (n)!0 as n!1and
R( ni( ))n+ 1 R(ei( ))n+ 1
(n) =)
?
R( ni( ))
n+ 1
?
R(ei( ))
n+ 1
< (n):
From this, we have,
eDCn ( ) DCn ( )
1n
nX
i=1
?
R( ni( ))
n+ 1
j ni( ) ei( )j
+ 1n
nX
i=
j?0( n(i))j
R( ni( ))n+ 1 R(ei( ))n+ 1
jei( )j
From the boundedness of ?, there exists a constant L > 0 such that
?
R( ni( ))
n+ 1
L.
Note that ni( ) ei( ) = i[gn(Ti) g(Ti)] where gn(t) = ~gC2n(t) (~gC1n(t)) . Also, note
that sup 2Bjgn(t) g(t)j!0 as n!0. This can be seen from the fact that gn(t) g(t) =
(~gC2n(t) gC2 (t)) (~gC1n(t) gC1 (t)) and then,
sup
2B
jgn(t) g(t)j j~gC2n(t) gC2 (t))j+ sup
2B
k kk~gC1n(t) gC1 (t))k!0 (4.6)
50
since ~gC2n(t) gC2 (t)!0 and ~gC1n(t) gC1 (t)!0 w.p. 1 as n!1. This implies that
sup
2B
eDCn ( ) DCn ( )
Ln
nX
i=1
sup
2B
j ni( ) ei( )j+ (n)Ln
nX
i=
sup
2B
jei( )j
Now,
1
n
nX
i=1
sup
2B
j ni( ) ei( )j 1n
nX
i=1
sup
2B
jgn(Ti) g(Ti)j
=
Z 1
0
sup
2B
jgn(t) g(t)jdmn(t)
where mn(t) =
nX
i=1
I(Ti t). Applying the Dominated Convergence Theorem, it is easily
seen that Z
1
0
sup
2B
jgn(t) g(t)jdmn(t)!0 as n!1:
On the other hand, by the strong law of large numbers,
1
n
nX
i=
sup
2B
jei( )j!E
h
sup
2B
jej
i
<1:
Therefore,
lim
n!1
sup
2B
jeDCn ( ) DCn ( )j= 0
Remark 8. This theorem implies that in the neighborhood of the true parameter 0, e C?
and the minimizer of DCn ( ) are equivalent. Under some mild-conditions, ~ C? is a strongly
consistent estimator of 0; that is, ~ C? ! 0 w.p.1. Also, ~ C? is a pn-consistent estimator
of 0, that is,pn( ~ C? 0) = Op(1). For reference to these facts, please see Hettmansperger
and McKean (1998).
51
For i = 1;:::;n, by imputation (j = 1) and inverse probability (j = 2), define eYijn by
eYijn =
8
><
>:
iYi + 1 i X i e C? +bgCn (Ti) ; if j = 1;
i
b (Ti)Yi +
1
i
b (Ti)
X
i e
C
? +bg
C
n (Ti)
; if j = 2,
where, bgCn (t) = ~gC2n(t) (~gC1n(t)) e C?, b (t) =
nX
j=1
!nj(t) j, with
!nj(t) = (
t Tj
hn )P
n
j=1 (
t Tj
hn )
:
is a kernel function and hn a bandwidth sequence satisfying hn!0 as n!1. If we set,
Yij =
8
><
>:
iYi + 1 i X i +g(Ti) ; if j = 1;
i
(Zi)Yi +
1
i
(Zi)
X
i +g(Ti)
; if j = 2, (4.7)
under the MAR assumption,
E(Yi1jZi) = E[ iYi + 1 i X i +g(Ti) jZi] = X i +g(Ti):
Thus Yi1 = X i + g(Ti) + ei, with E(eijZi) = 0. Also, for the inverse probability case and
under the MAR condition, we have
E
h i
(Zi)Yi+
1 i (Z
i)
X i +g(Ti)
Zi
i
= E
h i
(Ti)Yi+
1 i (T
i)
X i +g(Ti)
Zi
i
and
E
h i
(Ti)Yi +
1 i (T
i)
X i +g(Ti)
Zi
i
= X i +g(Ti) :
Thus Yi2 = X i +g(Ti) + i, with E( ijZi) = 0.
52
Also, taking the conditional expectation of (4.1) with respect to T, we have
E[YijTi = t] = E[X ijTi = t] +g(t); (4.8)
Let g1(t) = E[XjT = t] and gj2(t) = E[YjT = t] = E[YijjT = t], j = 1;2. This suggests that
g(t) = g2(t) g1(t) : (4.9)
Let
Wnj(t) = M(
t Tj
kn )P
n
j=1M(
t Tj
kn )
:
where M is a kernel function and kn is a bandwidth sequence (kn! 0 as n!1). Define
^g1n(t) =
nX
k=1
Wnk(t)Xk and ^gj2n(t) =
nX
k=1
Wnk(t)eYkjn. Under some mild conditions, it can be
shown that ^g1n(t)!g1(t) and ^gj2n(t)!gj2(t) with probability 1.
Theorem 9. Under assumptions (J5) (J9) and Remark 8
jgn(t) ^gCn (t)j!0 as n!1 w:p 1:
Proof.
jgn(t) ^gCn (t)j = j~gn(t) g(t) +g(t) ^gCn (t)j
jgn(t) g(t)j+jg(t) ^gCn (t)j
Clearly, from (4.6),jgn(t) g(t)j!0 w.p.1 as n!1. Then, to complete the proof, we just
need to show thatjg(t) ^gCn (t)j!0 w.p.1 as n!1. Indeed,
g(t) ^gCn (t) = gC2 (t) (gC1 (t)) ~gC2n(t) (~gC1n(t)) 0
= (gC2 (t) ~gC2n(t)) + (~gC1n(t) ~gC1 (t)) 0 + (~gC1 (t)) ( ~ C? 0)
53
Then,jg(t) ^gCn (t)j jgC2 (t) ~gC2n(t)j+k~gC1n(t) ~gC1 (t)kk 0k+k~gC1n(t)kk~ C? 0k. Clearly,
jgC2 (t) ~gC2n(t)j! 0 andjgC1 (t) ~gC1n(t)j! 0 w.p.1 by consistency. ~gC1n(t) is bounded since
it converges. Then, k~gC1n(t)kk~ C? 0k! 0 w.p.1 since ~ C? 0 ! 0 w.p.1. as n !1.
Therefore,
jgn(t) ^gCn (t)j!0 as n!1 w:p 1:
Set
vijn ( ) = eYijn ^gj2n(Ti) Xi ^g1n(Ti) :
Now,basedonsimpleimputation,therankestimator b I? of 0 isdefinedby b I? = Argmin
2B
DIn( )
with DIn( ) defined by
DIn( ) = 1n
nX
i=1
?
RI(vi1
n ( ))
n+ 1
vi1n ( ) ;
where RI(vi1n ( )) is the ith rank of vi1n ( ). Now letting SIn( ) = rDIn( ), we have
SIn( ) = 1n
nX
i=1
?
RI(vi1
n ( ))
n+ 1
Xi ~g1n(Ti) : (4.10)
On the other hand, based on the inverse probability, the rank estimator b IP? of 0, is defined
by b IP? = Argmin
2B
DIPn ( ) with
DIPn ( ) = 1n
nX
i=1
?
RIP(vi2
n ( ))
n+ 1
vi2n ( ) ;
where RIP(v2in( )) is the ith rank of vi2n ( ). Letting SIPn ( ) = rDIPn ( ), we have
SIPn ( ) = 1n
nX
i=1
?
RIP(vi2
n )
n+ 1
Xi ~g1n(Ti) (4.11)
54
Note that ^g1n(Ti) is a p-vector and the estimation of gj2, involves all the Yi?s, making
fvijn ( 0)gni=1 to be a set of dependent random variables. Also if we set G1in and G2in to denote
distribution functions of vi1n ( 0) and vi2n ( 0) respectively, it can be easily shown that
G1i(s) = limn!1G1in(s) = limn!1P(vi1n ( 0) s)
and
G2i(s) = limn!1G2in(s) = limn!1P(vi2n ( 0) s);
where G1i and G2i are conditional distribution functions of ei and i given Zi, respectively.
4.4 Assumptions
The following assumptions were used above to motivate the theory leading to the def-
inition of the rank estimators b I? and b IP? . They will be useful in studying the asymptotic
properties of the rank estimators b I? and b IP? .
(J1) P(X = X 0) < 1 for any 6= 0,
(J2) ? is a bounded, twice continuously differentiable score function with bounded deriva-
tives, defined on (0;1), and, satisfying:
Z 1
0
?(u)du = 0 and
Z 1
0
?2(u)du = 1
(J3) The cumulative distribution H of Y given Z is symmetric about 0 and has a corre-
sponding density h that is absolutely continuous with finite Fisher information, i.e.,
I(h) =
Z 1
1
hh0(")
h(")
i2
h(")d"<1. Define ? =
Z 1
0
?(u)?H(u)du,where?H(u) = h
0(H 1(u))
h(H 1(u)) .
55
(J4) DefineX = X ~g1n(T). Assumethat 1nX X! , 1n ~G n ~Gn!G and 1n ~G nX +X ~Gn !B,
for some positive definite matrices , G and B. Also, assume that
P(kX k cn) = oP(dn) for dn!0 as cn!1:
(J5) inft (t) > 0, infz (z) > 0 and the density of T, say m(t), exits and satisfies
0 < inft m(t) sup
t
m(t) <1:
Also ( ) has bounded derivatives.
(J6) gC1 ( ); gC2 ( ); g1( ), and gj2( ) have continuous derivatives at t and bounded derivatives
up to order 2.
(J7) The kernels K( ), ( ) and M( ) are regular kernels of order r(> 2).
(J8) The bandwidth bn, hn and kn satisfy the following conditions:
i:) nknbn!1, nb4rn !0, nk4rn !0 and b2n=kn!0
ii:) nhn!1and nh4rn !0
iii:) C(logn=n) < hn; bn; kn < n, for any C > 0, = 1 2=p, p > 2 and n not
necessarily the same for the three bandwidths such that C(logn=n) < n < 1
satisfying n!0 as n!1.
(J9) sup
t
E[kXkpjT = t] <1and sup
t
E[jYjpjT = t] <1for p> 2.
Remark 9. Assumption (J1) stands for the identifiability condition and together with (J2),
ensures the consistency of the resulting estimator. (J5), (J6), (J7), (J8) and (J9) are stan-
dard assumptions for nonparametric regression problems. Specifically (J5) ensures the non
missingnesss with probability 1 anywhere in the domain of Z. Assumption iii:) in (J8) and
56
(J9) ensure the uniform consistency of the Nadaraya Watson estimator used to the true un-
known function g, for more discussion see Einmahl and Mason (2005). As pointed out by
Xue (2009), the assumption P(kX k cn) = op(dn) for dn ! 0 as cn !1 in (J4) is
commonly used for avoiding the boundary problem. This assumption is also used by Zhu
and Fang (1996) and Wang and Rao (2002a). For a class of distributions that satisfy this
assumption, see discussion in Xue (2009). (J1), (J2), (J3) and (J4) together with others
assumptions listed above, ensure the asymptotic normality of the resulting estimator. For
practical issue, the optimal bandwidth can be chosen to lie in the interval [a:n 1=5;b:n 1=5]
for 0 < a < b <1. See Einmahl and Mason (2005) for more discussion. Most of regular
kernels in (J7) satisfy assumption (C6) of Xue (2009) or Wang and Sun (2007).
4.5 Asymptotic Normality
Put X = (X1; ;Xn) and ~Gn = (~g1n(T1); ;~g1n(Tn)) two matrices given by
X =
0
BB
BB
BB
B@
X11 X1n
X21 X2n
Xp1 Xpn
1
CC
CC
CC
CA
~Gn =
0
BB
BB
BB
B@
~g11n(T1) ~g1pn(T1)
~g11n(T2) ~g1pn(T2)
~g11n(Tn) ~g1pn(Tn)
1
CC
CC
CC
CA
:
For simplicity, set Dln = DIn, Sln = SIn for l = I and Dln = DIPn , Sln = SIPn for l = IP. Now,
as discussed in Hettmansperger and McKean (1998), if we set
Mln( ) = (2 ?) 1X X ( 0) ( 0) Sln( 0) +Dln( 0);
we obtain the following result called asymptotic quadraticity which was proved by Jaeckel
(1972).
57
Theorem 10. Under assumptions (J1) (J6),8 > 0 and C > 0,
limn!1P 0
h
max
k 0k Cpn
jDln( ) Mln( )j
i
= 0
This result provides a quadratic approximation ofDln byMln and leads to the asymptotic
linearity derived by Jureckova (1971) and is displayed as follows.
Theorem 11. Under (J1) (J9), for any C > 0 and > 0,
limn!1P 0
h
sup
k 0k n 1=2C
n 1=2[Sln( ) Sln( 0)] +n 1=2 1? X X ( 0)
i
= 0
Proofs of Theorem 10 and Theorem 11 can be constructed in a straightforward manner
along the lines discussed in Hettmansperger and McKean (1998) for the linear model setting.
Therefore, for the sake of brevity, they will not be included here.
Theorem 12. Under assumptions (J1) (J9),pnSln( 0) Np(0; jn) D !Np(0;Vj) where
j = 1 for l = I and j = 2 for l = IP. Vj = lim
n!1
jn.
Before giving the proof of Theorem 12, consider the following notation from Brunner
and Denker (1994). Set
in = Xi ^g1n(Ti); Jjn(s) = 1n
nX
i=1
Gjin(s); ^Jjn(s) = 1n
nX
i=1
I(vi1n ( 0) s);
Fjn(s) = 1n
nX
i=1
inGjin(s); ^Fjn(s) = 1n
nX
i=1
inI(vijn ( 0) s)
and Tln( 0) = S
l
n( 0)
(n) E
hSl
n( 0)
(n)
i
:
Lemma 5 (Brunner and Denker (1994)). Suppose that M0n m1(n) M1n for some
constants 0 0, and that jn Cna for some constant a; C2R,
58
where
Ujn =
Z
?(Jjn(s))( ^Fjn Fjn)(ds) +
Z
?0(Jjn(s))( ^Jjn(s) Jjn(s))Fjn(ds):
and jn is the minimum eigenvalue of Var(Ujn). Then nW 1jnTln( 0) is asymptotically stan-
dard multivariate normal, provided?is twice continuously differentiable, with bounded second
derivative and < (a+ 1)=2.
Proof of Theorem 12. Assumethat max
1 i n
k ink= (n) <1. Fromthesettingdefinedabove,
Sln( 0) = 1n
nX
i=1
?
Rl(vij
n ( 0))
n+ 1
Xi ^g1n(Ti) =
Z
?
n
n+ 1
^Jjn dFjn
Now define
n = n2E[(Tln( 0))(Tln( 0)) ]; Bjn =
Z
( ^Fjn Fjn)d?(Jjn) +
Z
( ^Jjn Jjn)dFjndJ
jn
d?(Jjn)
and Wjn = n2Var(Bjn). From the fact that 0 = Argmin
2B
E(Dln( )), it can be seen that
E[Sln( 0)] = 0. This implies that E[Tln( 0)] = 0 and Tln( 0) = S
l
n( 0)
(n) . Now under assump-
tion (J2) and by Brunner and Denker (1994), for Theorem 12 to hold, it suffices to check
if conditions of Lemma 5 are satisfied. To that end, the fact that Var("ijZi) > 0, there
exists > 0 such that the minimum eigenvalue of Var(Bjn), say jn satisfies jn > nb, for
0 < b < 1=2. This is obtained under the assumption that &jn=n!1 putting jn !1
as n !1, see Brunner and Denker (1994) for more discussion. Again by Brunner and
Denker (1994), Ujn = nBjn and then, Var(Ujn) = n2Var(Bjn). Thus, jn = n2 jn n2+b.
Hence, putting a = 2 + b, = 1, M0 = M1 and C = , conditions of Lemma 5 are satis-
fied. This, shows that nW 1jnTln( 0) is asymptotically multivariate standard normal. There-
fore pnSln( 0) is asymptotically multivariate normal with mean 0 and covariance matrix
jn = (n)pn Wjn.
59
Remark 10. For l = I or l = IP, note that by definition of b l?, Sln(b l?) = 0. Also,
from the fact that Mln is a quadratic function of , it is uniquely minimized by ~ l? =
? X X 1Sln( 0). Assumption (J4) ensures that 1n X X ! as n!1 for some
positive definite matrix . Then, under the assumptions of Theorem 11, pn(b l? ~ l?) =
op(1). The proof of this claim can also be constructed along the lines given in Hettmansperger
and McKean (1998). Moreover, conditioning on Z and using the SLLN, SIn( 0)!0 w:p:1.
Thefollowingtheoremgivesapracticalwayofestimatingthecovariancematrix jn = (n)pn Wjn.
Theorem 13. Suppose that assumptions (J1) (J9) hold and let j = 1 for l = I and j = 2
for l = IP. Define
bZlj =
nX
i=1
in?
R(vijn (b l?))
n+ 1
+ 1n
nX
i=1
in?0
R(vijn (b l?))
n+ 1
R(vijn (b l?))
= nSln(b l?) + 1n
nX
i=1
in?0
R(vijn (b l?))
n+ 1
R(vijn (b l?))
= 1n
nX
i=1
in?0
R(vijn (b l?))
n+ 1
R(vijn (b l?));
since Sln(b l?) = 0. Then, cWjn = [bZlj E(Zlj)][bZlj E(Zlj)] !Wjn (positive definite) in the
L2-norm as n!1, where
Zlj = n
Z
?(Jjn(t)) ^Fjn(dt) +
Z
?0(Jjn(t)) ^Jjn(t)Fjn(dt)
= nSln( 0) + 1n
nX
i=1
in?0
R(vij
n ( 0))
n+ 1
R(vijn ( 0))
Proof. By Theorem 4.1 of Brunner and Denker (1994), to prove this theorem it is sufficient
to prove that limn!1 n&
jn
= 0, where &jn represents the minimum eigenvalue of Wjn. This is
obtained from the fact that jn nb for 0 <
>:
(z)H(u); if j = 1;
(z)
(t) H(u); if j = 2.
For either j = 1 or j = 2, considering the change of variables, say, s = (z)H(u) or
s = (z) (t) H(u), we get
Z
?0(Gj(u))dGj(u) =
Z
?0(s)ds
Hence,
E[Zlj]
nX
i=1
in
Z
?0(s)ds
Theorem 14. Under assumptions (J1) (J6), we have
pn(b l
? 0)
D !N(0; 2
?
1
j ):
where
j = 1Vj 1 :
The proof of this Theorem is obtained by combining results of Theorem 11, Lemma 12
and the discussion in Remark 10.
61
4.6 Estimation of the function g
As it can seen from the discussion in section 2, function g is not fully estimated. In
estimating 0, the first step of the estimation of g was used by setting it as a known function
of t but unknown as a function of throughout the estimates of g1 and gj2. As soon as the
estimated values b ? of 0 are obtained, one can now obtain the estimated function ^gn of g
accordingly to the estimated value of 0 by putting:
^gn(t) = ^gj2n(t) ^g1n(t)b l?
where j = 1 for l = I and j = 2 for l = IP.
Theorem 15. Under assumptions (J1) (J9), we have ^gn(t) g(t) = o(1) with probability
1.
Proof. By definition of ^gn(t), we have
^gn(t) g(t) = ^gj2n(t) g2(t) ^g1n(t) g1(t) (b ? 0) (g1(t)) (b ? 0) ^g1n(t) g1(t) 0
This implies that,
j^gn(t) g(t)j j^gj2n(t) g2(t)j+k^g1n(t) g1(t)kkb ? 0k+k(g1(t))kkb ? 0k+k^g1n(t) g1(t)kk 0k
62
We continue the proof with j = 1 and similar argument can be used to derive that of j = 2.
By definition of ^g12n(t),
^g12n(t) g2(t) =
nX
i=1
Wni(t)eYi1n g2(t)
=
nX
i=1
Wni(t)[ iYi + 1 i X i e C? +bgCn (Ti) g2(t)]
=
nX
i=1
Wni(t)(Yi1 g2(t)) +
nX
i=1
Wni(t)(1 i)X i ( ~ C? 0)
+
nX
i=1
Wni(t)(1 i)(^gCn g(t)):
Note that E[Yi1jTi = t] = g2(t) and E[k(1 i)XikjTi] <1. Then, under mild-conditions,
we have,j
nX
i=1
Wni(t)Yi1 g2(t)j!0 w:p:1 and by Theorem 9,
j^gCn g(t)j!0 w:p:1:
Also, conditioning on T and using the SLLN, we have
nX
i=1
Wni(t)(1 i)Xi!E[k(1 )XkjT] w:p:1:
Hence,
nX
i=1
Wni(t)(1 i)Xi = O(1). From the fact that ~ C? 0 = o(1), we have
^g12n(t) g2(t)!0 w:p:1:
Using the same argument, it can be shown that ^g1n(t) g1(t)!0 w:p:1. Combining these
facts with Remark 10 completes the proof.
63
4.7 Bandwidth Selection
An important issue when dealing with kernel estimation, is the selection of an appro-
priate bandwidth sequence. In the nonparametric regression literature, this problem has
been studied extensively. The common approach used in selecting the bandwidth, is the
delete-one cross-validation rule. This approach consists in minimizing the "leave-one-out"
version of objective function evaluated at the estimator of 0 as a function the bandwidth
h. That is, bn, hn and kn may be obtained by minimizing
nX
i=1
?
R(^u i(h))
n+ 1
^u i(h);
where
^u i(h) =
8
>>>
><
>>>
>:
i;n( ~ C?;h); for bn;
v i;jn (b l?;h); for kn .
i b i(Ti;h); for for hn.
with i;n( ~ C?;h), v i;jn (b l?;h) and i b i(Ti;h) beingthe"leave-one-out"versionsof i;n( ~ C? ),
vijn (b l?), i b (Ti), respectively, in which bn; kn; hn are replaced by h.
4.8 Simulation
To fully understand the optimality properties of the proposed approach, we will consider
the finite sample behavior of the estimator. To do so, a simulation study is conducted. The
model used in the simulation is given by
Y = 0x+g(T) +"
withxgenerated from a normal distribution with mean 1 and variance 1. The random errors
" are generated from the contaminated normal distribution CN( ; ) = (1 )N(0;1) +
64
N(0; 2) with different degrees of contamination and the t-distribution with various degrees
of freedom. The kernels K( ) and M( ) were taken to be the Epanechnikov Kernel
K(t) = M(t) = 0:75(1 t2)I(jtj 1):
Based on assumption (J8), all the three bandwidths bn, kn and hn were taken to be propor-
tional to n 1=5. Two simulation scenarios were considered. In Scenario 1, the true g was
taken to be 0 and T = in2, for i = 1; ;n and for z = (x;t), three cases were considered:
Case 1: (z) = 0:8 + 0:2jx 1jifjx 1j 1, and 0:95 elsewhere.
Case 2: (z) = 0:9 0:2jx 1jifjx 1j 4:5, and 0:1 elsewhere.
Case 3: (z) is taken to be a sequence of proportions of missingness starting from 0%
to 50% with steps of 10%.
In Scenario 2, the true g was taken to be g(t) = (sin(2 t2))1=3 and T is generated from
the uniform distribution U[0;1=4]. One more case is added to the three previous.
Case 4: (z) = 0:9 0:2(jx 1j+jt 0:5j) ifjx 1j+jt 0:5j 1:5, and 0:8 elsewhere.
In all the cases, was generated from the Bernoulli( (z)) distribution. Under both scenarios
is estimated using simple imputation (SI) and based on the inverse probability procedure
where (t) is chosen in two ways. First, as defined above, we take it to be kernel (Ker),
that is (t) = M(t), and secondly by taking it to be the logistic function (Log) given
by (t) = 11 +e t as in M?ller (2009). From 5000 simulations, the MSEs of the resulting
rank (R) estimator of 0 are reported and are compared to those of the least squares (LS)
estimator of 0. Figures 4.1 ? 4.17 contain the results of the simulation experiments.
4.9 Results and Discussion
The results of the simulation experiment are described below:
65
Figure 4.1: Scenario 1, Case 1: MSE vs Proportion of Contamination and t-df
0.00 0.05 0.10 0.15 0.20 0.25
0.04
0.06
0.08
0.10
Contaminated Normal (Ker)
Prop. Contamination
MSE
LS
R
0.00 0.05 0.10 0.15 0.20 0.25
0.06
0.08
0.10
0.12
0.14
Contaminated Normal (Log)
Prop. Contamination
MSE
LS
R
0.00 0.05 0.10 0.15 0.20 0.25
0.04
0.06
0.08
0.10
Contaminated Normal (SI)
Prop. Contamination
MSE
LS
R
5 10 15 20 25 30
0.060
0.070
0.080
0.090
t Distribution (Ker)
t df
MSE
LS
R
5 10 15 20 25 30
0.040
0.045
0.050
0.055
t Distribution (Log)
t df
MSE
LS
R
5 10 15 20 25 30
0.045
0.050
0.055
0.060
0.065
0.070
t Distribution (SI)
t df
MSE
LS
R
Figure 4.2: Scenario 1, Case 2: MSE vs Proportion of Contamination and t-df
0.00 0.05 0.10 0.15 0.20 0.25
0.04
0.06
0.08
0.10
0.12
Contaminated Normal (Ker)
Prop. Contamination
MSE
LS
R
0.00 0.05 0.10 0.15 0.20 0.25
0.04
0.06
0.08
0.10
Contaminated Normal (Log)
Prop. Contamination
MSE
LS
R
0.00 0.05 0.10 0.15 0.20 0.25
0.10
0.15
0.20
Contaminated Normal (SI)
Prop. Contamination
MSE
LS
R
5 10 15 20 25 30
0.04
0.05
0.06
0.07
t Distribution (Ker)
t df
MSE
LS
R
5 10 15 20 25 30
0.08
0.10
0.12
0.14
t Distribution (Log)
t df
MSE
LS
R
5 10 15 20 25 30
0.050
0.060
0.070
0.080
t Distribution (SI)
t df
MSE
LS
R
66
Figure4.3: Scenario1,Case3: MSEvsProportionofMissingDataforSIundercontaminated
normal error distribution
0.1 0.2 0.3 0.4
0.035
0.040
0.045
CN( 0 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.035
0.040
0.045
0.050
CN( 0.01 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
0.060
0.065
CN( 0.05 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
CN( 0.1 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
0.10
CN( 0.15 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
CN( 0.25 , 3 ), Simple
Prop. missing
MSE
LS
R
Figure 4.4: Scenario 1, Case 3: MSE vs Proportion of Missing Data for SI under t error
distribution
0.1 0.2 0.3 0.4
0.050
0.060
0.070
0.080
t df = 5 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.045
0.050
0.055
0.060
0.065
t df = 10 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
0.060
t df = 15 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
t df = 20 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
t df = 25 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
t df = 30 , Simple
Prop. missing
MSE
LS
R
67
Figure 4.5: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under contaminated normal error distribution
0.1 0.2 0.3 0.4
0.035
0.045
0.055
0.065
CN( 0 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.050
0.060
0.070
CN( 0.01 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
CN( 0.05 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.07
0.09
0.11
CN( 0.1 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.08
0.10
0.12
0.14
CN( 0.15 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
0.16
0.18
CN( 0.25 , 3 ), Kernel
Prop. missing
MSE
LS
R
Figure 4.6: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under t error distribution
0.1 0.2 0.3 0.4
0.06
0.08
0.10
0.12
t df = 5 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.07
0.08
0.09
0.10
t df = 10 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 15 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 20 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 25 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 30 , Kernel
Prop. missing
MSE
LS
R
68
Figure 4.7: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under contaminated normal error distribution
0.1 0.2 0.3 0.4
0.040
0.050
0.060
CN( 0 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.050
0.060
0.070
CN( 0.01 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
CN( 0.05 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
CN( 0.1 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.08
0.10
0.12
0.14
CN( 0.15 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
0.16
0.18
0.20
CN( 0.25 , 3 ), Logistic
Prop. missing
MSE
LS
R
Figure 4.8: Scenario 1, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under t error distribution
0.1 0.2 0.3 0.4
0.06
0.08
0.10
0.12
0.14
t df = 5 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.07
0.08
0.09
0.10
t df = 10 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 15 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 20 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 25 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 30 , Logistic
Prop. missing
MSE
LS
R
69
Figure 4.9: Scenario 2, Case 1: MSE vs Proportion of Contamination and t-df
0.00 0.10 0.20
0.06
0.08
0.10
0.12
0.14
Contaminated Normal (Ker)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.06
0.08
0.10
0.12
0.14
0.16
0.18
Contaminated Normal (Log)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.04
0.06
0.08
0.10
0.12
Contaminated Normal (SI)
Prop. Contamination
MSE
LS
R
5 10 15 20 25 30
0.060
0.065
0.070
0.075
0.080
0.085
t Distribution (Ker)
t df
MSE
LS
R
5 10 15 20 25 30
0.055
0.065
0.075
0.085
t Distribution (Log)
t df
MSE
LS
R
5 10 15 20 25 30
0.045
0.050
0.055
0.060
0.065
0.070
t Distribution (SI)
t df
MSE
LS
R
Figure 4.10: Scenario 2, Case 2: MSE vs Proportion of Contamination and t-df
0.00 0.10 0.20
0.06
0.08
0.10
0.12
Contaminated Normal (Ker)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Contaminated Normal (Log)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.08
0.10
0.12
0.14
0.16
0.18
0.20
Contaminated Normal (SI)
Prop. Contamination
MSE
LS
R
5 10 15 20 25 30
0.07
0.08
0.09
0.10
t Distribution (Ker)
t df
MSE
LS
R
5 10 15 20 25 30
0.055
0.065
0.075
t Distribution (Log)
t df
MSE
LS
R
5 10 15 20 25 30
0.08
0.09
0.10
0.11
t Distribution (SI)
t df
MSE
LS
R
70
Figure 4.11: Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under contami-
nated normal error distribution
0.1 0.2 0.3 0.4
0.04
0.05
0.06
0.07
CN( 0 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.04
0.05
0.06
0.07
CN( 0.01 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
CN( 0.05 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
CN( 0.1 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.07
0.08
0.09
0.10
0.11
0.12
CN( 0.15 , 3 ), Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
0.16
CN( 0.25 , 3 ), Simple
Prop. missing
MSE
LS
R
Figure 4.12: Scenario 2, Case 3: MSE vs Proportion of Missing Data for SI under t error
distribution
0.1 0.2 0.3 0.4
0.10
0.11
0.12
0.13
0.14
t df = 5 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.09
0.10
0.11
0.12
t df = 10 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.09
0.10
0.11
0.12
t df = 15 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.09
0.10
0.11
0.12
t df = 20 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.09
0.10
0.11
0.12
t df = 25 , Simple
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.09
0.10
0.11
0.12
t df = 30 , Simple
Prop. missing
MSE
LS
R
71
Figure 4.13: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under contaminated normal error distribution
0.1 0.2 0.3 0.4
0.045
0.055
0.065
CN( 0 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.045
0.055
0.065
0.075
CN( 0.01 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
CN( 0.05 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.06
0.07
0.08
0.09
0.10
0.11
0.12
CN( 0.1 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
CN( 0.15 , 3 ), Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
0.16
0.18
CN( 0.25 , 3 ), Kernel
Prop. missing
MSE
LS
R
Figure 4.14: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Ker) under t error distribution
0.1 0.2 0.3 0.4
0.06
0.07
0.08
0.09
0.10
0.11
0.12
t df = 5 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 10 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 15 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
t df = 20 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.04
0.05
0.06
0.07
0.08
t df = 25 , Kernel
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.04
0.05
0.06
0.07
0.08
t df = 30 , Kernel
Prop. missing
MSE
LS
R
72
Figure 4.15: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under contaminated normal error distribution
0.1 0.2 0.3 0.4
0.055
0.065
0.075
CN( 0 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.060
0.065
0.070
0.075
0.080
0.085
CN( 0.01 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.07
0.08
0.09
0.10
0.11
0.12
CN( 0.05 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
CN( 0.1 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.08
0.10
0.12
0.14
0.16
0.18
CN( 0.15 , 3 ), Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.10
0.14
0.18
0.22
CN( 0.25 , 3 ), Logistic
Prop. missing
MSE
LS
R
Figure 4.16: Scenario 2, Case 3: MSE vs Proportion of Missing Data for inverse probability
(Log) under t error distribution
0.1 0.2 0.3 0.4
0.05
0.06
0.07
0.08
0.09
t df = 5 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.050
0.060
0.070
t df = 10 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.040
0.045
0.050
0.055
0.060
0.065
t df = 15 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.035
0.040
0.045
0.050
0.055
0.060
t df = 20 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.035
0.040
0.045
0.050
0.055
0.060
t df = 25 , Logistic
Prop. missing
MSE
LS
R
0.1 0.2 0.3 0.4
0.035
0.040
0.045
0.050
0.055
0.060
t df = 30 , Logistic
Prop. missing
MSE
LS
R
73
Scenario 1:
Case 1: For the case of theCN distribution, theRestimator becomes more and more
superior toLS when the proportion of contamination increased. The same is true
under the t distribution with the R becoming superior to LS as the degrees of
freedom decrease (heavier tails). The MSEs are generally larger for the logistic
case in the contaminated normal case but smaller in the t distribution case.
Case 2: Once again R is increasingly superior to LS as the contamination increases
or as the tail-thickness of the distribution increases. For CN distribution, (SI)
generally does worse than (Log) and (Ker) and fortdistribution (Log) does worse
than (Ker) and (SI).
Case 3: Under(SI),RperformscomparablytoLS undernormality (CN(0;3)) ortails
close to the tails of the normal distribution (CN(:01;3) and t30) but, as in the
previous cases, does much better under heavy tails or larger contamination. What
is additionally notable under (Ker) and (Log) is that R consistently outperforms
LS when the proportion of missing data is large, even under normality.
Scenario 2:
Case 1: Similar to Scenario 1, Case 1.
Case 2: Similar to Scenario 1, Case 2.
Case 3: Similar patterns are observed as in Case 3 or Scenario 1. There are, however,
some notable differences. Under Scenario 1, for the (SI) case, R and LS were
comparable under normality. Under Scenario 2, LS is clearly superior to R.
Under Scenario 1, for the (Ker) case, R does better than LS when the proportion
of missing increases. This is no longer the case when the t degree of increases
where R and LS are comparable. A similar observation can be made for the
(Log) case.
74
Case 4: Similar to Case 2.
The first take-home message is thatR performs better under heavy tailed error distributions
and cases containing contaminations. It is generally comparable to LS under normal error.
Thesecondtake-homemessageisthatRwithinverseprobabilityimputationdoesbetterthan
its LS counterpart when the proportion of missing data is large. This makes R estimation
extremely appealing for situations where we encounter high rates of missing information.
Wefinishthissectionbygivinganalternativeapproachtodefiningrankestimatorsunder
responses missing at random. The theory follows from the above theorems in a (mostly)
straightforward manner.
4.10 Empirical log-likelihood approach
Setting
ij( ) = ?
Rl(vij
n ( ))
n+ 1
Xi ~g1n(Ti)
for j = 1;2. One can define the empirical log-likelihood function of as follows
Lijn( ) = 2 sup
(p1;:::;pn)2(0;1)n
n nX
i=1
log(npi) :
nX
i=1
pi = 1;
nX
i=1
pi ij = 0
o
; j = 1;2:
Now take 2Rd for which the following equation holds:
1
n
nX
i=1
ij( )
1 + ij( ) = 0: (4.12)
From this, Lijn( ) can be rewritten as
Lijn( ) = 2
nX
i=1
log 1 + ij( ) : (4.13)
Consider the matrix Anj defined by Anj = 1n
nX
i=1
ij( 0) ij( 0). Clearly from the fact that
E("ijZi) = 0 and E("2ijZi) <1, Anj ! Vj as n!1. Based on Theorem 12, we have
75
max
1 n
k ij( 0)k= oP(cn) for some cn. Also, ? being bounded, implies that there exists a
positiveconstantC suchthatcn = C (n). Hence, onecanobtaintheasymptoticdistribution
of Lijn( ) at the true parameter. This is given in the following theorem.
Theorem 16. Under the assumptions (J1) (J9), we have
Lijn( 0) D ! 2p
The proof of this Theorem is obtained by putting together Theorem 12 and Slutsky?s
Lemma.
Remark 11. Note that considering the right hand side of (4.13) as a function of and
performing the Taylor expansion of order 1 at ij( 0) and using (4.12), we obtain
Lijn( 0) = A
1
nj
n
nX
i=1
ij( 0)
nX
i=1
ij( 0)
(1 +op(1)):
Setting
Lijn( 0) = A
1
nj
n
nX
i=1
ij( 0)
nX
i=1
ij( 0)
= nA 1nj Sln( 0) Sln( 0);
we see that for n large enough, Lijn( 0) Lijn( 0). Hence, Lijn( 0) D ! 2p. Now putting
? = Argmax
2B
f Lijn( )g, the maximum empirical likelihood estimator of 0, we have follow-
ing theorem.
Theorem 17. Under the assumptions (J1) (J9),
pn(
? 0)
D !N(0; 2
?
1
j );
76
Figure 4.17: Scenario 2, Case 4: MSE vs Proportion of Contamination and t-df
0.00 0.10 0.20
0.08
0.10
0.12
0.14
0.16
0.18
Contaminated Normal (Ker)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.10
0.15
0.20
Contaminated Normal (Log)
Prop. Contamination
MSE
LS
R
0.00 0.10 0.20
0.04
0.06
0.08
0.10
Contaminated Normal (SI)
Prop. Contamination
MSE
LS
R
5 10 15 20 25 30
0.060
0.070
0.080
0.090
t Distribution (Ker)
t df
MSE
LS
R
5 10 15 20 25 30
0.050
0.060
0.070
0.080
t Distribution (Log)
t df
MSE
LS
R
5 10 15 20 25 30
0.07
0.08
0.09
0.10
0.11
t Distribution (SI)
t df
MSE
LS
R
77
Bibliography
Abebe, A. and McKean, J. W. (2007). Highly efficient nonlinear regression based on the
Wilcoxon norm. In Umbach, D., editor, Festschrift in Honor of Mir Masoom Ali, pages
340?357.
Adams, R. A. (1975). Sobolev spaces. Academic Press, New York-London.
Brezis, H. (1983). Analyse fonctionnelle: Th?orie et applications. Masson, Paris.
Brunner, E. and Denker, M. (1994). Rank statistics under dependent observations and
applications to factorial designs. Journal of Statistical Planning and Inference, 42(3):353
? 378.
Chang, W. H., McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1999). High-breakdown
rank regression. J. Amer. Statist. Assoc., 94(445):205?219.
Chen, S. X. and Hall, P. (1993). Smoothed empirical likelihood confidence intervals for
quantiles. The Annals of Statistics, 21(3):pp. 1166?1181.
Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at
random. Journal of the American Statistical Association, 89(425):pp. 81?87.
Coakley, C. W. and Hettmansperger, T. P. (1993). A bounded influence, high breakdown,
efficient regression estimator. J. Amer. Statist. Assoc., 88(423):872?880.
Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type
function estimators. The Annals of Statistics, 33(3):pp. 1380?1403.
Gay, D. M. (1983). ALGORITHM 611 ? subroutines for unconstrained minimization using
a model/trust-region approach. ACM Trans. Math. Software, 9(4):503?524.
78
Gay, D. M. (1984). A trust-region approach to linearly constrained optimization. In Griffiths,
D. F., editor, Numerical Analysis. Proceedings, Dundee 1983, pages 72?105. Springer-
Verlag. Lecture Notes in Mathematics 1066.
Giltinan, D. M., Carroll, R. J., and Ruppert, D. (1986). Some new estimation methods for
weighted regression when there are possible outliers. Technometrics, 28(3):219?230.
Gray, R. J. (1994). Spline-based tests in survival analysis. Biometrics, 50(3):pp. 640?652.
Green, P. J. and Silverman, B. W. (1994). Nonparametric regression and generalized linear
models, volume 58 of Monographs on Statistics and Applied Probability. Chapman & Hall,
London. A roughness penalty approach.
H?jek, J. and ?id?k, Z. (1967). Theory of rank tests. Academic Press, New York.
Hall, P. and La Scala, B. (1990). Methodology and algorithms of empirical likelihood.
International Statistical Review, 58(2):109?127.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist.
Assoc., 69:383?393.
Haupt, H. and Oberhofer, W. (2009). On asymptotic normality in nonlinear regression.
Statist. Probab. Lett., 79(6):848?849.
Healy, M. and Westmacott, M. (1956). Missing values in experiments analysed on automatic
computers. Journal of the Royal Statistical Society. Series C (Applied Statistics), 5(3):pp.
203?206.
Heckman, N. E. (1986). Spline smoothing in a partly linear model. Journal of the Royal
Statistical Society. Series B (Methodological), 48(2):pp. 244?248.
Hettmansperger, T. P. and McKean, J. W. (1998). Robust nonparametric statistical methods,
volume 5 of Kendall?s Library of Statistics. Edward Arnold, London.
79
Hettmansperger, T. P., McKean, J. W., and Sheather, S. J. (2000). Robust nonparametric
methods. J. Amer. Statist. Assoc., 95(452):1308?1312.
Hubert, M., Rousseeuw, P. J., and Van Aelst, S. (2008). High-breakdown robust multivariate
methods. Statist. Sci., 23(1):92?119.
Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the
residuals. Ann. Math. Statist., 43:1449?1458.
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. Ann.
Math. Statist., 40:633?643.
Jureckova, J. (1971). Nonparametric estimate of regression coefficients. The Annals of
Mathematical Statistics, 42(4):pp. 1328?1338.
Jure?kov?, J. (2008). Regression rank scores in nonlinear models. In Beyond parametrics in
interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen, volume 1 of
Inst. Math. Stat. Collect., pages 173?183. Inst. Math. Statist., Beachwood, OH.
Kitamura, Y. (1997). Empirical likelihood methods with weakly dependent processes. The
Annals of Statistics, 25(5):pp. 2084?2102.
Krasker, W. S. and Welsch, R. E. (1982). Efficient bounded-influence regression estimation.
J. Amer. Statist. Assoc., 77(379):595?604.
Little, R.J.A.andRubin, D.B.(1987). Statistical analysis with missing data. WileySeriesin
Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley
& Sons Inc., New York.
M?ller, U. U. (2009). Estimating linear functionals in nonlinear regression with responses
missing at random. Ann. Statist., 37(5A):2245?2277.
Naranjo, J. D. and Hettmansperger, T. P. (1994). Bounded influence rank regression. J.
Roy. Statist. Soc. Ser. B, 56(1):209?220.
80
Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics,
18(1):pp. 90?120.
Peng, L. (2004). Empirical-likelihood-based confidence interval for the mean with a heavy-
tailed distribution. The Annals of Statistics, 32(3):pp. 1192?1214.
R Development Core Team (2009). R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Rice, J. (1986). Convergence rates for partially splined models. Statistics & Probability
Letters, 4(4):203 ? 208.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical Asso-
ciation, 89(427):pp. 846?866.
Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica,
56(4):pp. 931?954.
Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc.,
79(388):871?880.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Grossmann,
W., Pflug, G., Vincze, I., andWertz, W., editors,Mathematical Statistics and Applications,
volume 8, pages 283?297. Reidel, Dordrechet.
Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance
determinant estimator. Technometrics, 41(3):212?223.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581?592.
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. John Wiley &
Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics.
81
Sievers, G. L. (1983). A weighted dispersion function for estimation in linear models. Comm.
Statist. A?Theory Methods, 12(10):1161?1179.
Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992). On one-step gm estimates and
stability of inferences in linear regression. Journal of the American Statistical Association,
87(418):pp. 439?450.
Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal
Statistical Society. Series B (Methodological), 50(3):pp. 413?436.
Stromberg, A. J. (1993). Computation of high breakdown nonlinear regression parameters.
Journal of the American Statistical Association, 88(421):pp. 237?244.
Stromberg, A. J. (1995). Consistency of the least median of squares estimator in nonlinear
regression. Comm. Statist. Theory Methods, 24(8):1971?1984.
Stromberg, A. J. and Ruppert, D. (1992). Breakdown in nonlinear regression. J. Amer.
Statist. Assoc., 87(420):991?997.
Sun, Z., Wang, Q., and Dai, P. (2009). Model checking for partially linear models with
missing responses at random. J. Multivar. Anal., 100(4):636?651.
Tableman, M. (1990). Bounded-influence rank regression: a one-step estimator based on
Wilcoxon scores. J. Amer. Statist. Assoc., 85(410):508?513.
van der Vaart, A. W. (1998). Asymptotic statistics, volume 3 of Cambridge Series in Statis-
tical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
Van Zwet, W. R. (1980). A strong law for linear functions of order statistics. The Annals of
Probability, 8(5):pp. 986?990.
Wang, C. Y., Wang, S., Zhao, L.-P., and Ou, S.-T. (1997). Weighted semiparametric estima-
tion in regression analysis with missing covariate data. Journal of the American Statistical
Association, 92(438):pp. 512?525.
82
Wang, J. (1996). Asymptotics of least-squares estimators for constrained nonlinear regres-
sion. Ann. Statist., 24(3):1316?1326.
Wang, J. D. (1995). Asymptotic normality of L1-estimators in nonlinear regression. J.
Multivariate Anal., 54(2):227?238.
Wang, Q., Linton, O., and H?rdle, W. (2004). Semiparametric regression analysis with
missing response at random. Journal of the American Statistical Association, 99(466):pp.
334?345.
Wang, Q. and Rao, J. N. K. (2002a). Empirical likelihood-based inference under imputation
for missing response data. The Annals of Statistics, 30(3):pp. 896?924.
Wang, Q. and Rao, J. N. K. (2002b). Empirical likelihood-based inference under imputation
for missing response data. Ann. Statist., 30(3):896?924. Dedicated to the memory of
Lucien Le Cam.
Wang, Q. and Sun, Z. (2007). Estimation in partially linear models with missing responses
at random. Journal of Multivariate Analysis, 98(7):1470 ? 1493.
Wiens, D. and Zhou, J. (1994). Bounded-influence rank estimation in the linear model.
Canad. J. Statist., 22(2):233?245.
Wu, C.-F. (1981). Asymptotic theory of nonlinear least squares estimation. Ann. Statist.,
9(3):501?513.
Xue, L.(2009). Empiricallikelihoodconfidenceintervalsforresponsemeanwithdatamissing
at random. Scandinavian Journal of Statistics, 36(4):671?685.
Xue, L. and Zhu, L. (2007a). Empirical likelihood for a varying coefficient model with
longitudinal data. Journal of the American Statistical Association, 102(478):642?654.
Xue, L. and Zhu, L. (2007b). Empirical likelihood semiparametric regression analysis for
longitudinal data. Biometrika, 94(4):921?937.
83
Xue, L.-G. and Zhu, L. (2006). Empirical likelihood for single-index models. Journal of
Multivariate Analysis, 97(6):1295 ? 1312.
Yohai, V.J.(1987). Highbreakdown-pointandhighefficiencyrobustestimatesforregression.
Ann. Statist., 15(2):642?656.
Zeger, S. L. and Diggle, P. J. (1994). Semiparametric models for longitudinal data with
application to cd4 cell numbers in hiv seroconverters. Biometrics, 50(3):pp. 689?699.
Zhao, L. P., Lipsitz, S., and Lew, D. (1996). Regression analysis with missing covariate data
using estimating equations. Biometrics, 52(4):pp. 1165?1182.
Zhu, L.-X. and Fang, K.-T. (1996). Asymptotics for kernel estimate of sliced inverse regres-
sion. The Annals of Statistics, 24(3):pp. 1053?1068.
84