Robust estimation and outlier detection in linear models for grouped data

Pérez Garrido, Betsabé

Robust estimation and outlier detection in linear models for grouped data

Pérez Garrido, Betsabé

Dirigida por:

Daniel Peña Sánchez de Rivera Director/a
Isabel Molina Peralta Directora

Universidad de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 03 de febrero de 2012

Tribunal:

Juan José Romo Urroz Presidente
María Luz Durbán Reguera Secretario/a
María Dolores Ugarte Martínez Vocal
Domingo Morales González Vocal
Ralf Münnich Vocal

Tipo: Tesis

Teseo: 317611 DIALNET e-Archivo editor

Resumen

Statistical models are, implicitly or explicitly, based on certain number of assumptions. Failure of any of these assumptions can be due to the existence of atypical observations in the data that do not follow the model under consideration. In practice, the problem of outlying observations is quite common; therefore it is rather relevant to use estimation methods that appropriately treat them. The literature provides two main alternative approaches to handle this problem. The first one consists of applying robust methods that aim to reduce the impact of outlying observations on the estimation of model parameters. The second approach attempts to use diagnostic methods that identify outlying observations before fitting the model, eliminate them and then employ a non-robust method for model estimation to the remaining clean data. This dissertation treats the problems of robust estimation and outlier detection when data have a grouped structure and most of the data satisfy one of the following models a linear regression model with fixed group effects or a linear regression model with random group effects. Chapter 1 provides an introduction on the topics addressed in the dissertation, including some background information and motivation. Chapter 2 describes basic robust methods and diagnostic measures for linear regression models. Chapter 3 introduces the linear model with fixed group effects. To reduce the impact of outlying observations, we develop an extension of the method of Peña and Yohai (1999), which is based on the projection of the observations over several directions called principal sensitivity components. Outlying observations appear with extreme coordinates in these directions. Based on these coordinates, a subset of observations is chosen and an estimator based on minimizing a robust scale of the residuals (similarly to S estimators) is obtained. The new extension is called groupwise principal sensitivity components (GPSC). Our extension is compared with other proposals discussed in the literature, namely: the RDL1 method proposed by Hubert and Rosseeuw [19] and the M-S estimators elaborated by Maronna and Yohai (2001). We compare these methods through different simulation scenarios and under different types of contamination. Our simulation results show that the GPSC method is able to detect a high percentage of outlying observations and a limited number of false outliers (swamping effect). It is also apt to detect outlying observations in the space of explanatory variables (called high leverage points), including the case of masked outlying observations (masking effect). Chapter 4 introduces the linear model with random group effects, together with some diagnostic measures proposed in the literature, which are based on the assumption that the variance components are known (meaning no being estimated). In practice, variance components are not known and must be estimated from the data. Through some examples we show that the use of non-robust methods for estimating variance components can provide a wrong picture concerning the validation of model assumptions. Chapter 5 considers a linear model with random effects for the groups. Under this model, a robust procedure is proposed for estimation of model parameters (variance components and regression coefficients), and also for the prediction of the random effects. Variance components are estimated by a robustification of Henderson method III (Searle et al., 1992). The following benefits can be discerned related to the procedure: explicit expressions for the robust estimators are provided, avoiding iterative methods and the need for good starting values; no need for any assumption regarding the shape of the distribution of the response variable apart from the existence of first and second order moments; it is computationally low demanding; finally, the estimation procedure is simply based on the fitting of two simpler linear regression models. As a result, we propose a two-step procedure. In the first step, variance components are estimated using the robustified Henderson method III. In the second step, the fixed regression parameters are estimated and the random effects are predicted in a similar way as in Sinha and Rao (2009). This robust procedure is applied to small area estimation, in which the target is to estimate the population means of the areas. Alternative robust small area estimators are given for these means, based on the robust fitting procedure mentioned before. Chapter 6 provides an extension of the robustified Henderson method III in general linear mixed models.