Regression Diagnostics for Complex Survey Data: Identification of Influential Observations

Thumbnail Image


umi-umd-4863.pdf (2.9 MB)
No. of downloads: 4468

Publication or External Link






Discussion of diagnostics for linear regression models have become indispensable chapters or sections in most of the statistical textbooks. However, survey literature has not given much attention to this problem. Examples from real surveys show that sometimes the inclusion and exclusion of a small number of the sampled units can greatly change the regression parameter estimates, which indicates that techniques of identifying the influential units are necessary. The goal of this research is to extend and adapt the conventional ordinary least squares influence diagnostics to complex survey data, and determine how they should be justified.

We assume that an analyst is looking for a linear regression model that fits reasonably well for the bulk of the finite population and chooses to use the survey weighted regression estimator. Diagnostic statistics such as DFBETAS, DFFITS, and modified Cook's Distance are constructed to evaluate the effect on the regression coefficients of deleting a single observation. As components of the diagnostic statistics, the estimated variances of the coefficients are obtained from design-consistent estimators which account for complex design features, e.g. clustering and stratification. For survey data, sample weights, which are computed with the primary goal of estimating finite population statistics, are sources of influence besides the response variable and the predictor variables, and therefore need to be incorporated into influence measurement. The forward search method is also adapted to identify influential observations as a group when there is possible masked effect among the outlying observations.

Two case studies and simulations are done in this dissertation to test the performance of the adapted diagnostic statistics. We reach the conclusion that removing the identified influential observations from the model fitting can obtain less biased estimated coefficients. The standard errors of the coefficients may be underestimated since the variation in the number of observations used in the regressions was not accounted for.