Robust and Flexible Methods for Small Area Estimation
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Sample surveys are widely used to provide estimates for both the overall population and various subpopulations, known as domains, which can be defined by geographic or socio-demographic characteristics. Direct estimators rely solely on domain-specific sample data and are typically design-based, incorporating survey weights and relying on the probability distribution induced by the sampling design for making inferences. Although the total sample size in a survey is typically large, the sample size for specific domains may be small or even zero. When a domain-specific sample is too small to produce direct estimates with adequate precision, the domain is classified as a small domain or small area.
The increasing demand for small area statistics has driven the development of small area estimation (SAE) techniques, which produce reliable estimates for domains with limited or no sample data. This dissertation focuses on enhancing the robustness and flexibility of model-based SAE methodologies by addressing three key challenges: model misspecification, flexible modeling, and uncertainty quantification.
The first study examines the effects of model misspecification on several commonly used small area estimators. The results show that when the underlying model is misspecified, the observed best prediction (OBP) method does not consistently outperform the Empirical Best Linear Unbiased Predictor (EBLUP) in terms of the design-based mean squared prediction error (MSPE), even though OBP being designed to improve design-based MSPE over EBLUP under model misspecification. Both analytical and numerical evidence are provided to show that OBP performs better when using aggregated auxiliary variables compared to using the individual ones. It offers practical insights for handling model misspecification in small area estimation.
The second study develops a framework for predicting complex small area characteristics, which are often nonlinear functions of the study variable for population units, using a nested error regression model with high-dimensional parameters. This study addresses multiple challenges simultaneously. First, it allows both regression coefficients and sampling variances to vary across areas, accommodating heterogeneity and enhancing modeling robustness. Second, we propose a novel algorithm for estimating area-specific model parameters, improving computational efficiency compared to existing algorithms. Third, we introduce a new approach for producing area-specific poverty estimates for out-of-sample areas, yielding less synthetic estimates than existing methods. Design-based simulation studies demonstrate that the proposed method outperforms existing approaches in terms of relative bias and relative root mean squared prediction error. Additionally, the method is applied to household survey data from the 2002 Albania Living Standards Measurement Survey to estimate poverty indicators for Albanian municipalities.
A measurement of any quantity of interest is complete only when accompanied by an evaluation of its uncertainty. The third study advances the theory of parametric bootstrap methods for constructing highly efficient empirical best linear (EBL) prediction intervals for small area means, incorporating both fixed and random effects. We analytically demonstrate that even when the normality assumption for random effects is relaxed, the proposed EBL prediction interval maintains a second-order coverage error, provided a pivot exists for a suitably standardized random effect when hyperparameters are known. In the absence of a pivot, we find that the order of coverage error of the parametric bootstrap EBL prediction interval is $O(m^{-1})$, and the first-order term is theoretically positive under certain conditions, indicating possible overcoverage of the EBL prediction interval. This characteristic may be advantageous for practitioners who do not account for other properties of prediction intervals. Furthermore, we propose a novel double bootstrap method, which can correct coverage issues in general. Monte Carlo simulations indicate that the proposed single bootstrap method performs well compared to alternative approaches.
Overall, this dissertation provides valuable insights into critical challenges in small area estimation, specifically in model misspecification, flexible modeling, and uncertainty quantification. Future research should explore semi-parametric and nonparametric methods to further enhance the robustness of inference for small areas.