Effects of Model Selection on the Coverage Probability of Confidence Intervals in Binary-Response Logistic Regression

Thumbnail Image


umi-umd-5618.pdf (888.49 KB)
No. of downloads: 2478

Publication or External Link






While model selection is viewed as a fundamental task in data analysis, it imposes considerable effects on the subsequent inference. In applied statistics, it is common to carry out a data-driven approach in model selection and draw inference conditional on the selected model, as if it is given a priori. Parameter estimates following this procedure, however, generally do not reflect uncertainty about the model structure. As far as confidence intervals are concerned, it is often misleading to report estimates based upon conventional 1−α without considering possible post-model-selection impact. This paper addresses the coverage probability of confidence intervals of logit coefficients in binary-response logistic regression. We conduct simulation studies to examine the performance of automatic model selectors AIC and BIC, and their subsequent effects on actual coverage probability of interval estimates. Important considerations (e.g. model structure, covariate correlation, etc.) that may have key influence are investigated. This study contributes in terms of understanding quantitatively how the post-model-selection confidence intervals perform in terms of coverage in binary-response logistic regression models.

A major conclusion was that while it is usually below the nominal level, there is no simple predictable pattern with regard to how and how far the actual coverage probability of confidence intervals may fall. The coverage probability varies given the effects of multiple factors:

(1) While the model structure always plays a role of paramount importance, the covariate correlation significantly affects the interval's coverage, with the tendency that a higher correlation indicates a lower coverage probability.

(2) No evidence shows that AIC inevitably outperforms BIC in terms of achieving higher coverage probability, or vice versa. The model selector's performance is dependent upon the uncertain model structure and/or the unknown parameter vector θ .

(3) While the effect of sample size is intriguing, a larger sample size does not necessarily achieve asymptotically more accurate inference on interval estimates.

(4) Although the binary threshold of the logistic model may affect the coverage probability, such effect is less important. It is more likely to become substantial with an unrestricted model when extreme values along the dimensions of other factors (e.g. small sample size, high covariate correlation) are observed.