Effects of Model Selection on the Coverage Probability of Confidence Intervals in Binary-Response Logistic Regression

View/ Open
Date
2008-07-24Author
Zhang, Dongquan
Advisor
Dayton, C. Mitchell
Metadata
Show full item recordAbstract
While model selection is viewed as a fundamental task in data analysis, it imposes
considerable effects on the subsequent inference. In applied statistics, it is common to
carry out a data-driven approach in model selection and draw inference conditional on the
selected model, as if it is given a priori. Parameter estimates following this procedure,
however, generally do not reflect uncertainty about the model structure. As far as
confidence intervals are concerned, it is often misleading to report estimates based upon
conventional 1−α without considering possible post-model-selection impact. This paper
addresses the coverage probability of confidence intervals of logit coefficients in binary-response
logistic regression. We conduct simulation studies to examine the performance
of automatic model selectors AIC and BIC, and their subsequent effects on actual
coverage probability of interval estimates. Important considerations (e.g. model structure,
covariate correlation, etc.) that may have key influence are investigated. This study
contributes in terms of understanding quantitatively how the post-model-selection
confidence intervals perform in terms of coverage in binary-response logistic regression
models.
A major conclusion was that while it is usually below the nominal level, there is no
simple predictable pattern with regard to how and how far the actual coverage probability
of confidence intervals may fall. The coverage probability varies given the effects of
multiple factors:
(1) While the model structure always plays a role of paramount importance, the
covariate correlation significantly affects the interval's coverage, with the tendency that a
higher correlation indicates a lower coverage probability.
(2) No evidence shows that AIC inevitably outperforms BIC in terms of achieving
higher coverage probability, or vice versa. The model selector's performance is
dependent upon the uncertain model structure and/or the unknown parameter vector θ .
(3) While the effect of sample size is intriguing, a larger sample size does not
necessarily achieve asymptotically more accurate inference on interval estimates.
(4) Although the binary threshold of the logistic model may affect the coverage
probability, such effect is less important. It is more likely to become substantial with an
unrestricted model when extreme values along the dimensions of other factors (e.g. small
sample size, high covariate correlation) are observed.