Regularization Methods for High-Dimensional Inference

Thumbnail Image


Publication or External Link





High dimensionality is a common problem in statistical inference, and is becoming more prevalent in modern data analysis settings. While often data of interest may have a large -- often unmanageable -- dimension, modifications to various well-known techniques can be made to improve performance and aid interpretation. We typically assume that although predictors lie in a high-dimensional ambient space, they have a lower-dimensional structure that can be exploited through either prior knowledge or estimation.

In performing regression, the structure in the predictors can be taken into account implicitly through regularization. In the case where the underlying structure in the predictors is known, using knowledge of this structure can yield improvements in prediction. We approach this problem through regularization using a known projection based on knowledge of the structure of the Grassmannian. Using this projection, we can obtain improvements over many classical and recent techniques in both regression and classification problems with only minor modification to a typical least squares problem.

The structure of the predictors can also be taken into account explicitly through methods of dimension reduction. We often wish to have a lower-dimensional representation of our data in order to build potentially more interpretable models or to explore possible connections between predictors. In many problems, we are faced with data that does not have a similar distribution between estimating the model parameters and performing prediction. This results in problems when estimating a lower-dimensional structure of the predictors, as it may change. We pose methods for estimating a linear dimension reduction that will take into account these discrepancies between data distributions, while also incorporating as much of the information as possible in the data into construction of the predictor structure. These methods are built on regularized maximum likelihood and yield improvements in many cases of regression and classification, including those cases in which predictor dimension changes between training and testing.