NEW STATISTICAL METHODS FOR HIGH-DIMENSIONAL INTERCONNECTED DATA WITH UNIFORM BLOCKS

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2023

Advisor

Citation

Abstract

Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging data, consistently reveal the presence of strong modularity in the dependence patterns. In these analyses, highly correlated features often form a few distinct communities or modules, which can be interconnected with each other. While the interconnected community structure has been extensively studied in biomedical research (e.g., gene co-expression networks), its potential to assist in statistical modeling and inference remains largely unexplored. To address this research gap, we propose novel statistical models and methods that capitalize on the prevalent community structures observed in large covariance and precision matrices derived from high-dimensional biomedical interconnected data.

The first objective of this dissertation is to delve into the algebraic properties of the proposed interconnected community structures at the population level. Specifically, this pattern partitions the population covariance matrix into uniform (i.e., equal variances and covariances) blocks. To accomplish this objective, we introduce a block Hadamard product representation in Chapter 2, which relies on two lower-dimensional "coordinate" matrices and a pre-specific vector.This representation enables the explicit expressions of the square or power, determinant, inverse, eigendecomposition, canonical form, and the other matrix functions of the original larger-dimensional matrix on the basis of these lower-dimensional "coordinate" matrices.

Estimating a covariance matrix is central to high-dimensional data analysis. Our second objective is to consistently estimate a large covariance or precision matrix having an interconnected community structure with uniform blocks. In Chapter 3, we derive the best-unbiased estimators for covariance and precision matrices in closed forms and provide theoretical results on their asymptotic properties. Our proposed method improves the accuracy of covariance and precision matrix estimation and demonstrates superior performance compared to the competing methods in both simulations and real data analyses.

In Chapter 4, our goal is to investigate the effects of alcohol intake (as an exposure) on metabolomics outcome features. However, similar to other omics data, metabolomic outcomes often consist of numerous features that exhibit a structured dependence pattern, such as a co-expression network with interconnected modules. Effectively addressing this dependence structure is crucial for accurate statistical inferences and the identification of alcohol intake-related metabolomic outcomes. Nevertheless, incorporating the structured dependence patterns into multivariate outcome regression models remains difficulties in accurate estimation and inference. To bridge this gap, we propose a novel multivariate regression model that accounts for the correlations among outcome features using a network structure composed of interconnected modules. Additionally, we derive closed-form estimators of regression parameters and provide inference tools. Extensive simulation analysis demonstrates that our approach yields much-improved sensitivity with a well-controlled discovery rate when benchmarking against existing multivariate regression models.

Confirmatory factor analysis (CFA) models play a crucial role in revealing underlying latent common factors within sets of correlated variables. However, their implementation often relies on a strong prior theory to categorize variables into distinct classes, which is frequently unavailable (e.g., in omics data analysis scenarios). To address this limitation, in Chapter 5, we propose a novel strategy based on network analysis that allows data-driven discovery to substitute for the lacking prior theory. By leveraging the detected interconnected community structure, our approach offers an elegant statistical interpretation and yields closed-form uniformly minimum variance unbiased estimators for all unknown matrices. To evaluate the effectiveness of our proposed estimation procedure, we compare it to conventional numerical methods and thoroughly validate it through extensive Monte Carlo simulations and real-world applications.

Notes

Rights