PENALIZED STATISTICAL MODELS FOR PATHWAY-BASED TWAS AND HIGH-DIMENSIONAL MEDIATION ANALYSIS
Files
Publication or External Link
Date
Authors
Advisor
Slud, Eric V.
Citation
DRUM DOI
Abstract
High–throughput genomics and neuroimaging now generate thousands of correlated molecular and imaging features for each participant, presenting unprecedented opportunities—and methodological challenges—for causal and genetic discovery. This dissertation develops two complementary statistical frameworks that address key limitations of classical tools when confronted with such high-dimensional data. In Chapter 1, I introduce that genome-wide association studies (GWAS) have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and expression reference panels to identify the associations at gene level with tissue specificity, potentially improving the interpretability. However, the list of individual genes identified from univariate TWAS contains little unifying biological theme so the underlying mechanisms remain largely elusive. These limitations motivate a unified framework that not only identifies trait-associated genes through multivariate TWAS, but also traces how their effects propagate through intermediate brain features to influence clinical outcomes—thus naturally extending gene-level association analysis into high-dimensional mediation analysis. Mmediation analysis is a fundamental tool for elucidating causal mechanisms in complexsystems. However, the emergence of high-throughput biological and neuroimaging technologies has introduced multivariate exposures and mediators of increasing dimensionality, rendering classical approaches inadequate.
In the Chapter 2 we propose a novel multivariate TWAS method that Incorporates Pathway or gene Set information, namely TIPS, to identify genes and pathways most associated with complex polygenic traits. We jointly modeled the imputation and association steps in TWAS, incorporated a sparse group lasso penalty in the model to induce selection at both gene and pathway levels and developed an expectation-maximization algorithm to estimate the parameters for the penalized likelihood. We applied our method to three different complex traits: systolic and diastolic blood pressure, as well as a brain aging biomarker white matter brain age gap in UK Biobank and identified critical biologically relevant pathways and genes associated with these traits. These pathways cannot be detected by traditional univariate TWAS + pathway enrichment analysis approach, showing the power of our model. We also conducted comprehensive simulations with varying heritability levels and genetic architectures and showed our method outperformed other established TWAS methods in feature selection, statistical power and prediction. The R package that implements TIPS is available at \url{https://github.com/nwang123/TIPS}.
In Chapter 3, we propose a novel aggregation-based mediation framework that simultaneously models and selects high-dimensional multivariate exposures and mediators. Our method identifies sparse linear combinations of variables in each domain that jointly maximize the mediated effect, defined as the product of exposure–mediator and mediator–outcome effects. To estimate these low-dimensional aggregators, we formulate a bi-convex objective function integrating residual sum-of-squares penalties from standard mediation submodels, a structured mediation-enhancing term, and $\ell_1$-penalties that induce sparsity. The resulting optimization problem is solved efficiently using an alternating direction method of multipliers (ADMM) algorithm with block coordinate updates.
Through extensive simulations, we demonstrate that our method achieves superior performance in recovering true mediators and estimating mediation proportions across a wide range of signal strengths, noise levels, and correlation structures. Compared to existing methods—including MMP, Pathway Lasso, sparse PCA mediation, and the Directions of Mediation framework—our approach exhibits higher selection accuracy and reduced bias, particularly in challenging high-correlation regimes. We further illustrate the utility of the method in a real data application involving the mediation of smoking behavior through neuroimaging features, revealing biologically meaningful pathways linking gene expression in the nucleus accumbens to structural and functional brain indices implicated in addiction. These findings highlight the potential of our framework for integrative mediation analysis in high-dimensional biomedical studies.