Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 10 of 12
  • Thumbnail Image
    Item
    Application of Causal Inference in Large-Scale Biomedical Data
    (2024) Zhao, Zhiwei; Chen, Shuo; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation contains three projects that tackle the challenges in the application of causal inference on large-scale biomedical data. Project 1 proposes a novel mediation analysis framework with the existence of multiple mediators and outcomes. It can extract the mediation pathway efficiently and estimate the mediation effect from multiple mediators simultaneously. The effectiveness of the proposed method is validated through extensive simulation and a real data application focusing on human connectome study. Project 2 introduces a doubly machine learning based method, assisted by algorithm ensemble, for estimating longitudinal causal effects. This approach reduces estimation bias and accommodates high-dimensional covariates. The validity of the proposed method is justified by simulation studies and an application to adolescent brain cognitive development data, specifically evaluating the impact from sleep insufficiency on youth cognitive development. Project 3 develops a new bias-reduction estimation that addresses unmeasured confounding by leveraging proximal learning and negative control outcome techniques. This method can handle a moderate number of exposures and multivariate outcomes in the presence of unmeasured confounders. Both numerical experiment and data application using UK Biobank demonstrate that the proposed method effectively reduces estimation bias caused by unmeasured confounding. Collectively, these three projects introduce innovative methodologies for causal inference in neuroimaging, advancing mediation analysis in neuroimaging, improving longitudinal causal effect estimation, and reducing estimation bias in the presence of unmeasured confounding.
  • Thumbnail Image
    Item
    Variable selection and causal discovery methods with application in noncoding RNA regulation of gene expression
    (2024) Ke, Hongjie; Ma, Tianzhou; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Noncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs), micro RNAs (miRNAs), etc, are critical regulators that control the gene expression at multiple levels. Revealing how the ncRNAs regulate their target genes in disease associated pathways will provide mechanistic insights into the disease and have potential clinical usage. In this dissertation, we developed novel variable selection and causal discovery methods to study the regulatory relationship between ncRNAs and genes. In Chapter 2, we proposed a novel screening method based on robust partial correlation to identify noncoding RNA regulators of gene expression over the whole genome. In Chapter 3, we developed a computationally efficient two-stage Bayesian Network (BN) learning method to construct ncRNA-gene regulatory network from transcriptomic data of both coding genes and noncoding RNAs. We provided a novel analytical platform with a graphical user interface (GUI) which covered the entire pipeline of data preprocessing, network construction, module detection, visualization and downstream analyses to accompany the developed BN learning method. In Chapter 4, we proposed a Bayesian indicator variable selection model with hierarchical structure to uncover how the regulatory mechanism between noncoding RNAs and genes changes over different biological conditions (e.g., cancer stages). In Chapter 5, we discussed about the potential extension and future work. This dissertation presents computationally efficient and statistically rigorous methods that can jointly analyze high-dimensional noncoding RNA and gene expression data to investigate their regulatory relationships, which will deepen our understanding of the molecular mechanism of diseases.
  • Thumbnail Image
    Item
    DISSECTING TUMOR CLONALITY IN LIVER CANCER: A PHYLOGENY ANALYSIS USING COMPUTATIONAL AND STATISTICAL TOOLS
    (2023) Kacar, Zeynep; Slud, Eric ES; Levy, Doron DL; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Liver cancer is a heterogeneous disease characterized by extensive genetic and clonaldiversity. Understanding the clonal evolution of liver tumors is crucial for developing effective treatment strategies. This dissertation aims to dissect the tumor clonality in liver cancer using computational and statistical tools, with a focus on phylogenetic analysis. Through advancements in defining and assessing phylogenetic clusters, we gain a deeper understanding of the survival disparities and clonal evolution within liver tumors, which can inform the development of tailored treatment strategies and improve patient outcomes. The thesis begins by providing an overview of sources of heterogeneity in liver cancer and data types, from Whole-Exome (WEX) and RNA sequencing (RNA-seq) read-counts by gene to derived quantities such as Copy Number Alterations (CNAs) and Single Nucleotide Variants (SNVs). Various tools for deriving copy-numbers are discussed and compared. Additionally, comparison of survival distributions is discussed. The central data analyses of the thesis concern the derivation of distinct clones and clustered phylogeny types from the basic genomic data in three independent cancer cohorts, TCGA-LIHC, TIGER-LC and NCI-MONGOLIA. The SMASH (Subclone multiplicity allocation and somatic heterogeneity) algorithm is introduced for clonality analysis, followed by a discussion on clustering analysis of nonlinear tumor evolution trees and the construction of phylogenetic trees for liver cancer cohorts. Identification of drivers of tumor evolution, and the immune cell micro-environment of tumors are also explored. In this research, we employ survival analysis tools to investigate and document survival differences between groups of subjects defined from phylogenetic clusters. Specifically, we introduce the log-rank test and its modifications for generic right-censored survival data, which we then apply to survival follow-up data for the subjects in the studied cohorts, clustered based on their genomic data. The final chapter of this thesis takes a significant step forward by extending an existing methodology for covariate-adjustment in the two-sample log-rank test to a K-sample scenario, with a specific focus on the already defined phylogeny cluster groups. This extension is not straightforward because the computation of the test statistic for K-sample and its asymptotic null distribution do not follow directly from the two-sample case. Using these extended tools, we conduct an illustrative data analysis with real data from the TIGER-LC cohort, which comprises subjects with analyzed and clustered genomic data, leading to defined phylogenetic clusters associated with two different types of liver cancer. By applying the extended methodology to this dataset, we aim to effectively assess and validate the survival curves of the defined clusters.
  • Thumbnail Image
    Item
    Statistical Network Analysis of High-Dimensional Neuroimaging Data With Complex Topological Structures
    (2023) Lu, Tong; Chen, Shuo SC; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation contains three projects that collectively tackle statistical challenges in the field of high-dimensional brain connectome data analysis and enhance our understanding of the intricate workings of the human brain. Project 1 proposes a novel network method for detecting brain-disease-related alterations in voxel-pair-level brain functional connectivity with spatial constraints, thus improving spatial specificity and sensitivity. Its effectiveness is validated through extensive simulations and real data applications in nicotine addiction and schizophrenia studies. Project 2 introduces a multivariate multiple imputation method specifically designed for voxel-level neuroimaging data in high dimensions based on Bayesian models and Markov chain Monte Carlo processes. According to both synthetic data and real neurovascular water exchange data extracted from a neuroimaging dataset in a schizophrenia study, our method indicates high imputation accuracy and computational efficiency. Project 3 develops a multi-level network model based on graph combinatorics that captures vector-to-matrix associations between brain structural imaging measures and functional connectomic networks. The validity of the proposed model is justified through extensive simulations and a real structure-function imaging dataset from UK Biobank. These three projects contribute innovative methodologies and insights that advance neuroimaging data analysis, including improvements in spatial specificity, statistical power, imputation accuracy, and computational efficiency when revealing the brain’s complex neurological patterns.
  • Thumbnail Image
    Item
    Semiparametric Analysis of Multivariate Panel Count Data with an Informative Observation Process
    (2023) Chen, Chang; He, Xin XH; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Panel count data and recurrent event data often arise in event history studies. Unlike recurrent event data which are collected from studies that monitor subjects continuously, panel count data are encountered when subjects are observed only at discrete time points. In such case, the exact occurrence times of the events are unknown, but only the numbers of occurrences of the events between subsequent observation time points are recorded. Statistical analysis of panel count data have been studied based on two stochastic processes: an observation process and a response process that characterizes the occurrences of the events of interest.The first part of the dissertation will present a likelihood-based joint modeling procedure for the regression analysis of univariate panel count data with dependent observation equations and time processes. The inference procedure involves estimating equations and an EM algorithm for the estimation of all involved parameters. In the second part, we will extend the proposed methods to multivariate panel count data, which occurs when a recurrent event study involves several related types of recurrent events. In particular, we will present three types of multivariate modeling scenarios and the corresponding inference procedures. A model checking procedure is developed for the proposed univariate models and all three types of multivariate models. Simulation studies indicate that the proposed inference procedures have a good and consistent performance across various situations. The proposed methods are applied to a skin cancer study with bivariate panel count data on the occurrences of two types of related non-melanoma skin cancers.
  • Thumbnail Image
    Item
    Causal Survival Analysis – Machine Learning Assisted Models: Structural Nested Accelerated Failure Time Model and Threshold Regression
    (2022) Chen, Yiming; Lee, Mei-Ling ML; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Time-varying confounding for intervention complicates causal survival analysis when the data are collected in a longitudinal manner. Traditional survival models that only adjust for time-dependent covariates provide a biased causal conclusion for the intervention effect. Some techniques have been developed to address this challenge. Nevertheless, these existing methods may still lack power, and suffer from computational burden given high dimensional data with a temporally connected nature. The first part of this dissertation focuses on one of the methods that deal with time-varying confounding, the Structural Nested Model and associated G-estimation. Two Neural Networks (GE-SCORE and GE-MIMIC) were proposed to estimate the Structural Nested Accelerated Failure Time Model. The proposed algorithms can provide less biased and individualized intervention causal effect estimation. The second part explored the causal interpretations and applications of the First-Hitting-Time based Threshold Regression Model using a Wiener process. Moreover, a Neural Network expansion of this specific type of Threshold Regression (TRNN) was explored for the first time.
  • Thumbnail Image
    Item
    Bayesian Methods and Their Application in Neuroimaging Data
    (2022) Ge, Yunjiang; Kedem, Benjamin; Chen, Shuo; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The functional magnetic resonance imaging (fMRI) technique is widely used in the medical field because it allows the in vivo investigations of human cognition, emotions, and behaviors at the neural level. One primary objective is to study brain activation, which can be achieved through a conventional two-stage approach. We consider the individualized voxel-specific modeling in the first stage and group-level inference in the second stage. Existing methods, in general, rely on pre-determined parameters or domain knowledge, which may not properly incorporate the unique features from different studies or cohorts, and thus also leads to some gaps in the inference for activated regions. This dissertation focuses on Bayesian approaches to fill the gaps in statistical inference at all levels, as well as accounting for the various information carried out by the data. Cluster-wise statistical inference is the most widely used technique for fMRI data analyses. It consists of two steps: i) primary thresholding that excludes less significant voxels by a pre-specified cut-off (e.g., p<0.001); and ii) cluster-wise thresholding that is often obtained by counting the number of intra-cluster voxels which surpass a voxel-level statistical significance threshold. The selection of the primary threshold is critical because it determines both statistical power and false discovery rate. However, in most existing statistical packages, the primary threshold is selected based on prior knowledge (e.g., p<0.001) without considering the information in the data. Thus, in the first project, we propose a data-driven approach to algorithmically select the optimal primary threshold based on an empirical Bayes framework. We evaluate the proposed model using extensive simulation studies and real fMRI data. In the simulation, we show that our method can effectively increase statistical power while controlling the false discovery rate. We then investigate the brain response to the dose effect of chlorpromazine in patients with schizophrenia by analyzing fMRI scans and generating consistent results. In Chapter 3, we focus on controlling the FWER by conducting cluster-level inference. The cluster-extent measure can be sub-optimal regarding the power and false positive error rate because the supra-threshold voxel count neglects the voxel-wise significance levels and ignores the dependence between voxels. Based on the information that a cluster carries, we provide a new Integrated Cluster-wise significance Measure (ICM) for cluster-level significance determination in cluster-wise fMRI analysis by integrating cluster extent, voxel-level significance (e.g., p-values), and activation dependence between within-cluster voxels. We develop a computationally efficient strategy for ICM based on probabilistic approximation theories. Consequently, the computational load for ICM-based cluster-wise inference (e.g., permutation tests) is affordable. We validate the proposed method via extensive simulations and then apply it to two fMRI data sets. The results demonstrate that ICM can improve power with well-controlled FWER. The above chapters focus on the cluster-extent thresholding method, while the Bayesian hierarchical model can also efficiently handle high-dimensional neuroimaging data. Existing methods provide voxel-specific and pre-determined regional (region of interest (ROI)) inference. However, the activation clusters may be across multiple ROIs or vary from studies and study cohorts. To provide the inference and build the bridge between voxels, unknown activation clusters, targeted regions, and the whole brain, we propose the Dirichlet Process Mixture model with Spatial Constraint (DPMSC) in Chapter 4. The spatial constraint is based on the Euclidean distance between two voxels in the brain space. With such a constraint added at each iteration in Markov Chain Monte Carlo (MCMC), our DPMSC can efficiently remove the single voxel or small noise clusters, as well as provide a whole contiguous cluster that belongs to the same component in the mixture model. Finally, we provide a real data example and simulation studies based on various dataset features.
  • Thumbnail Image
    Item
    NEW STATISTICAL METHODS FOR HIGH-DIMENSIONAL DATA WITH COMPLEX STRUCTURES
    (2021) Wu, Qiong; Chen, Shuo; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The overwhelming advances in biomedical technology facilitate the availability of high-dimensional biomedical data with complex and organized structures. However, due to the obscured true signals by substantial false-positive noises and the high dimensionality, the statistical inference is challenging with the critical issue of research reproducibility and replicability. Hence, motivated by these urgent needs, this dissertation is devoted to statistical approaches in understanding the latent structures among biomedical objects, as well as improving statistical power and reducing false-positive errors in statistical inference. The first objective of this dissertation is motivated by the group-level brain connectome analysis in neuropsychiatric research with the goal of exhibiting the connectivity abnormality between clinical groups. In Chapter 2, we develop a likelihood-based adaptive dense subgraph discovery (ADSD) procedure to identify connectomic subnetworks (subgraphs) that are systematically associated with brain disorders. We propose the statistical inference procedure leveraging graph properties and combinatorics. We validate the proposed method by a brain fMRI study for schizophrenia research and synthetic data under various settings. In Chapter 3, we are interested in assessing the genetic effects on brain structural imaging with spatial specificity. In contrast to the inference on individual SNP-voxel pairs, we focus on the systematic associations between genetic and imaging measurements, which assists the understanding of a polygenic and pleiotropic association structure. Based on voxel-wise genome-wide association analysis (vGWAS), we characterize the polygenic and pleiotropic SNP-voxel association structure using imaging-genetics dense bi-cliques (IGDBs). We develop the estimation procedure and statistical inference framework on the IGDBs with computationally efficient algorithms. We demonstrate the performance of the proposed approach using imaging-genetics data from the human connectome project (HCP). Chapter 4 carries the analysis of gene co-expression network (GCN) in examining the gene-gene interactions and learning the underlying complex yet highly organized gene regulatory mechanisms. We propose the interconnected community network (ICN) structure that allows the interactions between genes from different communities, which relaxes the constraint of most existing GCN analysis approaches. We develop a computational package to detect the ICN structure based on graph norm shrinkage. The application of ICN detection is illustrated using an RNA-seq data from The Cancer Genome Atlas (TCGA) Acute Myeloid Leukemia (AML) study.
  • Thumbnail Image
    Item
    CAUSAL INFERENCE WITH A CONTINUOUS TREATMENT AND OUTCOME: ALTERNATIVE ESTIMATORS FOR PARAMETRIC DOSE-RESPONSE FUNCTIONS WITH APPLICATIONS.
    (2016) Galagate, Douglas; Schafer, Joseph L.; Smith, Paul J.; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Causal inference with a continuous treatment is a relatively under-explored problem. In this dissertation, we adopt the potential outcomes framework. Potential outcomes are responses that would be seen for a unit under all possible treatments. In an observational study where the treatment is continuous, the potential outcomes are an uncountably infinite set indexed by treatment dose. We parameterize this unobservable set as a linear combination of a finite number of basis functions whose coefficients vary across units. This leads to new techniques for estimating the population average dose-response function (ADRF). Some techniques require a model for the treatment assignment given covariates, some require a model for predicting the potential outcomes from covariates, and some require both. We develop these techniques using a framework of estimating functions, compare them to existing methods for continuous treatments, and simulate their performance in a population where the ADRF is linear and the models for the treatment and/or outcomes may be misspecified. We also extend the comparisons to a data set of lottery winners in Massachusetts. Next, we describe the methods and functions in the R package causaldrf using data from the National Medical Expenditure Survey (NMES) and Infant Health and Development Program (IHDP) as examples. Additionally, we analyze the National Growth and Health Study (NGHS) data set and deal with the issue of missing data. Lastly, we discuss future research goals and possible extensions.
  • Thumbnail Image
    Item
    MULTIVARIATE METHODS FOR HIGH-THROUGHPUT BIOLOGICAL DATA WITH APPLICATION TO COMPARATIVE GENOMICS
    (2015) Hsiao, Chiao-wen; Corrada Bravo, Héctor; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Phenotypic variation in multi-cellular organisms arises as a result complex gene regulation mechanisms. Modern development of high-through technology opens up the possibility of genome-wide interrogation of aspects of these mechanisms across molecular phenotypes. Multivariate statistical methods provide convenient frameworks for modeling and analyzing data obtained from high-throughput experiments probing these complex aspects. This dissertation presents multivariate statistical methods to analyze data arising from two specific high-throughput molecular assays: (1) ribosome footprint profiling experiments, and (2) flow cytometry data. Ribosome footprint profiling describes an in vivo translation profile in a living cell and offers insights into the process of post-transcriptional gene regulation. Translation efficiency (TE) is a measure that quantifies the rate at which active translation is occurring for each gene – defined as the ratio of ribosome protected fragment count to mRNA fragment count. We introduce pairedSeq, an empirical covariance shrinkage method for differential testing of translation efficiency from sequencing data. The method draws on variance decomposition techniques in mixed-effect modeling and analysis of variance. Benchmark tests comparing to the existing methods reveals that pairedSeq effectively detects signals in genes with high variation in expression measurements across samples due to high co-variability between ribosome occupancy and transcript abundance. In contrast, existing methods tend to mistake genes with negative co-variability as signals, as a result of variance underestimation when not accounting for negative co-variability. We then present a genome-wide survey of primate species divergence at the translational and post-translational layer of gene regulation. FCM is routinely employed to characterize cellular characteristics such as mRNA and protein expression at the single-cell level. While many computational methods have been developed that focus on identifying cell populations in individual FCM samples, very few have addressed how the identified cell populations can be matched across samples for comparative analysis. FlowMap-FR can be used to quantify the similarity between cell populations under scenarios of proportion differences and modest position shifts, and to identify situations in which inappropriate splitting or merging of cell populations has occurred during gating procedures. It has been implemented as a stand-alone R/Bioconductor package easily incorporated into current FCM data analytical workflows.