Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 10 of 126
  • Thumbnail Image
    Item
    Variable selection and causal discovery methods with application in noncoding RNA regulation of gene expression
    (2024) Ke, Hongjie; Ma, Tianzhou; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Noncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs), micro RNAs (miRNAs), etc, are critical regulators that control the gene expression at multiple levels. Revealing how the ncRNAs regulate their target genes in disease associated pathways will provide mechanistic insights into the disease and have potential clinical usage. In this dissertation, we developed novel variable selection and causal discovery methods to study the regulatory relationship between ncRNAs and genes. In Chapter 2, we proposed a novel screening method based on robust partial correlation to identify noncoding RNA regulators of gene expression over the whole genome. In Chapter 3, we developed a computationally efficient two-stage Bayesian Network (BN) learning method to construct ncRNA-gene regulatory network from transcriptomic data of both coding genes and noncoding RNAs. We provided a novel analytical platform with a graphical user interface (GUI) which covered the entire pipeline of data preprocessing, network construction, module detection, visualization and downstream analyses to accompany the developed BN learning method. In Chapter 4, we proposed a Bayesian indicator variable selection model with hierarchical structure to uncover how the regulatory mechanism between noncoding RNAs and genes changes over different biological conditions (e.g., cancer stages). In Chapter 5, we discussed about the potential extension and future work. This dissertation presents computationally efficient and statistically rigorous methods that can jointly analyze high-dimensional noncoding RNA and gene expression data to investigate their regulatory relationships, which will deepen our understanding of the molecular mechanism of diseases.
  • Thumbnail Image
    Item
    Advancements in Small Area Estimation Using Hierarchical Bayesian Methods and Complex Survey Data
    (2024) Das, Soumojit; Lahiri, Partha; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation addresses critical gaps in the estimation of multidimensional poverty measures for small areas and proposes innovative hierarchical Bayesian estimation techniques for finite population means in small areas. It also explores specialized applications of these methods for survey response variables with multiple categories. The dissertation presents a comprehensive review of relevant literature and methodologies, highlighting the importance of accurate estimation for evidence-based policymaking. In Chapter \ref{chap:2}, the focus is on the estimation of multidimensional poverty measures for small areas, filling an essential research gap. Using Bayesian methods, the dissertation demonstrates how multidimensional poverty rates and the relative contributions of different dimensions can be estimated for small areas. The proposed approach can be extended to various definitions of multidimensional poverty, including counting or fuzzy set methods. Chapter \ref{chap:3} introduces a novel hierarchical Bayesian estimation procedure for finite population means in small areas, integrating primary survey data with diverse sources, including social media data. The approach incorporates sample weights and factors influencing the outcome variable to reduce sampling informativeness. It demonstrates reduced sensitivity to model misspecifications and diminishes reliance on assumed models, making it versatile for various estimation challenges. In Chapter \ref{chap: 4}, the dissertation explores specialized applications for survey response variables with multiple categories, addressing the impact of biased or informative sampling on assumed models. It proposes methods for accommodating survey weights seamlessly within the modeling and estimation processes, conducting a comparative analysis with Multilevel Regression with Poststratification (MRP). The dissertation concludes by summarizing key findings and contributions from each chapter, emphasizing implications for evidence-based policymaking and outlining future research directions.
  • Thumbnail Image
    Item
    Structured discovery in graphs: Recommender systems and temporal graph analysis
    (2024) Peyman, Sheyda Do'a; Lyzinski, Vince V.; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Graph-valued data arises in numerous diverse scientific fields ranging from sociology, epidemiology and genomics to neuroscience and economics.For example, sociologists have used graphs to examine the roles of user attributes (gender, class, year) at American colleges and universities through the study of Facebook friendship networks and have studied segregation and homophily in social networks; epidemiologists have recently modeled Human-nCov protein-protein interactions via graphs, and neuroscientists have used graphs to model neuronal connectomes. The structure of graphs, including latent features, relationships between the vertex and importance of each vertex are all highly important graph properties that are main aspects of graph analysis/inference. While it is common to imbue nodes and/or edges with implicitly observed numeric or qualitative features, in this work we will consider latent network features that must be estimated from the network topology.The main focus of this text is to find ways of extracting the latent structure in the presence of network anomalies. These anomalies occur in different scenarios: including cases when the graph is subject to an adversarial attack and the anomaly is inhibiting inference, and in the scenario when detecting the anomaly is the key inference task. The former case is explored in the context of vertex nomination information retrieval, where we consider both analytic methods for countering the adversarial noise and also the addition of a user-in-the-loop in the retrieval algorithm to counter potential adversarial noise. In the latter case we use graph embedding methods to discover sequential anomalies in network time series.
  • Thumbnail Image
    Item
    STATISTICAL DATA FUSION WITH DENSITY RATIO MODEL AND EXTENSION TO RESIDUAL COHERENCE
    (2024) Zhang, Xuze; Kedem, Benjamin; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Nowadays, the statistical analysis of data from diverse sources has become more prevalent. The Density Ratio Model (DRM) is one of the methods for fusing and analyzing such data. The population distributions of different samples can be estimated basedon fused data, which leads to more precise estimates of the probability distributions. These probability distributions are related by assuming the ratios of their probability density functions (PDFs) follow a parametric form. In the previous works, this parametric form is assumed to be uniform for all ratios. In Chapter 1, an extension is made to allow this parametric form to vary for different ratios. Two methods of determining the parametric form for each ratio are developed based on asymptotic test and Akaike Information Criterion (AIC). This extended DRM is applied to Radon concentration and Pertussis rates to demonstrate the use of this extension in univariate case and multivariate case, respectively. The above analysis is made possible when data in each sample are independent and identically distributed (IID). However, in many cases, statistical analysis is entailed for time series in which data appear to be sequentially dependent. In Chapter 2, an extension is made for DRM to account for weakly dependent data, which allows us to investigate the structure of multiple time series on the strength of each other. It is shown that the IID assumption can be replaced by proper stationarity, mixing and moment conditions. This extended DRM is applied to the analysis of air quality data which are recorded in chronological order. As mentioned above, DRM is suitable for the situation where we investigate a single time series based on multiple alternative ones. These time series are assumed to be mutually independent. However, in time series analysis, it is often of interest to detect linear and nonlinear dependence between different time series. In such dependent scenario, coherence is a common tool to measure the linear dependence between two time series, and residual coherence is used to detect a possible quadratic relationship. In Chapter 3, we extend the notion of residual coherence and develop statistical tests for detecting linear and nonlinear associations between time series. These tests are applied to the analysis of brain functional connectivity data.
  • Thumbnail Image
    Item
    The Shuffling Effect: Vertex Label Error’s Impact on Hypothesis Testing, Classification, and Clustering in Graph Data
    (2024) Saxena, Ayushi; Lyzinski, Vince; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The increasing prevalence of graph and network-valued data across various disciplines has prompted significant interest and research in recent years. This dissertation explores the impact of vertex shuffling, or vertex misalignment, on the statistical network inference tasks of hypothesis testing, classification, and clustering. Our focus is within the framework of multiple network inference, where existing methodologies often assume known vertex correspondence across networks. This assumption frequently does not hold in practice. Through theoretical analyses, simulations, and experiments, we aim to reveal the effects of vertex shuffling on different types of performance.Our investigation begins with an examination of two-sample network hypothesis testing, focusing on the decrease in statistical power resulting from vertex shuffling. In this work, our analysis focuses on the random dot product and stochastic block model network settings. Subsequent chapters delve into the effects of shuffling on graph classification and clustering, showcasing how misalignment negatively impacts accuracy in categorizing and clustering graphs (and vertices) based on their structural characteristics. Various machine learning algorithms and clustering methodologies are explored, revealing a theme of consistent performance degradation in the presence of vertex shuffling. We also explore how graph matching algorithms can potentially mitigate the effects of vertex misalignment and recover the lost performance. Our findings also highlight the risk of graph matching as a pre-processing tool, as it can induce artificial signal. These findings highlight the difficulties and subtleties of addressing vertex shuffling across multiple network inference tasks and suggest avenues for future research in order to enhance the robustness of statistical inference methodologies in complex network environments.
  • Thumbnail Image
    Item
    A VARIATIONAL APPROACH TO CLUSTERING WITH LIPSCHITZ DECISION FUNCTIONS
    (2023) Zhou, Xiaoyu; Slud, Eric; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation proposes an objective function based clustering approach using Lipschitzfunctions to represent the clustering function. We establish some mathematical properties including two optimality conditions and a uniqueness result; some statistical properties including two consistency results; and some computational development. This work is a step forward building upon existing work about Lipschitz classifiers to proceed from classification to clustering, also covering more theoretical and computational aspects. The mathematical contents strongly suggest further future analysis of the method. The general objective function might be of independent interest.
  • Thumbnail Image
    Item
    DISSECTING TUMOR CLONALITY IN LIVER CANCER: A PHYLOGENY ANALYSIS USING COMPUTATIONAL AND STATISTICAL TOOLS
    (2023) Kacar, Zeynep; Slud, Eric ES; Levy, Doron DL; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Liver cancer is a heterogeneous disease characterized by extensive genetic and clonaldiversity. Understanding the clonal evolution of liver tumors is crucial for developing effective treatment strategies. This dissertation aims to dissect the tumor clonality in liver cancer using computational and statistical tools, with a focus on phylogenetic analysis. Through advancements in defining and assessing phylogenetic clusters, we gain a deeper understanding of the survival disparities and clonal evolution within liver tumors, which can inform the development of tailored treatment strategies and improve patient outcomes. The thesis begins by providing an overview of sources of heterogeneity in liver cancer and data types, from Whole-Exome (WEX) and RNA sequencing (RNA-seq) read-counts by gene to derived quantities such as Copy Number Alterations (CNAs) and Single Nucleotide Variants (SNVs). Various tools for deriving copy-numbers are discussed and compared. Additionally, comparison of survival distributions is discussed. The central data analyses of the thesis concern the derivation of distinct clones and clustered phylogeny types from the basic genomic data in three independent cancer cohorts, TCGA-LIHC, TIGER-LC and NCI-MONGOLIA. The SMASH (Subclone multiplicity allocation and somatic heterogeneity) algorithm is introduced for clonality analysis, followed by a discussion on clustering analysis of nonlinear tumor evolution trees and the construction of phylogenetic trees for liver cancer cohorts. Identification of drivers of tumor evolution, and the immune cell micro-environment of tumors are also explored. In this research, we employ survival analysis tools to investigate and document survival differences between groups of subjects defined from phylogenetic clusters. Specifically, we introduce the log-rank test and its modifications for generic right-censored survival data, which we then apply to survival follow-up data for the subjects in the studied cohorts, clustered based on their genomic data. The final chapter of this thesis takes a significant step forward by extending an existing methodology for covariate-adjustment in the two-sample log-rank test to a K-sample scenario, with a specific focus on the already defined phylogeny cluster groups. This extension is not straightforward because the computation of the test statistic for K-sample and its asymptotic null distribution do not follow directly from the two-sample case. Using these extended tools, we conduct an illustrative data analysis with real data from the TIGER-LC cohort, which comprises subjects with analyzed and clustered genomic data, leading to defined phylogenetic clusters associated with two different types of liver cancer. By applying the extended methodology to this dataset, we aim to effectively assess and validate the survival curves of the defined clusters.
  • Thumbnail Image
    Item
    NEW STATISTICAL METHODS FOR HIGH-DIMENSIONAL INTERCONNECTED DATA WITH UNIFORM BLOCKS
    (2023) Yang, Yifan; Chen, Shuo; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging data, consistently reveal the presence of strong modularity in the dependence patterns. In these analyses, highly correlated features often form a few distinct communities or modules, which can be interconnected with each other. While the interconnected community structure has been extensively studied in biomedical research (e.g., gene co-expression networks), its potential to assist in statistical modeling and inference remains largely unexplored. To address this research gap, we propose novel statistical models and methods that capitalize on the prevalent community structures observed in large covariance and precision matrices derived from high-dimensional biomedical interconnected data. The first objective of this dissertation is to delve into the algebraic properties of the proposed interconnected community structures at the population level. Specifically, this pattern partitions the population covariance matrix into uniform (i.e., equal variances and covariances) blocks. To accomplish this objective, we introduce a block Hadamard product representation in Chapter 2, which relies on two lower-dimensional "coordinate" matrices and a pre-specific vector.This representation enables the explicit expressions of the square or power, determinant, inverse, eigendecomposition, canonical form, and the other matrix functions of the original larger-dimensional matrix on the basis of these lower-dimensional "coordinate" matrices. Estimating a covariance matrix is central to high-dimensional data analysis. Our second objective is to consistently estimate a large covariance or precision matrix having an interconnected community structure with uniform blocks. In Chapter 3, we derive the best-unbiased estimators for covariance and precision matrices in closed forms and provide theoretical results on their asymptotic properties. Our proposed method improves the accuracy of covariance and precision matrix estimation and demonstrates superior performance compared to the competing methods in both simulations and real data analyses. In Chapter 4, our goal is to investigate the effects of alcohol intake (as an exposure) on metabolomics outcome features. However, similar to other omics data, metabolomic outcomes often consist of numerous features that exhibit a structured dependence pattern, such as a co-expression network with interconnected modules. Effectively addressing this dependence structure is crucial for accurate statistical inferences and the identification of alcohol intake-related metabolomic outcomes. Nevertheless, incorporating the structured dependence patterns into multivariate outcome regression models remains difficulties in accurate estimation and inference. To bridge this gap, we propose a novel multivariate regression model that accounts for the correlations among outcome features using a network structure composed of interconnected modules. Additionally, we derive closed-form estimators of regression parameters and provide inference tools. Extensive simulation analysis demonstrates that our approach yields much-improved sensitivity with a well-controlled discovery rate when benchmarking against existing multivariate regression models. Confirmatory factor analysis (CFA) models play a crucial role in revealing underlying latent common factors within sets of correlated variables. However, their implementation often relies on a strong prior theory to categorize variables into distinct classes, which is frequently unavailable (e.g., in omics data analysis scenarios). To address this limitation, in Chapter 5, we propose a novel strategy based on network analysis that allows data-driven discovery to substitute for the lacking prior theory. By leveraging the detected interconnected community structure, our approach offers an elegant statistical interpretation and yields closed-form uniformly minimum variance unbiased estimators for all unknown matrices. To evaluate the effectiveness of our proposed estimation procedure, we compare it to conventional numerical methods and thoroughly validate it through extensive Monte Carlo simulations and real-world applications.
  • Thumbnail Image
    Item
    Statistical Network Analysis of High-Dimensional Neuroimaging Data With Complex Topological Structures
    (2023) Lu, Tong; Chen, Shuo SC; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation contains three projects that collectively tackle statistical challenges in the field of high-dimensional brain connectome data analysis and enhance our understanding of the intricate workings of the human brain. Project 1 proposes a novel network method for detecting brain-disease-related alterations in voxel-pair-level brain functional connectivity with spatial constraints, thus improving spatial specificity and sensitivity. Its effectiveness is validated through extensive simulations and real data applications in nicotine addiction and schizophrenia studies. Project 2 introduces a multivariate multiple imputation method specifically designed for voxel-level neuroimaging data in high dimensions based on Bayesian models and Markov chain Monte Carlo processes. According to both synthetic data and real neurovascular water exchange data extracted from a neuroimaging dataset in a schizophrenia study, our method indicates high imputation accuracy and computational efficiency. Project 3 develops a multi-level network model based on graph combinatorics that captures vector-to-matrix associations between brain structural imaging measures and functional connectomic networks. The validity of the proposed model is justified through extensive simulations and a real structure-function imaging dataset from UK Biobank. These three projects contribute innovative methodologies and insights that advance neuroimaging data analysis, including improvements in spatial specificity, statistical power, imputation accuracy, and computational efficiency when revealing the brain’s complex neurological patterns.
  • Thumbnail Image
    Item
    Proportional Hazards Model for Right Censored Survival Data with Longitudinal Covariates
    (2023) Shi, Yuyin; Ren, Joan Jian-Jian; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The proportional hazards model is one of the most widely used tools in analyzing survival data. In medical and epidemiological studies, the interrelationship between time-to-event variable and longitudinal covariates is often the primary research interest. Thus, joint modeling of survival data and longitudinal data has received very much attention in statistical literature, but it's a considerably difficult problem due to censoring on the survival time and that the longitudinal covariate process is in fact a completely unknown and not completely observed stochastic process. Up to now, all existing works made parametric or semi-parametric assumptions on the longitudinal covariate process, and resulting inferences critically depends on validity of these not justifiable assumptions. This dissertation does not make any parametric or semi-parametric assumptions on the longitudinal covariate process. We use the empirical likelihood method to derive the maximum likelihood estimator (MLE) for the proportional hazards model based on right censored survival data with longitudinal covariates. Computation algorithm is developed here and our simulation studies show that our MLE performs very well.