Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    Some Statistical and Dynamical Models for the Analysis of Mcrobial Ecosystems and their Genomic Data
    (2019) Muthiah, Senthilkumar; Corrada Bravo, Héctor; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Embedded within their genetic makeup and ecology, microbes harbor unparalleled stories on natural selection, evolution and biomedicine. In modern biology, such stories are elucidated through rigorous interrogation of microbial ecosystems with a variety of theoretic and experimental techniques. These range from abstract, isolated mathematical models to high-resolution sequencing technologies that probe every single nucleotide of a cell's DNA. It is clear that inferences thus obtained are markedly sensitive to the unforeseen technical variability introduced during an experiment, and are limited by the tractability and robustness of the models in generating sound hypotheses. We have developed statistical and computational tools to advance statistical inference for microbial genomics by overcoming a subset of technical biases, and have explored certain interesting cases of microbial interactions and their evolution by developing tractable mathematical models. Compositional bias induced by the sequencing machine. A DNA sequencing machine produces only percentage measurements (fraction molecules of a given type) of the DNA molecules in its input. When contrasting measurements from different inputs, one therefore obtains confounded inferences on absolute concentrations (molecules per unit volume). We theoretically analyze this compositional bias problem with significant generality, and exploit it to develop an empirical Bayes approach to solve it under certain assumptions with particular emphasis on microbial sequencing technologies. Suicidal attributes of prokaryotic adaptive immunity. The recently discovered CRISPR systems provide the first examples of bacterial and archaeal adaptive immune systems operating against invading viruses over ecological time scales. Equally surprising as their adaptive nature, is their ability to induce high rates of host autoimmunity. We theoretically analyze the ecological and evolutionary dynamics of such a costly defense mechanism in simplified models of prokaryote-phage coevolution. We show that by allowing for regulated post-infection activation, CRISPRs can function by exploiting a dual defense strategy of abortive infection and anti-viral resistance. Additional statistical and analytic extensions for some related questions on clustering and multi-resolution analysis also appear.
  • Thumbnail Image
    Item
    Anti-Profiles for Anomaly Classification and Regression
    (2015) Dinalankara, Wikum; Bravo, Héctor C; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Anomaly detection is a classical problem in Statistical Learning with wide-reaching applications in security, networks, genomics and others. In this work, we formulate the anomaly classification problem as an extension to the detection problem: how to distinguish between samples from multiple heterogenous classes that are anomalies relative to a well-defined, homogenous, normal class. Our formulation of this learning setting arises from studies in cancer genomics, where this problem follows from prognosis and diagnosis applications. Standard binary and multi-class classification schemes are not well suited to the anomaly classification task since they attempt to directly model these highly unstable, heterogeneous classes. In this work, we show that robust classifiers can be obtained by modeling the degree of deviation from the normal class as a stable characteristic of each anomaly class. To do so, we formalize the anomaly classification problem, characterize it statistically and computationally via kernel methods and propose a class of robust learning methods, anti-profiles, specifically designed for this task. We focus on an open area of research in cancer genomics which motivates this project: the classification of tumors for prognosis and diagnosis. We provide experimental results obtained by applying the anti-profile method to gene expression data. In addition we extend the anti-profile approach to use kernel functions, and develop a support-vector machine (SVM) based method for classification of anomalies based on their deviation from a stable normal class. We provide experimental results obtained by applying this method to genetic data to classify different stages of tumor progression, and show that this method provides much more stable classifiers than the application of regular classifiers. In addition we show that this approach can be applied to anomaly classification problems in other application domains. We conclude by developing an SVM for censored survival information and demonstrate that the anti-profile method can produce stable classifiers for modeling the clinical outcome of clinical studies of cancer.
  • Thumbnail Image
    Item
    NORMALIZATION AND DIFFERENTIAL ABUNDANCE ANALYSIS OF METAGENOMIC BIOMARKER-GENE SURVEYS
    (2015) Paulson, Joseph Nathaniel; Pop, Mihai; Corrada Bravo, Héctor; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    High-throughput technologies such as whole targeted sequencing of marker-genes and whole metagenomic shotgun (WMS) sequencing have provided unprecedented insight into microbial communities and the interactions between their members. Statistical inference is a challenging task in analyzing these communities while accounting for a far too common limitation of metagenomic datasets: under-sampling. In this dissertation I present novel and robust methods for normalization and differential abundance testing of marker-gene surveys and whole metagenomic shotgun sequencing experiments. Using these methods I analyze one particular microbial community of interest, gut microbiota associated with diarrhea. One central problem in almost any metagenomic analysis is under-sampling of the microbial community. The analysis and interpretation of both marker-gene surveys and WMS sequencing data can bias mean and variance estimates due to the misinterpretation of zero valued counts. Even in very deep sequencing surveys, the nature of the “counting experiment” that is a metagenomic analysis can skew representative population estimates for community members. To address this issue, I characterize the biases that sparsity has on association testing of various metagenomic experiments. I developed sparsity-aware methods to 1) control for the variability in sequencing depth with a novel normalization algorithm and 2) associate gene abundance with host phenotypes. The central idea in testing associations is to weight zero values of a gene or taxa according to the posterior probability of not being observed due to under-sampling. These methods have broad general applicability in the analysis of large, relatively sparse data sets, they will provide better insight into the biological properties of complex microbial communities and their potential roles in various environmental niches. In applying these methods to ecosystems previously unexplored I was able to obtain novel insights in the microbial community of healthy and diseased children from low-income countries. I analyzed 992 children under five years of age from low-income countries, including, The Gambia, Mali, Bangladesh, and Kenya. Approximately half of the samples were from children diagnosed with moderate-to-severe diarrhea. In applying the methods developed we recovered known diarrhea-causing pathogens, including Escherichia/Shigella and Campylobacter species. We also detected previously unknown associations with disease for several bacteria including Granulicatella species and Streptococcus mitis/pneumonia groups.