Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 7 of 7
  • Thumbnail Image
    Item
    ALGORITHMS FOR THE ALIGNMENT AND VISUALIZATION OF GENOME MAPPING DATA WITH APPLICATIONS TO STRUCTURAL VARIANT DETECTION
    (2015) Mendelowitz, Lee M.; Pop, Mihai; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Optical mapping and nanocoding are single molecule restriction mapping systems for interrogating genomic structure at a scale that cannot currently be achieved using DNA sequencing methods. In these mapping experiments, large DNA molecules approximately 500 kb are stretched, immobilized or confined, and then digested with a restriction endonuclease that cuts or nicks the DNA at its cognate sequence. The cut/nick sites are then observed through fluorescent microscopy and machine vision is used to estimate the length of the DNA fragments between consecutive sites. This produces, for each molecule, a barcode-like pattern comprising the ordered list of restriction fragment lengths Despite the promise of the optical mapping and nanocoding systems, there are few open source tools for working with the data generated by these platforms. Most analyses rely on custom in-house software pipelines using proprietary software. In this dissertation we present open source software tools for the alignment and vizualization of restriction mapping data. In this work we first present a review of the optical mapping and nanocoding systems and provide an overview of the current methods for aligning and assembling consensus restriction maps and their related applications. Next, we present the Maligner software for the alignment of a query restriction pattern to a reference pattern. Alignment is a fundamental problem which is the first step in many downstream analyses, such as consensus map assembly or structural variant calling. The Maligner software features both a sensitive dynamic programming implementation and a faster but less sensitive index based mode of alignment. We compare the Maligner software to other available tools for the task of aligning a sequence contig assembly to a reference optical map and for aligning single molecule maps to a reference. Next, we present a portable data visualization web application for visualizing pairwise alignments of restriction maps. Finally, we present updates to the Maligner software to support partial alignments of single molecule maps, allowing for the clustering of compatible split map alignments to identify structural variants.
  • Thumbnail Image
    Item
    NORMALIZATION AND DIFFERENTIAL ABUNDANCE ANALYSIS OF METAGENOMIC BIOMARKER-GENE SURVEYS
    (2015) Paulson, Joseph Nathaniel; Pop, Mihai; Corrada Bravo, Héctor; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    High-throughput technologies such as whole targeted sequencing of marker-genes and whole metagenomic shotgun (WMS) sequencing have provided unprecedented insight into microbial communities and the interactions between their members. Statistical inference is a challenging task in analyzing these communities while accounting for a far too common limitation of metagenomic datasets: under-sampling. In this dissertation I present novel and robust methods for normalization and differential abundance testing of marker-gene surveys and whole metagenomic shotgun sequencing experiments. Using these methods I analyze one particular microbial community of interest, gut microbiota associated with diarrhea. One central problem in almost any metagenomic analysis is under-sampling of the microbial community. The analysis and interpretation of both marker-gene surveys and WMS sequencing data can bias mean and variance estimates due to the misinterpretation of zero valued counts. Even in very deep sequencing surveys, the nature of the “counting experiment” that is a metagenomic analysis can skew representative population estimates for community members. To address this issue, I characterize the biases that sparsity has on association testing of various metagenomic experiments. I developed sparsity-aware methods to 1) control for the variability in sequencing depth with a novel normalization algorithm and 2) associate gene abundance with host phenotypes. The central idea in testing associations is to weight zero values of a gene or taxa according to the posterior probability of not being observed due to under-sampling. These methods have broad general applicability in the analysis of large, relatively sparse data sets, they will provide better insight into the biological properties of complex microbial communities and their potential roles in various environmental niches. In applying these methods to ecosystems previously unexplored I was able to obtain novel insights in the microbial community of healthy and diseased children from low-income countries. I analyzed 992 children under five years of age from low-income countries, including, The Gambia, Mali, Bangladesh, and Kenya. Approximately half of the samples were from children diagnosed with moderate-to-severe diarrhea. In applying the methods developed we recovered known diarrhea-causing pathogens, including Escherichia/Shigella and Campylobacter species. We also detected previously unknown associations with disease for several bacteria including Granulicatella species and Streptococcus mitis/pneumonia groups.
  • Thumbnail Image
    Item
    Applications of parametric and semi-parametric models for longitudinal data analysis
    (2014) Talukder, Hisham; Corrada Bravo, Héctor; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A wide range of scientific applications involve analyses of longitudinal data. Whether it is time or location, careful considerations need to be made when applying different statistical tools. One such challenge is to correctly estimate variance components in observed data. In this dissertation, I apply statistical tools to solve problems involving longitudinal data in the field of Biology, Healthcare and Networks. In the second chapter, I apply SSANOVA models to find regions in the genome that have a specific biological trait. We introduce a direct approach of estimating genomic longitudinal data of two different biological groups. Using SSANOVA we then produce a novel method to estimate the difference between the two groups and find regions (location or time) where this difference is biologically significant. In the third chapter, we analyze longitudinal network data using an overdispersed Poisson model. We build a network of musical writers yearly for a 42 year period. Using statistical models, we predict network level topology changes and find covariates that explain these changes. Network level characteristics used for this chapter include average node degree, clustering coefficient and network density. We also build a visualization tool using R-Shiny. The fourth chapter uses data partitioning to study the difference between insured patients and uninsured patients in health outcomes. There is a disparity in health outcomes depending on an individual's type of insurance. The level of risk for an injury is the longitudinal aspect of this dataset. We partition the data into four pre-defined risk categories and evaluate the disparity between insured and uninsured patients using logistic regression models.
  • Thumbnail Image
    Item
    Data Representation for Learning and Information Fusion in Bioinformatics
    (2013) Rajapakse, Vinodh Nalin; Czaja, Wojciech; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data.
  • Thumbnail Image
    Item
    Network Algorithms for Complex Systems with Applications to Non-linear Oscillators and Genome Assembly
    (2013) Schmitt, Karl Robert Bruce; Girvan, Michelle; Zimin, Aleksey; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Network and complex system models are useful for studying a wide range of phenomena, from disease spread to traffic flow. Because of the broad applicability of the framework it is important to develop effective simulations and algorithms for complex networks. This dissertation presents contributions to two applied problems in this area First, we study an electro-optical, nonlinear, and time-delayed feedback loop commonly used in applications that require a broad range of chaotic behavior. For this system we detail a discrete-time simulation model, exploring the model's synchronization behavior under specific coupling conditions. Expanding upon already published results that investigated changes in feedback strength, we explore how both time-delay and nonlinear sensitivity impact synchronization. We also relax the requirement of strictly identical systems components to study how synchronization regions are affected when coupled systems have non-identical components (parameters). Last, we allow wider variance in coupling strengths, including unique strengths to each system, to identify a rich synchronization region not previously seen. In our second application, we take a complex networks approach to improving genome assembly algorithms. One key part of sequencing a genome is solving the orientation problem. The orientation problem is finding the relative orientations for each data fragment generated during sequencing. By viewing the genomic data as a network we can apply standard analysis techniques for community finding and utilize the significantly modular structure of the data. This structure informs development and application of two new heuristics based on (A) genetic algorithms and (B) hierarchical clustering for solving the orientation problem. Genetic algorithms allow us to preserve some internal structure while quickly exploring a large solution space. We present studies using a multi-scale genetic algorithm to solve the orientation problem. We show that this approach can be used in conjunction with currently used methods to identify a better solution to the orientation problem. Our hierarchical algorithm further utilizes the modular structure of the data. By progressively solving and merging sub-problems together we pick optimal `local' solutions while allowing more global corrections to occur later. Our results show significant improvements over current techniques for both generated data and real assembly data.
  • Thumbnail Image
    Item
    Genome Assembly Techniques
    (2011) Marcais, Guillaume; Yorke, James; Kingsford, Carl; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Since the publication of the human genome in 2001, the price and the time of DNA sequencing have dropped dramatically. The genome of many more species have since been sequenced, and genome sequencing is an ever more important tool for biologists. This trend will likely revolutionize biology and medicine in the near future where the genome sequence of each individual person, instead of a model genome for the human, becomes readily accessible. Nevertheless, genome assembly remains a challenging computational problem, even more so with second generation sequencing technologies which generate a greater amount of data and make the assembly process more complex. Research to quickly, cheaply and accurately assemble the increasing amount of DNA sequenced is of great practical importance. In the first part of this thesis, we present two software developed to improve genome assemblies. First, Jellyfish is a fast k-mer counter, capable of handling large data sets. k-mer frequencies are central to many tasks in genome assembly (e.g. for error correction, finding read overlaps) and other study of the genome (e.g. finding highly repeated sequences such as transposons). Second, Chromosome Builder is a scaffolder and contig placement software. It aims at improving the accuracy of genome assembly. In the second part of this thesis we explore several problems dealing with graphs. The theory of graphs can be used to solve many computational problems. For example, the genome assembly problem can be represented as finding an Eulerian path in a de Bruijn graph. The physical interactions between proteins (PPI network), or between transcription factors and genes (regulatory networks), are naturally expressed as graphs. First, we introduce the concept of "exactly 3-edge-connected" graphs. These graphs have only a remote biological motivation but are interesting in their own right. Second, we study the reconstruction of ancestral network which aims at inferring the state of ancestral species' biological networks based on the networks of current species.
  • Thumbnail Image
    Item
    Detecting and Correcting Errors in Genome Assemblies
    (2010) Subramanian, Poorani; Yorke, James A; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Genome assemblies have various types of deficiencies or misassemblies. This work is aimed at detecting and correcting a type of misassembly that we call Compression/Expansion or CE misassemblies whereby a section of sequence has been erroneously omitted or inserted in the assembly. Other types of deficiencies include gaps in the genome sequence. We developed a statistic for identifying Compression/Expansion misassemblies called the CE statistic. It is based on examining the placement of mate pairs of reads in the assembly. In addition to this, we developed an algorithm that is aimed at closing gaps and validating and/or correcting CE misassemblies detected by the CE statistic. This algorithm is similar to a shooting algorithm used in solving two-point boundary value problems in partial differential equations. We call this algorithm the Shooting Method. The Shooting Method finds all possible ways to assemble a local region of the genome contained between two target reads. We use a combination of the CE statistic and Shooting Method to detect and correct some CE misassemblies and close gaps in genome assemblies. We tested our techniques both on faux and real data. Applying this technique to 22 bacterial draft assemblies for which the finished genome sequence is known, we were able to identify 5 out of 8 real CE misassemblies. We applied the Shooting Method to a de novo assembly of the Bos taurus genome made from Sanger data. We were able to close 9,863 gaps out of 58,386. This added 8.34 Mbp of sequence to the assembly, and resulted in a 7 % increase of N50 contig size.