Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    GENOME ASSEMBLY AND VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES AND GRAPH BASED METHODS
    (2018) Ghurye, Jay; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The increased availability of genomic data and the increased ease and lower costs of DNA sequencing have revolutionized biomedical research. One of the critical steps in most bioinformatics analyses is the assembly of the genome sequence of an organism using the data generated from the sequencing machines. Despite the long length of sequences generated by third-generation sequencing technologies (tens of thousands of basepairs), the automated reconstruction of entire genomes continues to be a formidable computational task. Although long read technologies help in resolving highly repetitive regions, the contigs generated from long read assembly do not always span a complete chromosome or even an arm of the chromosome. Recently, new genomic technologies have been developed that can ''bridge" across repeats or other genomic regions that are difficult to sequence or assemble and improve genome assemblies by ''scaffolding" together large segments of the genome. The problem of scaffolding is vital in the context of both single genome assembly of large eukaryotic genomes and in metagenomics where the goal is to assemble multiple bacterial genomes in a sample simultaneously. First, we describe SALSA2, a method we developed to use interaction frequency between any two loci in the genome obtained using Hi-C technology to scaffold fragmented eukaryotic genome assemblies into chromosomes. SALSA2 can be used with either short or long read assembly to generate highly contiguous and accurate chromosome level assemblies. Hi-C data are known to introduce small inversion errors in the assembly, so we included the assembly graph in the scaffolding process and used the sequence overlap information to correct the orientation errors. Next, we present our contributions to metagenomics. We developed a scaffolding and variant detection method MetaCarvel for metagenomic datasets. Several factors such as the presence of inter-genomic repeats, coverage ambiguities, and polymorphic regions in the genomes complicate the task of scaffolding metagenomes. Variant detection is also tricky in metagenomes because the different genomes within these complex samples are not known beforehand. We showed that MetaCarvel was able to generate accurate scaffolds and find genome-wide variations de novo in metagenomic datasets. Finally, we present EDIT, a tool for clustering millions of DNA sequence fragments originating from the highly conserved 16s rRNA gene in bacteria. We extended classical Four Russians' speed up to banded sequence alignment and showed that our method clusters highly similar sequences efficiently. This method can also be used to remove duplicates or near duplicate sequences from a dataset. With the increasing data being generated in different genomic and metagenomic studies using emerging sequencing technologies, our software tools and algorithms are well timed with the need of the community.
  • Thumbnail Image
    Item
    DATA DRIVEN APPROACHES TO IDENTIFY DETERMINANTS OF HEART DISEASES AND CANCER RESISTANCE
    (2016) Sahu, Avinash Das; Hannenhalli, Sridhar; Ruppin, Eytan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cancer and cardio-vascular diseases are the leading causes of death world-wide. Caused by systemic genetic and molecular disruptions in cells, these disorders are the manifestation of profound disturbance of normal cellular homeostasis. People suffering or at high risk for these disorders need early diagnosis and personalized therapeutic intervention. Successful implementation of such clinical measures can significantly improve global health. However, development of effective therapies is hindered by the challenges in identifying genetic and molecular determinants of the onset of diseases; and in cases where therapies already exist, the main challenge is to identify molecular determinants that drive resistance to the therapies. Due to the progress in sequencing technologies, the access to a large genome-wide biological data is now extended far beyond few experimental labs to the global research community. The unprecedented availability of the data has revolutionized the capabilities of computational researchers, enabling them to collaboratively address the long standing problems from many different perspectives. Likewise, this thesis tackles the two main public health related challenges using data driven approaches. Numerous association studies have been proposed to identify genomic variants that determine disease. However, their clinical utility remains limited due to their inability to distinguish causal variants from associated variants. In the presented thesis, we first propose a simple scheme that improves association studies in supervised fashion and has shown its applicability in identifying genomic regulatory variants associated with hypertension. Next, we propose a coupled Bayesian regression approach -- eQTeL, which leverages epigenetic data to estimate regulatory and gene interaction potential, and identifies combinations of regulatory genomic variants that explain the gene expression variance. On human heart data, eQTeL not only explains a significantly greater proportion of expression variance in samples, but also predicts gene expression more accurately than other methods. We demonstrate that eQTeL accurately detects causal regulatory SNPs by simulation, particularly those with small effect sizes. Using various functional data, we show that SNPs detected by eQTeL are enriched for allele-specific protein binding and histone modifications, which potentially disrupt binding of core cardiac transcription factors and are spatially proximal to their target. eQTeL SNPs capture a substantial proportion of genetic determinants of expression variance and we estimate that 58% of these SNPs are putatively causal. The challenge of identifying molecular determinants of cancer resistance so far could only be dealt with labor intensive and costly experimental studies, and in case of experimental drugs such studies are infeasible. Here we take a fundamentally different data driven approach to understand the evolving landscape of emerging resistance. We introduce a novel class of genetic interactions termed synthetic rescues (SR) in cancer, which denotes a functional interaction between two genes where a change in the activity of one vulnerable gene (which may be a target of a cancer drug) is lethal, but subsequently altered activity of its partner rescuer gene restores cell viability. Next we describe a comprehensive computational framework --termed INCISOR-- for identifying SR underlying cancer resistance. Applying INCISOR to mine The Cancer Genome Atlas (TCGA), a large collection of cancer patient data, we identified the first pan-cancer SR networks, composed of interactions common to many cancer types. We experimentally test and validate a subset of these interactions involving the master regulator gene mTOR. We find that rescuer genes become increasingly activated as breast cancer progresses, testifying to pervasive ongoing rescue processes. We show that SRs can be utilized to successfully predict patients' survival and response to the majority of current cancer drugs, and importantly, for predicting the emergence of drug resistance from the initial tumor biopsy. Our analysis suggests a potential new strategy for enhancing the effectiveness of existing cancer therapies by targeting their rescuer genes to counteract resistance. The thesis provides statistical frameworks that can harness ever increasing high throughput genomic data to address challenges in determining the molecular underpinnings of hypertension, cardiovascular disease and cancer resistance. We discover novel molecular mechanistic insights that will advance the progress in early disease prevention and personalized therapeutics. Our analyses sheds light on the fundamental biological understanding of gene regulation and interaction, and opens up exciting avenues of translational applications in risk prediction and therapeutics.
  • Thumbnail Image
    Item
    Epiviz: Integrative Visual Analysis Software for Genomics
    (2015) Chelaru, Florin; Corrada Bravo, Hector; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling of a small number of moderately sized datasets. Workflows that involve the integration and exploration of multiple heterogeneous data sources, small and large, public and user specific have been poorly addressed by these tools. Commonly, the data visualized in these tools is the output of analyses performed in powerful computing environments like R/Bioconductor or Python. Two essential aspects of data analysis are usually treated as distinct, in spite of being part of the same exploratory process: algorithmic analysis and interactive visualization. In current technologies these are not integrated within one tool, but rather, one precedes the other. Recent technological advances in web-based data visualization have made it possible for interactive visualization tools to tightly integrate with powerful algorithmic tools, without being restricted to one such tool in particular. We introduce Epiviz (http://epiviz.cbcb.umd.edu), an integrative visualization tool that bridges the gap between the two types of tools, simplifying genomic data analysis workflows. Epiviz is the first genomics interactive visualization tool to provide tight-knit integration with computational and statistical modeling and data analysis. We discuss three ways in which Epiviz advances the field of genomic data analysis: 1) it brings code to interactive visualizations at various different levels; 2) takes the first steps in the direction of collaborative data analysis by incorporating user plugins from source control providers, as well as by allowing analysis states to be shared among the scientific community; 3) combines established analysis features that have never before been available simultaneously in a visualization tool for genomics. Epiviz can be used in multiple branches of genomics data analysis for various types of datasets, of which we detail two: functional genomics data, aligned to a continuous coordinate such as the genome, and metagenomics, organized according to volatile hierarchical coordinate spaces. We also present security implications of the current design, performance benchmarks, a series of limitations and future research steps.
  • Thumbnail Image
    Item
    Genome Assembly Techniques
    (2011) Marcais, Guillaume; Yorke, James; Kingsford, Carl; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Since the publication of the human genome in 2001, the price and the time of DNA sequencing have dropped dramatically. The genome of many more species have since been sequenced, and genome sequencing is an ever more important tool for biologists. This trend will likely revolutionize biology and medicine in the near future where the genome sequence of each individual person, instead of a model genome for the human, becomes readily accessible. Nevertheless, genome assembly remains a challenging computational problem, even more so with second generation sequencing technologies which generate a greater amount of data and make the assembly process more complex. Research to quickly, cheaply and accurately assemble the increasing amount of DNA sequenced is of great practical importance. In the first part of this thesis, we present two software developed to improve genome assemblies. First, Jellyfish is a fast k-mer counter, capable of handling large data sets. k-mer frequencies are central to many tasks in genome assembly (e.g. for error correction, finding read overlaps) and other study of the genome (e.g. finding highly repeated sequences such as transposons). Second, Chromosome Builder is a scaffolder and contig placement software. It aims at improving the accuracy of genome assembly. In the second part of this thesis we explore several problems dealing with graphs. The theory of graphs can be used to solve many computational problems. For example, the genome assembly problem can be represented as finding an Eulerian path in a de Bruijn graph. The physical interactions between proteins (PPI network), or between transcription factors and genes (regulatory networks), are naturally expressed as graphs. First, we introduce the concept of "exactly 3-edge-connected" graphs. These graphs have only a remote biological motivation but are interesting in their own right. Second, we study the reconstruction of ancestral network which aims at inferring the state of ancestral species' biological networks based on the networks of current species.
  • Thumbnail Image
    Item
    Mathematical modeling of drug resistance and cancer stem cells dynamics
    (2010) Tomasetti, Cristian; Levy, Doron; Dolgopyat, Dmitry; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this dissertation we consider the dynamics of drug resistance in cancer and the related issue of the dynamics of cancer stem cells. Our focus is only on resistance which is caused by random genetic point mutations. A very simple system of ordinary differential equations allows us to obtain results that are comparable to those found in the literature with one important difference. We show that the amount of resistance that is generated before the beginning of the treatment, and which is present at some given time afterward, always depends on the turnover rate, no matter how many drugs are used. Previous work in the literature indicated no dependence on the turnover rate in the single drug case while a strong dependence in the multi-drug case. We develop a new methodology in order to derive an estimate of the probability of developing resistance to drugs by the time a tumor is diagnosed and the expected number of drug-resistant cells found at detection if resistance is present at detection. Our modeling methodology may be seen as more general than previous approaches, in the sense that at least for the wild-type population we make assumptions only on their averaged behavior (no Markov property for example). Importantly, the heterogeneity of the cancer population is taken into account. Moreover, in the case of chronic myeloid leukemia (CML), which is a cancer of the white blood cells, we are able to infer the preferred mode of division of the hematopoietic cancer stem cells, predicting a large shift from asymmetric division to symmetric renewal. We extend our results by relaxing the assumption on the average growth of the tumor, thus going beyond the standard exponential case, and showing that our results may be a good approximation also for much more general forms of tumor growth models. Finally, after reviewing the basic modeling assumptions and main results found in the mathematical modeling literature on chronic myeloid leukemia (CML), we formulate a new hypothesis on the effects that the drug Imatinib has on leukemic stem cells. Based on this hypothesis, we obtain new insights on the dynamics of the development of drug resistance in CML.