Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 10 of 37
  • Thumbnail Image
    Item
    DATA-DRIVEN ALGORITHMS FOR CHARACTERIZING STRUCTURAL VARIATION IN METAGENOMIC DATA
    (2024) Muralidharan, Harihara Subrahmaniam; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide poly-morphisms, or relying on reference-based analyses that miss complex structural variants. In this thesis, we describe data-driven methods to characterize variation in metagenomic data. In the first part of the thesis, I present our analysis of the structural variants identified from metagenomic whole genome shotgun sequencing data. I begin by describing the power of assembly graph analysis to detect over 9 million structural variants such as insertion/deletion, repeat copy-number changes, and mobile elements, in nearly 1,000 metagenomes generated as a part of the Human Microbiome Project. Next, I describe Binnacle, which is a structural variant aware binning algorithm. To improve upon the fragmented nature of assemblies, metagenomic binning is performed to cluster contigs that is likely to have originated from the same genome. We show that binning “graph-based” scaffolds, rather than contigs, improves the quality of the bins, and captures a broader set of the genes of the genomes being reconstructed. Finally, we present a case study of the microbial mats from the Yellowstone National Park. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ∼ 50oC to ∼ 65oC and plays a key role in fixing carbon and nitrogen. Previous studies have isolated and generated good quality reference sequences of two major Synechococcus spp. that share a very high genomic content; OS-A and OS-B’. Despite the high abundance of the Synechococcus spp., metagenomic assembly of these organisms is challenging due to the large number of rearrangements between them. We explore the genomic diversity of the Synechococcus spp. using a reference genome, reliant assembly and scaffolding. We also highlight that the variants we detect can be used to fingerprint the local biogeography of the hot spring. In the second part of the thesis, I present our analysis of amplicon sequencing data, specifically the 16S rRNA gene sequences. I begin by describing SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate), which is a fast iterative algorithm for clustering large 16S rRNA gene datasets. We also show that SCRAPT is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST and runs orders of magnitude faster than existing methods. Finally, we study the impact of transitive annotation on taxonomic classifiers. Taxonomic labels are assigned using machine learning algorithms trained to recognize individual taxonomic groups based on training sequences with known taxonomic labels. Ideally, the training data should rely on experimentally verified-formal taxonomic labels however, the labels associated with sequences in biological databases are most commonly the result of computational predictions– “transitive annotation.” We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively annotated training data and propose best practices to avoid the artifacts described in this thesis.
  • Thumbnail Image
    Item
    Analyzing and indexing huge reference sequence collections
    (2023) Fan, Jason; Patro, Rob; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Recent advancements in genome-scale assays and high throughput sequencing have made systematic measurement of model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples or compare measurements against multiple reference organisms have been designed to improve analyses and gain new insights. However, such models and algorithms can be difficult to apply in practice. Furthermore, analysis of high-throughput sequencing data across multiple samples and multiple reference genomic sequences can be prohibitively costly in terms of space and time. In three parts, this dissertation investigates novel computational techniques that improve analyses at various scales. In Part I, I present two general matrix-factorization algorithms designed to analyze and compare biological measurements of related species that can be summarized as networks. In Part II, I present methods that improve analyses of high-throughput sequencing data. The first method, SCALPELSIG, reduces the computation burden of applying mutational signature analysis in resource limited settings; and the second method, a derivation of perplexity for gene and transcript expression estimation models, enables effective model se- lection in experimental RNA-seq data where ground-truth is absent. In Part III, I tackle the difficulties of indexing and analyzing huge collections reference sequences. I introduce the spectrum preserving tiling (SPT), a new computational and mathematical abstraction. Mathematically, the SPT explicitly relates past work on compactly representing k-mer sets --- namely the compacted de Bruijn graph and recent derivations of spectrum preserving string sets --- to the indexing of k-mer positions and metadata in reference sequences. Computationally, the SPT makes possible an entire class of efficient and modular k-mer indexes. I introduce a pair of indexing schemes respectively designed to efficiently support rapid locate and k-mer "color" queries in small space. In the final Chapter of this dissertation, I show how these modular indexes can be effectively and generically implemented.
  • Thumbnail Image
    Item
    Algorithms for scalable and efficient population genomics and metagenomics
    (2022) Javkar, Kiran Gajanan; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Microbes strongly impact human health and the ecosystem of which they are a part. Rapid improvements and decreasing costs in sequencing technologies have revolutionized the field of genomics and enabled important insights into microbial genome biology and microbiomes. However, new tools and approaches are needed to facilitate the efficient analysis of large sets of genomes and to associate genomic features with phenotypic characteristics better. Here, we built and utilized several tools for large-scale whole-genome analysis for different microbial characteristics, such as antimicrobial resistance and pathogenicity, that are important for human health. Chapters 2 and 3 demonstrate the needs and challenges of population genomics in associating antimicrobial resistance with genomic features. Our results highlight important limitations of reference database-driven analysis for genotype-phenotype association studies and demonstrate the utility of whole-genome population genomics in uncovering novel genomic factors associated with antimicrobial resistance. Chapter 4 describes PRAWNS, a fast and scalable bioinformatics tool that generates compact pan-genomic features. Existing approaches are unable to meet the needs of large-scale whole-genome analyses, either due to scalability limitations or the inability of the genomic features generated to support a thorough whole-genome assessment. We demonstrate that PRAWNS scales to thousands of genomes and provides a concise collection of genomic features which support the downstream analyses. In Chapter 5, we assess whether the combination of long and short-read sequencing can expedite the accurate reconstruction of a pathogen genome from a microbial community. We describe the challenges for pathogen detection in current foodborne illness outbreak monitoring. Our results show that the recovery of a pathogen genome can be accelerated using a combination of long and short-read sequencing after limited culturing of the microbial community. We evaluated several popular genome assembly approaches and identified areas for improvement. In Chapter 6, we describe SIMILE, a fast and scalable bioinformatics tool that enables the detection of genomic regions shared between several assembled metagenomes. In metagenomics, microbial communities are sequenced directly without culturing. Although metagenomics has furthered our understanding of the microbiome, comparing metagenomic samples is extremely difficult. We describe the need and challenges in comparing several metagenomic samples and present an approach that facilitates large-scale metagenomic comparisons.
  • Thumbnail Image
    Item
    Predicting Cancer Prognosis and Drug Response from the Tumor Microbiome
    (2021) Hermida, Leandro Cruz; Ruppin, Eytan; Patro, Robert; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Tumor gene expression is predictive of patient prognosis in some cancers. However, RNA-seq and whole genome sequencing data contain not only reads from host tumor and normal tissue, but also reads from the tumor microbiome, which can be used to infer the microbial abundances in each tumor. Here, we show that tumor microbial abundances, alone or in combination with tumor gene expression data, can predict cancer prognosis and drug response to some extent – microbial abundances are significantly less predictive of prognosis than gene expression, although remarkably, similarly as predictive of drug response, but in mostly different cancer-drug combinations. Thus, it appears possible to leverage existing sequencing technology, or develop new protocols, to obtain more non-redundant information about prognosis and drug response from RNA-seq and whole genome sequencing experiments than could be obtained from gene expression or mutation data alone.
  • Thumbnail Image
    Item
    OPTIMIZING THE ACCURACY OF LIGHTWEIGHT METHODS FOR SHORT READ ALIGNMENT AND QUANTIFICATION
    (2021) Zakeri, Mohsen; Patro, Rob; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The analysis of the high throughput sequencing (HTS) data includes a number of involved computational steps, ranging from the assembly of transcriptome, mapping or alignment of the reads to existing or assembled sequences, estimating the abundance of sequenced molecules, performing differential or comparative analysis between samples, and even inferring dynamics of interest from snapshot data. Many methods have been developed for these different tasks that provide various trade-offs in terms of accuracy and speed, because accuracy and robustness typically come at the expense of sacrificing speed and vice versa. In this work, I focus on the problems of alignment and quantification of RNA-seq data, and review different aspects of the available methods for these problems. I explore finding a reasonable balance between these competing goals, and introduce methods that provide accurate results without sacrificing speed. Alignment of sequencing reads to known reference sequences is a challenging computational step in the RNA-seq pipeline mainly because of the large size of sample data and reference sequences, and highly-repetitive sequence. Recently, the concept of lightweight alignment is introduced to accelerate the mapping step of abundance estimation.I collaborated with my colleagues to explore some of the shortcomings of the lightweight alignment methods, and to address those with a new approach called the selective-alignment. Moreover, we introduce an aligner, Puffaligner, which benefits from both the indexing approach introduced by the Pufferfish index and also selective-alignment to produce accurate alignments in a short amount of time compared to other popular aligners. To improve the speed of RNA-seq quantification given a collection of alignments, some tools group fragments (reads) into equivalence classes which are sets of fragments that are compatible with the same subset of reference sequences. Summarizing the fragments into equivalence classes factorizes the likelihood function being optimized and increases the speed of the typical optimization algorithms deployed. I explore how this factorization affects the accuracy of abundance estimates, and propose a new factorization approach that demonstrates higher fidelity to the non-approximate model. Finally, estimating the posterior distribution of the transcript expressions is a crucial step in finding robust and reliable estimates of transcript abundance in the presence of high levels of multi-mapping. To assess the accuracy of their point estimates, quantification tools generate inferential replicates using techniques such as Bootstrap sampling and Gibbs sampling. The utility of inferential replicates has been portrayed in different downstream RNA-seq applications, i.e., performing differential expression analysis. I explore how sampling from both observed and unobserved data points (reads) improves the accuracy of Bootstrap sampling. I demonstrate the utility of this approach in estimating allelic expression with RNA-seq reads, where the absence of unique mapping reads to reference transcripts is a major obstacle for calculating robust estimates.
  • Thumbnail Image
    Item
    Developing computational tools for studying cancer metabolism and genomics
    (2021) Katzir-Sheratzki, Rotem; Ruppin, Eytan; Reggia, James; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The interplay between different genomic and epigenomic alterations lead to different prognoses in cancer patients. Advances in high-throughput technologies, like gene expression profiling, next-generation sequencing, proteomics, and fluxomics, have enabled detailed molecular characterization of various tumors, yet studying this interplay is a complex computational problem.Here we set to develop computational approaches to identify and study emerging challenges in cancer metabolism and genomics. We focus on three research questions, addressed by different computational approaches: (1) What is the set of metabolic interactions in cancer metabolism? To this end we generated a computational framework that quantitatively predicts synthetic dosage lethal (SDL) interactions in human metabolism, by developing a new algorithmic-modeling approach. SDLs offer a promising way to selectively kill cancer cells by targeting the SDL partners of activated oncogenes in tumors, which are often difficult to target directly. (2) What is the landscape of metabolic regulation in breast cancer? To this end we established a new framework that utilizes different data types to perform multi-omics data integration and flux prediction, by incorporating machine learn- ing techniques with Genome Scale Metabolic Modeling (GSMM). This enabled us to study the regulation of breast cancer cell line under different growth conditions, from multiple omics data. (3) What is the power of somatic mutations derived from RNA in estimating the tumor mutational burden? Here we develop a new tool to detect somatic mutations from RNA sequencing data without a matched- normal sample. To this end we developed a machine learning pipeline that takes as input a list of single nucleotide variants and classifies them as either somatic or germline, based on read-level features as well as position-specific variant statistics and common germline databases. We showed that detecting somatic mutations directly from RNA enables the identification of expressed mutations, and therefore represent a more relevant metric in estimating the tumor mutational burden, which is significantly associated with patient survival. In sum, my work has been focused around developing computational methods to tackle different research questions in cancer metabolism and genomics, utilizing various types of omics data and a variety of computational approaches. These methods provide new solutions to some important computational challenges, and their applications help to generate promising leads for cancer research, and can be utilized in many future applications, analyzing novel and existing datasets.
  • Thumbnail Image
    Item
    Data-driven algorithms for characterizing microbial communities
    (2021) Shah, Nidhi; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Complex microbial communities play a crucial role in environmental and human health. Traditionally, microbes have been studied by isolating and culturing them, missing organisms that cannot grow in standard laboratory conditions, and losing information about microbe-microbe interactions. With affordable high- throughput sequencing, a new field called metagenomics has emerged, that studies the genomic content of the microbial community as a whole. Metagenomics enables researchers to characterize complex microbial communities, however, many computational challenges remain with downstream analyses of large sequencing datasets. Here, we explore some fundamental problems in metagenomics and present simple algorithms and open-source software tools that implement these solutions. In the first section, we focus on using a reference database of known organisms (and genomic segments within) to taxonomically classify sequences and estimate abundances of taxa in a metagenomic sample. We developed a “BLAST outlier detection” algorithm that identifies significant outliers within database search results. We extended this method and developed ATLAS, which uses significant database hits to group sequences in the database into partitions. These partitions capture the extent of ambiguity within the classification of a sample. Besides taxonomically classifying sequences, we also explored the problem of taxonomic abundance profiling, i.e., estimating the abundance of different species in the community. We describe TIPP2, a marker gene-based abundance profiling method, which combines phylogenetic placement with statistical techniques to control classification accuracy. TIPP2 includes an updated set of reference packages and several algorithmic improvements over the original TIPP method. Next, we explore how to reconstruct genomes from metagenomic samples. Despite advances in metagenome assembly algorithms, assembling reads into complete genomes is still a computationally challenging problem because of repeated sequences within and among genomes, uneven abundances of organisms, sequencing errors, and strain-level variation. To improve upon the fragmented assemblies, researchers use a strategy called binning, which involves clustering together genomic fragments that likely originate from an individual organism. We describe Binnacle, a tool that explicitly accounts for scaffold information in binning. We describe novel algorithms for estimating the scaffold-level depth of coverage information and show that variation-aware scaffolders help further improve the completeness and quality of the resulting metagenomic bins. Finally, we explore how to organize enormous sets of sequence data generated through the surge of metagenomic studies. There have been recent efforts to organize and document genes found in microbial communities in “microbial gene catalogs”. Although gene catalogs are commonly used, they have not been critically evaluated for their effectiveness as a basis for metagenomic analyses. We investigated one such catalog and focus on both the approach used to construct this catalog and its effectiveness when used as a reference for microbiome studies. Our results highlight important limitations of the approach used to construct the catalog and call into question the broad usefulness of gene catalogs. We also recommend best practices for the construction and use of gene catalogs in microbiome studies and highlight opportunities for future research. With the increasing data being generated in different metagenomic studies, we believe our ideas, algorithms, and software tools are well-timed with the need of the community.
  • Thumbnail Image
    Item
    Computational methods for the identification of mutation signatures and intracellular microbes in cancer
    (2021) Robinson, Wells; Leiserson, Mark D.M.; Ruppin, Eytan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cancer is the second leading cause of death in the United States behind heart disease, killing ~600,000 Americans per year. Technological advances have lowered the cost of sequencing a tumor genome even faster than would have been predicted by Moore’s law. However, specialized computational techniques are required to effectively analyze this genomic data. In this dissertation, we present two such computational approaches to address key challenges in the field of computational cancer biology. Given the importance of reproducibility in biomedical research, we provide publicly available workflows for reproducing the results from our computational approaches. In the first part of this thesis, we present a novel approach for the extraction of mutation signatures from matrices of somatic mutations. One computational challenge for extracting mutation signatures is the relatively small number of mutations in each tumor compared to the relatively large number of distinct signatures, which can be mathematically similar to each other. To help address this computational challenge, we apply ideas from the field of topic modeling to develop the first mutation signature model, the Tumor Covariate Signature Model (TCSM), that can incorporate known tumor covariates. We focus on two mathematically similar signatures associated with distinct covariates to evaluate TCSM and show that by leveraging these covariates, TCSM can more accurately distinguish between mutations attributed to these two signatures than existing NMF-based approaches. The second part focuses on the microbes in the tumor microenvironment. It is not currently known whether microbial reads identified from tumor sequencing datasets result from contamination or represent either extracellular or intracellular microbial residents of the tumor microenvironment. We develop a computational approach named CSI-Microbes (computational identification of Cell type Specific Intracellular Microbes) that mines single-cell RNA sequencing (scRNA-seq) datasets to distinguish cell-type specific intracellular microbes from other microbes. We show that CSI-Microbes can identify previously reported intracellular microbes from both human-designed and cancer scRNA-seq datasets. Finally, we apply CSI-Microbes to a large scRNA-seq lung cancer dataset and identify microbial taxa in tumor cells with a transcriptomic signature of decreased immune activity.
  • Thumbnail Image
    Item
    Fantastic Sources Of Tumor Heterogeneity And How To Characterize Them
    (2021) Patkar, Sushant A; Ruppin, Eytan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cancer constantly evolves to evade the host immune system and resist different treatments. As a consequence, we see a wide range of inter and intra-tumor heterogeneity. In this PhD thesis, we present a collection of computational methods that characterize this heterogeneity from diverse perspectives. First, we developed computational frameworks for predicting functional re-wiring events in cancer and imputing the functional effects of protein-protein interactions given genome-wide transcriptomics and genetic perturbation data. Second, we developed a computational framework to characterize intra-tumor genetic heterogeneity in melanoma from bulk sequencing data and study its effects on the host immune response and patient survival independently of the overall mutation burden. Third, we analyzed publicly available genome-wide copy number, expression and methylation data of distinct cancer types and their normal tissues of origin to systematically uncover factors driving the acquisition of cancer type-specific chromosomal aneuploidies. Lastly, we developed a new computational tool: CODEFACS (COnfident Deconvolution For All Cell Subsets) to dissect the cellular heterogeneity of each patient’s tumor microenvironment (TME) from bulk RNA sequencing data, and LIRICS (LIgand Receptor Interactions between Cell Subsets): a supporting statistical framework to discover clinically relevant cellular immune crosstalk. Taken together, the methods presented in this thesis offer a way to study tumor heterogeneity in large patient cohorts using widely available bulk sequencing data and obtain new insights on tumor progression.
  • Thumbnail Image
    Item
    Reference-guided assembly of metagenomes
    (2020) Cepeda, Victoria P; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Microorganisms play an important role in all of the Earth's ecosystems, and are critical for the health of humans [1], plants, and animals. Most microbes are not easily cultured [2]; yet, Metagenomics, the analysis of organismal DNA sequences obtained directly from an environmental sample, enables the study of these microorganisms. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. The two main paradigms for this method are de novo assembly (i.e., reconstructing genomes directly from the read data), and reference-guided assembly (i.e., reconstructing genomes using closely related organisms). Because the latter paradigm has a high computational cost—due to the mapping of tens of millions of reads to thousands of full genome sequences—Metagenomic studies have primarily relied on the former paradigm. However, the increased availability of high-throughput sequencing technologies has generated thousands of bacterial genomes, making reference-guided assembly a valuable resource regardless of its computational cost. Thus, this study describes a novel metagenome assembly approach, called MetaCompass, that combines reference-guided assembly and de novo assembly, and it is organized in the following stages: (i) selecting reference genomes from a database using a metagenomic taxonomy classification software that combines gene and genome comparison methods, achieving species and strain level resolution; (ii) performing reference-guided assembly in a new manner, which uses the minimum set cover principle to remove redundancy in a metagenome read mapping while performing consensus calling; and (iii) performing de novo assembly using the reads that have not been mapped to any reference genomes. We show that MetaCompass improves the most common metrics used to evaluate assembly quality—contiguity, consistency, and reference-bases metrics—for both synthetic and real datasets such as the ones gathered in the Human Microbiome Project (HMP) [3], and it also facilitates the assembly of low abundance microorganisms retrieved with the reference-guided approach. Lastly, we used our HMP assembly results to characterize the relative advantages and limitations of de novo and reference-guided assembly approaches, thereby providing guidance on analytical strategies for characterizing the human-associated microbiota.