Biology
Permanent URI for this communityhttp://hdl.handle.net/1903/11810
Browse
12 results
Search Results
Item Genome assembly forensics: finding the elusive mis-assembly(Springer Nature, 2008-03-14) Phillippy, Adam M; Schatz, Michael C; Pop, MihaiWe present the first collection of tools aimed at automated genome assembly validation. This work formalizes several mechanisms for detecting mis-assemblies, and describes their implementation in our automated validation pipeline, called amosvalidate. We demonstrate the application of our pipeline in both bacterial and eukaryotic genome assemblies, and highlight several assembly errors in both draft and finished genomes. The software described is compatible with common assembly formats and is released, open-source, at http://amos.sourceforge.net .Item Searching for SNPs with cloud computing(Springer Nature, 2009-11-20) Langmead, Ben; Schatz, Michael C; Lin, Jimmy; Pop, Mihai; Salzberg, Steven LAs DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/ .Item Genomic characterization of the Yersinia genus(Springer Nature, 2010-01-04) Chen, Peter E; Cook, Christopher; Stewart, Andrew C; Nagarajan, Niranjan; Sommer, Dan D; Pop, Mihai; Thomason, Brendan; Thomason, Maureen P Kiley; Lentz, Shannon; Nolan, Nichole; Sozhamannan, Shanmuga; Sulakvelidze, Alexander; Mateczun, Alfred; Du, Lei; Zwick, Michael E; Read, Timothy DNew DNA sequencing technologies have enabled detailed comparative genomic analyses of entire genera of bacterial pathogens. Prior to this study, three species of the enterobacterial genus Yersinia that cause invasive human diseases (Yersinia pestis, Yersinia pseudotuberculosis, and Yersinia enterocolitica) had been sequenced. However, there were no genomic data on the Yersinia species with more limited virulence potential, frequently found in soil and water environments. We used high-throughput sequencing-by-synthesis instruments to obtain 25- to 42-fold average redundancy, whole-genome shotgun data from the type strains of eight species: Y. aldovae, Y. bercovieri, Y. frederiksenii, Y. kristensenii, Y. intermedia, Y. mollaretii, Y. rohdei, and Y. ruckeri. The deepest branching species in the genus, Y. ruckeri, causative agent of red mouth disease in fish, has the smallest genome (3.7 Mb), although it shares the same core set of approximately 2,500 genes as the other members of the species, whose genomes range in size from 4.3 to 4.8 Mb. Yersinia genomes had a similar global partition of protein functions, as measured by the distribution of Cluster of Orthologous Groups families. Genome to genome variation in islands with genes encoding functions such as ureases, hydrogeneases and B-12 cofactor metabolite reactions may reflect adaptations to colonizing specific host habitats. Rapid high-quality draft sequencing was used successfully to compare pathogenic and non-pathogenic members of the Yersinia genus. This work underscores the importance of the acquisition of horizontally transferred genes in the evolution of Y. pestis and points to virulence determinants that have been gained and lost on multiple occasions in the history of the genus.Item Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences(Springer Nature, 2011-07-27) Liu, Bo; Gibbons, Theodore; Ghodsi, Mohammad; Treangen, Todd; Pop, MihaiA major goal of metagenomics is to characterize the microbial composition of an environment. The most popular approach relies on 16S rRNA sequencing, however this approach can generate biased estimates due to differences in the copy number of the gene between even closely related organisms, and due to PCR artifacts. The taxonomic composition can also be determined from metagenomic shotgun sequencing data by matching individual reads against a database of reference sequences. One major limitation of prior computational methods used for this purpose is the use of a universal classification threshold for all genes at all taxonomic levels. We propose that better classification results can be obtained by tuning the taxonomic classifier to each matching length, reference gene, and taxonomic level. We present a novel taxonomic classifier MetaPhyler (http://metaphyler.cbcb.umd.edu), which uses phylogenetic marker genes as a taxonomic reference. Results on simulated datasets demonstrate that MetaPhyler outperforms other tools commonly used in this context (CARMA, Megan and PhymmBL). We also present interesting results by analyzing a real metagenomic dataset. We have introduced a novel taxonomic classification method for analyzing the microbial diversity from whole-metagenome shotgun sequences. Compared with previous approaches, MetaPhyler is much more accurate in estimating the phylogenetic composition. In addition, we have shown that MetaPhyler can be used to guide the discovery of novel organisms from metagenomic samples.Item Exploiting sparseness in de novo genome assembly(Springer Nature, 2012-04-19) Ye, Chengxi; Sam Ma, Zhanshan; Cannon, Charles H; Pop, Mihai; Yu, Douglas WThe very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments. In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k- mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer. We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k- mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.Item We are what we eat: how the diet of infants affects their gut microbiome(Springer Nature, 2012-04-30) Pop, MihaiSimultaneous analysis of the gut microbiome and host gene expression in infants reveals the impact of diet (breastfeeding versus formula) on host-microbiome interactions.Item De novo likelihood-based measures for comparing genome assemblies(Springer Nature, 2013-08-22) Ghodsi, Mohammadreza; Hill, Christopher M; Astrovskaya, Irina; Lin, Henry; Sommer, Dan D; Koren, Sergey; Pop, MihaiThe current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.Item Automated ensemble assembly and validation of microbial genomes(Springer Nature, 2014-05-03) Koren, Sergey; Treangen, Todd J; Hill, Christopher M; Pop, Mihai; Phillippy, Adam MThe continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.Item Computational methods for optical mapping(Springer Nature, 2014-12-30) Mendelowitz, Lee; Pop, MihaiOptical mapping and newer genome mapping technologies based on nicking enzymes provide low resolution but long-range genomic information. The optical mapping technique has been successfully used for assessing the quality of genome assemblies and for detecting large-scale structural variants and rearrangements that cannot be detected using current paired end sequencing protocols. Here, we review several algorithms and methods for building consensus optical maps and aligning restriction patterns to a reference map, as well as methods for using optical maps with sequence assemblies.Item Longitudinal analysis of the lung microbiota of cynomolgous macaques during long-term SHIV infection(Springer Nature, 2016-07-08) Morris, Alison; Paulson, Joseph N.; Talukder, Hisham; Tipton, Laura; Kling, Heather; Cui, Lijia; Fitch, Adam; Pop, Mihai; Norris, Karen A.; Ghedin, ElodieLongitudinal studies of the lung microbiome are challenging due to the invasive nature of sample collection. In addition, studies of the lung microbiome in human disease are usually performed after disease onset, limiting the ability to determine early events in the lung. We used a non-human primate model to assess lung microbiome alterations over time in response to an HIV-like immunosuppression and determined impact of the lung microbiome on development of obstructive lung disease. Cynomolgous macaques were infected with the SIV-HIV chimeric virus SHIV89.6P. Bronchoalveolar lavage fluid samples were collected pre-infection and every 4 weeks for 53 weeks post-infection. The microbiota was characterized at each time point by 16S ribosomal RNA (rRNA) sequencing. We observed individual variation in the composition of the lung microbiota with a proportion of the macaques having Tropheryma whipplei as the dominant organism in their lungs. Bacterial communities varied over time both within and between animals, but there did not appear to be a systematic alteration due to SHIV infection. Development of obstructive lung disease in the SHIV-infected animals was characterized by a relative increase in abundance of oral anaerobes. Network analysis further identified a difference in community composition that accompanied the development of obstructive disease with negative correlations between members of the obstructed and non-obstructed groups. This emphasizes how species shifts can impact multiple other species, potentially resulting in disease. This study is the first to investigate the dynamics of the lung microbiota over time and in response to immunosuppression in a non-human primate model. The persistence of oral bacteria in the lung and their association with obstruction suggest a potential role in pathogenesis. The lung microbiome in the non-human primate is a valuable tool for examining the impact of the lung microbiome in human health and disease.