Browsing by Author "Salzberg, Steven L"
Now showing 1 - 20 of 20
Results Per Page
Sort Options
Item Between a chicken and a grape: estimating the number of human genes(Springer Nature, 2010-05-05) Pertea, Mihaela; Salzberg, Steven LMany people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.Item Clustering metagenomic sequences with interpolated Markov models(2010-11-02) Kelley, David R; Salzberg, Steven LBackground: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.Item Complete Columbian mammoth mitogenome suggests interbreeding with woolly mammoths(Springer Nature, 2011-05-31) Enk, Jacob; Devault, Alison; Debruyne, Regis; King, Christine E; Treangen, Todd; O'Rourke, Dennis; Salzberg, Steven L; Fisher, Daniel; MacPhee, Ross; Poinar, HendrikLate Pleistocene North America hosted at least two divergent and ecologically distinct species of mammoth: the periglacial woolly mammoth (Mammuthus primigenius) and the subglacial Columbian mammoth (Mammuthus columbi). To date, mammoth genetic research has been entirely restricted to woolly mammoths, rendering their genetic evolution difficult to contextualize within broader Pleistocene paleoecology and biogeography. Here, we take an interspecific approach to clarifying mammoth phylogeny by targeting Columbian mammoth remains for mitogenomic sequencing. We sequenced the first complete mitochondrial genome of a classic Columbian mammoth, as well as the first complete mitochondrial genome of a North American woolly mammoth. Somewhat contrary to conventional paleontological models, which posit that the two species were highly divergent, the M. columbi mitogenome we obtained falls securely within a subclade of endemic North American M. primigenius. Though limited, our data suggest that the two species interbred at some point in their evolutionary histories. One potential explanation is that woolly mammoth haplotypes entered Columbian mammoth populations via introgression at subglacial ecotones, a scenario with compelling parallels in extant elephants and consistent with certain regional paleontological observations. This highlights the need for multi-genomic data to sufficiently characterize mammoth evolutionary history. Our results demonstrate that the use of next-generation sequencing technologies holds promise in obtaining such data, even from non-cave, non-permafrost Pleistocene depositional contexts.Item A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana(Springer Nature, 2007-05-21) Pertea, Mihaela; Mount, Stephen M; Salzberg, Steven LAlgorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites. We have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. First, 84 putative exonic splicing enhancer hexamers are identified in Arabidopsis thaliana. Then a Gibbs sampling program called ELPH was used to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. Oligomers containing 35 of these motifs have been shown experimentally to induce significant inclusion of A. thaliana exons. Second, integration of our regulatory motifs into two different splice site recognition programs significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes. We have released GeneSplicerESE, the improved splice site recognition code, as open source software. Our results show that the use of the ESE motifs consistently improves splice site prediction accuracy.Item Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies(Springer Nature, 2014-03-04) Neale, David B; Wegrzyn, Jill L; Stevens, Kristian A; Zimin, Aleksey V; Puiu, Daniela; Crepeau, Marc W; Cardeno, Charis; Koriabine, Maxim; Holtz-Morris, Ann E; Liechty, John D; Martínez-García, Pedro J; Vasquez-Gross, Hans A; Lin, Brian Y; Zieve, Jacob J; Dougherty, William M; Fuentes-Soriano, Sara; Wu, Le-Shin; Gilbert, Don; Marçais, Guillaume; Roberts, Michael; Holt, Carson; Yandell, Mark; Davis, John M; Smith, Katherine E; Dean, Jeffrey FD; Lorenz, W Walter; Whetten, Ross W; Sederoff, Ronald; Wheeler, Nicholas; McGuire, Patrick E; Main, Doreen; Loopstra, Carol A; Mockaitis, Keithanne; deJong, Pieter J; Yorke, James A; Salzberg, Steven L; Langley, Charles HThe size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.Item Detection and correction of false segmental duplications caused by genome mis-assembly(2010-03-10) Kelley, David R; Salzberg, Steven LDiploid genomes with divergent chromosomes present special problems for assembly software as two copies of especially polymorphic regions may be mistakenly constructed, creating the appearance of a recent segmental duplication. We developed a method for identifying such false duplications and applied it to four vertebrate genomes. For each genome, we corrected mis-assemblies, improved estimates of the amount of duplicated sequence, and recovered polymorphisms between the sequenced chromosomes.Item Detection of lineage-specific evolutionary changes among primate species(2011-07-04) Pertea, Mihaela; Pertea, Geo M; Salzberg, Steven LBackground: Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection. Results: We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection. Conclusions: DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.Item Efficient oligonucleotide probe selection for pan-genomic tiling arrays(2009-09-16) Phillippy, Adam M; Deng, Xiangyu; Zhang, Wei; Salzberg, Steven LBackground: Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome. Results: This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pangenome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. Conclusion: PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.Item Genome re-annotation: a wiki solution?(Springer Nature, 2007-02-01) Salzberg, Steven LThe annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution.Item Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A(Springer Nature, 2008-05-01) Salzberg, Steven L; Sommer, Daniel D; Schatz, Michael C; Phillippy, Adam M; Rabinowicz, Pablo D; Tsuge, Seiji; Furutani, Ayako; Ochiai, Hirokazu; Delcher, Arthur L; Kelley, David; Madupu, Ramana; Puiu, Daniela; Radune, Diana; Shumway, Martin; Trapnell, Cole; Aparna, Gudlur; Jha, Gopaljee; Pandey, Alok; Patil, Prabhu B; Ishihara, Hiromichi; Meyer, Damien F; Szurek, Boris; Verdier, Valerie; Koebnik, Ralf; Dow, J Maxwell; Ryan, Robert P; Hirata, Hisae; Tsuyumu, Shinji; Lee, Sang Won; Ronald, Pamela C; Sonti, Ramesh V; Van Sluys, Marie-Anne; Leach, Jan E; White, Frank F; Bogdanove, Adam JXanthomonas oryzae pv. oryzae causes bacterial blight of rice (Oryza sativa L.), a major disease that constrains production of this staple crop in many parts of the world. We report here on the complete genome sequence of strain PXO99A and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another. The PXO99A genome is a single circular chromosome of 5,240,075 bp, considerably longer than the genomes of the other strains (4,941,439 bp and 4,940,217 bp, respectively), and it contains 5083 protein-coding genes, including 87 not found in KACC10331 or MAFF311018. PXO99A contains a greater number of virulence-associated transcription activator-like effector genes and has at least ten major chromosomal rearrangements relative to KACC10331 and MAFF311018. PXO99A contains numerous copies of diverse insertion sequence elements, members of which are associated with 7 out of 10 of the major rearrangements. A rapidly-evolving CRISPR (clustered regularly interspersed short palindromic repeats) region contains evidence of dozens of phage infections unique to the PXO99A lineage. PXO99A also contains a unique, near-perfect tandem repeat of 212 kilobases close to the replication terminus. Our results provide striking evidence of genome plasticity and rapid evolution within Xanthomonas oryzae pv. oryzae. The comparisons point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen that help to explain the extraordinary diversity of Xanthomonas oryzae pv. oryzae genotypes and races that have been isolated from around the world.Item Improving pan-genome annotation using whole genome multiple alignment(2011-06-30) Angiuoli, Samuel V; Dunning Hotopp, Julie C; Salzberg, Steven L; Tettelin, HerveBackground: Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes) is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes. Results: We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review. Conclusions: Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.Item A new rhesus macaque assembly and annotation for next-generation sequencing analyses(Springer Nature, 2014-10-14) Zimin, Aleksey V; Cornish, Adam S; Maudhoo, Mnirnal D; Gibbs, Robert M; Zhang, Xiongfei; Pandey, Sanjit; Meehan, Daniel T; Wipfler, Kristin; Bosinger, Steven E; Johnson, Zachary P; Tharp, Gregory K; Marçais, Guillaume; Roberts, Michael; Ferguson, Betsy; Fox, Howard S; Treangen, Todd; Salzberg, Steven L; Yorke, James A; Norgren, Robert B JrThe rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.Item Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification(Springer Nature, 2010-09-16) Deng, Xiangyu; Phillippy, Adam M; Li, Zengxin; Salzberg, Steven L; Zhang, WeiBacterial pathogens often show significant intraspecific variations in ecological fitness, host preference and pathogenic potential to cause infectious disease. The species of Listeria monocytogenes, a facultative intracellular pathogen and the causative agent of human listeriosis, consists of at least three distinct genetic lineages. Two of these lineages predominantly cause human sporadic and epidemic infections, whereas the third lineage has never been implicated in human disease outbreaks despite its overall conservation of many known virulence factors. Here we compare the genomes of 26 L. monocytogenes strains representing the three lineages based on both in silico comparative genomic analysis and high-density, pan-genomic DNA array hybridizations. We uncover 86 genes and 8 small regulatory RNAs that likely make L. monocytogenes lineages differ in carbohydrate utilization and stress resistance during their residence in natural habitats and passage through the host gastrointestinal tract. We also identify 2,330 to 2,456 core genes that define this species along with an open pan-genome pool that contains more than 4,052 genes. Phylogenomic reconstructions based on 3,560 homologous groups allowed robust estimation of phylogenetic relatedness among L. monocytogenes strains. Our pan-genome approach enables accurate co-analysis of DNA sequence and hybridization array data for both core gene estimation and phylogenomics. Application of our method to the pan-genome of L. monocytogenes sheds new insights into the intraspecific niche expansion and evolution of this important foodborne pathogen.Item Quake: quality-aware detection and correction of sequencing errors(2010-11-29) Kelley, David R; Schatz, Michael C; Salzberg, Steven LWe introduce Quake, a program to detect and correct errors in DNA sequencing reads. Using a maximum likelihood approach incorporating quality values and nucleotide specific miscall rates, Quake achieves the highest accuracy on realistically simulated reads. We further demonstrate substantial improvements in de novo assembly and SNP detection after using Quake. Quake can be used for any size project, including more than one billion human reads, and is freely available as open source software from http://www.cbcb.umd.edu/software/quake.Item Searching for SNPs with cloud computing(Springer Nature, 2009-11-20) Langmead, Ben; Schatz, Michael C; Lin, Jimmy; Pop, Mihai; Salzberg, Steven LAs DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/ .Item Thousands of missed genes found in bacterial genomes and their analysis with COMBREX(Springer Nature, 2012-10-30) Wood, Derrick E; Lin, Henry; Levy-Moonshine, Ami; Swaminathan, Rajiswari; Chang, Yi-Chien; Anton, Brian P; Osmani, Lais; Steffen, Martin; Kasif, Simon; Salzberg, Steven LThe dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website.Item TopHat-Fusion: an algorithm for discovery of novel fusion transcripts(2011-08-11) Kim, Daehwan; Salzberg, Steven LTopHat-Fusion is an algorithm designed to discover transcripts representing fusion gene products, which result from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome. TopHat-Fusion is an enhanced version of TopHat, an efficient program that aligns RNA-seq reads without relying on existing annotation. Because it is independent of gene annotation, TopHat-Fusion can discover fusion products deriving from known genes, unknown genes and unannotated splice variants of known genes. Using RNA-seq data from breast and prostate cancer cell lines, we detected both previously reported and novel fusions with solid supporting evidence. TopHat-Fusion is available at http://tophat-fusion.sourceforge.net/.Item TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions(Springer Nature, 2013-04-25) Kim, Daehwan; Pertea, Geo; Trapnell, Cole; Pimentel, Harold; Kelley, Ryan; Salzberg, Steven LTopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat .Item Ultrafast and memory-efficient alignment of short DNA sequences to the human genome(Springer Nature, 2009-03-04) Langmead, Ben; Trapnell, Cole; Pop, Mihai; Salzberg, Steven LBowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu .Item A whole-genome assembly of the domestic cow, Bos taurus(2009-04-29) Zimin, Aleksey V; Delcher, Arthur L; Florea, Liliana; Kelley, David R; Schatz, Michael C; Puiu, Daniela; Hanrahan, Finnian; Pertea, Geo; Van Tassell, Curtis P; Sonstegard, Tad S; Marcais, Guillaume; Roberts, Michael; Subramanian, Poorani; Yorke, James A; Salzberg, Steven LBackground: The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods. Results: We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions. Conclusions: By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.