Biology
Permanent URI for this communityhttp://hdl.handle.net/1903/11810
Browse
13 results
Search Results
Item Genome re-annotation: a wiki solution?(Springer Nature, 2007-02-01) Salzberg, Steven LThe annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution.Item A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana(Springer Nature, 2007-05-21) Pertea, Mihaela; Mount, Stephen M; Salzberg, Steven LAlgorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions. However exonic splicing enhancers have been shown to enhance the utilization of nearby splice sites. We have developed a new computational technique to identify significantly conserved motifs involved in splice site regulation. First, 84 putative exonic splicing enhancer hexamers are identified in Arabidopsis thaliana. Then a Gibbs sampling program called ELPH was used to locate conserved motifs represented by these hexamers in exonic regions near splice sites in confirmed genes. Oligomers containing 35 of these motifs have been shown experimentally to induce significant inclusion of A. thaliana exons. Second, integration of our regulatory motifs into two different splice site recognition programs significantly improved the ability of the software to correctly predict splice sites in a large database of confirmed genes. We have released GeneSplicerESE, the improved splice site recognition code, as open source software. Our results show that the use of the ESE motifs consistently improves splice site prediction accuracy.Item Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A(Springer Nature, 2008-05-01) Salzberg, Steven L; Sommer, Daniel D; Schatz, Michael C; Phillippy, Adam M; Rabinowicz, Pablo D; Tsuge, Seiji; Furutani, Ayako; Ochiai, Hirokazu; Delcher, Arthur L; Kelley, David; Madupu, Ramana; Puiu, Daniela; Radune, Diana; Shumway, Martin; Trapnell, Cole; Aparna, Gudlur; Jha, Gopaljee; Pandey, Alok; Patil, Prabhu B; Ishihara, Hiromichi; Meyer, Damien F; Szurek, Boris; Verdier, Valerie; Koebnik, Ralf; Dow, J Maxwell; Ryan, Robert P; Hirata, Hisae; Tsuyumu, Shinji; Lee, Sang Won; Ronald, Pamela C; Sonti, Ramesh V; Van Sluys, Marie-Anne; Leach, Jan E; White, Frank F; Bogdanove, Adam JXanthomonas oryzae pv. oryzae causes bacterial blight of rice (Oryza sativa L.), a major disease that constrains production of this staple crop in many parts of the world. We report here on the complete genome sequence of strain PXO99A and its comparison to two previously sequenced strains, KACC10331 and MAFF311018, which are highly similar to one another. The PXO99A genome is a single circular chromosome of 5,240,075 bp, considerably longer than the genomes of the other strains (4,941,439 bp and 4,940,217 bp, respectively), and it contains 5083 protein-coding genes, including 87 not found in KACC10331 or MAFF311018. PXO99A contains a greater number of virulence-associated transcription activator-like effector genes and has at least ten major chromosomal rearrangements relative to KACC10331 and MAFF311018. PXO99A contains numerous copies of diverse insertion sequence elements, members of which are associated with 7 out of 10 of the major rearrangements. A rapidly-evolving CRISPR (clustered regularly interspersed short palindromic repeats) region contains evidence of dozens of phage infections unique to the PXO99A lineage. PXO99A also contains a unique, near-perfect tandem repeat of 212 kilobases close to the replication terminus. Our results provide striking evidence of genome plasticity and rapid evolution within Xanthomonas oryzae pv. oryzae. The comparisons point to sources of genomic variation and candidates for strain-specific adaptations of this pathogen that help to explain the extraordinary diversity of Xanthomonas oryzae pv. oryzae genotypes and races that have been isolated from around the world.Item Searching for SNPs with cloud computing(Springer Nature, 2009-11-20) Langmead, Ben; Schatz, Michael C; Lin, Jimmy; Pop, Mihai; Salzberg, Steven LAs DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/ .Item Between a chicken and a grape: estimating the number of human genes(Springer Nature, 2010-05-05) Pertea, Mihaela; Salzberg, Steven LMany people expected the question 'How many genes in the human genome?' to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.Item Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification(Springer Nature, 2010-09-16) Deng, Xiangyu; Phillippy, Adam M; Li, Zengxin; Salzberg, Steven L; Zhang, WeiBacterial pathogens often show significant intraspecific variations in ecological fitness, host preference and pathogenic potential to cause infectious disease. The species of Listeria monocytogenes, a facultative intracellular pathogen and the causative agent of human listeriosis, consists of at least three distinct genetic lineages. Two of these lineages predominantly cause human sporadic and epidemic infections, whereas the third lineage has never been implicated in human disease outbreaks despite its overall conservation of many known virulence factors. Here we compare the genomes of 26 L. monocytogenes strains representing the three lineages based on both in silico comparative genomic analysis and high-density, pan-genomic DNA array hybridizations. We uncover 86 genes and 8 small regulatory RNAs that likely make L. monocytogenes lineages differ in carbohydrate utilization and stress resistance during their residence in natural habitats and passage through the host gastrointestinal tract. We also identify 2,330 to 2,456 core genes that define this species along with an open pan-genome pool that contains more than 4,052 genes. Phylogenomic reconstructions based on 3,560 homologous groups allowed robust estimation of phylogenetic relatedness among L. monocytogenes strains. Our pan-genome approach enables accurate co-analysis of DNA sequence and hybridization array data for both core gene estimation and phylogenomics. Application of our method to the pan-genome of L. monocytogenes sheds new insights into the intraspecific niche expansion and evolution of this important foodborne pathogen.Item Complete Columbian mammoth mitogenome suggests interbreeding with woolly mammoths(Springer Nature, 2011-05-31) Enk, Jacob; Devault, Alison; Debruyne, Regis; King, Christine E; Treangen, Todd; O'Rourke, Dennis; Salzberg, Steven L; Fisher, Daniel; MacPhee, Ross; Poinar, HendrikLate Pleistocene North America hosted at least two divergent and ecologically distinct species of mammoth: the periglacial woolly mammoth (Mammuthus primigenius) and the subglacial Columbian mammoth (Mammuthus columbi). To date, mammoth genetic research has been entirely restricted to woolly mammoths, rendering their genetic evolution difficult to contextualize within broader Pleistocene paleoecology and biogeography. Here, we take an interspecific approach to clarifying mammoth phylogeny by targeting Columbian mammoth remains for mitogenomic sequencing. We sequenced the first complete mitochondrial genome of a classic Columbian mammoth, as well as the first complete mitochondrial genome of a North American woolly mammoth. Somewhat contrary to conventional paleontological models, which posit that the two species were highly divergent, the M. columbi mitogenome we obtained falls securely within a subclade of endemic North American M. primigenius. Though limited, our data suggest that the two species interbred at some point in their evolutionary histories. One potential explanation is that woolly mammoth haplotypes entered Columbian mammoth populations via introgression at subglacial ecotones, a scenario with compelling parallels in extant elephants and consistent with certain regional paleontological observations. This highlights the need for multi-genomic data to sufficiently characterize mammoth evolutionary history. Our results demonstrate that the use of next-generation sequencing technologies holds promise in obtaining such data, even from non-cave, non-permafrost Pleistocene depositional contexts.Item TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions(Springer Nature, 2013-04-25) Kim, Daehwan; Pertea, Geo; Trapnell, Cole; Pimentel, Harold; Kelley, Ryan; Salzberg, Steven LTopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat .Item A whole-genome assembly of the domestic cow, Bos taurus(2009-04-29) Zimin, Aleksey V; Delcher, Arthur L; Florea, Liliana; Kelley, David R; Schatz, Michael C; Puiu, Daniela; Hanrahan, Finnian; Pertea, Geo; Van Tassell, Curtis P; Sonstegard, Tad S; Marcais, Guillaume; Roberts, Michael; Subramanian, Poorani; Yorke, James A; Salzberg, Steven LBackground: The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods. Results: We have assembled the 35 million sequence reads and applied a variety of assembly improvement techniques, creating an assembly of 2.86 billion base pairs that has multiple improvements over previous assemblies: it is more complete, covering more of the genome; thousands of gaps have been closed; many erroneous inversions, deletions, and translocations have been corrected; and thousands of single-nucleotide errors have been corrected. Our evaluation using independent metrics demonstrates that the resulting assembly is substantially more accurate and complete than alternative versions. Conclusions: By using independent mapping data and conserved synteny between the cow and human genomes, we were able to construct an assembly with excellent large-scale contiguity in which a large majority (approximately 91%) of the genome has been placed onto the 30 B. taurus chromosomes. We constructed a new cow-human synteny map that expands upon previous maps. We also identified for the first time a portion of the B. taurus Y chromosome.Item Efficient oligonucleotide probe selection for pan-genomic tiling arrays(2009-09-16) Phillippy, Adam M; Deng, Xiangyu; Zhang, Wei; Salzberg, Steven LBackground: Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome. Results: This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pangenome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage. Conclusion: PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.