Computer Science Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1593

Browse

Search Results

Now showing 1 - 10 of 20
  • Item
    Use and mis-use of supplementary material in science publications
    (Springer Nature, 2015-11-03) Pop, Mihai; Salzberg, Steven L.
    Supplementary material is a ubiquitous feature of scientific articles, particularly in journals that limit the length of the articles. While the judicious use of supplementary material can improve the readability of scientific articles, its excessive use threatens the scientific review process and by extension the integrity of the scientific literature. In many cases supplementary material today is so extensive that it is reviewed superficially or not at all. Furthermore, citations buried within supplementary files rob other scientists of recognition of their contribution to the scientific record. These issues are exacerbated by the lack of guidance on the use of supplementary information from the journals to authors and reviewers. We propose that the removal of artificial length restrictions plus the use of interactive features made possible by modern electronic media can help to alleviate these problems. Many journals, in fact, have already removed article length limitations (as is the case for BMC Bioinformatics and other BioMed Central journals). We hope that the issues raised in our article will encourage publishers and scientists to work together towards a better use of supplementary information in scientific publishing.
  • Item
    A clustering method for repeat analysis in DNA sequences.
    (Genome Biology, 2001-08-01) Volfovsky, Natalia; Haas, Brian J.; Salzberg, Steven L.
    Background: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
  • Item
    Full-length messenger RNA sequences greatly improve genome annotation
    (Genome Biology, 2002-05-30) Haas, Brian J; Volfovsky, Natalia; Town, Christopher D; Troukhan, Maxim; Alexandrov, Nickolai; Feldman, Kenneth A; Flavell, Richard B; White, Owen; Salzberg, Steven L.
    Background: Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. Results: Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. Conclusions: Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.
  • Item
    Evidence for symmetric chromosomal inversions around the replication origin in bacteria
    (Genome Biology, 2000-12-04) Eisen, Jonathan A.; Heidelberg, John F.; White, Owen; Salzberg, Steven L.
    Background: Whole-genome comparisons can provide great insight into many aspects of biology. Until recently, however, comparisons were mainly possible only between distantly related species. Complete genome sequences are now becoming available from multiple sets of closely related strains or species. Results: By comparing the recently completed genome sequences of Vibrio cholerae, Streptococcus pneumoniae and Mycobacterium tuberculosis to those of closely related species - Escherichia coli, Streptococcus pyogenes and Mycobacterium leprae, respectively - we have identified an unusual and previously unobserved feature of bacterial genome structure. Scatterplots of the conserved sequences (both DNA and protein) between each pair of species produce a distinct X-shaped pattern, which we call an X-alignment. The key feature of these alignments is that they have symmetry around the replication origin and terminus; that is, the distance of a particular conserved feature (DNA or protein) from the replication origin (or terminus) is conserved between closely related pairs of species. Statistically significant X-alignments are also found within some genomes, indicating that there is symmetry about the replication origin for paralogous features as well. Conclusions: The most likely mechanism of generation of X-alignments involves large chromosomal inversions that reverse the genomic sequence symmetrically around the origin of replication. The finding of these X-alignments between many pairs of species suggests that chromosomal inversions around the origin are a common feature of bacterial genome evolution.
  • Item
    Method for Identifying Splice Sites and Translational Start Sites in Eukaryotic mRNA
    (Computer Applications in the Biosciences (CABIOS), 1997) Salzberg, Steven L.
    This paper describes a new method for determining the consensus sequences that signal the start of donor translation and the boundaries between exons and introns (donor and acceptor sites) in eukaryotic mRNA. The method takes into account the dependencies between adjacent bases, in contrast to the usual technique of considering each position independently. When coupled with a dynamic program to compute the most likely sequence, new consensus sequences emerge. The consensus sequence information is summarized in conditional probability matrices which, when used to locate signals in uncharacterized genomic DNA, have greater sensitivity and specificity than conventional matrices. Species-specific versions of these matrices are especially effective at distinguishing true and false sites.
  • Item
    Versatile and open software for comparing large genomes
    (Genome Biology, 2004-01-30) Kurtz, Stefan; Phillippy, Adam; Delcher, Arthur L.; Smoot, Michael; Shumway, Martin; Antonescu, Corina; Salzberg, Steven L.
    The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes. Two new graphical viewing tools provide alternative ways to analyze genome alignments. The new system is the first version of MUMmer to be released as open-source software. This allows other developers to contribute to the code base and freely redistribute the code. The MUMmer sources are available at http://www.tigr.org/software/mummer.
  • Item
    An empirical analysis of training protocols for probabilistic gene finders
    (BMC Bioinformatics, 2004-12-21) Majoros, William H.; Salzberg, Steven L.
    Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.
  • Item
    Efficient decoding algorithms for generalized hidden Markov model gene finders
    (BMC Bioinformatics, 2005-01-24) Majoros, William H.; Pertea, Mihaela; Delcher, Arthur L.; Salzberg, Steven L.
    Background: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. Results: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. Conclusions: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.
  • Item
    JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions
    (Genome Biology, 2006-08-07) Allen, Jonathan E.; Majoros, William H.; Pertea, Mihaela; Salzberg, Steven L.
    Background: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures. Results: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy. Conclusions: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based ‘combiner’ program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.
  • Item
    Minimus: a fast, lightweight genome assembler
    (BMC Bioinformatics, 2007-02-26) Sommer, Daniel D.; Delcher, Arthur L.; Salzberg, Steven L.; Pop, Mihai
    Background: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. Results: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. Conclusion: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.