Computer Science Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1593

Browse

Search Results

Now showing 1 - 3 of 3
  • Item
    Efficient decoding algorithms for generalized hidden Markov model gene finders
    (BMC Bioinformatics, 2005-01-24) Majoros, William H.; Pertea, Mihaela; Delcher, Arthur L.; Salzberg, Steven L.
    Background: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. Results: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. Conclusions: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.
  • Item
    Minimus: a fast, lightweight genome assembler
    (BMC Bioinformatics, 2007-02-26) Sommer, Daniel D.; Delcher, Arthur L.; Salzberg, Steven L.; Pop, Mihai
    Background: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. Results: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. Conclusion: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.
  • Item
    Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote
    (PLoS Biology, 2006) Eisen, Jonathan A.; Coyne, Robert S.; Wu, Martin; Wu, Dongying; Thiagarajan, Mathangi; Wortman, Jennifer R.; Badger, Jonathan H.; Ren, Qinghu; Amedeo, Paolo; Jones, Kristie M.; Tallon, Luke J.; Delcher, Arthur L.; Salzberg, Steven L.; Silva, Joana C.; Haas, Brian J.; Majoros, William H.; Farzad, Maryam; Carlton, Jane M.; Smith, Robert K. Jr.; Garg, Jyoti; Pearlman, Ronald E.; Karrer, Kathleen M.; Sun, Lei; Manning, Gerard; Elde, Nels C.; Turkewitz, Aaron P.; Asai, David J.; Wilkes, David E.; Wang, Yufeng; Cai, Hong; Collins, Kathleen; Stewart, B. Andrew; Lee, Suzanne R.; Wilamowsk, Katarzyna; Weinberg, Zasha; Ruzzo, Walter L.; Wloga, Dorota; Gaertig, Jacek; Frankel, Joseph; Tsao, Che-Chia; Gorovsky, Martin A.; Keeling, Patrick J.; Waller, Ross F.; Patron, Nicola J.; Cherry, J. Michael; Stover, Nicholas A.; Krieger, Cynthia J.; del Toro, Christina; Ryder, Hilary F.; Williamson, Sondra C.; Barbeau, Rebecca A.; Hamilton, Eileen P.; Orias, Eduardo
    The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.