Computer Science Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1593

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions
    (Genome Biology, 2006-08-07) Allen, Jonathan E.; Majoros, William H.; Pertea, Mihaela; Salzberg, Steven L.
    Background: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures. Results: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy. Conclusions: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based ‘combiner’ program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.
  • Item
    Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote
    (PLoS Biology, 2006) Eisen, Jonathan A.; Coyne, Robert S.; Wu, Martin; Wu, Dongying; Thiagarajan, Mathangi; Wortman, Jennifer R.; Badger, Jonathan H.; Ren, Qinghu; Amedeo, Paolo; Jones, Kristie M.; Tallon, Luke J.; Delcher, Arthur L.; Salzberg, Steven L.; Silva, Joana C.; Haas, Brian J.; Majoros, William H.; Farzad, Maryam; Carlton, Jane M.; Smith, Robert K. Jr.; Garg, Jyoti; Pearlman, Ronald E.; Karrer, Kathleen M.; Sun, Lei; Manning, Gerard; Elde, Nels C.; Turkewitz, Aaron P.; Asai, David J.; Wilkes, David E.; Wang, Yufeng; Cai, Hong; Collins, Kathleen; Stewart, B. Andrew; Lee, Suzanne R.; Wilamowsk, Katarzyna; Weinberg, Zasha; Ruzzo, Walter L.; Wloga, Dorota; Gaertig, Jacek; Frankel, Joseph; Tsao, Che-Chia; Gorovsky, Martin A.; Keeling, Patrick J.; Waller, Ross F.; Patron, Nicola J.; Cherry, J. Michael; Stover, Nicholas A.; Krieger, Cynthia J.; del Toro, Christina; Ryder, Hilary F.; Williamson, Sondra C.; Barbeau, Rebecca A.; Hamilton, Eileen P.; Orias, Eduardo
    The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.