Comparative and Computational Methods for Microbial Genomics

dc.contributor.advisorSalzberg, Steven Len_US
dc.contributor.advisorPop, Mihaien_US
dc.contributor.authorWood, Derricken_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2014-06-24T05:58:48Z
dc.date.available2014-06-24T05:58:48Z
dc.date.issued2014en_US
dc.description.abstractThrough the study of genomic sequences, researchers are able to learn much about the workings of life. As sequencing technology has improved over the past decade, the number of genomes that have been assembled has grown exponentially, and the amount of sequence generated by sequencing machines can easily number in the billions or even trillions of nucleotides for a single project. This rise in the amount of information present requires informatics approaches to correctly and efficiently analyze the data. One common approach has been to use comparative methods, which use sequence similarity to infer functional or evolutionary relationships between sequences. This dissertation uses comparative methods to improve existing records of genomic data, and introduces a novel computational approach to the problem of taxonomic sequence classification. The first part of this dissertation uses two approaches involving pairwise and multiple sequence alignments to find and correct errors in the public records of microbial genomes. Through alignment to sets of genes with known function, we show that thousands of genes have been mistakenly omitted from our public records. Our analysis of these genes shows a tendency for short genes to be omitted, and reveals that genes are more frequently omitted by organizations with less experience in annotating genomes. We also use multiple alignments of protein sequences to improve the annotation of start positions of genes, in some cases restoring hundreds of nucleotides to the genes' records. Through analysis of our results, we also found a link between a high use of rare start codons and a high rate of erroneously annotated start sites. The final part of this dissertation presents a method involving exact alignment of short sequences to perform rapid taxonomic sequence classification. By using the existing concept of minimizers to increase CPU cache utilization, we have created a tool capable of performing taxonomic classification with a sensitivity that is comparable to existing methods, a precision that surpasses all existing methods, and a speed that is over 900 times faster than the fastest existing classification approach.en_US
dc.identifier.urihttp://hdl.handle.net/1903/15260
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledBioinformaticsen_US
dc.subject.pquncontrolledgenome annotationen_US
dc.subject.pquncontrolledmetagenomicsen_US
dc.subject.pquncontrolledmicrobial genomicsen_US
dc.subject.pquncontrolledsequence alignmenten_US
dc.subject.pquncontrolledtaxonomic classificationen_US
dc.titleComparative and Computational Methods for Microbial Genomicsen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wood_umd_0117E_15060.pdf
Size:
1.33 MB
Format:
Adobe Portable Document Format