Comparative and Computational Methods for Microbial Genomics

Wood, Derrick

Comparative and Computational Methods for Microbial Genomics

dc.contributor.advisor	Salzberg, Steven L	en_US
dc.contributor.advisor	Pop, Mihai	en_US
dc.contributor.author	Wood, Derrick	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2014-06-24T05:58:48Z
dc.date.available	2014-06-24T05:58:48Z
dc.date.issued	2014	en_US
dc.description.abstract	Through the study of genomic sequences, researchers are able to learn much about the workings of life. As sequencing technology has improved over the past decade, the number of genomes that have been assembled has grown exponentially, and the amount of sequence generated by sequencing machines can easily number in the billions or even trillions of nucleotides for a single project. This rise in the amount of information present requires informatics approaches to correctly and efficiently analyze the data. One common approach has been to use comparative methods, which use sequence similarity to infer functional or evolutionary relationships between sequences. This dissertation uses comparative methods to improve existing records of genomic data, and introduces a novel computational approach to the problem of taxonomic sequence classification. The first part of this dissertation uses two approaches involving pairwise and multiple sequence alignments to find and correct errors in the public records of microbial genomes. Through alignment to sets of genes with known function, we show that thousands of genes have been mistakenly omitted from our public records. Our analysis of these genes shows a tendency for short genes to be omitted, and reveals that genes are more frequently omitted by organizations with less experience in annotating genomes. We also use multiple alignments of protein sequences to improve the annotation of start positions of genes, in some cases restoring hundreds of nucleotides to the genes' records. Through analysis of our results, we also found a link between a high use of rare start codons and a high rate of erroneously annotated start sites. The final part of this dissertation presents a method involving exact alignment of short sequences to perform rapid taxonomic sequence classification. By using the existing concept of minimizers to increase CPU cache utilization, we have created a tool capable of performing taxonomic classification with a sensitivity that is comparable to existing methods, a precision that surpasses all existing methods, and a speed that is over 900 times faster than the fastest existing classification approach.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/15260
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pqcontrolled	Bioinformatics	en_US
dc.subject.pquncontrolled	genome annotation	en_US
dc.subject.pquncontrolled	metagenomics	en_US
dc.subject.pquncontrolled	microbial genomics	en_US
dc.subject.pquncontrolled	sequence alignment	en_US
dc.subject.pquncontrolled	taxonomic classification	en_US
dc.title	Comparative and Computational Methods for Microbial Genomics	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wood_umd_0117E_15060.pdf
Size:: 1.33 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations