METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

Angiuoli, Samuel Vincent

METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

dc.contributor.advisor	Salzberg, Steven L	en_US
dc.contributor.author	Angiuoli, Samuel Vincent	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2011-07-07T05:39:32Z
dc.date.available	2011-07-07T05:39:32Z
dc.date.issued	2011	en_US
dc.description.abstract	High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/11669
dc.subject.pqcontrolled	Bioinformatics	en_US
dc.subject.pquncontrolled	Cloud computing	en_US
dc.subject.pquncontrolled	Comparative genomics	en_US
dc.subject.pquncontrolled	Multiple sequence alignment	en_US
dc.title	METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Angiuoli_umd_0117E_12062.pdf
Size:: 3.07 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations