Searching, clustering and evaluating biological sequences

Ghodsi, Mohammadreza

Searching, clustering and evaluating biological sequences

dc.contributor.advisor	Pop, Mihai	en_US
dc.contributor.author	Ghodsi, Mohammadreza	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2012-10-11T05:54:45Z
dc.date.available	2012-10-11T05:54:45Z
dc.date.issued	2012	en_US
dc.description.abstract	The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponential, and so far, has outpaced the rate of improvement of computer speed and capacity. This rate of growth, however, makes analysis of new datasets increasingly difficult, and highlights the need for efficient, scalable and modular software tools. Fortunately most types of analysis of sequence data involve a few fundamental operations. Here we study three such problems, namely searching for local alignments between two sets of sequences, clustering sequences, and evaluating the assemblies made from sequence fragments. We present simple and efficient heuristic algorithms for these problems, as well as open source software tools which implement these algorithms. First, we present approximate seeds; a new type of seed for local alignment search. Approximate seeds are a generalization of exact seeds and spaced seeds, in that they allow for insertions and deletions within the seed. We prove that approximate seeds are completely sensitive. We also show how to efficiently find approximate seeds using a suffix array index of the sequences. Next, we present DNACLUST; a tool for clustering millions of DNA sequence fragments. Although DNACLUST has been primarily made for clustering 16S ribosomal RNA sequences, it can be used for other tasks, such as removing duplicate or near duplicate sequences from a dataset. Finally, we present a framework for comparing (two or more) assemblies built from the same set of reads. Our evaluation requires the set of reads and the assemblies only, and does not require the true genome sequence. Therefore our method can be used in de novo assembly projects, where the true genome is not known. Our score is based on probability theory, and the true genome is expected to obtain the maximum score.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/13186
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pqcontrolled	Bioinformatics	en_US
dc.subject.pquncontrolled	clustering	en_US
dc.subject.pquncontrolled	genome assembly	en_US
dc.subject.pquncontrolled	string matching	en_US
dc.title	Searching, clustering and evaluating biological sequences	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ghodsi_umd_0117E_13470.pdf
Size:: 686.45 KB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations