Searching, clustering and evaluating biological sequences
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
The latest generation of biological sequencing technologies have made
it possible to generate sequence data faster and cheaper than ever
before. The growth of sequence data has been exponential, and so far,
has outpaced the rate of improvement of computer speed and capacity.
This rate of growth, however, makes analysis of new datasets
increasingly difficult, and highlights the need for efficient,
scalable and modular software tools.
Fortunately most types of analysis of sequence data involve a few
fundamental operations. Here we study three such problems, namely
searching for local alignments between two sets of sequences,
clustering sequences, and evaluating the assemblies made from sequence
fragments. We present simple and efficient heuristic algorithms for
these problems, as well as open source software tools which implement
these algorithms.
First, we present approximate seeds; a new type of seed for local
alignment search. Approximate seeds are a generalization of exact
seeds and spaced seeds, in that they allow for insertions and
deletions within the seed. We prove that approximate seeds are
completely sensitive. We also show how to efficiently find approximate
seeds using a suffix array index of the sequences.
Next, we present DNACLUST; a tool for clustering millions of DNA
sequence fragments. Although DNACLUST has been primarily made for
clustering 16S ribosomal RNA sequences, it can be used for other
tasks, such as removing duplicate or near duplicate sequences from a
dataset.
Finally, we present a framework for comparing (two or more) assemblies
built from the same set of reads. Our evaluation requires the set of
reads and the assemblies only, and does not require the true genome
sequence. Therefore our method can be used in de novo assembly
projects, where the true genome is not known. Our score is based on
probability theory, and the true genome is expected to obtain the
maximum score.