ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE

dc.contributor.advisorPatro, Roberten_US
dc.contributor.authorAlmodaresi, Fatemehen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2020-09-24T05:31:21Z
dc.date.available2020-09-24T05:31:21Z
dc.date.issued2020en_US
dc.description.abstractHigh-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assem- bly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or tran- scriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query. The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner. We have developed both sparse and dense indexing schemes which allow trading index space for query speed (though queries always remain asymptotically optimal). Pufferfish provides a full reference index that can return the set of refer- ences, positions and orientations of any k-mer (substring of fixed length “k” ) in the input genomes. We have built an alignment tool, Puffaligner, around this index for aligning sequencing reads to reference sequences. We demonstrate that Puffaligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly and exhibits speed similar to the ultrafast STAR aligner while requiring considerably less memory to construct its index and align reads. The Rainbowfish and Mantis data structures, on the other hand, are based on reference-free colored de Bruijn graphs (cdbg) constructed over raw sequencing data. Rainbowfish introduces a new efficient representation of the color information which is then adopted and refined by Mantis. Mantis supports graph traversal and other topological analyses, but is also particularly well-suited for large-scale sequence-level search over thousands of samples. We develop multiple and successively-refined versions of the Mantis index, culminating in an index that adopts a minimizer- partitioned representation of the underlying k-mer set and a referential encoding of the color information that exploits fast near-neighbor search and efficient encoding via a minimum spanning tree. We describe, further, how this index can be made incrementally updatable by developing an efficient merge algorithm and storing the overall index in a multi-level log-structured merge (LSM) tree. We demonstrate the utility of this index by building a searchable Mantis, via recursive merging, over 10,000 raw sequencing samples, which we then scale to over 15,000 samples via incremental update. This index can be queried, on a commodity server, to discover the samples likely containing thousands of reference sequences in only a few minutes.en_US
dc.identifierhttps://doi.org/10.13016/et1e-wjqp
dc.identifier.urihttp://hdl.handle.net/1903/26394
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledBioinformaticsen_US
dc.subject.pquncontrolledde bruijn graphen_US
dc.subject.pquncontrolledhigh performance and efficient indexingen_US
dc.subject.pquncontrolledmantisen_US
dc.subject.pquncontrolledpufferfishen_US
dc.subject.pquncontrolledsequence alignmenten_US
dc.subject.pquncontrolledsequence indexingen_US
dc.titleALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCEen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Almodaresi_umd_0117E_21117.pdf
Size:
2.65 MB
Format:
Adobe Portable Document Format