Analyzing and indexing huge reference sequence collections

Fan, Jason

Analyzing and indexing huge reference sequence collections

Files

Fan_umd_0117E_23880.pdf (19.49 MB)

No. of downloads: 3

Date

2023

Authors

Fan, Jason

Advisor

Patro, Rob

DRUM DOI

https://doi.org/10.13016/vftr-xiun

Abstract

Recent advancements in genome-scale assays and high throughput sequencing have made systematic measurement of model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples or compare measurements against multiple reference organisms have been designed to improve analyses and gain new insights. However, such models and algorithms can be difficult to apply in practice. Furthermore, analysis of high-throughput sequencing data across multiple samples and multiple reference genomic sequences can be prohibitively costly in terms of space and time. In three parts, this dissertation investigates novel computational techniques that improve analyses at various scales.

In Part I, I present two general matrix-factorization algorithms designed to analyze and compare biological measurements of related species that can be summarized as networks. In Part II, I present methods that improve analyses of high-throughput sequencing data. The first method, SCALPELSIG, reduces the computation burden of applying mutational signature analysis in resource limited settings; and the second method, a derivation of perplexity for gene and transcript expression estimation models, enables effective model se- lection in experimental RNA-seq data where ground-truth is absent.

In Part III, I tackle the difficulties of indexing and analyzing huge collections reference sequences. I introduce the spectrum preserving tiling (SPT), a new computational and mathematical abstraction. Mathematically, the SPT explicitly relates past work on compactly representing k-mer sets --- namely the compacted de Bruijn graph and recent derivations of spectrum preserving string sets --- to the indexing of k-mer positions and metadata in reference sequences. Computationally, the SPT makes possible an entire class of eﬀicient and modular k-mer indexes. I introduce a pair of indexing schemes respectively designed to eﬀiciently support rapid locate and k-mer "color" queries in small space. In the final Chapter of this dissertation, I show how these modular indexes can be effectively and generically implemented.

URI (handle)

http://hdl.handle.net/1903/31741

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page