Analyzing and indexing huge reference sequence collections

Fan, Jason

Analyzing and indexing huge reference sequence collections

dc.contributor.advisor	Patro, Rob	en_US
dc.contributor.author	Fan, Jason	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2024-02-14T06:41:56Z
dc.date.available	2024-02-14T06:41:56Z
dc.date.issued	2023	en_US
dc.description.abstract	Recent advancements in genome-scale assays and high throughput sequencing have made systematic measurement of model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples or compare measurements against multiple reference organisms have been designed to improve analyses and gain new insights. However, such models and algorithms can be difficult to apply in practice. Furthermore, analysis of high-throughput sequencing data across multiple samples and multiple reference genomic sequences can be prohibitively costly in terms of space and time. In three parts, this dissertation investigates novel computational techniques that improve analyses at various scales. In Part I, I present two general matrix-factorization algorithms designed to analyze and compare biological measurements of related species that can be summarized as networks. In Part II, I present methods that improve analyses of high-throughput sequencing data. The first method, SCALPELSIG, reduces the computation burden of applying mutational signature analysis in resource limited settings; and the second method, a derivation of perplexity for gene and transcript expression estimation models, enables effective model se- lection in experimental RNA-seq data where ground-truth is absent. In Part III, I tackle the difficulties of indexing and analyzing huge collections reference sequences. I introduce the spectrum preserving tiling (SPT), a new computational and mathematical abstraction. Mathematically, the SPT explicitly relates past work on compactly representing k-mer sets --- namely the compacted de Bruijn graph and recent derivations of spectrum preserving string sets --- to the indexing of k-mer positions and metadata in reference sequences. Computationally, the SPT makes possible an entire class of eﬀicient and modular k-mer indexes. I introduce a pair of indexing schemes respectively designed to eﬀiciently support rapid locate and k-mer "color" queries in small space. In the final Chapter of this dissertation, I show how these modular indexes can be effectively and generically implemented.	en_US
dc.identifier	https://doi.org/10.13016/vftr-xiun
dc.identifier.uri	http://hdl.handle.net/1903/31741
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Bioinformatics	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.title	Analyzing and indexing huge reference sequence collections	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Fan_umd_0117E_23880.pdf
Size:: 19.49 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations