Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
117 results
Search Results
Item Algorithmic approaches for investigating DNA Methylation in tumor evolution and heterogeneity(2024) Li, Xuan; Sahinalp, S. Cenk; Mount, Stephen M.; Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Intratumor heterogeneity and tumor diversity of cancer impose significant challenges on the prospect of personalized cancer diagnosis, treatment, and prognostics. While many studies seek to understand the complex dynamics of cancer with theoretically well-suited biomarkers like DNA mutations, the relative molecular rigidity and sparsity of mutation make it often challenging to reconstruct reliable tumor lineage using mutation profiles in practice. Epigenetic markers like DNA methylation, on the other hand, serve as a promising alternative to elucidate intratumor heterogeneity and tumor diversity. However, systematic research leveraging algorithmic approaches to investigate DNA methylation in the context of tumor evolution and heterogeneity remains limited. Aimed to address critical gaps in computational cancer research, this dissertation presents novel computational frameworks for analyzing DNA methylation at both single-cell and bulk levels and offers insights into methylation-based tumor heterogeneity, tumor evolutionary dynamics, and cellular composition in tumor samples for characterization of the complex epigenetic landscape of tumors. Chapter 2 and Chapter 3 introduce Sgootr (Single-cell Genomic methylatiOn tumOr Tree Reconstruction), the first distance-based computational method to jointly select tumor lineage-informative CpG sites and reconstruct tumor lineages from single-cell methylation data. Sgootr lays the groundwork for understanding tumor evolution through the lens of single-cell methylation profiles. Motivated by the need highlighted in Chapter 2 to overcome imbalances in single-cell methylation data across patient samples for interpretable comparative patient analysis, Chapter 4 presents FALAFL (FAir muLti-sAmple Feature seLection). With integer linear programming (ILP) serving as its algorithmic backbone, FALAFL provides a fast and reliable solution to fairly select CpG sites across different single-cell methylation patient samples to optimally represent the entire patient cohort and identify reliable tumor lineage-informative CpG sites. Finally, Chapter 5 shifts the scope from single-cell to bulk tissue contexts and introduces Qombucha (Quadratic prOgraMming Based tUmor deConvolution with cell HierArchy), which is designed to tackle the challenges of bulk tissue analysis by inferring the methylation profiles of progenitor brain cells and determining cell type composition in bulk glioblastoma (GBM) samples. The work presented in this dissertation demonstrates the power of algorithmic and data science approaches to tackle some of the most pressing challenges in understanding the complexity of cancer epigenomics. With novel computational tools addressing current limitations in methylation data analysis, this work paves the way for further research in tumor evolution, personalized cancer treatment, and biomarker discovery. Overall, the computational frameworks and findings presented here bridge the gap between complex molecular data and clinically meaningful insights in the battle against cancer.Item THE EVOLUTIONARY TRAJECTORY OF METARHIZIUM ROBERTSII ENDOPHYTIC CAPABILITY AND ENTOMOPATHOGENICITY(2024) Sheng, Huiyu; St. Leger, Raymond; Entomology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Metarhizium fungi are a keystone genus of soil-inhabiting ascomycetes providing essential ecosystem services as saprotrophs, plant symbionts and insect pathogens, among other roles. Recent studies have looked at how Metarhizium niches have evolved and shaped genome evolution over large time scales within the Metarhizium genus. This dissertation uses Metarhizium robertsii (M. robertsii) as a model to explore the evolution of its dual roles as an entomopathogen and endophyte by examining phenotypic and genomic differences among eight closely related strains. The study found that early diverged strains, characterized by slow germination on insect cuticles, low virulence, and extensive sporulation, exhibit a biotrophic lifestyle, systemically colonizing living hosts. In contrast, recently diverged strains exhibited rapid germination, high virulence, and reduced sporulation, indicating a shift towards a necrotrophic lifestyle. The study highlighted the influence of host immune responses in shaping M. robertsii-insect interactions, and showed that strong insect virulence correlated with better colonization of plant roots. Comparative genomics revealed that recently diverged strains expanded a small number of gene families related to gene expression as well as carbohydrate-degrading enzymes and proteases enhancing metabolic capabilities, insect virulence, and endophytic potential. Some early diverged strains exhibited high Repeat-Induced Point mutation activity, suggesting cryptic sexual reproduction in their evolutionary past. Overall, M. robertsii strains maintained a conserved genome with similar protein family sizes, with differences in gene expression patterns driving their varied lifestyles. This research provides new insights into M. robertsii’s recent co-evolution with plants and insects, highlighting the importance of understanding the ecological and evolutionary dynamics of these interactions for optimizing its use in sustainable agriculture.Item Cell Population Shifts and Clinical Heterogeneity in Sjögren's Disease(2024) Pranzatelli, Thomas J; Johnson, Philip L.F.; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sjögren's disease (SjD) is a systemic autoimmune disease that causes loss of function of the salivary and lacrimal glands. Those with the disease, overwhelmingly female with an onset of disease in the fourth or fifth decade of life, commonly suffer from dry mouth, cavities and damage to the eyes. Patients present with a wide variety of clinical phenotypes, with variation in degree of immune infiltration and glandular damage as well as positivity for autoantibodies. This thesis uncovers the changes in cell population and gene expression in the gland that underpin diversity in disease severity. SjD patients lose the majority of a specific epithelial population in their labial salivary glands and, as the number of immune infiltrates grows the surviving members of this population can be found colocalizing with invading GZMK+ T cells and expressing markers of increased proliferation. Standard differential gene expression analysis highlighted gene markers of cell types changing in proportion with disease; an unenlightening result when the cell population changes are well-characterized. To avoid this pitfall an ensemble of random forests was trained to find genes predictive of patient subtypes without being correlated with diagnosis. Genes with high importance for autoantibody positivity were enriched for GO terms related to antigen processing and presentation. A master regulator of salivary gland identity, ZBTB7B, was identified from chromatin accessibility data. Mice with this transcription factor knocked out lose salivary flow and develop pockets of tissue in their glands that resemble other glands, eg., labial gland epithelium inside of parotid glands. This work supports a clinical presentation-specific approach to therapy and paves the path for reengineering the glands to correct the effects of disease.Item TRANSLATION, REPLICATION AND TRANSCRIPTOMICS OF THE SIMPLEST PLUS-STRAND RNA PLANT VIRUSES(2024) Johnson, Philip Zhao; Simon, Anne E; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Plus (+)-strand RNA viruses are among the most common pathogens of plants and animals. Furthermore, they present model systems for the study of basic biological processes, including protein translation and RNA replication, and shed light on the versatile roles that RNA structures play in these processes. After cell entry, the next step in the (+)-strand RNA viral life cycle is translation of the viral genome to produce the viral RNA-dependent RNA polymerase (RdRp) and associated replication proteins necessary for viral replication to occur. For many (+)-strand RNA viruses lacking a 5´cap and 3´ poly(A) tail, translation depends upon RNA structural elements within their genomes capable of hijacking the host translation machinery, which for plant viruses are commonly located in their 3´ proximal regions and are termed 3´ cap-independent translation enhancer (CITE) elements. In Chapter 2, I report upon my work characterizing a new subclass of panicum mosaic virus-like translation enhancer (PTE) elements, which bind and co-opt for viral use the host translation initiation factor 4E (eIF4E) – the translation initiation factor normally responsible for binding and recognition of mRNA 5´caps during canonical eukaryotic translation initiation. Thus, PTE 3´CITEs present a novel mechanism for co-opting the critical host factor eIF4E. My work characterizing a new subclass of PTE 3´CITEs further revealed characteristics common among all PTE 3´CITEs pertaining to their mechanism of binding eIF4E.After translation of the necessary viral replication proteins, replication of the viral RNA occurs, which again is in large part mediated by RNA structural elements within the viral genome that can bind to the viral RdRp and/or host factors involved in viral replication. Indeed, RNA structural elements often serve dual roles in viral translation and replication and/or are located proximal to RNA structural elements involved in the alternate function. In Chapter 3, I discuss my work characterizing novel replication elements in the 3´ terminal regions of umbraviruses (family Tombusviridae). The uncovered replication elements appear to be specific to umbraviruses and are located immediately upstream of replication/translation elements that are common throughout the Tombusviridae, lending greater complexity to the already complex 3´ proximal structures of umbraviruses. While the study of (+)-strand RNA viruses has historically focused on their protein-coding transcripts, (+)-strand RNA viruses also commonly produce additional non-coding transcripts, including recombinant defective RNAs, typically containing 5´ and 3´ co-terminal viral genome segments, and (+/-)-foldback RNAs, composed of complementary (+)- and (-)-strand viral sequences joined together. Long non-coding RNAs that accumulate to high levels have also been reported for plant and animal (+)-strand RNA viruses in recent years, and truncations of viral transcripts also commonly arise due to host nuclease activity and/or premature termination of replication elongation by the viral RdRp. The rise of long-read high-throughput sequencing technologies such as nanopore sequencing presents an opportunity to fully map the complexity of (+)-strand RNA viral transcriptomes. In Chapter 4, I present my work performing this analysis, employing direct RNA nanopore sequencing, in which the transcripts present in an RNA sample of interest are directly sequenced. This analysis revealed for the umbra-like virus citrus yellow vein-associated virus (CY1): (i) three novel 5´ co-terminal long non-coding RNAs; (ii) D-RNA population dynamics; (iii) a common 3´ terminal truncation of 61 nt among (+)-strand viral transcripts; (iv) missing 3´ terminal CCC-OH motif in virtually all (-)-strand reads; (v) major timepoint- and tissue-specific differences; and (vi) an abundance of (+/-)-foldback RNAs at later infection timepoints in leaf tissues. This work also sheds light on the current shortcomings of direct RNA nanopore sequencing as a technique. Finally, the importance of RNA structural biology in the study of (+)-strand RNA viruses presents the need for specialized RNA structure drawing software with functionality to easily control the layout of nucleobases, edit base-pairs, and annotate/color the nucleobases and bonds in a drawing. It is through the visual exploration of RNA structures that RNA biologists routinely improve upon the outputs of RNA structure prediction programs and perform crucial phylogenetic analyses among related RNA structures. Large RNA structures, such as whole viral genomes thousands of nucleotides long, can only be studied in their entirety with the aid of RNA structure visualization tools. To this end, I have developed over the course of my doctoral education the 2D RNA structure drawing application RNAcanvas, which is available as a web app and has grown popular among the RNA biology community. RNAcanvas emphasizes graphical mouse-based interaction with RNA structure drawings and has special functionality well suited for the drawing and exploration of large RNA structures, such as automatic layout adjustment and maintenance, complementary sequence highlighting, motif finding, and performance optimizations. Large viral structures such as that of the 2.7 kb CY1 genomic RNA could not have been characterized without the aid of RNAcanvas. In Chapter 5, I present my work developing RNAcanvas.Item GENOMICS ENABLED GENE DISCOVERY IN DIPLOID AND POLYPLOID WHEAT(2024) Yadav, Inderjit Singh; Tiwari, Vijay; Plant Science and Landscape Architecture (PSLA); Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Hexaploid bread wheat (Triticum aestivum) is one of the most important staple food crops for humans. Sustainable genetic improvement in wheat is critical for ensuring global food security and requires the introduction of new genes and alleles into elite wheat cultivars. The progenitor species and wild wheat relatives are a reservoir of genetic diversity for wheat improvement. This doctoral thesis demonstrates the application of genomic resources and bioinformatics pipelines to characterize the wild germplasm and to streamline the gene discovery pipeline using five diverse species involving wheat progenitor, wild, and related species. Genomics-assisted characterization of the genetic diversity present in gene banks is a major step towards the systematic utilization of unexploited germplasm to ensure the sustainable development of new varieties. Toward this end, we used genomics datasets to curate wild and related accessions of tetraploid wheat from two distinct species Triticum turgidum and Triticum timopheevii. Using Genotyping by sequencing (GBS) data and a unique similarity matrix and powercore analysis, a set of 102 accessions were identified as the core set accessions that represent 20 and 35 percent of the total accessions of the WGRC tetraploid wheat collection of T. turgidum and T. timopheevii, respectively. Further, three distinct centers of rich genetic diversity were identified for wild and domesticated emmer and T. timopheevii in the Fertile Crescent. GWAS analysis of the genotypic and phenotypic dataset identified a novel QTL for leaf rust resistance on chromosome 2B in T. timopheevii. Triticale is a man-made cereal derived from a cross between tetraploid and hexaploid wheat with diploid rye. There are large numbers of triticale germplasm available in different gene banks; however, in many cases, the ploidy information is not accurate and affects the quality of work with large triticale germplasm. In this work, using the low-cost GBS datasets, a pipeline was developed to detect contamination in the UMD triticale collection and facilitated the accurate classification of ploidy, ensuring the purity of the triticale germplasm. This approach identified contamination of 11 wheat accessions and enabled the correct classification of 236 hexaploid and 12 octoploid triticale, these results were further confirmed through GISH experiments. Wild and related germplasms are considered as the goldmine of genetic diversity for wheat improvement. The modern wheat cultivars have gone through several rounds of heavy selections for yield related traits and have lost the genetic diversity against several abiotic and biotic stresses. On the other hand, wild relatives of wheat have been growing naturally without any substantial artificial selection pressure and it allowed them to preserve their genetic diversity. This study investigates the genetic diversity of a selected set of genes to visualize the differences in wild wheat relatives and polyploid wheat cultivars. To study these differences, group 5 chromosome of Aegilops geniculata and Aegilops umbellulata, belonging to the tertiary gene pool, were assembled. Comparative analysis revealed a higher rate of pseudogenization in bread wheat compared to these two wild relatives, primarily due to the difference in exon/intron length between the genes, rendering these genes non-functional. Diploid einkorn wheat (Triticum monococcum), with inherent disease resistance, offers a valuable resource for wheat improvement. To facilitate its proper utilization, two of the reference genomes-one wild (T. monococcum ssp. aegilopoides) and one domesticated (T. monococcum ssp. monococcum) were assembled in the study. Kmer-GWAS identified seven novel QTLs associated with powdery mildew resistance, three for leaf rust resistance, and two for stem rust resistance. These QTLs harbor diverse gene classes encoding for resistance gene analogs, cysteine-rich receptor kinases, transcription factors, and G-type lectins. Overall, the knowledge and resources developed in this research would contribute to the characterization of vast germplasm and the development of climate-resilient wheat.Item DATA-DRIVEN ALGORITHMS FOR CHARACTERIZING STRUCTURAL VARIATION IN METAGENOMIC DATA(2024) Muralidharan, Harihara Subrahmaniam; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide poly-morphisms, or relying on reference-based analyses that miss complex structural variants. In this thesis, we describe data-driven methods to characterize variation in metagenomic data. In the first part of the thesis, I present our analysis of the structural variants identified from metagenomic whole genome shotgun sequencing data. I begin by describing the power of assembly graph analysis to detect over 9 million structural variants such as insertion/deletion, repeat copy-number changes, and mobile elements, in nearly 1,000 metagenomes generated as a part of the Human Microbiome Project. Next, I describe Binnacle, which is a structural variant aware binning algorithm. To improve upon the fragmented nature of assemblies, metagenomic binning is performed to cluster contigs that is likely to have originated from the same genome. We show that binning “graph-based” scaffolds, rather than contigs, improves the quality of the bins, and captures a broader set of the genes of the genomes being reconstructed. Finally, we present a case study of the microbial mats from the Yellowstone National Park. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ∼ 50oC to ∼ 65oC and plays a key role in fixing carbon and nitrogen. Previous studies have isolated and generated good quality reference sequences of two major Synechococcus spp. that share a very high genomic content; OS-A and OS-B’. Despite the high abundance of the Synechococcus spp., metagenomic assembly of these organisms is challenging due to the large number of rearrangements between them. We explore the genomic diversity of the Synechococcus spp. using a reference genome, reliant assembly and scaffolding. We also highlight that the variants we detect can be used to fingerprint the local biogeography of the hot spring. In the second part of the thesis, I present our analysis of amplicon sequencing data, specifically the 16S rRNA gene sequences. I begin by describing SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate), which is a fast iterative algorithm for clustering large 16S rRNA gene datasets. We also show that SCRAPT is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST and runs orders of magnitude faster than existing methods. Finally, we study the impact of transitive annotation on taxonomic classifiers. Taxonomic labels are assigned using machine learning algorithms trained to recognize individual taxonomic groups based on training sequences with known taxonomic labels. Ideally, the training data should rely on experimentally verified-formal taxonomic labels however, the labels associated with sequences in biological databases are most commonly the result of computational predictions– “transitive annotation.” We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively annotated training data and propose best practices to avoid the artifacts described in this thesis.Item Mixture Models for Nucleic Acid Sequence Feature Analysis(2023) Wang, Bixuan; Mount, Stephen M; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Signals in nucleotide sequences play a crucial role in interactions among macromolecules and the regulation of biological functional processes such as transcription, the splicing of messenger RNA precursors and translation. Recognition of signals in nucleotide sequences is the first step in functional annotation, which is critical for the identification of deleterious mutations and the identification of targets for disease treatment. One of the essential steps in gene expression, RNA splicing removes introns from newly transcribed RNA, ligating exons to generate mature RNA. Splicing involves the formation and recycling of the spliceosome, a large macromolecular complex whose assembly requires complex coordination by splicing factors through the recognition of RNA-protein binding sites. One potential method to reveal unknown subtypes of samples and identify distinctively distributed features is by applying a mixture model called the admixture model or Latent Dirichlet Allocation (LDA), which allows samples to have partial memberships of different clusters that can be interpreted for functional motif identification. By applying mixture models to RNA sequences, I found splicing signals such as the polypyrimidine tract and the branch point in intron sequences. Mixture models also showed motifs associated with reading frames from coding sequences, which further revealed potential coding regions from 5’ untranslated regions and long non-coding RNAs. Dynamic single-molecule imaging of nascent RNAs coupled with multiple genome-wide assays reveals that splicing happens far more often than expected, and partial intron removal can be captured prior to completion of the entire transcript. I hypothesize that the spliceosome progressively removes large introns in small pieces through 'recursive splicing' instead of removing the whole intron at once. However, the sequence features that distinguish sites of recursive splicing from canonical splice sites remain to be discovered. Here, I applied mixture models to sequences from human introns to identify sequence features associated with recursive splicing. This method helped me to recognize and visualize splicing signals from annotated intron sequences and identify potential coding sequences from human 5' untranslated regions and long non-coding RNA. After applying mixture models to the sequences surrounding recursive and canonical splicing sites, I found that transcripts where large introns can be recursively spliced can be distinguished from those without recursive splicing by the presence of CG-rich motifs flanking 5' splice sites upstream of first introns, and the absence of DNA methylation at these sites.In addition to applications of mixture models, I also explored RNA Bind-N-Seq data reflecting the binding activities of the splicing factor U2AF and found that the recursive 3' splice sites have higher U2AF binding affinities than the downstream canonical 3'SS. The observations suggest that, first, mixture models have the potential to identify functional motifs, including subtle signals in sequences such as the branch sites that only occur in a subgroup of introns. Second, the usage of recursive splicing sites is associated with sequence features in the first exons of the transcripts, suggesting a testable model for the regulation of recursive splicing in human introns.Item EVOLUTION OF THE CRISPR IMMUNE SYSTEM FROM ECOLOGICAL TO MOLECULAR SCALES(2024) Xiao, Wei; Johnson, Philip LF; Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Bacteria and archaea inhabit environments that constantly face viral infections and other external genetic threats. They have evolved an arsenal of defense strategies to protect themselves. My research delves into the CRISPR immune system, the only known adaptive immune system of prokaryotes. My work explores three different dimensions of the CRISPR immune system, ranging from ecological to molecular scales.From an evolutionary perspective, CRISPR is widely distributed across the prokaryotic tree, underscoring its immune effectiveness. However, the CRISPR distribution is uneven and some lineages are devoid of CRISPR. Here, I identify two ecological drivers of the CRISPR immune system. By analyzing both 16S rRNA data and metagenomic data, I find the CRISPR system is favored in less abundant prokaryotes in the saltwater environment and higher diverse prokaryote communities in the human oral environment. On the molecular level, the CRISPR system selects and cleaves its “favorite” DNA segments (also known as “spacers”) from invading viral genomes to form immune memories. I explore how the spacer sequence composition affects its acquisition rate by the CRISPR system. I develop a convolutional neural network model to predict the spacer acquisition rate based on the spacer sequence composition in natural microbial communities. The model interpretation reveals that the PAM-proximal end of the spacer is more important in predicting the spacer abundance, which is consistent with previous findings from controlled experimental studies. Combining these scales, CRISPR repeat sequences coevolve with the rest of the genome. Thus, I explore the potential of utilizing CRISPR repeat sequences for taxonomy profiling. I find a strong relationship between unique repeat sequences and taxonomy in both the RefSeq database and a human metagenomic dataset. Then I show high accuracy when utilizing repeat sequences in taxonomy annotation of human metagenomic contigs. This novel method not only aids in annotating CRISPR arrays but also introduces a novel tool for metagenomic sequence annotation.Item Methods for Efficient Processing and Comprehensive Analysis of Single Cell Sequencing Data(2024) He, Dongze; Patro, Rob R.P.; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Over the past decade, the rapid development of single-cell RNA-sequencing (scRNA-seq) technology has revolutionized the understanding of cellular differentiation, heterogeneity, transcriptional dynamics, and, many other biological processes. Despite the explosive growth of data analysis methods that aid in biological discovery, there are still many unsolved questions in raw data processing (also known as preprocessing) of scRNA-seq data --- the procedure for analyzing the raw sequenced fragments to generate the quantitative measurements of gene expression. In this dissertation, we first describe a computational ecosystem we developed that provides an end-to-end pipeline for accurately and efficiently processing single-cell sequencing data. Then, we will discuss the computational and analytical challenges we found during the development of alevin-fry and the solutions we provided for tackling these challenges. Chapters 2 and 3 demonstrate the computational successes we achieved for single-cell data processing. In Chapter 2, we present a novel computational framework, alevin-fry, for rapid, accurate, and memory-frugal quantification of single-cell sequencing data. In Chapter 3, we discuss an augmented execution context, simpleaf, of alevin-fry that not only provides a simplified user interface to the alevin-fry framework, but also offers many high-level simplifications for single-cell data processing, and for assisting with data provenance propagation and reproducible analyses. Our results demonstrate that, with the help of alevin-fry and simpleaf, we are able to process single-cell data from both "standard'' chemistries, as well as from more advanced and complex data types, and achieve the same level of accuracy as existing best-in-class methods, while being substantially faster and more memory efficient. Chapter 4 introduces Forseti, a mechanistic model to probabilistically assign a splicing status to scRNA-seq reads. As the first probabilistic and mechanistic model for solving the ambiguity of splicing status in tagged-end, short-read scRNA-seq data, we show that Forseti can be used to accurately and efficiently infer the splicing status of scRNA-seq reads, and to help identify the correct gene origin for multigene-mapped reads. In Chapter 5, we describe the results of a comprehensive analysis of "off-target'' reads (reads whose mappings cannot be accounted for under the presumed and intended components of the underlying protocol) in scRNA-seq. Overall, our results suggest that off-target scRNA-seq reads contain underappreciated information about various transcriptional activities. These observations about yet-unexploited information in existing scRNA-seq data will help guide and motivate the community to improve current algorithms and analysis methods, and to develop novel approaches that utilize off-target reads to extend the reach and accuracy of single-cell data analysis pipelines.Item Analyzing and indexing huge reference sequence collections(2023) Fan, Jason; Patro, Rob; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recent advancements in genome-scale assays and high throughput sequencing have made systematic measurement of model-organisms both accessible and abundant. As a result, novel algorithms that exploit similarities across multiple samples or compare measurements against multiple reference organisms have been designed to improve analyses and gain new insights. However, such models and algorithms can be difficult to apply in practice. Furthermore, analysis of high-throughput sequencing data across multiple samples and multiple reference genomic sequences can be prohibitively costly in terms of space and time. In three parts, this dissertation investigates novel computational techniques that improve analyses at various scales. In Part I, I present two general matrix-factorization algorithms designed to analyze and compare biological measurements of related species that can be summarized as networks. In Part II, I present methods that improve analyses of high-throughput sequencing data. The first method, SCALPELSIG, reduces the computation burden of applying mutational signature analysis in resource limited settings; and the second method, a derivation of perplexity for gene and transcript expression estimation models, enables effective model se- lection in experimental RNA-seq data where ground-truth is absent. In Part III, I tackle the difficulties of indexing and analyzing huge collections reference sequences. I introduce the spectrum preserving tiling (SPT), a new computational and mathematical abstraction. Mathematically, the SPT explicitly relates past work on compactly representing k-mer sets --- namely the compacted de Bruijn graph and recent derivations of spectrum preserving string sets --- to the indexing of k-mer positions and metadata in reference sequences. Computationally, the SPT makes possible an entire class of efficient and modular k-mer indexes. I introduce a pair of indexing schemes respectively designed to efficiently support rapid locate and k-mer "color" queries in small space. In the final Chapter of this dissertation, I show how these modular indexes can be effectively and generically implemented.