UMD Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/3
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
3 results
Search Results
Item OPTIMIZING THE ACCURACY OF LIGHTWEIGHT METHODS FOR SHORT READ ALIGNMENT AND QUANTIFICATION(2021) Zakeri, Mohsen; Patro, Rob; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The analysis of the high throughput sequencing (HTS) data includes a number of involved computational steps, ranging from the assembly of transcriptome, mapping or alignment of the reads to existing or assembled sequences, estimating the abundance of sequenced molecules, performing differential or comparative analysis between samples, and even inferring dynamics of interest from snapshot data. Many methods have been developed for these different tasks that provide various trade-offs in terms of accuracy and speed, because accuracy and robustness typically come at the expense of sacrificing speed and vice versa. In this work, I focus on the problems of alignment and quantification of RNA-seq data, and review different aspects of the available methods for these problems. I explore finding a reasonable balance between these competing goals, and introduce methods that provide accurate results without sacrificing speed. Alignment of sequencing reads to known reference sequences is a challenging computational step in the RNA-seq pipeline mainly because of the large size of sample data and reference sequences, and highly-repetitive sequence. Recently, the concept of lightweight alignment is introduced to accelerate the mapping step of abundance estimation.I collaborated with my colleagues to explore some of the shortcomings of the lightweight alignment methods, and to address those with a new approach called the selective-alignment. Moreover, we introduce an aligner, Puffaligner, which benefits from both the indexing approach introduced by the Pufferfish index and also selective-alignment to produce accurate alignments in a short amount of time compared to other popular aligners. To improve the speed of RNA-seq quantification given a collection of alignments, some tools group fragments (reads) into equivalence classes which are sets of fragments that are compatible with the same subset of reference sequences. Summarizing the fragments into equivalence classes factorizes the likelihood function being optimized and increases the speed of the typical optimization algorithms deployed. I explore how this factorization affects the accuracy of abundance estimates, and propose a new factorization approach that demonstrates higher fidelity to the non-approximate model. Finally, estimating the posterior distribution of the transcript expressions is a crucial step in finding robust and reliable estimates of transcript abundance in the presence of high levels of multi-mapping. To assess the accuracy of their point estimates, quantification tools generate inferential replicates using techniques such as Bootstrap sampling and Gibbs sampling. The utility of inferential replicates has been portrayed in different downstream RNA-seq applications, i.e., performing differential expression analysis. I explore how sampling from both observed and unobserved data points (reads) improves the accuracy of Bootstrap sampling. I demonstrate the utility of this approach in estimating allelic expression with RNA-seq reads, where the absence of unique mapping reads to reference transcripts is a major obstacle for calculating robust estimates.Item The Psycho-logic of Universal Quantifiers(2021) Knowlton, Tyler Zarus; Lidz, Jeffrey; Linguistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A universally quantified sentence like every frog is green is standardly thought to express a two-place second-order relation (e.g., the set of frogs is a subset of the set of green things). This dissertation argues that as a psychological hypothesis about how speakers mentally represent universal quantifiers, this view is wrong in two respects. First, each, every, and all are not represented as two-place relations, but as one-place descriptions of how a predicate applies to a restricted domain (e.g., relative to the frogs, everything is green). Second, while every and all are represented in a second-order way that implicates a group, each is represented in a completely first-order way that does not involve grouping the satisfiers of a predicate together (e.g., relative to individual frogs, each one is green).These “psycho-logical” distinctions have consequences for how participants evaluate sentences like every circle is green in controlled settings. In particular, participants represent the extension of the determiner’s internal argument (the cir- cles), but not the extension of its external argument (the green things). Moreover, the cognitive system they use to represent the internal argument differs depend- ing on the determiner: Given every or all, participants show signatures of forming ensemble representations, but given each, they represent individual object-files. In addition to psychosemantic evidence, the proposed representations provide explanations for at least two semantic phenomena. The first is the “conservativity” universal: All determiners allow for duplicating their first argument in their second argument without a change in informational significance (e.g., every fish swims has the same truth-conditions as every fish is a fish that swims). This is a puzzling gen- eralization if determiners express two-place relations, but it is a logical consequence if they are devices for forming one-place restricted quantifiers. The second is that every, but not each, naturally invites certain kinds of generic interpretations (e.g., gravity acts on every/#each object). This asymmetry can po- tentially be explained by details of the interfacing cognitive systems (ensemble and object-file representations). And given that the difference leads to lower-level con- comitants in child-ambient speech (as revealed by a corpus investigation), children may be able to leverage it to acquire every’s second-order meaning. This case study on the universal quantifiers suggests that knowing the meaning of a word like every consists not just in understanding the informational contribu- tion that it makes, but in representing that contribution in a particular format. And much like phonological representations provide instructions to the motor plan- ning system, it supports the idea that meaning representations provide (sometimes surprisingly precise) instructions to conceptual systems.Item Computational approaches for improving the accuracy and efficiency of RNA-seq analysis(2020) Sarkar, Hirak N/A; Patro, Robert; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, I concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts in one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, I have described these problems and demonstrated efficient state-of-the-art solutions to a number of them. The current thesis will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. I used the equivalence classes for a number of applications in both single-cell and bulk RNA-seq technologies. By employing equivalence classes at cellular resolution, I have developed a droplet-based single-cell RNA-seq sequence simulator capable of generating tagged end short read sequences resembling the properties of real datasets. In bulk RNA-seq, I have utilized equivalence classes to applications ranging from data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, I introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. This approach, implemented in the tool terminus, groups together transcripts in a data-driven manner. It leverages the equivalence class factorization to quickly identify transcripts that share reads and posterior samples to measure the confidence of the point estimates. As a result, terminus allows transcript-level analysis where it can be confidently supported, and derives transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.