Computational approaches for improving the accuracy and efficiency of RNA-seq analysis
dc.contributor.advisor | Patro, Robert | en_US |
dc.contributor.author | Sarkar, Hirak N/A | en_US |
dc.contributor.department | Computer Science | en_US |
dc.contributor.publisher | Digital Repository at the University of Maryland | en_US |
dc.contributor.publisher | University of Maryland (College Park, Md.) | en_US |
dc.date.accessioned | 2020-09-25T05:37:08Z | |
dc.date.available | 2020-09-25T05:37:08Z | |
dc.date.issued | 2020 | en_US |
dc.description.abstract | The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, I concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts in one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, I have described these problems and demonstrated efficient state-of-the-art solutions to a number of them. The current thesis will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. I used the equivalence classes for a number of applications in both single-cell and bulk RNA-seq technologies. By employing equivalence classes at cellular resolution, I have developed a droplet-based single-cell RNA-seq sequence simulator capable of generating tagged end short read sequences resembling the properties of real datasets. In bulk RNA-seq, I have utilized equivalence classes to applications ranging from data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, I introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. This approach, implemented in the tool terminus, groups together transcripts in a data-driven manner. It leverages the equivalence class factorization to quickly identify transcripts that share reads and posterior samples to measure the confidence of the point estimates. As a result, terminus allows transcript-level analysis where it can be confidently supported, and derives transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. | en_US |
dc.identifier | https://doi.org/10.13016/tuy4-kjcc | |
dc.identifier.uri | http://hdl.handle.net/1903/26454 | |
dc.language.iso | en | en_US |
dc.subject.pqcontrolled | Computer science | en_US |
dc.subject.pqcontrolled | Bioinformatics | en_US |
dc.subject.pqcontrolled | Conservation biology | en_US |
dc.subject.pquncontrolled | Assembly | en_US |
dc.subject.pquncontrolled | Clustering | en_US |
dc.subject.pquncontrolled | Equivalence Classe | en_US |
dc.subject.pquncontrolled | Quantification | en_US |
dc.subject.pquncontrolled | RNA-seq | en_US |
dc.subject.pquncontrolled | Transcription | en_US |
dc.title | Computational approaches for improving the accuracy and efficiency of RNA-seq analysis | en_US |
dc.type | Dissertation | en_US |
Files
Original bundle
1 - 1 of 1