Computational approaches for improving the accuracy and efficiency of RNA-seq analysis

Thumbnail Image


Publication or External Link





The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, I concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts in one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, I have described these problems and demonstrated efficient state-of-the-art solutions to a number of them.

The current thesis will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. I used the equivalence classes for a number of applications in both single-cell and bulk RNA-seq technologies. By employing equivalence classes at cellular resolution, I have developed a droplet-based single-cell RNA-seq sequence simulator capable of generating tagged end short read sequences resembling the properties of real datasets. In bulk RNA-seq, I have utilized equivalence classes to applications ranging from data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, I introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. This approach, implemented in the tool terminus, groups together transcripts in a data-driven manner. It leverages the equivalence class factorization to quickly identify transcripts that share reads and posterior samples to measure the confidence of the point estimates. As a result, terminus allows transcript-level analysis where it can be confidently supported, and derives transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.