Computational approaches for improving the accuracy and efficiency of RNA-seq analysis

Thumbnail Image
Publication or External Link
Sarkar, Hirak N/A
Patro, Robert
The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, I concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts in one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, I have described these problems and demonstrated efficient state-of-the-art solutions to a number of them. The current thesis will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. I used the equivalence classes for a number of applications in both single-cell and bulk RNA-seq technologies. By employing equivalence classes at cellular resolution, I have developed a droplet-based single-cell RNA-seq sequence simulator capable of generating tagged end short read sequences resembling the properties of real datasets. In bulk RNA-seq, I have utilized equivalence classes to applications ranging from data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, I introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. This approach, implemented in the tool terminus, groups together transcripts in a data-driven manner. It leverages the equivalence class factorization to quickly identify transcripts that share reads and posterior samples to measure the confidence of the point estimates. As a result, terminus allows transcript-level analysis where it can be confidently supported, and derives transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.