Methods for Efficient Processing and Comprehensive Analysis of Single Cell Sequencing Data

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2024

Citation

Abstract

Over the past decade, the rapid development of single-cell RNA-sequencing (scRNA-seq) technology has revolutionized the understanding of cellular differentiation, heterogeneity, transcriptional dynamics, and, many other biological processes. Despite the explosive growth of data analysis methods that aid in biological discovery, there are still many unsolved questions in raw data processing (also known as preprocessing) of scRNA-seq data --- the procedure for analyzing the raw sequenced fragments to generate the quantitative measurements of gene expression. In this dissertation, we first describe a computational ecosystem we developed that provides an end-to-end pipeline for accurately and efficiently processing single-cell sequencing data. Then, we will discuss the computational and analytical challenges we found during the development of alevin-fry and the solutions we provided for tackling these challenges.

Chapters 2 and 3 demonstrate the computational successes we achieved for single-cell data processing. In Chapter 2, we present a novel computational framework, alevin-fry, for rapid, accurate, and memory-frugal quantification of single-cell sequencing data. In Chapter 3, we discuss an augmented execution context, simpleaf, of alevin-fry that not only provides a simplified user interface to the alevin-fry framework, but also offers many high-level simplifications for single-cell data processing, and for assisting with data provenance propagation and reproducible analyses. Our results demonstrate that, with the help of alevin-fry and simpleaf, we are able to process single-cell data from both "standard'' chemistries, as well as from more advanced and complex data types, and achieve the same level of accuracy as existing best-in-class methods, while being substantially faster and more memory efficient.

Chapter 4 introduces Forseti, a mechanistic model to probabilistically assign a splicing status to scRNA-seq reads. As the first probabilistic and mechanistic model for solving the ambiguity of splicing status in tagged-end, short-read scRNA-seq data, we show that Forseti can be used to accurately and efficiently infer the splicing status of scRNA-seq reads, and to help identify the correct gene origin for multigene-mapped reads.

In Chapter 5, we describe the results of a comprehensive analysis of "off-target'' reads (reads whose mappings cannot be accounted for under the presumed and intended components of the underlying protocol) in scRNA-seq. Overall, our results suggest that off-target scRNA-seq reads contain underappreciated information about various transcriptional activities. These observations about yet-unexploited information in existing scRNA-seq data will help guide and motivate the community to improve current algorithms and analysis methods, and to develop novel approaches that utilize off-target reads to extend the reach and accuracy of single-cell data analysis pipelines.

Notes

Rights