Applications of Graph Segmentation Algorithms For Quantitative Genomic Analyses

Thumbnail Image
Publication or External Link
Gunady, Mohamed
Bravo, Hector C
There is a growing interest in utilizing graph formulations and graph-based algorithms in different subproblems of genomic analysis. Since graphs provide a natural and efficient representation of sequences of data where some structural relationships are observed within the data, we study some graph applications in quantitative analysis of typical RNA-seq and Whole Genome Sequencing pipelines. Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. This work describes Yanagi, a tool for segmenting transcriptomes to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves transcriptome substrings and structural relationships between transcripts while eliminating unnecessary sequence duplications. First, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries. The resulting segment sequences can be used with pseudo-alignment tools to quantify gene expression and alternative splicing at the segment level and provide gene-level visualization of the segments for more interpretability. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses. Furthermore, we show how transcriptome quantification can be performed from segment-level statistics. We present an EM algorithm that uses segment counts as features to estimate transcripts relative abundances in a way that maximizes the likelihood of the observed sequenced data. Then we tackle the problem of quantification in an incomplete annotation setting. We propose an assembly-free correction procedure that reduces bias in the estimated abundances of the annotated transcripts caused by the presence of unannotated transcripts in an RNA-seq sample, while avoiding the need to assemble the missing transcripts first. Another use case of our graph segmentation approach is representing population reference genome graphs used in Whole Genome Sequencing (WGS), which can be crucial for some genomic analysis studying highly polymorphic genes like HLA. Usually graph-based aligners are slow and computationally demanding. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs. Lastly, we explore the use of Generative Adversarial Networks (GANs) for imputing the sparse and noisy expression data obtained in single cell RNA sequencing (scRNA-seq) experiments. scRNA-seq provides a rich view into the heterogeneity underlying a cell population which is usually lost when performing bulk RNA-seq. However, these datasets are usually noisy and very sparse, and a number of methods have been proposed to impute zeros in these datasets with the goal of improving downstream analysis. In this work, we propose an approach, scGAIN, to impute zero counts of dropout genes in single cell data using Generative Adversarial Networks (GANs) by learning an approximation of the data distribution. The work presented here discusses an approach to adopt GAIN, a GAN model developed to impute data in image data, into the domain of imputing single cell data. Experiments show that scGAIN gives competitive results compared to the state-of-the-art imputation approaches while showing superiority in various aspects in simulation and real data. Imputation by scGAIN successfully recovers the underlying clustering of cell sub-populations, provides sharp estimates around true mean expression, reducing variability in the data, and increases the correspondence with matched bulk RNA-seq experiments.