Shape Analysis of High-throughput Genomics Data

Thumbnail Image
Publication or External Link
Okrah, Kwame
Corrada Bravo, Hector
RNA sequencing refers to the use of next-generation sequencing technologies to characterize the identity and abundance of target RNA species in a biological sample of interest. The recent improvement and reduction in the cost of next-generation sequencing technologies have been paralleled by the development of statistical methodologies to analyze the data they produce. Coupled with the reduction in cost is the increase in the complexity of experiments. Some of the old challenges still remain. For example the issue of normalization is important now more than ever. Some of the crude assumptions made in the early stages of RNA sequencing data analysis were necessary since the technology was new and untested, the number of replicates were small, and the experiments were relatively simple. One of the many uses of RNA sequencing experiments is the identification of genes whose abundance levels are significantly different across various biological conditions of interest. Several methods have been developed to answer this question. Some of these newly developed methods are based on the assumption that the data observed or a transformation of the data are relatively symmetric with light tails, usually summarized by assuming a Gaussian random component. It is indeed very difficult to assess this assumption for small sample sizes (e.g. sample sizes in the range of 4 to 30). In this dissertation, we utilize L-moments statistics as the basis for normalization, exploratory data analysis, the assessment of distributional assumptions, and the hypothesis testing of high-throughput transcriptomic data. In particular, we introduce a new normalization method for high-throughput transcriptomic data that is a modification of quantile normalization. We use L-moments ratios for assessing the shape (skewness and kurtosis statistics) of high-throughput transcriptome data. Based on these statistics, we propose a test for assessing whether the shapes of the observed samples differ across biological conditions. We also illustrate the utility of this framework to characterize the robustness of distributional assumptions made by statistical methods for differential expression. We apply it to RNA-seq data and find that methods based on the simple t-test for differential expression analysis using L-moments statistics as weights are robust. Finally we provide an algorithm based on L-moments ratios for identifying genes with distributions that are markedly different from the majority in the data.