Shape Analysis of High-throughput Genomics Data

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2015

Citation

Abstract

RNA sequencing refers to the use of

next-generation sequencing technologies to characterize

the identity and abundance of target RNA species in a biological sample

of interest.

The recent improvement and reduction in the cost of next-generation

sequencing technologies have been

paralleled by the development of statistical methodologies to analyze the

data they produce.

Coupled with the reduction in cost is the increase in the complexity

of experiments.

Some of the old challenges still remain.

For example the issue of normalization is important now more than ever.

Some of the crude assumptions made in the early stages of RNA sequencing

data analysis were necessary since the technology was new and untested,

the number of replicates were small, and the experiments were relatively

simple.

One of the many uses of RNA sequencing experiments is the

identification of genes whose abundance levels are significantly different

across various biological conditions of interest.

Several methods have been developed to answer this question.

Some of these newly developed methods are based on the assumption

that the data observed or a transformation of the data are relatively symmetric

with light tails, usually summarized by assuming a Gaussian random component.

It is indeed very difficult to assess this assumption for small sample sizes

(e.g. sample sizes in the range of 4 to 30).

In this dissertation, we utilize L-moments statistics as the basis for

normalization, exploratory data analysis, the assessment of distributional assumptions,

and the hypothesis testing of high-throughput transcriptomic data.

In particular, we introduce a new normalization method for high-throughput

transcriptomic data that is a modification of quantile normalization.

We use L-moments ratios for assessing the shape

(skewness and kurtosis statistics) of high-throughput transcriptome data.

Based on these statistics, we propose a test for assessing whether

the shapes of the observed samples differ across biological conditions.

We also illustrate the utility of this framework to characterize

the robustness of distributional assumptions made by statistical methods

for differential expression.

We apply it to RNA-seq data and find that methods based on the simple t-test

for differential expression analysis using L-moments statistics as weights are robust.

Finally we provide an algorithm based on L-moments ratios for identifying genes with

distributions that are markedly different from the majority in the data.

Notes

Rights