Shape Analysis of High-throughput Genomics Data
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
RNA sequencing refers to the use of
next-generation sequencing technologies to characterize
the identity and abundance of target RNA species in a biological sample
of interest.
The recent improvement and reduction in the cost of next-generation
sequencing technologies have been
paralleled by the development of statistical methodologies to analyze the
data they produce.
Coupled with the reduction in cost is the increase in the complexity
of experiments.
Some of the old challenges still remain.
For example the issue of normalization is important now more than ever.
Some of the crude assumptions made in the early stages of RNA sequencing
data analysis were necessary since the technology was new and untested,
the number of replicates were small, and the experiments were relatively
simple.
One of the many uses of RNA sequencing experiments is the
identification of genes whose abundance levels are significantly different
across various biological conditions of interest.
Several methods have been developed to answer this question.
Some of these newly developed methods are based on the assumption
that the data observed or a transformation of the data are relatively symmetric
with light tails, usually summarized by assuming a Gaussian random component.
It is indeed very difficult to assess this assumption for small sample sizes
(e.g. sample sizes in the range of 4 to 30).
In this dissertation, we utilize L-moments statistics as the basis for
normalization, exploratory data analysis, the assessment of distributional assumptions,
and the hypothesis testing of high-throughput transcriptomic data.
In particular, we introduce a new normalization method for high-throughput
transcriptomic data that is a modification of quantile normalization.
We use L-moments ratios for assessing the shape
(skewness and kurtosis statistics) of high-throughput transcriptome data.
Based on these statistics, we propose a test for assessing whether
the shapes of the observed samples differ across biological conditions.
We also illustrate the utility of this framework to characterize
the robustness of distributional assumptions made by statistical methods
for differential expression.
We apply it to RNA-seq data and find that methods based on the simple t-test
for differential expression analysis using L-moments statistics as weights are robust.
Finally we provide an algorithm based on L-moments ratios for identifying genes with
distributions that are markedly different from the majority in the data.