Transcript assembly and abundance estimation with high-throughput RNA sequencing

Trapnell, Bruce Colston

Transcript assembly and abundance estimation with high-throughput RNA sequencing

dc.contributor.advisor	Salzberg, Steven L	en_US
dc.contributor.author	Trapnell, Bruce Colston	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2010-07-02T06:01:11Z
dc.date.available	2010-07-02T06:01:11Z
dc.date.issued	2010	en_US
dc.description.abstract	We present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing ("RNA-Seq"). We evaluate these approaches through large-scale experiments of a well studied model of muscle development. We begin with an overview of sequencing assays and outline why the short read alignment problem is fundamental to the analysis of these assays. We then describe two approaches to the contiguous alignment problem, one of which uses massively parallel graphics hardware to accelerate alignment, and one of which exploits an indexing scheme based on the Burrows-Wheeler transform. We then turn to the spliced alignment problem, which is fundamental to RNA-Seq, and present an algorithm, TopHat. TopHat is the first algorithm that can align the reads from an entire RNA-Seq experiment to a large genome without the aid of reference gene models. In the second part of the thesis, we present the first comparative RNA-Seq as- sembly algorithm, Cufflinks, which is adapted from a constructive proof of Dilworth's Theorem, a classic result in combinatorics. We evaluate Cufflinks by assembling the transcriptome from a time course RNA-Seq experiment of developing skeletal muscle cells. The assembly contains 13,689 known transcripts and 3,724 novel ones. Of the novel transcripts, 62% were strongly supported by earlier sequencing experiments or by homologous transcripts in other organisms. We further validated interesting genes with isoform-specific RT-PCR. We then present a statistical model for RNA-Seq included in Cufflinks and with which we estimate abundances of transcripts from RNA-seq data. Simulation studies demonstrate that the model is highly accurate. We apply this model to the muscle data, and track the abundances of individual isoforms over development. Finally, we present significance tests for changes in relative and absolute abundances between time points, which we employ to uncover differential expression and differential regulation. By testing for relative abundance changes within and between transcripts sharing a transcription start site, we find significant shifts in the rates of alternative splicing and promoter preference in hundreds of genes, including those believed to regulate muscle development.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/10364
dc.subject.pqcontrolled	Computer Science	en_US
dc.subject.pqcontrolled	Biology, Molecular	en_US
dc.subject.pquncontrolled	alternative splicing	en_US
dc.subject.pquncontrolled	differential expression	en_US
dc.subject.pquncontrolled	RNA-Seq	en_US
dc.subject.pquncontrolled	short read sequencing	en_US
dc.subject.pquncontrolled	transcriptomics	en_US
dc.title	Transcript assembly and abundance estimation with high-throughput RNA sequencing	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Trapnell_umd_0117E_11206.pdf
Size:: 3.47 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations