Transcript assembly and abundance estimation with high-throughput RNA sequencing

dc.contributor.advisorSalzberg, Steven Len_US
dc.contributor.authorTrapnell, Bruce Colstonen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2010-07-02T06:01:11Z
dc.date.available2010-07-02T06:01:11Z
dc.date.issued2010en_US
dc.description.abstractWe present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing ("RNA-Seq"). We evaluate these approaches through large-scale experiments of a well studied model of muscle development. We begin with an overview of sequencing assays and outline why the short read alignment problem is fundamental to the analysis of these assays. We then describe two approaches to the contiguous alignment problem, one of which uses massively parallel graphics hardware to accelerate alignment, and one of which exploits an indexing scheme based on the Burrows-Wheeler transform. We then turn to the spliced alignment problem, which is fundamental to RNA-Seq, and present an algorithm, TopHat. TopHat is the first algorithm that can align the reads from an entire RNA-Seq experiment to a large genome without the aid of reference gene models. In the second part of the thesis, we present the first comparative RNA-Seq as- sembly algorithm, Cufflinks, which is adapted from a constructive proof of Dilworth's Theorem, a classic result in combinatorics. We evaluate Cufflinks by assembling the transcriptome from a time course RNA-Seq experiment of developing skeletal muscle cells. The assembly contains 13,689 known transcripts and 3,724 novel ones. Of the novel transcripts, 62% were strongly supported by earlier sequencing experiments or by homologous transcripts in other organisms. We further validated interesting genes with isoform-specific RT-PCR. We then present a statistical model for RNA-Seq included in Cufflinks and with which we estimate abundances of transcripts from RNA-seq data. Simulation studies demonstrate that the model is highly accurate. We apply this model to the muscle data, and track the abundances of individual isoforms over development. Finally, we present significance tests for changes in relative and absolute abundances between time points, which we employ to uncover differential expression and differential regulation. By testing for relative abundance changes within and between transcripts sharing a transcription start site, we find significant shifts in the rates of alternative splicing and promoter preference in hundreds of genes, including those believed to regulate muscle development.en_US
dc.identifier.urihttp://hdl.handle.net/1903/10364
dc.subject.pqcontrolledComputer Scienceen_US
dc.subject.pqcontrolledBiology, Molecularen_US
dc.subject.pquncontrolledalternative splicingen_US
dc.subject.pquncontrolleddifferential expressionen_US
dc.subject.pquncontrolledRNA-Seqen_US
dc.subject.pquncontrolledshort read sequencingen_US
dc.subject.pquncontrolledtranscriptomicsen_US
dc.titleTranscript assembly and abundance estimation with high-throughput RNA sequencingen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Trapnell_umd_0117E_11206.pdf
Size:
3.47 MB
Format:
Adobe Portable Document Format