RNA-SEQUENCING ANALYSIS: READ ALIGNMENT AND DISCOVERY AND RECONSTRUCTION OF FUSION TRANSCRIPTS
Salzberg, Steven L
MetadataShow full item record
RNA-sequencing technologies, which sequence the RNA molecules being transcribed in cells, allow us to explore the process of transcription in exquisite detail. One of the primary goals of RNA sequencing analysis is to reconstruct the full set of transcripts (isoforms) of genes that were present in the original cells. In addition to the transcript structures, experimenters need to estimate the expression levels for all transcripts. The first step in the analysis process is to map the RNA-seq reads against the reference genome, which provides the location from which the reads originated. In contrast to DNA sequence alignment, RNA-seq mapping algorithms have two additional challenges. First, any RNA-seq alignment program must be able to handle gapped alignment (or spliced alignment) with very large gaps due to introns, typically from 50-100,000 bases in mammalian genomes. Second, the presence of processed pseudogenes from which introns have been removed may cause many exon-spanning reads to map incorrectly. In order to cope with these problems effectively, I have developed new alignment algorithms and implemented them in TopHat2, a second version of TopHat (one of the first spliced aligners for RNA-seq reads). The new TopHat2 program can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length insertions and deletions with respect to the reference genome. TopHat2 combines the ability to discover novel splice sites with direct mapping to known transcripts, producing more sensitive and accurate alignments, even for highly repetitive genomes or in the presence of processed pseudogenes. These new capabilities will contribute to improvements in the quality of downstream analysis. In addition to its splice junction mapping algorithm, I have developed novel algorithms to align reads across fusion break points, which result from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome. Based on this new fusion alignment algorithm, I have developed TransFUSE, one of the first systems for reconstruction and quantification of full- length fusion gene transcripts. TransFUSE can be run with or without known gene annotations, and it can discover novel fusion transcripts that are transcribed from known or unknown genes.