ALGORITHMS AND HIGH PERFORMANCE COMPUTING APPROACHES FOR SEQUENCING-BASED COMPARATIVE GENOMICS
Langmead, Benjamin Thomas
Salzberg, Steven L
MetadataShow full item record
As cost and throughput of second-generation sequencers continue to improve, even modestly resourced research laboratories can now perform DNA sequencing experiments that generate hundreds of billions of nucleotides of data, enough to cover the human genome dozens of times over, in about a week for a few thousand dollars. Such data are now being generated rapidly by research groups across the world, and large-scale analyses of these data appear often in high-profile publications such as Nature, Science, and The New England Journal of Medicine. But with these advances comes a serious problem: growth in per-sequencer throughput (currently about 4x per year) is drastically outpacing growth in computer speed (about 2x every 2 years). As the throughput gap widens over time, sequence analysis software is becoming a performance bottleneck, and the costs associated with building and maintaining the needed computing resources is burdensome for research laboratories. This thesis proposes two methods and describes four open source software tools that help to address these issues using novel algorithms and high-performance computing techniques. The proposed approaches build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that these algorithmic advances can be combined with MapReduce and cloud computing to solve comparative genomics problems in a manner that is scalable, fault tolerant, and usable even by small research groups.