Improving Genome Assembly

Thumbnail Image
umi-umd-2750.pdf(3.32 MB)
No. of downloads: 672
Publication or External Link
Ustun, Cevat
Yorke, James A
We present a reliable, easy to implement algorithm to generate a set of highly reliable overlaps based on identifying repeat k-mers. Our method is coverage independent. Whereas traditionally reads have been trimmed to have expected error rates of 2%, we find our error correction allows extending usable sequence in reads to 16% trimming. We use a version of the Phrap assembly program that uses only overlaps computed by the UMD overlapper, called PhrapUMD. We integrate the UMD algorithms with Baylor's ATLAS assembler applied to Rattus norvegicus. Starting with the same data as the Nov. 2002 ATLAS assembly, we compare our results to 4.5 Mbp of rat sequence in 21 BACs that have been finished. We find that after extension and error correction, (i) the reads are 30% longer than reads trimmed to 2%; (ii) the average error rate across the extended reads is about 3 in 10,000 bases, with 88% of the extended reads matching finished sequence exactly across their entire length; and (iii) PhrapUMD with these reads and our reliable overlaps produces a draft assembly of the rat which has no misassemblies and increases the coverage of finished sequence from 92.2% to 95.7%, while simultaneously reducing the base error rate for quality 20 or higher bases from 1.50 to 0.87 errors per 10,000.