Detecting and Correcting Errors in Genome Assemblies

dc.contributor.advisorYorke, James Aen_US
dc.contributor.authorSubramanian, Pooranien_US
dc.contributor.departmentApplied Mathematics and Scientific Computationen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2011-02-19T07:07:48Z
dc.date.available2011-02-19T07:07:48Z
dc.date.issued2010en_US
dc.description.abstractGenome assemblies have various types of deficiencies or misassemblies. This work is aimed at detecting and correcting a type of misassembly that we call Compression/Expansion or CE misassemblies whereby a section of sequence has been erroneously omitted or inserted in the assembly. Other types of deficiencies include gaps in the genome sequence. We developed a statistic for identifying Compression/Expansion misassemblies called the CE statistic. It is based on examining the placement of mate pairs of reads in the assembly. In addition to this, we developed an algorithm that is aimed at closing gaps and validating and/or correcting CE misassemblies detected by the CE statistic. This algorithm is similar to a shooting algorithm used in solving two-point boundary value problems in partial differential equations. We call this algorithm the Shooting Method. The Shooting Method finds all possible ways to assemble a local region of the genome contained between two target reads. We use a combination of the CE statistic and Shooting Method to detect and correct some CE misassemblies and close gaps in genome assemblies. We tested our techniques both on faux and real data. Applying this technique to 22 bacterial draft assemblies for which the finished genome sequence is known, we were able to identify 5 out of 8 real CE misassemblies. We applied the Shooting Method to a de novo assembly of the Bos taurus genome made from Sanger data. We were able to close 9,863 gaps out of 58,386. This added 8.34 Mbp of sequence to the assembly, and resulted in a 7 % increase of N50 contig size.en_US
dc.identifier.urihttp://hdl.handle.net/1903/11228
dc.subject.pqcontrolledApplied Mathematicsen_US
dc.subject.pqcontrolledBioinformaticsen_US
dc.titleDetecting and Correcting Errors in Genome Assembliesen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Subramanian_umd_0117E_11793.pdf
Size:
1.43 MB
Format:
Adobe Portable Document Format