Clustering metagenomic sequences with interpolated Markov models

dc.contributor.authorKelley, David R
dc.contributor.authorSalzberg, Steven L
dc.date.accessioned2013-01-10T21:17:48Z
dc.date.available2013-01-10T21:17:48Z
dc.date.issued2010-11-02
dc.description.abstractBackground: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.en_US
dc.description.urihttps://doi.org/10.1186/1471-2105-11-544
dc.identifier.citationKelley, D.R., Salzberg, S.L. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 11, 544 (2010).en_US
dc.identifier.urihttp://hdl.handle.net/1903/13365
dc.language.isoen_USen_US
dc.relation.isAvailableAtCollege of Computer, Mathematical & Physical Sciencesen_us
dc.relation.isAvailableAtDigital Repository at the University of Marylanden_us
dc.relation.isAvailableAtBiologyen_us
dc.relation.isAvailableAtUniversity of Maryland (College Park, MD)en_us
dc.subjectmetagenomicsen_US
dc.subjectenvironmental DNAen_US
dc.titleClustering metagenomic sequences with interpolated Markov modelsen_US
dc.typeArticleen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kelley and Salzberg.pdf
Size:
753.76 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.57 KB
Format:
Item-specific license agreed upon to submission
Description: