Data-driven algorithms for characterizing microbial communities

Thumbnail Image


Publication or External Link





Complex microbial communities play a crucial role in environmental and human health. Traditionally, microbes have been studied by isolating and culturing them, missing organisms that cannot grow in standard laboratory conditions, and losing information about microbe-microbe interactions. With affordable high- throughput sequencing, a new field called metagenomics has emerged, that studies the genomic content of the microbial community as a whole. Metagenomics enables researchers to characterize complex microbial communities, however, many computational challenges remain with downstream analyses of large sequencing datasets. Here, we explore some fundamental problems in metagenomics and present simple algorithms and open-source software tools that implement these solutions.

In the first section, we focus on using a reference database of known organisms (and genomic segments within) to taxonomically classify sequences and estimate abundances of taxa in a metagenomic sample. We developed a “BLAST outlier detection” algorithm that identifies significant outliers within database search results. We extended this method and developed ATLAS, which uses significant database hits to group sequences in the database into partitions. These partitions capture the extent of ambiguity within the classification of a sample. Besides taxonomically classifying sequences, we also explored the problem of taxonomic abundance profiling, i.e., estimating the abundance of different species in the community. We describe TIPP2, a marker gene-based abundance profiling method, which combines phylogenetic placement with statistical techniques to control classification accuracy. TIPP2 includes an updated set of reference packages and several algorithmic improvements over the original TIPP method.

Next, we explore how to reconstruct genomes from metagenomic samples. Despite advances in metagenome assembly algorithms, assembling reads into complete genomes is still a computationally challenging problem because of repeated sequences within and among genomes, uneven abundances of organisms, sequencing errors, and strain-level variation. To improve upon the fragmented assemblies, researchers use a strategy called binning, which involves clustering together genomic fragments that likely originate from an individual organism. We describe Binnacle, a tool that explicitly accounts for scaffold information in binning. We describe novel algorithms for estimating the scaffold-level depth of coverage information and show that variation-aware scaffolders help further improve the completeness and quality of the resulting metagenomic bins.

Finally, we explore how to organize enormous sets of sequence data generated through the surge of metagenomic studies. There have been recent efforts to organize and document genes found in microbial communities in “microbial gene catalogs”. Although gene catalogs are commonly used, they have not been critically evaluated for their effectiveness as a basis for metagenomic analyses. We investigated one such catalog and focus on both the approach used to construct this catalog and its effectiveness when used as a reference for microbiome studies. Our results highlight important limitations of the approach used to construct the catalog and call into question the broad usefulness of gene catalogs. We also recommend best practices for the construction and use of gene catalogs in microbiome studies and highlight opportunities for future research.

With the increasing data being generated in different metagenomic studies, we believe our ideas, algorithms, and software tools are well-timed with the need of the community.