Analysis and correction of compositional bias in sparse sequencing count data

Thumbnail Image
s12864-018-5160-5.pdf(1.86 MB)
No. of downloads: 65
Publication or External Link
Kumar, M. Senthil
Slud, Eric V.
Okrah, Kwame
Hicks, Stephanie C.
Hannenhalli, Sridhar
Bravo, Héctor Corrada
Kumar, M., Slud, E., Okrah, K. et al. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics 19, 799 (2018).
Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it.