Development of machine learning and advanced data analytical techniques to incorporate genomic data in predictive modeling for Salmonella enterica

Thumbnail Image


Publication or External Link





The past few decades have seen a renaissance in the field of food safety, with the increasing usage of genomic data (e.g., whole genome sequencing (WGS)) in determining the cause of microbial foodborne illness, particularly for multi-serovar agents such as Salmonella enterica. However, utilizing such data in a preventative framework, specifically in the field of quantitative microbial risk assessment (QMRA) remains in its infancy, because incorporating such large-scale datasets in statistical models is hindered by the sheer number of variables/features introduced. Thus, the goal of this research is to introduce machine learning (ML)-based approaches to potentially incorporate WGS data in various stages of a risk assessment for Salmonella enterica. Specifically, we developed a machine learning-based workflow to obtain an association between gene presence/absence data from microbial whole genome sequences and severity of Salmonella-related health outcomes in host systems. A key contribution of this dissertation is assessing the applicability of Elastic Net model, a recursive feature selection technique, which resolves a well-known issue concerning WGS-based data analysis: variables/features outnumber the count of observations. Building on this finding, we developed a gene weighted Poisson regression method to incorporate genes into a dose-response framework for Salmonella enterica, thereby incorporating genetic variability directly into a risk assessment framework. Finally, we combined machine learning with count-based models to determine how significant genes interact with meteorological factors in impacting the severity of salmonellosis outbreaks. This dissertation uncovers some interesting findings. First, although commonly used classifiers (such as random forest) performed well in predicting disease severity, logistic regression, in conjunction with Elastic Net, performed significantly better. This finding is important, as the result of a logistic regression is generally more interpretable than that of other classifiers, easing its incorporation into predictive microbial modeling. Next, machine learning-supported count-based models, such as Poisson regression also proved to be a good fit for gene-informed dose-response modeling and determination of outbreak severity when combined with extrinsic factors such as atmospheric temperature and precipitation. Overall, this dissertation identified areas within a QMRA framework that could benefit from incorporating genetic information, and introduced ML models to incorporate such information.