Development of machine learning and advanced data analytical techniques to incorporate genomic data in predictive modeling for Salmonella enterica

dc.contributor.advisorPradhan, Abani Ken_US
dc.contributor.authorKaranth, Shraddhaen_US
dc.contributor.departmentFood Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2022-02-04T06:36:25Z
dc.date.available2022-02-04T06:36:25Z
dc.date.issued2021en_US
dc.description.abstractThe past few decades have seen a renaissance in the field of food safety, with the increasing usage of genomic data (e.g., whole genome sequencing (WGS)) in determining the cause of microbial foodborne illness, particularly for multi-serovar agents such as Salmonella enterica. However, utilizing such data in a preventative framework, specifically in the field of quantitative microbial risk assessment (QMRA) remains in its infancy, because incorporating such large-scale datasets in statistical models is hindered by the sheer number of variables/features introduced. Thus, the goal of this research is to introduce machine learning (ML)-based approaches to potentially incorporate WGS data in various stages of a risk assessment for Salmonella enterica. Specifically, we developed a machine learning-based workflow to obtain an association between gene presence/absence data from microbial whole genome sequences and severity of Salmonella-related health outcomes in host systems. A key contribution of this dissertation is assessing the applicability of Elastic Net model, a recursive feature selection technique, which resolves a well-known issue concerning WGS-based data analysis: variables/features outnumber the count of observations. Building on this finding, we developed a gene weighted Poisson regression method to incorporate genes into a dose-response framework for Salmonella enterica, thereby incorporating genetic variability directly into a risk assessment framework. Finally, we combined machine learning with count-based models to determine how significant genes interact with meteorological factors in impacting the severity of salmonellosis outbreaks. This dissertation uncovers some interesting findings. First, although commonly used classifiers (such as random forest) performed well in predicting disease severity, logistic regression, in conjunction with Elastic Net, performed significantly better. This finding is important, as the result of a logistic regression is generally more interpretable than that of other classifiers, easing its incorporation into predictive microbial modeling. Next, machine learning-supported count-based models, such as Poisson regression also proved to be a good fit for gene-informed dose-response modeling and determination of outbreak severity when combined with extrinsic factors such as atmospheric temperature and precipitation. Overall, this dissertation identified areas within a QMRA framework that could benefit from incorporating genetic information, and introduced ML models to incorporate such information.en_US
dc.identifierhttps://doi.org/10.13016/zsof-n669
dc.identifier.urihttp://hdl.handle.net/1903/28441
dc.language.isoenen_US
dc.subject.pqcontrolledFood scienceen_US
dc.subject.pqcontrolledMicrobiologyen_US
dc.subject.pqcontrolledPublic healthen_US
dc.subject.pquncontrolledCount-based modelsen_US
dc.subject.pquncontrolledElastic Neten_US
dc.subject.pquncontrolledMachine learningen_US
dc.subject.pquncontrolledPredictive modelingen_US
dc.subject.pquncontrolledSalmonella entericaen_US
dc.subject.pquncontrolledWhole genome sequencingen_US
dc.titleDevelopment of machine learning and advanced data analytical techniques to incorporate genomic data in predictive modeling for Salmonella entericaen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Karanth_umd_0117E_22090.pdf
Size:
2.59 MB
Format:
Adobe Portable Document Format