College of Agriculture & Natural Resources
Permanent URI for this communityhttp://hdl.handle.net/1903/1598
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
2 results
Search Results
Item Development of machine learning and advanced data analytical techniques to incorporate genomic data in predictive modeling for Salmonella enterica(2021) Karanth, Shraddha; Pradhan, Abani K; Food Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The past few decades have seen a renaissance in the field of food safety, with the increasing usage of genomic data (e.g., whole genome sequencing (WGS)) in determining the cause of microbial foodborne illness, particularly for multi-serovar agents such as Salmonella enterica. However, utilizing such data in a preventative framework, specifically in the field of quantitative microbial risk assessment (QMRA) remains in its infancy, because incorporating such large-scale datasets in statistical models is hindered by the sheer number of variables/features introduced. Thus, the goal of this research is to introduce machine learning (ML)-based approaches to potentially incorporate WGS data in various stages of a risk assessment for Salmonella enterica. Specifically, we developed a machine learning-based workflow to obtain an association between gene presence/absence data from microbial whole genome sequences and severity of Salmonella-related health outcomes in host systems. A key contribution of this dissertation is assessing the applicability of Elastic Net model, a recursive feature selection technique, which resolves a well-known issue concerning WGS-based data analysis: variables/features outnumber the count of observations. Building on this finding, we developed a gene weighted Poisson regression method to incorporate genes into a dose-response framework for Salmonella enterica, thereby incorporating genetic variability directly into a risk assessment framework. Finally, we combined machine learning with count-based models to determine how significant genes interact with meteorological factors in impacting the severity of salmonellosis outbreaks. This dissertation uncovers some interesting findings. First, although commonly used classifiers (such as random forest) performed well in predicting disease severity, logistic regression, in conjunction with Elastic Net, performed significantly better. This finding is important, as the result of a logistic regression is generally more interpretable than that of other classifiers, easing its incorporation into predictive microbial modeling. Next, machine learning-supported count-based models, such as Poisson regression also proved to be a good fit for gene-informed dose-response modeling and determination of outbreak severity when combined with extrinsic factors such as atmospheric temperature and precipitation. Overall, this dissertation identified areas within a QMRA framework that could benefit from incorporating genetic information, and introduced ML models to incorporate such information.Item A COMPARATIVE ANALYSIS OF RANDOM FOREST AND LOGISTIC REGRESSION FOR WEED RISK ASSESSMENT(2018) Harris, Chinchu; Peer, Wendy; Plant Science and Landscape Architecture (PSLA); Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Invasive species have largely negative impacts on the environment and the economy. The management and regulation of invasive plants are facilitated using screening tools, such as weed risk assessments (WRAs) to predict the invasive potential of non-native plants. The identification of these species and their subsequent regulation on importation helps to reduce the risk of future ecosystem and economic costs. Globally, there are many different types of highly useful WRAs already available. However, in this day of big data and powerful predictive analytics, there is an increasing demand for the development of new and more robust screening tools. In this thesis, I use the machine learning algorithm, Random forests, to develop a new WRA. I show that random forest model has greater predictive accuracies than an existing logistic regression model and that random forest is a better learner. In addition, variable importance analysis was performed to identify factors associated with invasive status classification of non-native plants. The study suggests that random forests make powerful weed risk screening tools and should be utilized for assessing invasive risk potential along with other WRAs. An integrative approach for evaluating weed risk can greatly serve to facilitate the WRA process.