Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
117 results
Search Results
Item OUT OF DISTRIBUTION EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS: GENERALIZATION TO LOW-RESOURCE AND DISTANT LANGUAGES AND HUMAN-AI COLLABORATIVE WRITING(2024) Richburg, Aquia; Carpuat, Marine; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Large language models have revolutionized natural language processing with their capabilities in text generation and understanding. Their rich contextual representations learned from training on diverse text datasets have lead LLMs to be used across a variety of settings. However this increases the chance of models being used in unintended use cases and causing harm to users. This dissertation delves into empirical studies of out-of-distribution issues in text generation (machine translation) and text classification (authorship analysis) tasks, examining how LLMs perform in settings distant from their training distributions.In our first work, the goal is to understand the characteristics of the training distribution of LLMs by visualizing the roles of samples during the training of a machine translation model. Our results indicate that sample contributions are not uniform and play complex roles throughout the training process. This highlights the difficulty of describing samples that are representative of the training distribution and motivates thorough evaluation of models in diverse settings. Our second and third works turn to the evaluation of LLMs in out-of-distribution settings to better understand their strengths and limitations for generalization on unseen tasks. We evaluate LLMs in machine translation tasks, focusing on how translation quality is affected by the presence or absence of specific language pairs in the training data. Our findings show that while finetuning improves translation for unseen languages, the impact varies across different language pairs. This emphasizes the need for further research to enable effective massively multilingual translation with LLMs. In text classification, we explore out-of-distribution generalization for authorship analysis in the context of human-AI collaborative writing. Our studies reveal that traditional AI detection models underperform when distinguishing between human and AI cowritten text. Simpler n-gram techniques are more robust than LLM for authorship identification, suggesting the need for adapted authorship analysis tools. In summary this dissertation advances our understanding of LLM generalization and provides insights for improving the robustness and adaptability of NLP systems.Item MATHEMATICS OF THE DYNAMICS AND CONTROL OF THE SARS-COV-2 PANDEMIC(2024) Pant, Binod; Gumel, Abba B.; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The pneumonia-like illness that emerged late in 2019, caused by SARS-CoV-2 (and coined COVID-19), became the greatest public health challenge humans have faced since the 1918/1919 influenza pandemic, causing over 670 million confirmed cases and 7 million fatalities globally. This dissertation contributes in providing deep qualitative insights and understanding on the transmission dynamics and control of the pandemic, using mathematical modeling approaches together with data analytics and computation. Specifically, it addresses some of the pertinent challenges associated with modeling the dynamics of the disease, notably the disproportionate effect of the disease on certain (risk and demographic) populations (inducing various heterogeneities) and behavior changes with respect to adherence or lack thereof to interventions. An $m-$group model, which monitors the temporal dynamics of the disease in m heterogeneous populations, was designed and used to study the impact of age heterogeneity and vaccination on the spread of the disease in the United States. For instance, the disease-free equilibrium for the case of the model with m=1 (i.e., the model with a homogeneous population) was shown to be globally-asymptotically stable for two special cases (when vaccine is perfect or when disease-induced mortality is negligible) whenever the associated reproduction number of the homogeneous model is less than one. The homogeneous model has a unique endemic equilibrium whenever the reproduction threshold exceeds unity (this equilibrium was shown to be globally-asymptotically stable for a special case, using a nonlinear Lyapunov function of Goh-Volterra type). The homogeneous model was fitted to the observed cumulative mortality data for the SARS-CoV-2 pandemic in the United States during the period from January to May of 2022 (when Omicron was the predominant variant). It was shown that vaccine-derived herd immunity (needed to eliminate the disease) cannot be attained using the homogeneous model regardless of the proportion of individuals fully vaccinated. Such vaccine-derived immunity can, however, be achieved using the $m$-group heterogeneous model, with $m=2$ (where the total population is split into two groups: those under 65 years of age, and those 65 years and older), if at least 61\% of the susceptible population is fully vaccinated. Thus, this dissertation shows that heterogeneity reduces the level of vaccine coverage needed to eliminate the pandemic (and models that do not account for heterogeneity may be over-estimating the vaccination coverage needed to achieve herd immunity in the community). To quantify the impact of human behavior changes on the spread and control of the pandemic, we designed a novel behavior-epidemiology model which considers numerous metrics for inducing human behavior changes (such as current level of disease burden and intervention adherence fatigue). Unlike the equivalent model without human behavior explicitly incorporated, the behavior-epidemiology model fits the observed cumulative mortality and predicts the observed daily mortality data very well. It was also shown that the behavior metrics related to the level of SARS-CoV-2 mortality and symptomatic transmission were more influential in inducing positive behavior changes than all other behavior metrics considered. Finally, a model was developed to assess the utility of wastewater surveillance to study the transmission dynamics and control of SARS-CoV-2 in a community. Specifically, we developed and calibrated a wastewater-based epidemiology model using wastewater data from Miami-Dade county, Florida, during the third wave of the SARS-CoV-2 pandemic. The model showed a strong correlation between the observed (detected) weekly case data and the corresponding weekly data predicted by the calibrated model. The model's prediction of the week when maximum number of SARS-CoV-2 cases will be recorded in the county during the simulation period precisely matched the time when the maximum observed/reported cases were recorded (August 14, 2021). Furthermore, the model's projection of the maximum number of cases for the week of August 14, 2021 was about 15 times higher than the maximum observed weekly case count for the county on that day (i.e., the maximum case count estimated by the model was 15 times higher than the actual/observed count for confirmed cases). In addition to being in line with other modeling studies, this result is consistent with the CDC estimate that the reported confirmed case data may be 10 times lower than the actual (since the confirmed data did not account for asymptomatic and presymptomatic transmission). Furthermore, the model accurately predicted a one-week lag between the peak in weekly COVID-19 case and hospitalization data during the time period of the study in Miami-Dade, with the model-predicted hospitalizations peaking on August 21, 2021. Detailed time-varying global sensitivity analysis was carried out to determine the parameters (wastewater-based, epidemiological and biological) that have the most influence on the chosen response function (namely, the cumulative viral load in the wastewater). This analysis identified key parameters that significantly affect the value of the response function (hence, they should be targeted for intervention). This dissertation conclusively showed that wastewater surveillance data can be a very powerful indicator for measuring (i.e., providing early-warning signal and current burden) and predicting the future trajectory and burden (e.g., number of cases and hospitalizations) of emerging and re-emerging infectious diseases, such as SARS-CoV-2, in a community.Item Quantum Algorithms for Nonconvex Optimization: Theory and Implementation(2024) Leng, Jiaqi; Wu, Xiaodi XW; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Continuous optimization problems arise in virtually all disciplines of quantitative research. While convex optimization has been well-studied in recent decades, large-scale nonconvex optimization problems remain intractable in both theory and practice. Quantum computers are expected to outperform classical computers in certain challenging computational problems. Some quantum algorithms for convex optimization have shown asymptotic speedups, while the quantum advantage for nonconvex optimization is yet to be fully understood. This thesis focuses on Quantum Hamiltonian Descent (QHD), a quantum algorithm for continuous optimization problems. A systematic study of Quantum Hamiltonian Descent is presented, including theoretical results concerning nonconvex optimization and efficient implementation techniques for quantum computers. Quantum Hamiltonian Descent is derived as the path integral of classical gradient descent algorithms. Due to the quantum interference of classical descent trajectories, Quantum Hamiltonian Descent exhibits drastically different behavior from classical gradient descent, especially for nonconvex problems. Under mild assumptions, we prove that Quantum Hamiltonian Descent can always find the global minimum of an unconstrained optimization problem given a sufficiently long runtime. Moreover, we demonstrate that Quantum Hamiltonian Descent can efficiently solve a family of nonconvex optimization problems with exponentially many local minima, which most commonly used classical optimizers require super-polynomial time to solve. Additionally, by using Quantum Hamiltonian Descent as an algorithmic primitive, we show a quadratic oracular separation between quantum and classical computing. We consider the implementation of Quantum Hamiltonian Descent for two important paradigms of quantum computing, namely digital (fault-tolerant) and analog quantum computers. Exploiting the product formula for quantum Hamiltonian simulation, we demonstrate that a digital quantum computer can implement Quantum Hamiltonian Descent with gate complexity nearly linear in problem dimension and evolution time. With a hardware-efficient sparse Hamiltonian simulation technique known as Hamiltonian embedding, we develop an analog implementation recipe for Quantum Hamiltonian Descent that addresses a broad class of nonlinear optimization problems, including nonconvex quadratic programming. This analog implementation approach is deployed on large-scale quantum spin-glass simulators, and the empirical results strongly suggest that Quantum Hamiltonian Descent has great potential for highly nonconvex and nonlinear optimization tasks.Item CUR Matrix Approximation Through Convex Optimization(2024) Linehan, Kathryn; Balan, Radu V; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation we present work on the CUR matrix approximation. Specifically, we present 1) an approximation of the proximal operator of the L-infinity norm using a neural network, 2) a novel deterministic CUR formulation and algorithm, and 3) a novel application of CUR as a feature selection method to determine discriminant proteins when clustering protein expression data in a self-organizing map (SOM). The proximal operator of the L-infinity norm arises in our CUR algorithm. Since the computation of the proximal operator of the L-infinity norm requires a sort of the input data (or at least a partial sort similar to quicksort), we present a neural network to approximate the proximal operator. A novel aspect of the network is that it is able to accept vectors of varying lengths due to a feature selection process that uses moments of the input data. We present results on the accuracy of the approximation, feature importance, and computational efficiency of the approach, and present an algorithm to calculate the proximal operator of the L-infinity norm exactly, relate it to the Moreau decomposition, and compare its computational efficiency to that of the approximation. Next, we present a novel deterministic CUR formulation that uses convex optimization to form the matrices C and R, and a corresponding algorithm that uses bisection to ensure that the user selected number of columns appear in C and the user selected number of rows appear in R. We implement the algorithm using the surrogate functional technique of Daubechies et al. [Communications on Pure and Applied Mathematics, 57.11 (2004)] and extend the theory of this approach to apply to our CUR formulation. Numerical results are presented that demonstrate the effectiveness of our CUR algorithm as compared to the singular value decomposition (SVD) and other CUR algorithms. Last, we use our CUR approximation as a feature selection method in the application by Higuera et al. [PLOS ONE, 10(6) (2015)] to determine discriminant proteins when clustering protein expression data in an SOM. This is a novel application of CUR and to the best of our knowledge, this is the first use of CUR on protein expression data. We compare the performance of our CUR algorithm to other CUR algorithms and the Wilcoxon rank-sum test (the original feature selection method in the work).Item EFFICIENT COMPUTATIONAL ALGORITHMS FOR MAGNETIC EQUILIBRIUM IN A FUSION REACTOR(2024) Liang, Jiaxing; Elman, Howard C.; Sanchez-Vizuet, Tonatiuh; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In a magnetic confinement fusion reactor, like a Tokamak, hydrogen isotopes are injected into the chamber and heated to form a plasma. The hot plasma tends to expand, so it's crucial to confine it within the center of the reactor to prevent contact with the reactor walls. This confinement is achieved through magnetic fields generated by currents running through external coils surrounding the reactor. However, these currents may suffer from uncertainties, which could arise from factors like temperature fluctuations and material impurities. These variables introduce a level of unpredictability in the plasma's behavior and the overall stability of the fusion process. This thesis aims to investigate the impact that stochasticity in the current intensities has on the confinement properties of the plasma and to estimate the expected behavior of the magnetic field. While the focus is on the variability in current intensities, the tools developed can be applied to other sources of uncertainty, such as the positioning of coils and the source term parameters. To quantify the variability in model predictions and to evaluate the statistical properties of solutions over a range of parameter values, traditional sampling methods like Monte Carlo, often require intensive and expensive nonlinear computations. To tackle this challenge, we propose three approaches. Firstly, we focus on the development and application of a surrogate function, constructed via a stochastic collocation approach on a sparse grid in the parameter space. This surrogate function is employed to replace the nonlinear solution in Monte Carlo sampling processes. For our numerical experiments, we evaluate the efficiency and accuracy of the outcomes produced by the surrogate, in comparison with those obtained through direct nonlinear solutions. Our findings indicate that a carefully selected surrogate function reduces the sampling cost -- achieving acceleration factors ranging from 7 to over 30 -- while preserving the accuracy of the results. The second part of the thesis explores the multilevel Monte Carlo approach, investigating its potential for cost savings compared to simple Monte Carlo. This method involves conducting the majority of computations on a sequence of coarser spatial grids compared to what a simple Monte Carlo simulation would typically use. We examine this approach with non-linear computation, using both uniformly refined meshes and adaptively refined grids guided by a discrete error estimator. Numerical experiments reveal substantial cost reductions achieved through multilevel methods, typically ranging from a factor of 60 to exceeding 200. Adaptive gridding results in more accurate computation of relevant geometric parameters. In the last part of this work, we explore hybridmethods that integrate surrogates with multilevel Monte Carlo to further reduce the sampling cost. We establish the optimal construction and sampling costs for the surrogate-based multilevel Monte Carlo. Numerical results demonstrate that surrogate-based multilevel Monte Carlo remarkably reduces the computational burden, requiring only 0.1 to 14 seconds for a target relative mean square error ranging from $8\times 10^{-3}$ to $2\times10^{-4}$, reducing the cost of direct computation by factors of 50 to 300. In terms of accuracy, the surrogate-based sampling results exhibit close congruence with those obtained via direct computation, both in plasma boundary and geometric descriptors. The primary contributions of our work entail the application of stochastic collocation techniques and multilevel Monte Carlo methods to analyze plasma behavior under uncertainties in current within fusion reactors. Furthermore, we establish the universal sampling cost for the surrogate-enhanced multilevel Monte Carlo approach. Our methodology presents a paradigm in how we approach and manage computational challenges in this field.Item Structured discovery in graphs: Recommender systems and temporal graph analysis(2024) Peyman, Sheyda Do'a; Lyzinski, Vince V.; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Graph-valued data arises in numerous diverse scientific fields ranging from sociology, epidemiology and genomics to neuroscience and economics.For example, sociologists have used graphs to examine the roles of user attributes (gender, class, year) at American colleges and universities through the study of Facebook friendship networks and have studied segregation and homophily in social networks; epidemiologists have recently modeled Human-nCov protein-protein interactions via graphs, and neuroscientists have used graphs to model neuronal connectomes. The structure of graphs, including latent features, relationships between the vertex and importance of each vertex are all highly important graph properties that are main aspects of graph analysis/inference. While it is common to imbue nodes and/or edges with implicitly observed numeric or qualitative features, in this work we will consider latent network features that must be estimated from the network topology.The main focus of this text is to find ways of extracting the latent structure in the presence of network anomalies. These anomalies occur in different scenarios: including cases when the graph is subject to an adversarial attack and the anomaly is inhibiting inference, and in the scenario when detecting the anomaly is the key inference task. The former case is explored in the context of vertex nomination information retrieval, where we consider both analytic methods for countering the adversarial noise and also the addition of a user-in-the-loop in the retrieval algorithm to counter potential adversarial noise. In the latter case we use graph embedding methods to discover sequential anomalies in network time series.Item A Data Assimilation System for Lake Erie Based on the Local Ensemble Transform Kalman Filter(2024) Russell, David Scott; Ide, Kayo; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Data assimilation (DA) is the process by which a model forecast is adjusted to account for recent observations, taking into account both forecast and observation uncertainties. Although DA is common in numerical weather prediction (NWP) and other applications at global and regional scales, DA for large lakes such as North America's Great Lakes is still at an early stage of research and is not yet used operationally. In particular, the use of an ensemble-based approach to DA has scarcely been explored for large lakes, despite its growing popularity in operational NWP centers worldwide due to its dynamic estimation of the forecast covariance. Using Lake Erie as a test case, this study investigates the potential of ensemble DA to i) propagate forecast improvements throughout the lake and across forecast variables, and ii) inform the design of in-situ observing systems. The local ensemble transform Kalman filter (LETKF) is an efficient, localized, flexible variant of the ensemble Kalman filter (EnKF) that is used in multiple operational NWP centers. This work presents the development of a DA system for Lake Erie, which uses the LETKF to adjust forecasts of temperatures, currents, and water levels throughout the lake, using only lake surface temperature (LST) and temperature profile (TP) observations. The impact of both types of observations on all three forecast variables is evaluated within the framework of observing system simulation experiments (OSSEs), in which a DA system attempts to reconstruct a nature run (NR) by assimilating simulated observations of the NR. Observing system design questions are explored by comparing three different TP configurations. Assimilation of LST observations alone produces strong improvements to temperatures throughout the epilimnion (upper layer), while assimilation of TP observations extends these improvements to the hypolimnion (lower layer) near each profile. TP assimilation also shows improved representation of strong gyre currents and associated changes to thermocline depth and surface height, particularly when profiles sample from locations and depths where the thermal stratification in the forecast has been strongly affected by erroneous gyre currents. This work shows that the LETKF can be an efficient and effective tool for improving both forecasts and observing systems for large lakes, two essential ingredients in predicting the onset and development of economically and ecologically important phenomena such as harmful algal blooms (HABs) and hypoxia.Item Ensemble Kalman Inverse Parameter Estimation for Human and Nature Dynamics Two(2023) Karpovich, Maia; Kalnay, Eugenia; Mote, Safa; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Since the widespread development of agriculture 10,000 years ago and particularly since the Industrial Revolution beginning in the 18th century, the coupled Earth and Human systems have experienced transformative change. The world's population and Gross Domestic Product have each increased by factors of at least eight in the last two centuries, powered by the intensive use of fossil energy and fossil water. This has had dramatic repercussions on the Earth system's stability, threatened by Human system activities such as habitat destruction, global warming, and depletion of Regenerating and Nonrenewable energy resources that increasingly alter environmental feedbacks. To analyze these changes, we have developed the second generation of the Human and Nature Dynamics model, HANDY2. HANDY2 is designed to simulate the dynamics of energy resources and population over the Industrial era from 1700 to 2200, flexibly incorporating real-world observations of population and energy consumption in an expanded suite of mechanisms that track capital investment, labor force allocation, class mobility, and extraction and production technologies. The use of automated Ensemble Kalman Inversion (EnKI) estimation for HANDY2's parameters allows us to accurately capture the rapid 20th-century rise in the use of phytomass and fossil fuels, as well as the global enrichment of Elites that puts pressure on natural resources and Commoners. EnKI-derived HANDY2 ensembles project that current world policies may lead to a collapse in the world's population by 2200 caused by rapid depletion of resources. However, this collapse can be prevented by a combination of actions taken to support voluntary family planning, lower economic inequality, and most importantly, invest in the rapid expansion of Renewable energy extraction.Item TWO-PHASE FLOW OF COMPRESSIBLE VISCOUS DIBLOCK COPOLYMER FLUID(2023) Ye, Anqi; Trivisa, Konstantina; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A diblock copolymer is a linear-chain molecule consisting of two types of monomer.Mathematical models for diblock copolymers can aid researchers in studying the material properties of products as upholstery foam, adhesive tape and asphalt additive. Such models incorporate a variety of factors, including concentration difference, connectivity of the subchains, and chemical potential. We consider a flow of two macroscopically immiscible, viscous compressible diblock copolymer fluids. We first give the derivation of this model on the basis of a local dissipation inequality. Second, we prove that there exist weak solutions to this model. The proof of existence relies on constructing an approximating system by means of time-discretization and vanishing dissipation. We then prove that the solutions to these approximating schemes converge to a solution to the original problem. We also cast thought on the large-time behavior with regularity assumption on the limit.Item A Multifaceted Quantification of Bias in Large Language Models(2023) Sotnikova, Anna; Daumé III, Hal; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Language models are rapidly developing, demonstrating impressive capabilities in comprehending, generating, and manipulating text. As they advance, they unlock diverse applications across various domains and become increasingly integrated into our daily lives. Nevertheless, these models, trained on vast and unfiltered datasets, come with a range of potential drawbacks and ethical issues. One significant concern is the potential amplification of biases present in the training data, generating stereotypes and reinforcing societal injustices when language models are deployed. In this work, we propose methods to quantify biases in large language models. We examine stereotypical associations for a wide variety of social groups characterized by both single and intersectional identities. Additionally, we propose a framework for measuring stereotype leakage across different languages within multilingual large language models. Finally, we introduce an algorithm that allows us to optimize human data collection in conditions of high levels of human disagreement.