Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
37 results
Search Results
Item Advancements in Small Area Estimation Using Hierarchical Bayesian Methods and Complex Survey Data(2024) Das, Soumojit; Lahiri, Partha; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation addresses critical gaps in the estimation of multidimensional poverty measures for small areas and proposes innovative hierarchical Bayesian estimation techniques for finite population means in small areas. It also explores specialized applications of these methods for survey response variables with multiple categories. The dissertation presents a comprehensive review of relevant literature and methodologies, highlighting the importance of accurate estimation for evidence-based policymaking. In Chapter \ref{chap:2}, the focus is on the estimation of multidimensional poverty measures for small areas, filling an essential research gap. Using Bayesian methods, the dissertation demonstrates how multidimensional poverty rates and the relative contributions of different dimensions can be estimated for small areas. The proposed approach can be extended to various definitions of multidimensional poverty, including counting or fuzzy set methods. Chapter \ref{chap:3} introduces a novel hierarchical Bayesian estimation procedure for finite population means in small areas, integrating primary survey data with diverse sources, including social media data. The approach incorporates sample weights and factors influencing the outcome variable to reduce sampling informativeness. It demonstrates reduced sensitivity to model misspecifications and diminishes reliance on assumed models, making it versatile for various estimation challenges. In Chapter \ref{chap: 4}, the dissertation explores specialized applications for survey response variables with multiple categories, addressing the impact of biased or informative sampling on assumed models. It proposes methods for accommodating survey weights seamlessly within the modeling and estimation processes, conducting a comparative analysis with Multilevel Regression with Poststratification (MRP). The dissertation concludes by summarizing key findings and contributions from each chapter, emphasizing implications for evidence-based policymaking and outlining future research directions.Item Structured discovery in graphs: Recommender systems and temporal graph analysis(2024) Peyman, Sheyda Do'a; Lyzinski, Vince V.; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Graph-valued data arises in numerous diverse scientific fields ranging from sociology, epidemiology and genomics to neuroscience and economics.For example, sociologists have used graphs to examine the roles of user attributes (gender, class, year) at American colleges and universities through the study of Facebook friendship networks and have studied segregation and homophily in social networks; epidemiologists have recently modeled Human-nCov protein-protein interactions via graphs, and neuroscientists have used graphs to model neuronal connectomes. The structure of graphs, including latent features, relationships between the vertex and importance of each vertex are all highly important graph properties that are main aspects of graph analysis/inference. While it is common to imbue nodes and/or edges with implicitly observed numeric or qualitative features, in this work we will consider latent network features that must be estimated from the network topology.The main focus of this text is to find ways of extracting the latent structure in the presence of network anomalies. These anomalies occur in different scenarios: including cases when the graph is subject to an adversarial attack and the anomaly is inhibiting inference, and in the scenario when detecting the anomaly is the key inference task. The former case is explored in the context of vertex nomination information retrieval, where we consider both analytic methods for countering the adversarial noise and also the addition of a user-in-the-loop in the retrieval algorithm to counter potential adversarial noise. In the latter case we use graph embedding methods to discover sequential anomalies in network time series.Item MAHALANOBIS DIFFUSION MAPS FOR QUANTIFYING RARE EVENTS: THEORY AND APPLICATION TO MOLECULAR DYNAMICS(2023) Evans, Luke; Cameron, Maria K; Tiwary, Pratyush; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The study of rare events in molecular and atomic systems such as conformal changes and cluster rearrangements has been one of the most important research themes in chemical physics. Key challenges are associated with long waiting times rendering molecular simulations inefficient, high dimensionality impeding the use of PDE-based approaches, and the complexity or breadth of transition processes limiting the predictive power of asymptotic methods. Diffusion maps are promising algorithms to mitigate these issues. We adapt the diffusion map with Mahalanobis kernel proposed by Singer and Coifman (2008) for the SDE describing molecular dynamics in collective variables in which the diffusion matrix is position-dependent and, unlike the case considered by Singer and Coifman, is not associated with a diffeomorphism. We offer an elementary proof showing that one can approximate the generator for this SDE discretized to a point cloud via the Mahalanobis diffusion map. We then upgrade to incorporate standard enhanced sampling techniques such as metadynamics. The resulting algorithm, which we call the target measure Mahalanobis diffusion map (tm-mmap), is suitable for a moderate number of collective variables in which one can approximate the diffusion tensor and free energy. The tm-mmap algorithm allows us to approximate the backward Kolmogorov operator and compute the committor function, the key function for describing transition events in the framework of transition path theory. Simple post-processing steps delineate the transition channels and estimate the transition rates. We apply this methodology to a number of test problems including benchmark systems in chemical physics such as alanine dipeptide with four dihedral angles and Lennard-Jones 7, validate the results, and demonstrate the efficacy of the proposed approach. In particular, we show that use of (i) the Mahalanobis kernel, (ii) enhanced sampling data, and (iii) phase space dimensions beyond the scope of standard PDE solvers (such as finite difference and finite element methods) is essential for capturing the underlying dynamics and accurately estimating transition rates.Item Simulation Optimization: Efficient Selection and Dynamic Programming(2023) Zhou, Yi; Fu, Michael C; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Many real-world problems can be modeled as stochastic systems, whose performances can be estimated by simulations. Important topics include statistical ranking and selection (R&S) problems, response surface problems, and stochastic estimation problems. The first part of the dissertation focuses on the R&S problems, where there is a finite set of feasible solutions ("systems" or "alternatives") with unknown values of performances that must be estimated from noisy observations, e.g., from expensive stochastic simulation experiments. The goal in R&S problems is to identify the highest-valued alternative as quickly as possible. In many practical settings, alternatives with similar attributes could have similar performances; we thus focus on such settings where prior similarity information between the performance outputs of different alternatives can be learned from data. Incorporating similarity information enables efficient budget allocation for faster identification of the best alternative in sequential selection. Using a new selection criterion, the similarity selection index, we develop two new allocation methods: one based on a mathematical programming characterization of the asymptotically optimal budget allocation, and the other based on a myopic expected improvement measure. For the former, we present a novel sequential implementation that provably learns the optimal allocation without tuning. For the latter, we also derive its asymptotic sampling ratios. We also propose a practical way to update the prior similarity information as new samples are collected. The second part of the dissertation considers the stochastic estimation problems of estimating a conditional expectation. We first formulate the conditional expectation as a ratio of two stochastic derivatives. Applying stochastic gradient estimation techniques, we express the two derivatives using ordinary expectations, which can be then estimated by Monte Carlo methods. Based on an empirical distribution estimated from simulation, we provide guidance on selecting the appropriate formulation of the derivatives to reduce the variance of the ratio estimator. The third part of this dissertation further explores the application of estimators for conditional expectations in policy evaluation, an important topic in stochastic dynamic programming. In particular, we focus on a finite-horizon setting with continuous state variables. The Bellman equation represents the value function as a conditional expectation. By using the finite difference method and the generalized likelihood ratio method, we propose new estimators for policy evaluation and show how the value of any given state can be estimated using sample paths starting from various other states.Item SENSITIVITY ANALYSIS AND STOCHASTIC OPTIMIZATIONS IN STOCHASTIC ACTIVITY NETWORKS(2022) Wan, Peng; Fu, Michael C; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Activity networks are a powerful tool for modeling and analysis in project management, and in many other applications, such as circuit design and parallel computing. An activity network can be represented by a directed acyclic graph with one source node and one sink node. The directed arcs between nodes in an activity network represent the precedence relationships between different activities in the project. In a stochastic activity network (SAN), the arc lengths are random variables. This dissertation studies stochastic gradient estimators for SANs using Monte Carlo simulation, and the application of stochastic gradient estimators to network optimization problems. A new algorithm called Threshold Arc Criticality (TAC) for estimating the arc criticalities of stochastic activity networks is proposed. TAC is based on the following result: given the length of all arcs in a SAN except for the one arc of interest, that arc is on the critical path (longest path) if and only if its length is greater than a threshold. By applying Infinitesimal Perturbation Analysis (IPA) to TAC, an unbiased estimator of the derivative of the arc criticalities with respect to parameters of arc length distributions can be derived. The stochastic derivative estimator can be used for sensitivity analysis of arc criticalities via simulation. Using TAC, a new IPA gradient estimator of the first and second moments of project completion time (PCT) is proposed. Combining the new PCT stochastic gradient estimator with a Taylor series approximation, a functional estimation procedure for estimating the change in PCT moments caused by a large perturbation in an activity duration's distribution parameter is proposed and applied to optimization problems involving time-cost tradeoffs. In activity networks, crashing an activity means reducing the activity's duration (deterministic or stochastic) by a given percentage with an associated cost. A crashing plan of a project aims to shorten the PCT by reducing the duration of a set of activities under a limited budget. A disruption is an event that occurs at an uncertain time. Examples of disruptions are natural disasters, electrical outages, labor strikes, etc. For an activity network, a disruption may cause delays in unfinished activities. Previous work formulates finding the optimal crashing plan of an activity network under a single disruption as a two-stage stochastic mixed integer programming problem and applies a sample average approximation technique for finding the optimal solution. In this thesis, a new stochastic gradient estimator is derived and a gradient-based simulation optimization algorithm is applied to the problem of optimizing crashing under disruption.Item Multivariate Bilateral Gamma Process with Financial Application and Machine Learning in Corporate Bond Market(2022) Zhang, Yiran; Madan, Dilip; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation consists of three essays. Chapter 1 is titled “Calibration for multivariate bilateral gamma model”. In this chapter, we discuss the multivariate bilateral gamma (MBG) model, which is an extension of the multivariate variance gamma (MVG) model that consists of predetermined Bilateral Gamma (BG) marginals. This model is used to model financial asset returns that have excess kurtosis, negative skewness, and asymmetry in the upward and downward motions. An efficient Monte Carlo simulation schema of this process is devised. Moreover, we propose an estimation procedure for the MBG modelbased on the Continuum Generalized Method of Moments (CGMM) and reparameterization of its correlation matrix. We compare this model with the full Gaussian copula (FGC) model by fitting it to the US equity data in the Dow Jones index, and the result indicates that the MBG model outperforms significantly. Chapter 2 is titled “Pairs trading strategies: Distance, cointegration, copula, and MBG method”. In this chapter, we propose a new method of building pairs trading strategy that uses MBG to model the dependency structure of stock pairs. The trading signal is then built based on the cumulative mispricing indexes, which are calculated using the conditional probability of the MBG distribution. We also conduct a comprehensive study on the performance of four different pairs trading strategies—the distance, cointegration, copula, and MBG methods—on the US equity market from 2016 to 2019. The result shows that the MBG method has the highest excess return and records the best risk-adjusted performance. Chapter 3 is titled “Predictability of corporate bond returns with machine learning.” This chapter performs a comparative analysis of machine learning methods for the predictability of corporate bonds returns. We found that machine learning models, especially ensemble learning methods like boosting, substantially improve the out-of-sample performance of stock and bond characteristics in predicting future bond returns. We also show that portfolios constructed using the prediction of machine learning models contribute to economic gains. To investigate which features the models rely on when making a prediction, we discuss three methods for calculating feature importance, including SHapley Additive exPlanations(SHAP) based on Shapley value. This method is model agnostic, consistent, and can provide an explanation for each observation. We also discuss the performance of the parsimonious model with the 15 most important features selected by SHAP.Item Topics in Stochastic Optimization(2019) Sun, Guowei N/A; Fu, Michael C; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this thesis, we work with three topics in stochastic optimization: ranking and selection (R&S), multi-armed bandits (MAB) and stochastic kriging (SK). For R&S, we first consider the problem of making inferences about all candidates based on samples drawn from one. Then we study the problem of designing efficient allocation algorithms for problems where the selection objective is more complex than the simple expectation of a random output. In MAB, we use the autoregressive process to capture possible temporal correlations in the unknown reward processes and study the effect of such correlations on the regret bounds of various bandit algorithms. Lastly, for SK, we design a procedure for dynamic experimental design for establishing a good global fit by efficiently allocating simulation budgets in the design space. The first two Chapters of the thesis work with variations of the R&S problem. In Chapter 1, we consider the problem of choosing the best design alternative under a small simulation budget, where making inferences about all alternatives from a single observation could enhance the probability of correct selection. We propose a new selection rule exploiting the relative similarity between pairs of alternatives and show its improvement on selection performance, evaluated by the Probability of Correct Selection, compared to selection based on collected sample averages. We illustrate the effectiveness by applying our selection index on simulated R\&S problems using two well-known budget allocation policies. In Chapter 2, we present two sequential allocation frameworks for selecting from a set of competing alternatives when the decision maker cares about more than just the simple expected rewards. The frameworks are built on general parametric reward distributions and assume the objective of selection, which we refer to as utility, can be expressed as a function of the governing reward distributional parameters. The first algorithm, which we call utility-based OCBA (UOCBA), uses the Delta-technique to find the asymptotic distribution of a utility estimator to establish the asymptotically optimal allocation by solving the corresponding constrained optimization problem. The second, which we refer to as utility-based value of information (UVoI) approach, is a variation of the Bayesian value of information (VoI) techniques for efficient learning of the utility. We establish the asymptotic optimality of both allocation policies and illustrate the performance of the two algorithms through numerical experiments. Chapter 3 considers the restless bandit problem where the rewards on the arms are stochastic processes with strong temporal correlations that can be characterized by the well-known stationary autoregressive-moving-average time series models. We argue that despite the statistical stationarity of the reward processes, a linear improvement in cumulative reward can be obtained by exploiting the temporal correlation, compared to policies that work under the independent reward assumption. We introduce the notion of temporal exploration-exploitation trade-off, where a policy has to balance between learning more recent information to track the evolution of all reward processes and utilizing currently available predictions to gain better immediate reward. We prove a regret lower bound characterized by the bandit problem complexity and correlation strength along the time index and propose policies that achieve a matching upper bound. Lastly, Chapter 4 proposes a fully sequential experimental design procedure for the stochastic kriging (SK) methodology of fitting unknown response surfaces from simulation experiments. The procedure first estimates the current SK model performance by jackknifing the existing data points. Then, an additional SK model is fitted on the jackknife error estimates to capture the landscape of the current SK model performance. Methodologies for balancing exploration and exploitation trade-off in Bayesian optimization are employed to select the next simulation point. Compared to existing experimental design procedures relying on the posterior uncertainty estimates from the fitted SK model for evaluating model performance, our method is robust to the SK model specifications. We design a dynamic allocation algorithm, which we call kriging-based dynamic stochastic kriging (KDSK), and illustrate its performance through two numerical experiments.Item COMPUTATIONAL METHODS IN MACHINE LEARNING: TRANSPORT MODEL, HAAR WAVELET, DNA CLASSIFICATION, AND MRI(2018) Njeunje, Franck Olivier Ndjakou; Czaja, Wojciech K; Benedetto, John J; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)With the increasing amount of raw data generation produced every day, it has become pertinent to develop new techniques for data representation, analyses, and interpretation. Motivated by real-world applications, there is a trending interest in techniques such as dimensionality reduction, wavelet decomposition, and classication methods that allow for better understanding of data. This thesis details the development of a new non-linear dimension reduction technique based on transport model by advection. We provide a series of computational experiments, and practical applications in hyperspectral images to illustrate the strength of our algorithm. In wavelet decomposition, we construct a novel Haar approximation technique for functions f in the Lp-space, 0 < p < 1, such that the approximants have support contained in the support of f. Furthermore, a classification algorithm to study tissue-specific deoxyribonucleic acids (DNA) is constructed using the support vector machine. In magnetic resonance imaging, we provide an extension of the T2-store-T2 magnetic resonance relaxometry experiment used in the analysis of magnetization signal from 2 to N exchanging sites, where N >= 2.Item Route Planning with Statistical Models(2018) Huang, Yufei; Ryzhov, Ilya; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)One difficulty to find the fastest route in route planning is how to determine the precise travel time on each road. In the real world, the travel time of each road varies with time, weather condition and many other factors. The thesis aims at studying route planning algorithms that use statistical models to predict the changes of travel time for each road and calculate the fastest route. Using the historical data of main roads in Washington D.C. area, the thesis studied major factors that would affect the travel time. Different statistical models are presented and compared to fit the travel time of each road. Then the LASSO regression model is chosen, and different predictive route planning algorithms are introduced to fulfill our goal. Finally, deterministic approximate dynamic programming is recommended to solve our problem.Item MISSPECIFIED WEIGHTS IN WEIGHT-SMOOTHING METHODS(2018) Li, Xia; Slud, Eric V; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Misspecification happens for various reasons in weight adjustment procedures in survey data analysis. To study the consequences of weight misspecifications, we study the effects of using a multiplicative biasing factor to describe the weight adjustments and reflect the distributional change from design/initial weights to final weights. The necessary and sufficient condition of the Horvitz-Thompson (HT) estimator of a population total being consistent is then given in a superpopulation setting. When HT is consistent, we first investigate the bias in other estimators for population totals. We show the necessary condition for bias in Generalized Regression (GREG) estimator and the resulting bias formula in the superpopulation limiting sense. We also link the bias in a model-based estimator of Zheng and Little to the failure of extrapolated model-fitting outside the sample. Both findings are validated in simulation studies. Next we find that the biasing factor affects estimators so that one particular estimator may have the smallest variance under design weights but not under misspecified weights due to variance inflation. A preliminary analysis on simulated samples drawn from a population of real American Community Survey (ACS) data illustrates the quality of fit of the biasing factor model we proposed to the ACS data with weights modified by a few calibration/raking steps.