Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 10 of 97
  • Thumbnail Image
    Item
    Markov multi-state models for survival analysis with recurrent events
    (2019) Zhang, Tianhui; Yang, Grace; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Markov models are a major class within the system of multi-state models for the analysis of lifetime or event-time data. Applications abound, including the estimation of lifetime of ultra-cold neutrons, the bias correction of the apparent magnitude distribution of the stars in a certain area of the sky, and the survival analysis of clinical trials. This thesis addresses some of the problems arising in the analysis of right-censored lifetime data. Clinical trials are used as examples to investigate these problems. A Markov model that takes a patient's disease development into account for the analysis of right-censored data was first constructed by Fix and Neyman (1951). The Fix-Neyman (F-N) model is a homogeneous Markov process with two transient and two absorbing states that describes a patient's status over a period of time during a cancer clinical trial. This thesis extends the F-N model by assuming the transition rates (hazard rates) to be both state and time dependent. Recurrent transitions between relapse and recovery are allowed in the extended model. By relaxing the condition of time-independent hazard rates, the extension increases the applicability of the Markov models. The extended models are used to compute the model survival functions, cumulative hazard functions that take into consideration of right censored observations as it has been done in the celebrated Kaplan-Meier estimator. Using the Fix-Neyman procedure and the Kolmogorov forward equations, closed-form solutions are obtained for certain irreversible 4-state extended models while numerical solutions are obtained for the model with recurrent events. The 4-state model is motivated by an Aplastic Anemia data set. The computational method works for general irreversible and reversible models with no restriction on the number of states. Simulations of right-censored Markov processes are performed by using a sequence of competing risks models. Simulated data are used for checking the performance of nonparametric estimators for various sample sizes. In addition, applying Aalen's (1978) results, estimators are shown have asymptotic normal distributions. A brief review of some of the literature relevant to this thesis is provided. References are readily available from a vast literature on the survival analysis including many text books. General Markov process models for survival analysis are described, e.g., in Andersen, Borgan, Gill and Keiding (1993).
  • Thumbnail Image
    Item
    Regression Analysis of Recurrent Events with Measurement Errors
    (2019) Ren, Yixin; Smith, Paul J; He, Xin; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Recurrent event data and panel count data are often encountered in longitudinal follow-up studies. The main difference between the two types of data is the observation process. Continuous observations will result in recurrent event data; and discrete observations will lead to panel count data. In statistical literature, regression analysis of the two types of data have been well studied; and a typical assumption of those studies is that all covariates are accurately recorded. However, in many applications, it is common to have measurement errors in some of the covariates. For example, in a clinical trial, a medical index might have been measured multiple times. Then dealing with the differences among those measurements is an essential topic for statisticians. For recurrent event data, we present a class of semiparametric regression models that allow correlations between censoring time and recurrent event process via frailty. An estimating equation based approach is developed to account for the presence of measurement errors in some of the covariates. Both large and finite sample properties of the proposed estimators are established. An example from the study of gamma interferon in chronic granulomatous disease is provided. For panel count data, we consider two situations in which the observation process is independent or dependent of covariates. Estimating equations are developed for the estimation of the regression parameters for both cases. Simulation studies indicate that the proposed inference procedures perform well for practical situations. An example of bladder cancer study is used to demonstrate the value of the proposed method.
  • Thumbnail Image
    Item
    Topics in Stochastic Optimization
    (2019) Sun, Guowei N/A; Fu, Michael C; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this thesis, we work with three topics in stochastic optimization: ranking and selection (R&S), multi-armed bandits (MAB) and stochastic kriging (SK). For R&S, we first consider the problem of making inferences about all candidates based on samples drawn from one. Then we study the problem of designing efficient allocation algorithms for problems where the selection objective is more complex than the simple expectation of a random output. In MAB, we use the autoregressive process to capture possible temporal correlations in the unknown reward processes and study the effect of such correlations on the regret bounds of various bandit algorithms. Lastly, for SK, we design a procedure for dynamic experimental design for establishing a good global fit by efficiently allocating simulation budgets in the design space. The first two Chapters of the thesis work with variations of the R&S problem. In Chapter 1, we consider the problem of choosing the best design alternative under a small simulation budget, where making inferences about all alternatives from a single observation could enhance the probability of correct selection. We propose a new selection rule exploiting the relative similarity between pairs of alternatives and show its improvement on selection performance, evaluated by the Probability of Correct Selection, compared to selection based on collected sample averages. We illustrate the effectiveness by applying our selection index on simulated R\&S problems using two well-known budget allocation policies. In Chapter 2, we present two sequential allocation frameworks for selecting from a set of competing alternatives when the decision maker cares about more than just the simple expected rewards. The frameworks are built on general parametric reward distributions and assume the objective of selection, which we refer to as utility, can be expressed as a function of the governing reward distributional parameters. The first algorithm, which we call utility-based OCBA (UOCBA), uses the Delta-technique to find the asymptotic distribution of a utility estimator to establish the asymptotically optimal allocation by solving the corresponding constrained optimization problem. The second, which we refer to as utility-based value of information (UVoI) approach, is a variation of the Bayesian value of information (VoI) techniques for efficient learning of the utility. We establish the asymptotic optimality of both allocation policies and illustrate the performance of the two algorithms through numerical experiments. Chapter 3 considers the restless bandit problem where the rewards on the arms are stochastic processes with strong temporal correlations that can be characterized by the well-known stationary autoregressive-moving-average time series models. We argue that despite the statistical stationarity of the reward processes, a linear improvement in cumulative reward can be obtained by exploiting the temporal correlation, compared to policies that work under the independent reward assumption. We introduce the notion of temporal exploration-exploitation trade-off, where a policy has to balance between learning more recent information to track the evolution of all reward processes and utilizing currently available predictions to gain better immediate reward. We prove a regret lower bound characterized by the bandit problem complexity and correlation strength along the time index and propose policies that achieve a matching upper bound. Lastly, Chapter 4 proposes a fully sequential experimental design procedure for the stochastic kriging (SK) methodology of fitting unknown response surfaces from simulation experiments. The procedure first estimates the current SK model performance by jackknifing the existing data points. Then, an additional SK model is fitted on the jackknife error estimates to capture the landscape of the current SK model performance. Methodologies for balancing exploration and exploitation trade-off in Bayesian optimization are employed to select the next simulation point. Compared to existing experimental design procedures relying on the posterior uncertainty estimates from the fitted SK model for evaluating model performance, our method is robust to the SK model specifications. We design a dynamic allocation algorithm, which we call kriging-based dynamic stochastic kriging (KDSK), and illustrate its performance through two numerical experiments.
  • Thumbnail Image
    Item
    Branching diffusion processes in periodic media
    (2019) Hebbar, Pratima; Koralov, Leonid; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In the first part of this manuscript, we investigate the asymptotic behavior of solutions to parabolic partial differential equations (PDEs) in $\real^d$ with space-periodic diffusion matrix, drift, and potential. The asymptotics is obtained up to linear in time distances from the support of the initial function. Using this asymptotics, we describe the behavior of branching diffusion processes in periodic media. For a super-critical branching process, we distinguish two types of behavior for the normalized number of particles in a bounded domain, depending on the distance of the domain from the region where the bulk of the particles is located. At distances that grow linearly in time, we observe intermittency (i.e., the $k-$th moment dominates the $k-$th power of the first moment for some $k$), while, at distances that grow sub-linearly in time, we show that all the moments converge. In the second part of the manuscript, we obtain asymptotic expansions for the distribution functions of continuous time stochastic processes with weakly dependent increments in the domain of large deviations. As a key example, we show that additive functionals of solutions of stochastic differential equations (SDEs) satisfying H\"ormander condition on a $d$--dimensional compact manifold admit asymptotic expansions of all orders in the domain of large deviations.
  • Thumbnail Image
    Item
    Data Fusion based on the Density Ratio Model
    (2018) Wang, Chen; Kedem, Benjamin; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A vast amount of the statistical literature deals with a single sample coming from a distribution where the problem is to make inferences about the distribution by estimation and testing procedures. Data fusion is a process of integrating multiple data sources in the hope of getting more accurate inference than that provided by a single data sources, the expectation being that fused data are more informative than the individual original inputs. This requires appropriate statistical methods which can provide inference by using multiple data sources as input. The Density Ratio Model is a model which allows semiparametric inference about probability distributions from fused data. In this dissertation, we will discuss three different types of problems based on the Density Ratio Model. We will discuss the situation where there is a system of sensors, each producing data according to some probability distribution. The parametric connection between the distributions allows various hypothesis tests including that of equidistribution, which are very helpful in detecting abnormalities in mechanical systems. Another example of a data fusion problem is the small area estimation where borrowing strength occurs by using all data from all areas where information is available. Real data can be fused with other real data, or even with artificial data. Thus, a given sample can be fused with computer-generated data giving rise to the concept of out of sample fusion(OSF). We will see that this approach is very helpful when estimating a small threshold exceedance probability when the sample size is not large enough and consisting of values below the threshold.
  • Thumbnail Image
    Item
    ESSAYS IN STATISTICAL ANALYSIS: ISOTONIC REGRESSION AND FILTERING
    (2018) Xue, Jinhang; Ryzhov, Ilya O; Smith, Paul J; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In many real-world applications in optimal information collection and stochastic approximation, statistical estimators are often constructed to learn the true parameter value of some utility functions or underlying signals. Many of these estimators exhibit excellent empirical performance, but full analyses of their consistency are not previously available, thus putting decision-makers in somewhat of a predicament regarding implementation. The goal of this dissertation is to fill this blank of missing consistency proofs. The first part of this thesis considers the consistency of estimating a monotonic cost function which appears in an optimal learning algorithm that incorporates isotonic regression with a Bayesian policy known as Knowledge Gradient with Discrete Priors (KGDP). Isotonic regression deals with regression problems under order constraints. Previous literature proposed to estimate the cost function by a weighted sum of a pool of candidate curves, each of which is generated by the isotonic regression estimator based on all the previous observations that have been collected, and the weights are calculated by KGDP. Our primary objective is to establish the consistency of the suggested estimator. Some minor results, regarding with the knowledge gradient algorithm and the isotonic regression estimator under insufficient observations, are also discussed. The second part of this thesis focuses on the convergence of the bias-adjusted Kalman filter (BAKF). The BAKF algorithm is designed to optimize the statistical estimation of a non-stationary signal that can only be observed with stochastic noise. The algorithm has numerous applications in dynamic programming and signal processing. However, a consistency analysis of the process that approximates the underlying signal has heretofore not been available. We resolve this open issue by showing that the BAKF stepsize satisfies the well-known conditions on almost sure convergence of a stochastic approximation sequence, with only one additional assumption on the convergence rate of the signal compared to those used in the derivation of the original problem.
  • Thumbnail Image
    Item
    Statistical Inference Using Data From Multiple Files Combined Through Record Linkage
    (2018) HAN, YING; Lahiri, Partha; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Record linkage methods help us combine multiple data sets from different sources when a single data set with all necessary information is unavailable or when data collection on additional variables is time consuming and extremely costly. Linkage errors are inevitable in the linked data set because of the unavailability of an error-free and unique identifier and because of possible errors in measuring or recording. It has been realized that even a small amount of linkage errors can lead to substantial bias and increase variability in estimating the parameters of a statistical model. The importance of incorporating uncertainty of the record linkage process into the statistical analysis step cannot be overemphasized. The current research is mainly focused on the regression analysis of the linked data. The record linkage and statistical analysis processes are treated as two separate steps. Due to the limited information about the record linkage process, simplifying assumptions on the linkage mechanism have to be made. In reality, however, these assumptions may be violated. Also, most of the existing linkage error models are built on the linked data set, which only contains records for the designated links. Information about linkage errors carried by the designated non-links is missing. In the dissertation, we provide general methodologies for both regression analysis and small area estimation using data from multiple files. A general integrated model is proposed to combine the record linkage and statistical analysis processes. The proposed linkage error models are built directly on the data values from the original sources, and based on the actual record linkage method that is used. We have adapted the jackknife methods to estimate bias, variance, and mean squared error of our proposed estimators. To illustrate the general methodology, we give one example of estimating the regression coefficients in the linear and logistic regression models, and another example of estimating small area mean under the nested-error linear regression model. In order to reduce the computational burden, simplified version of the proposed estimators, jackknife methods, and numerical algorithms are given. A Monte Carlo simulation study is devised to evaluate the performance of the proposed estimators and to investigate the difference between the standard and simplified jackknife methods.
  • Thumbnail Image
    Item
    STOCHASTIC OPTIMIZATION: APPROXIMATE BAYESIAN INFERENCE AND COMPLETE EXPECTED IMPROVEMENT
    (2018) Chen, Ye; Ryzhov, Ilya; Smith, Paul; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Stochastic optimization includes modeling, computing and decision making. In practice, due to the limitation of mathematical tools or real budget, many practical solution methods are designed using approximation techniques or taking forms that are efficient to compute and update. These models have shown their practical benefits in different backgrounds, but many of them also lack rigorous theoretical support. Through interfacing with statistical tools, we analyze the asymptotic properties of two important Bayesian models and show their validity by proving consistency or other limiting results, which may be useful to algorithmic scientists seeking to leverage these computational techniques for their practical performance. The first part of the thesis is the consistency analysis of sequential learning algorithms under approximate Bayesian inference. Approximate Bayesian inference is a powerful methodology for constructing computationally efficient statistical mechanisms for sequential learning from incomplete or censored information.Approximate Bayesian learning models have proven successful in a variety of operations research and business problems; however, prior work in this area has been primarily computational, and the consistency of approximate Bayesian estimators has been a largely open problem. We develop a new consistency theory by interpreting approximate Bayesian inference as a form of stochastic approximation (SA) with an additional “bias” term. We prove the convergence of a general SA algorithm of this form, and leverage this analysis to derive the first consistency proofs for a suite of approximate Bayesian models from the recent literature. The second part of the thesis proposes a budget allocation algorithm for the ranking and selection problem. The ranking and selection problem is a well-known mathematical framework for the formal study of optimal information collection. Expected improvement (EI) is a leading algorithmic approach to this problem; the practical benefits of EI have repeatedly been demonstrated in the literature, especially in the widely studied setting of Gaussian sampling distributions. However, it was recently proved that some of the most well-known EI-type methods achieve sub- optimal convergence rates. We investigate a recently-proposed variant of EI (known as “complete EI”) and prove that, with some minor modifications, it can be made to converge to the rate-optimal static budget allocation without requiring any tuning.
  • Thumbnail Image
    Item
    ADJUSTMENT FOR DENSITY METHOD TO ESTIMATE RANDOM EFFECTS IN HIERARCHICAL BAYES MODELS
    (2018) Cao, Lijuan; Lahiri, Partha; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The Adjustment for Density Method (ADM) has received considerable attention in recent years. The method was proposed about thirty years back in approximating a complex univariate density by a density from the Pearson family of distributions. The ADM has been developed to approximate posterior distributions of hyper-parameters, shrinkage parameters and random effects of a few well-known univariate hierarchical Bayesian models. This dissertation advances the ADM to approximate posterior distributions of hyper-parameters, shrinkage parameters, synthetic probabilities and multinomial probabilities associated with a multinomial-Dirichlet-logit Bayesian hierarchical model. The method is adapted so it can be applied to weighted counts. We carefully propose prior for the hyper-parameters of the multinomial-Dirichlet-logit model so as to ensure propriety of posterior of relevant parameters of the model and to achieve good small sample properties. Following general guidelines of the ADM for univariate distributions, we devise suitable adjustments to the posterior density of the hyper-parameters so that adjusted posterior modes lie in the interior of the parameter space and to reduce the bias in the point estimates. Beta distribution approximations are employed when approximating the posterior distributions of the individual shrinkage factors and Dirichlet distribution approximations are used when approximating the posterior distributions of the synthetic probabilities. The parameters of the beta or the Dirichlet posterior density are approximated carefully so the method approximates the exact posterior densities accurately. Simulation studies demonstrate that our proposed approach in estimating the multinomial probabilities in the multinomial-Dirichlet-logit model is accurate in estimation, fast in speed and has better operating characteristics compared to other existing procedures. We consider two applications of our proposed hierarchical Bayes model using complex survey and Big Data. In the first example, we consider small area gender proportions using a binomial-beta-logit model. The proposed method improves on a rival method in terms of smaller margins of error. In the second application, we demonstrate how small area multi-category race proportions estimates, obtained by direct method applied on Twitter data, can be improved by the proposed method. This dissertation ends with a discussion on future research in the area of ADM.
  • Thumbnail Image
    Item
    COMPUTATIONAL METHODS IN MACHINE LEARNING: TRANSPORT MODEL, HAAR WAVELET, DNA CLASSIFICATION, AND MRI
    (2018) Njeunje, Franck Olivier Ndjakou; Czaja, Wojciech K; Benedetto, John J; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    With the increasing amount of raw data generation produced every day, it has become pertinent to develop new techniques for data representation, analyses, and interpretation. Motivated by real-world applications, there is a trending interest in techniques such as dimensionality reduction, wavelet decomposition, and classication methods that allow for better understanding of data. This thesis details the development of a new non-linear dimension reduction technique based on transport model by advection. We provide a series of computational experiments, and practical applications in hyperspectral images to illustrate the strength of our algorithm. In wavelet decomposition, we construct a novel Haar approximation technique for functions f in the Lp-space, 0 < p < 1, such that the approximants have support contained in the support of f. Furthermore, a classification algorithm to study tissue-specific deoxyribonucleic acids (DNA) is constructed using the support vector machine. In magnetic resonance imaging, we provide an extension of the T2-store-T2 magnetic resonance relaxometry experiment used in the analysis of magnetization signal from 2 to N exchanging sites, where N >= 2.