Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 10 of 45
  • Thumbnail Image
    Item
    MEANS AND AVERAGING ON RIEMANNIAN MANIFOLDS
    (2009) Afsari, Bijan; Krishnaprasad, P.S.; Grove, Karsten; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Processing of manifold-valued data has received considerable attention in recent years. Standard data processing methods are not adequate for such data. Among many related data processing tasks finding means or averages of manifold-valued data is a basic and important one. Although means on Riemannian manifolds have a long history, there are still many unanswered theoretical questions about them, some of which we try to answer. We focus on two classes of means: the Riemannian $L^{p}$ mean and the recursive-iterative means. The Riemannian $L^{p}$ mean is defined as the solution(s) of a minimization problem, while the recursive-iterative means are defined based on the notion of Mean-Invariance (MI) in a recursive and iterative process. We give a new existence and uniqueness result for the Riemannian $L^{p}$ mean. The significant consequence is that it shows the local and global definitions of the Riemannian $L^{p}$ mean coincide under an uncompromised condition which guarantees the uniqueness of the local mean. We also study smoothness, isometry compatibility, convexity and noise sensitivity properties of the $L^{p}$ mean. In particular, we argue that positive sectional curvature of a manifold can cause high sensitivity to noise for the $L^{2}$ mean which might lead to a non-averaging behavior of that mean. We show that the $L^{2}$ mean on a manifold of positive curvature can have an averaging property in a weak sense. We introduce the notion of MI, and study a large class of recursive-iterative means. MI means are related to an interesting class of dynamical systems that can find Riemannian convex combinations. A special class of the MI means called pairwise mean, which through an iterative scheme called Perimeter Shrinkage is related to cyclic pursuit on manifolds, is also studied. Finally, we derive results specific to the special orthogonal group and the Grassmannian manifold, as these manifolds appear naturally in many applications. We distinguish the $2$-norm Finsler balls of appropriate radius in these manifolds as domains for existence and uniqueness of the studied means. We also introduce some efficient numerical methods to perform the related calculations in the specified manifolds.
  • Thumbnail Image
    Item
    Regularized Variable Selection in Proportional Hazards Model Using Area under Receiver Operating Characteristic Curve Criterion
    (2009) Wang, Wen-Chyi; Yang, Grace L; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The goal of this thesis is to develop a statistical procedure for selecting pertinent predictors among a number of covariates to accurately predict the survival time of a patient. There are available many variable selection procedures in the literature. This thesis is focused on a more recently developed “regularized variable selection procedure”. This procedure, based on a penalized likelihood, can simultaneously address the problem of variable selection and variable estimation which previous procedures lack. Specifically, this thesis studies regularized variable selection procedure in the proportional hazards model for censored survival data. Implementation of the procedure requires judicious determination of the amount of penalty, a regularization parameter λ, on the likelihood and the development of computational intensive algorithms. In this thesis, a new criterion of determining λ using the notion of “the area under the receiver operating characteristic curve (AUC)” is proposed. The conventional generalized cross-validation criterion (GCV) is based on the likelihood and its second derivative. Unlike GCV, the AUC criterion is based on the performance of disease classification in terms of patients' survival times. Simulations show that performance of the AUC and the GCV criteria are similar. But the AUC criterion gives a better interpretation of the survival data. We also establish the consistency and asymptotic normality of the regularized estimators of parameters in the partial likelihood of proportional hazards model. Some oracle properties of the regularized estimators are discussed under certain sparsity conditions. An algorithm for selecting λ and computing regularized estimates is developed. The developed procedure is then illustrated with an application to the survival data of patients who have cancers in head and neck. The results show that the proposed method is comparable with the conventional one.
  • Thumbnail Image
    Item
    Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection
    (2009) Lotze, Thomas Harvey; Shmueli, Galit; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way.
  • Thumbnail Image
    Item
    Statistical Inference Based On Estimating Functions in Exact and Misspecified Models
    (2009) Janicki, Ryan Louis; Kagan, Abram M; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Estimating functions, introduced by Godambe, are a useful tool for constructing estimators. The classical maximum likelihood estimator and the method of moments estimator are special cases of estimators generated as the solution to certain estimating equations. The main advantage of this method is that it does not require knowledge of the full model, but rather of some functionals, such as a number of moments. We define an estimating function Ψ to be a Fisher estimating function if it satisfies Eθ(ΨΨTθ(dΨ/dθ). The motivation for considering this class of estimating functions is that a Fisher estimating function behaves much like the Fisher score, and the estimators generated as solutions to these estimating equations behave much like maximum likelihood estimators. The estimating functions in this class share some of the same optimality properties as the Fisher score function and they have applications for estimation in submodels, elimination of nuisance parameters, and combinations of independent samples. We give some applications of estimating functions to estimation of a location parameter in the presence of a nuisance scale parameter. We also consider the behavior of estimators generated as solutions to estimating equations under model misspecication when the misspecication is small and can be parameterized. A problem related to model misspecication is attempting to distinguish between a finite number of competing parametric families. We construct an estimator that is consistent and efficient, regardless of which family contains the true distribution.
  • Thumbnail Image
    Item
    Investigating center effects in a multi-center clinical trial study using a parametric proportional hazards meta-analysis model
    (2009) Demissie, Mathewos Solomon; Slud, Eric V; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this paper, we investigate meta-analysis of the overall treatment effect in the setting of a multi-center clinical trial study in which patient level data are available. We estimate the overall treatment effect using two methods: meta-analysis, which uses the summary statistics from each center and a unified combined analysis of patient level data. In the meta-analysis we use a random effects meta-analysis model and in both analyses we use a parametric proportional hazards model. In a randomized clinical trial study, subjects are recruited at multiple centers to accrue large enough samples within an acceptable period of time and to enhance the generalizability of study results. Heterogeneity between trials may arise from the center effects or treatment effect itself. To take into account the heterogeneities, random effects models are used. We performed a data analysis based on a multi-center clinical trial study in small-cell lung cancer conducted by the Eastern Cooperative Oncology Group and then parallel data analysis within a simulation study. In the simulation study we vary the magnitude of the center and the treatment-by-center heterogeneity in the data generation and estimated the over all treatment effect using the two methods. We compared the two methods in terms of bias, mean square error and percentage of significant treatment effect. The simulation study shows that meta-analysis treatment effects estimate are slightly biased when covariates are included in the analysis.
  • Thumbnail Image
    Item
    Mining of Business Data
    (2009) Zhang, Shu; Jank, Wolfgang; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Applying statistical tools to help understand business processes and make informed business decisions has attracted enormous amount of research interests in recent years. In this dissertation, we develop and apply data mining techniques to two sources of data, online bidding data for eBay and offline sales transaction data from a grocery product distributor. We mine online auction data to develop forecasting models and bidding strategies and mine offline sales transaction data to investigate sales people's price formation process. We start with discussing bidders' bidding strategies in online auctions. Conventional bidding strategies do not help bidders select an auction to bid on. We propose an automated and data-driven strategy which consists of a dynamic forecasting model for auction closing price and a bidding framework built around this model to determine the best auction to bid on and the best bid amount. One important component of our bidding strategy is a good forecasting model. We investigate forecasting alternatives in three ways. Firstly, we develop model selection strategies for online auctions (Chapter 3). Secondly, we propose a novel functional K-nearest neighbor (KNN) forecaster for real time forecasting of online auctions (Chapter 4). The forecaster uses information from other auctions and weighs their contribution by their relevance in terms of auction features. It improves the predictive performance compared to several competing models across various levels of data heterogeneity. Thirdly, we develop a Beta model (Chapter 5) for capturing auction price paths and find this model has advantageous forecasting capability. Apart from online transactions, we also employ data mining techniques to understand offline transactions where sales representatives (salesreps) serve as media to interact with customers and quote prices. We investigate the mental models for salesreps' decision making, and find that price recommendation makes salesreps concentrate on cost related information. In summary, the dissertation develops various data mining techniques for business data. Our study is of great importance for understanding auction price formation processes, forecasting auction outcomes, optimizing bidding strategies, and identifying key factors in sales people's decision making. Those techniques not only advance our understanding of business processes, but also help design business infrastructure.
  • Thumbnail Image
    Item
    Diagnostics for Nonlinear Mixed-Effects Models
    (2009) Nagem, Mohamed Ould; Kedem, Benjamin; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The estimation methods in Nonlinear Mixed-Effects Models (NLMM) still largely rely on numerical approximation of the likelihood function and the properties of these methods are yet to be characterized. These methods are available in most statistical software packages, such as S-plus and SAS; However approaches on how to assess the reliability of these estimation methods are still open to debate. Moreover, the lack of a common measure to capture the best fitted model is still an open area of research. Common Software packages such as SAS and S-plus do not provide a specific method for computing such a measure other than the traditional Akaike's Information Criterion (AIC) Akaike [2], Bayesian Information Criterion (BIC) Schwarz [38], or the likelihood ratio. These methods are comparative in nature and are very hard to interpret in this context due to the complex structure and dependent nature of the populations that they were intended to analyze. This dissertation focuses on approximate methods of estimating parameters of NLMM. In chapter 1, the general form of a NLMM is introduced and real data examples are presented to illustrate the usefulness of NLMM where a standard regression model is not appropriate. A general review of the approximation methods of the log-likelihood function is described. In chapter 2, we compared three approximation techniques, which are widely used in the estimation of NLMM, based on simulation studies. In this chapter we compared these approx- imation methods through extensive simulation studies motivated by two widely used data sets. We compared the empirical estimates from three different approximations of the log-likelihood function and their bias, precision, convergence rate, and the 95% confidence interval coverage probability. We compared the First Order approximation (FO) of Beal and Sheiner [5], the Laplace approximation (LP) of Wolfinger [49], and the Gaussian Quadrature (GQ) of Davidian and Gallant [10]. We also compared these approaches under different sample size configurations and analyzed their effects on both fixed effects estimates and the precision measures. The question of which approximation yields the best estimates and the degree of precision associated with it seems to depend greatly on many aspects. We explored some of these aspects such as the magnitude of variability among the random effects, the random parameters covariance structure, and the way in which such random parameters enter the model as well as the \linearity" or the "close to linearity" of the model as a function of these random parameters. We concluded that, while no method outperformed the others on a consistent basis, both the GQ and LP methods provided the most accurate estimates. The FO method has the advantage that it is exact when the model is linear in the random effects. It also has the advantage of being computationally simple and provides reasonable convergence rates. In chapter 3 we investigated the robustness and sensitivity of the three approximation techniques to the structure of the random effect parameters, the dimension of these parameters, and the correlation structure of the covariance matrix. We expanded the work of Hartford and Davidian [18] to assess the robustness of these approximation methods under different scenarios (models) of random effect covariance structures:(1) Under the assumption of single random effect models;(2) under the assumption of correlated random effect models;(3) under the assumption of non-correlated random effect models. We showed that the LP and GQ methods are very similar and provided the most accurate estimates. Even though the LP is fairly robust to mild deviations, the LP estimates can be extremely biased due to the difficulty of achieving convergence. The LP method is sensitive to misspecification of the inter-individual model. In chapter 4 we evaluated the Goodness of Fit measure (GOF) of Hosmer et. al. [20] and Sturdivant and Hosmer [43] to a class of NLMM and evaluated the asymptotic sum of residual squares statistics as a measure of goodness of fit by conditioning the response on the random effect parameter and using Taylor series approximations in the estimation technique. Simulations of different mixed logistic regression models were evaluated, as well as the effect of the sample size on such statistics. We showed that the proposed sum of squares residual statistics works well for a class of mixed logistic regression models with the presence of continuous covariates with a modest sample size dataset. However, the same statistics failed to provide an adequate power to detect the correct model in the presence of binary covariates.
  • Thumbnail Image
    Item
    Discrete Time Stochastic Volatility Model
    (2009) Tang, Guojing; Madan, Dilip; Kedem, Benjamin; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this dissertation we propose a new model which captures observed features of asset prices. The model reproduces the skewness and fat tails of asset returns by introducing a discretized variance gamma process as the driving innovation process, in addition to a double gamma process to reflect the stochastic nature of volatility coefficients. The leverage effect between returns and volatilities is built in by a polynomial function describing the relationship between these two variables. One application of this model is to price volatility contracts whose payoffs depend on realized variance or volatility. Because of the scarcity of market quotes and consequent unavailability of risk neutral calibration, we propose a new scheme of pricing based on the model estimated from historical data. The estimation of the model parameters is carried out by maximizing likelihood function, which is calculated through a combination of Expectation-Maximization and Particle Filter algorithm. The resulting distribution is transformed by concave distortions, the extent of which reflects the risk aversion level of market.
  • Thumbnail Image
    Item
    MODELING MEDIAN HOUSEHOLD INCOME DISTRIBUTION
    (2008) FOSTER, KEVIN MATTHEW; KEDEM, BENJAMIN; Mathematical Statistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this thesis we are going to use U.S. Census data to study median household income distribution for 13 U.S. counties and seven U.S. states. Over the years, researchers have fitted income data with various probability distributions. During our review of the literature, we saw that researchers do not agree on any one best distribution. We will be looking at lognormal, gamma and Weibull, each of which has two parameters. We will also investigate the Singh-Maddala, which has three parameters. Finally, we will introduce the Generalized Beta II, which has four parameters. These distributions will be tested using Mean Squared Error, Mean Absolute Error, Chi-square Goodness-Of-Fit, Akaike's Information Criterion and Bayesian Information Criterion. We also use the graphical technique of QQ Plots. We discover that the Singh-Maddala most often provides the best fit model for our income data, and we make the recommendation that users choose the Singh-Maddala distribution as their model when studying median household income distribution.
  • Thumbnail Image
    Item
    A New Scheme for Monitoring Multivariate Process Dispersion
    (2009) Song, Xin; Smith, Paul; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Construction of control charts for multivariate process dispersion is not as straightforward as for the process mean. Because of the complexity of out of control scenarios, a general method is not available. In this dissertation, we consider the problem of monitoring multivariate dispersion from two perspectives. First, we derive asymptotic approximations to the power of Nagao's test for the equality of a normal dispersion matrix to a given constant matrix under local and fixed alternatives. Second, we propose various unequally weighted sum of squares estimators for the dispersion matrix, particularly with exponential weights. The new estimators give more weights to more recent observations and are not exactly Wishart distributed. Satterthwaite's method is used to approximate the distribution of the new estimators. By combining these two techniques based on exponentially weighted sums of squares and Nagao's test, we are able to propose a new control scheme MTNT, which is easy to implement. The control limits are easily calculated since they only depend on the dimension of the process and the desired in control average run length. Our simulations show that compared with schemes based on the likelihood ratio test and the sample generalized variance, MTNT has the shortest out of control average run length for a variety of out of control scenarios, particularly when process variances increase.