Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
13 results
Search Results
Item Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection(2009) Lotze, Thomas Harvey; Shmueli, Galit; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way.Item Tractable Learning and Inference in High-Treewidth Graphical Models(2009) Domke, Justin; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Probabilistic graphical models, by making conditional independence assumptions, can represent complex joint distributions in a factorized form. However, in large problems graphical models often run into two issues. First, in non-treelike graphs, computational issues frustrate exact inference. There are several approximate inference algorithms that, while often working well, do not obey approximation bounds. Second, traditional learning methods are non-robust with respect to model errors-- if the conditional independence assumptions of the model are violated, poor predictions can result. This thesis proposes two new methods for learning parameters of graphical models: implicit and procedural fitting. The goal of these methods is to improve the results of running a particular inference algorithm. Implicit fitting views inference as a large nonlinear energy function over predicted marginals. During learning, the parameters are adjusted to place the minima of this function close to the true marginals. Inspired by algorithms like loopy belief propagation, procedural fitting considers inference as a message passing procedure. Parameters are adjusted while learning so that this message-passing process gives the best results. These methods are robust to both model errors and approximate inference because learning is done directly in terms of predictive accuracy.Item Mining of Business Data(2009) Zhang, Shu; Jank, Wolfgang; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Applying statistical tools to help understand business processes and make informed business decisions has attracted enormous amount of research interests in recent years. In this dissertation, we develop and apply data mining techniques to two sources of data, online bidding data for eBay and offline sales transaction data from a grocery product distributor. We mine online auction data to develop forecasting models and bidding strategies and mine offline sales transaction data to investigate sales people's price formation process. We start with discussing bidders' bidding strategies in online auctions. Conventional bidding strategies do not help bidders select an auction to bid on. We propose an automated and data-driven strategy which consists of a dynamic forecasting model for auction closing price and a bidding framework built around this model to determine the best auction to bid on and the best bid amount. One important component of our bidding strategy is a good forecasting model. We investigate forecasting alternatives in three ways. Firstly, we develop model selection strategies for online auctions (Chapter 3). Secondly, we propose a novel functional K-nearest neighbor (KNN) forecaster for real time forecasting of online auctions (Chapter 4). The forecaster uses information from other auctions and weighs their contribution by their relevance in terms of auction features. It improves the predictive performance compared to several competing models across various levels of data heterogeneity. Thirdly, we develop a Beta model (Chapter 5) for capturing auction price paths and find this model has advantageous forecasting capability. Apart from online transactions, we also employ data mining techniques to understand offline transactions where sales representatives (salesreps) serve as media to interact with customers and quote prices. We investigate the mental models for salesreps' decision making, and find that price recommendation makes salesreps concentrate on cost related information. In summary, the dissertation develops various data mining techniques for business data. Our study is of great importance for understanding auction price formation processes, forecasting auction outcomes, optimizing bidding strategies, and identifying key factors in sales people's decision making. Those techniques not only advance our understanding of business processes, but also help design business infrastructure.Item Diagnostics for Nonlinear Mixed-Effects Models(2009) Nagem, Mohamed Ould; Kedem, Benjamin; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The estimation methods in Nonlinear Mixed-Effects Models (NLMM) still largely rely on numerical approximation of the likelihood function and the properties of these methods are yet to be characterized. These methods are available in most statistical software packages, such as S-plus and SAS; However approaches on how to assess the reliability of these estimation methods are still open to debate. Moreover, the lack of a common measure to capture the best fitted model is still an open area of research. Common Software packages such as SAS and S-plus do not provide a specific method for computing such a measure other than the traditional Akaike's Information Criterion (AIC) Akaike [2], Bayesian Information Criterion (BIC) Schwarz [38], or the likelihood ratio. These methods are comparative in nature and are very hard to interpret in this context due to the complex structure and dependent nature of the populations that they were intended to analyze. This dissertation focuses on approximate methods of estimating parameters of NLMM. In chapter 1, the general form of a NLMM is introduced and real data examples are presented to illustrate the usefulness of NLMM where a standard regression model is not appropriate. A general review of the approximation methods of the log-likelihood function is described. In chapter 2, we compared three approximation techniques, which are widely used in the estimation of NLMM, based on simulation studies. In this chapter we compared these approx- imation methods through extensive simulation studies motivated by two widely used data sets. We compared the empirical estimates from three different approximations of the log-likelihood function and their bias, precision, convergence rate, and the 95% confidence interval coverage probability. We compared the First Order approximation (FO) of Beal and Sheiner [5], the Laplace approximation (LP) of Wolfinger [49], and the Gaussian Quadrature (GQ) of Davidian and Gallant [10]. We also compared these approaches under different sample size configurations and analyzed their effects on both fixed effects estimates and the precision measures. The question of which approximation yields the best estimates and the degree of precision associated with it seems to depend greatly on many aspects. We explored some of these aspects such as the magnitude of variability among the random effects, the random parameters covariance structure, and the way in which such random parameters enter the model as well as the \linearity" or the "close to linearity" of the model as a function of these random parameters. We concluded that, while no method outperformed the others on a consistent basis, both the GQ and LP methods provided the most accurate estimates. The FO method has the advantage that it is exact when the model is linear in the random effects. It also has the advantage of being computationally simple and provides reasonable convergence rates. In chapter 3 we investigated the robustness and sensitivity of the three approximation techniques to the structure of the random effect parameters, the dimension of these parameters, and the correlation structure of the covariance matrix. We expanded the work of Hartford and Davidian [18] to assess the robustness of these approximation methods under different scenarios (models) of random effect covariance structures:(1) Under the assumption of single random effect models;(2) under the assumption of correlated random effect models;(3) under the assumption of non-correlated random effect models. We showed that the LP and GQ methods are very similar and provided the most accurate estimates. Even though the LP is fairly robust to mild deviations, the LP estimates can be extremely biased due to the difficulty of achieving convergence. The LP method is sensitive to misspecification of the inter-individual model. In chapter 4 we evaluated the Goodness of Fit measure (GOF) of Hosmer et. al. [20] and Sturdivant and Hosmer [43] to a class of NLMM and evaluated the asymptotic sum of residual squares statistics as a measure of goodness of fit by conditioning the response on the random effect parameter and using Taylor series approximations in the estimation technique. Simulations of different mixed logistic regression models were evaluated, as well as the effect of the sample size on such statistics. We showed that the proposed sum of squares residual statistics works well for a class of mixed logistic regression models with the presence of continuous covariates with a modest sample size dataset. However, the same statistics failed to provide an adequate power to detect the correct model in the presence of binary covariates.Item Dependence Structure for Levy Processes and Its Application in Finance(2008-06-06) chen, qiwen; Madan, Dilip B; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this paper, we introduce DSPMD, discretely sampled process with pre-specified marginals and pre-specified dependence, and SRLMD, series representation for Levy process with pre-specified marginals and pre-specified dependence. In the DSPMD for Levy processes, some regular copula can be extracted from the discrete samples of a joint process so as to correlate discrete samples on the pre-specified marginal processes. We prove that if the pre-specified marginals and pre-specified joint processes are some Levy processes, the DSPMD converges to some Levy process. Compared with Levy copula, proposed by Tankov, DSPMD offers easy access to statistical properties of the dependence structure through the copula on the random variable level, which is difficult in Levy copula. It also comes with a simulation algorithm that overcomes the first component bias effect of the series representation algorithm proposed by Tankov. As an application and example of DSPMD for Levy process, we examined the statistical explanatory power of VG copula implied by the multidimensional VG processes. Several baskets of equities and indices are considered. Some basket options are priced using risk neutral marginals and statistical dependence. SRLMD is based on Rosinski's series representation and Sklar's Theorem for Levy copula. Starting with a series representation of a multi-dimensional Levy process, we transform each term in the series component-wise to new jumps satisfying pre-specified jump measure. The resulting series is the SRLMD, which is an exact Levy process, not an approximation. We give an example of alpha-stable Levy copula which has the advantage over what Tankov proposed in the follow aspects: First, it is naturally high dimensional. Second, the structure is so general that it allows from complete dependence to complete independence and can have any regular copula behavior built in. Thirdly, and most importantly, in simulation, the truncation error can be well controlled and simulation efficiency does not deteriorate in nearly independence case. For compound Poisson processes as pre-specified marginals, zero truncation error can be attained.Item An Investigation of the Relationship Between Automated Machine Translation Evaluation Metrics and User Performance on an Information Extraction Task(2007-12-04) Tate, Calandra Rilette; Slud, Eric V; Dorr, Bonnie J; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation applies nonparametric statistical techniques toMachine Translation (MT) Evaluation using data from a MT Evaluation experiment conducted through a joint Army Research Laboratory (ARL) and Center for the Advanced Study of Language (CASL) project. In particular, the relationship between human task performance on an information extraction task with translated documents and well-known automated translation evaluation metric scores for those documents is studied. Findings from a correlation analysis of the connection between autometrics and task-based metrics are presented and contrasted with current strategies for evaluating translations. A novel idea for assessing partial rank correlation within the presence of grouping factors is also introduced. Lastly, this dissertation presents a framework for task-based machine translation (MT) evaluation and predictive modeling of task responses that gives new information about the relative predictive strengths of the different autometrics (and re-coded variants of them) within the statistical Generalized Linear Models developed in analyses of the Information Extraction Task data. This work shows that current autometrics are inadequate with respect to the prediction of task performance but, near adequacy can be accomplished through the use of re-coded autometrics in a logistic regression setting. As a result, a class of automated metrics that are best suitable for predicting performance is established and suggestions are offered about how to utilize metrics to supplement expensive and time-consuming experiments with human participants. Now users can begin to tie the intrinsic automated metrics to the extrinsic metrics for task they perform. The bottom line is that there is a need to average away MT dependence (averaged metrics perform better in overall predictions than original autometrics). Moreover, combinations of recoded metrics performed better than any individual metric. Ultimately, MT evaluation methodology is extended to create new metrics specially relevant to task-based comparisons. A formal method to establish that differences among metrics as predictors are strong enough not to be due by chance remains as future work. Given the lack of connection in the field of MT Evaluation between task utility and the interpretation of automated evaluation metrics, as well as the absence of solid statistical reasoning in evaluating MT, there is a need to bring innovative and interdisciplinary analytical techniques to this problem. Because there are no papers in the MT evaluation literature that have done statistical modeling before or that have linked automated metrics with how well MT supports human tasks, this work is unique and has high potential for benefiting the Machine Translation research community.Item Global Optimization of Finite Mixture Models(2007-05-31) Heath, Jeffrey Wells; Fu, Michael; Jank, Wolfgang; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The Expectation-Maximization (EM) algorithm is a popular and convenient tool for the estimation of Gaussian mixture models and its natural extension, model-based clustering. However, while the algorithm is convenient to implement and numerically very stable, it only produces solutions that are locally optimal. Thus, EM may not achieve the globally optimal solution in Gaussian mixture analysis problems, which can have a large number of local optima. This dissertation introduces several new algorithms designed to produce globally optimal solutions for Gaussian mixture models. The building blocks for these algorithms are methods from the operations research literature, namely the Cross-Entropy (CE) method and Model Reference Adaptive Search (MRAS). The new algorithms we propose must efficiently simulate positive definite covariance matrices of the Gaussian mixture components. We propose several new solutions to this problem. One solution is to blend the updating procedure of CE and MRAS with the principles of Expectation-Maximization updating for the covariance matrices, leading to two new algorithms, CE-EM and MRAS-EM. We also propose two additional algorithms, CE-CD and MRAS-CD, which rely on the Cholesky decomposition to construct the random covariance matrices. Numerical experiments illustrate the effectiveness of the proposed algorithms in finding global optima where the classical EM fails to do so. We find that although a single run of the new algorithms may be slower than EM, they have the potential of producing significantly better global solutions to the model-based clustering problem. We also show that the global optimum matters in the sense that it significantly improves the clustering task. Furthermore, we provide a a theoretical proof of global convergence to the optimal solution of the likelihood function of Gaussian mixtures for one of the algorithms, namely MRAS-CD. This offers support that the algorithm is not merely an ad-hoc heuristic, but is systematically designed to produce global solutions to Gaussian mixture models. Finally, we investigate the fitness landscape of Gaussian mixture models and give evidence for why this is a difficult global optimization problem. We discuss different metrics that can be used to evaluate the difficulty of global optimization problems, and then apply them to the context of Gaussian mixture models.Item Comparing Regime-Switching Models In Time Series: Logistic Mixtures vs. Markov Switching(2007-05-16) Paliouras, Dimitrios V.; Kedem, Benjamin; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The purpose of this thesis is to review several related regime-switching time series models. Specifically, we use simulated data to compare models where the unobserved state vector follows a Markov process against an independent logistic mixture process. We apply these techniques to crude oil and heating oil futures prices using several explanatory variables to estimate the unobserved regimes. We find that crude oil is characterized by regime switching, where prices alternate between a high volatility state with low returns and significant mean reversion and a low volatility state with positive returns and some trending. The spread between one-month and three-month futures prices is an important determinant in the dynamics of crude oil prices.Item Representing, Visualizing, and Modeling Online Auction Data(2007-05-03) Hyde, Valerie; Shmueli, Galit; Jank, Wolfgang; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The wide and growing popularity of online auctions creates enormous amounts of publicly available bid data providing an important topic for research. These data pose unique statistical challenges because of their special structures. This research focuses on methods for representing, visualizing, and modeling such data. We find semi-continuous data manifested in auction consumer surplus data. Semi-continuous data arise in many applications where naturally continuous data become contaminated by the data generating mechanism. The resulting data contain several values that are ``too-frequent", a hybrid between discrete and continuous data. The main problem is that standard statistical methods, which are geared towards continuous or discrete data, cannot be applied adequately to semi-continuous data. We propose a new set of two transformations for semi-continuous data that ``iron-out" the too-frequent values into completely continuous data. We show that the transformed data maintain the properties of the original data but are suitable for standard analysis. We are also interested in the effect of concurrency not only on the final price of an auction but also on the relationship between the current bid levels and high bids in simultaneous ongoing auctions. We suggest several ways to visually represent the concurrent nature of online auction prices. These include ``rug plots" for displaying the price-evolution and price dynamics in concurrent auctions, time-grouped box plots, and moving statistics plots. We find price trends and relationships between prices in concurrent auctions and raise new research questions. Finally, bids during an online auction arrive at unequally-spaced discrete time points. Our goal is to capture the entire continuous price-evolution function by representing it as a functional object. Various nonparametric smoothing methods exist to estimate the functional object from the observed discrete bid data. Previous studies use penalized polynomial and monotone smoothing splines; however, these require the determination of a large number of coefficients and often lengthy computational time. We present a family of parametric growth curves that describe the price-evolution during online auctions. Our approach is parsimonious and has an appealing interpretation in the online auction context. We also provide an automated fitting algorithm that is computationally fast.Item Generalized Volatility Model And Calculating VaR Using A New Semiparametric Model(2005-12-05) Guo, Haiming; Kedem, Benjamin; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The first part of the dissertation concerns financial volatility models. Financial volatility has some stylized facts, such as excess kurtosis, volatility clustering and leverage effects. A good volatility model should be able to capture all these stylized facts. Among the volatility models, ARCH, GARCH, EGARCH and stochastic volatility models are the most important. We propose a generalized volatility model or GVM in this part, which is a generalization of all the ARCH family and stochastic volatility models. The GVM adopts the structure of the generalized linear model (GLM). GLM was originally intended for independent data. However, using partial likelihood, GLM can be extended to time series, and can then be applied to predict financial volatility. Interestingly, the family of ARCH models are special cases of GVM. Also, any covariates can be added easily to a GVM model. As an example, we use GVM to predict the realized volatility. Because of the availability of high frequency data in today's market, we can calculate realized volatility directly. We compare the prediction results of GVM with that of other classical models. By the measure of mean square error, GVM is the best among these the models. The second part of this dissertation is about value at risk (VaR). The most common methods to compute VaR are GARCH, historical simulation, and extreme value theory. A new semiparametric model based on density ratio is developed in Chapter three. By assuming that the density of the return series is an exponential function times the density of another reference return series, we can derive the density function of the portfolio's distribution. Then, we can compute the corresponding quantile or the VaR. We ran a monte carlo simulation to compare the semiparametric model and the traditional VaR models under many different scenarios. In several cases, the semiparametric model performs quite satisfactorily. Furthermore, when applied to real data, the semiparametric model performs best among all the considered models using the metric of failure rate.