Mathematics

Permanent URI for this communityhttp://hdl.handle.net/1903/2261

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    Item
    Complexity-Regularized Regression for Serially-Correlated Residuals with Applications to Stock Market Data
    (MDPI, 2014-12-23) Darmon, David; Girvan, Michelle
    A popular approach in the investigation of the short-term behavior of a non-stationary time series is to assume that the time series decomposes additively into a long-term trend and short-term fluctuations. A first step towards investigating the short-term behavior requires estimation of the trend, typically via smoothing in the time domain. We propose a method for time-domain smoothing, called complexity-regularized regression (CRR). This method extends recent work, which infers a regression function that makes residuals from a model “look random”. Our approach operationalizes non-randomness in the residuals by applying ideas from computational mechanics, in particular the statistical complexity of the residual process. The method is compared to generalized cross-validation (GCV), a standard approach for inferring regression functions, and shown to outperform GCV when the error terms are serially correlated. Regression under serially-correlated residuals has applications to time series analysis, where the residuals may represent short timescale activity. We apply CRR to a time series drawn from the Dow Jones Industrial Average and examine how both the long-term and short-term behavior of the market have changed over time.
  • Thumbnail Image
    Item
    Innovations In Time Series Forecasting: New Validation Procedures to Improve Forecasting Accuracy and A Novel Machine Learning Strategy for Model Selection
    (2021) Varela Alvarenga, Gustavo; Kedem, Benjamin; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation is divided into two parts. The first part introduces the p-Holdout family of validation schemes for minimizing the generalization error rate and improving forecasting accuracy. More specifically, if one wants to compare different forecasting methods, or models, based on their performance, one may choose to use “out-of-sample tests” based on formal hypothesis tests, or “out-of-sample tests” based on data-driven procedures that directly compare the models using an error measure (e.g., MSE, MASE). To distinguish between the two “out-of-sample tests” terminologies seen in the literature, we will use the term “out-of-sample tests” for the former and “out-of-sample validation” for the latter. Both methods rely on some form of data split. We call these data partition methods “validation schemes.” We also provide a history of their use with time-series data, along with their formulas and the formulas for the associated out-of-sample generalization errors. We also attempt to organize the different terminologies used in the statistics, econometrics, and machine learning literature into one set of terms. Moreover, we noticed that the schemes used in a time series context overlook one crucial characteristic of this type of data: its seasonality. We also observed that deseasonalizing is not often done in the machine learning literature. With this in mind, we introduce the p-Holdout family of validation schemes. It has three new procedures that we have developed specifically to consider a series’ periodicity. Our results show that when applied to benchmark data and compared to state-of-the-art schemes, the new procedures are computationally inexpensive, improve the forecast accuracy, and greatly reduce, on average, the forecast error bias, especially when applied to non-stationary time series.In the second part of this dissertation, we introduce a new machine learning strategy to select forecasting models. We call it the GEARS (generalized and rolling sample) strategy. The “generalized” part of the name is because we use generalized linear models combined with partial likelihood inference to estimate the parameters. It has been shown that partial likelihood inference enables very flexible conditions that allow for correct time series analysis using GLMs. With this, it becomes easy for users to estimate multivariate (or univariate) time series models. All they have to do is provide the right-hand side variable, the variables that should enter the left-hand side of the model, and their lags. GLMs also allow for the inclusion of interactions and all sorts of non-linear links. This easy setup is an advantage over more complicated models like state-space and GARCH. And the fact that we can include covariates and interactions is an advantage over ARIMA, Theta-method, and other univariate methods. The “rolling sample” part relates to estimating the parameters over a sample of a fixed size that “moves forward” at different “rounds” of estimation (also known as “folds”). This part resembles the “rolling window” validation scheme, but ours does not start at T = 1. The “best” model is taken from the set with all possible combinations of covariates - and their respective lags - included in the right-hand side of the forecasting model. Its selection is based on the minimization of the average error measure over all folds. Once this is done, the best model’s estimated coefficients are used to get the out- of-sample forecasts. We applied the GEARS method to all the 100,000 time-series used in the 2018’s M-Competition, the M4 Forecasting Competition. We produced one-step-ahead forecasts for each series and compared our results with the submitted approaches and the bench- mark methods. The GEARS strategy yielded the best results - in terms of the smallest overall weighted average of the forecast errors - more often than any of the twenty-five top methods in that competition. We had the best results in 8,750 cases out of the 100,000, while the procedure that won the competition had better results in fewer than 7,300 series. Moreover, the GEARS strategy shows promise when dealing with multivariate time series. Here, we estimated several forecasting models based on a complex formulation that includes covariates with variable and fixed lags, quadratic terms, and interaction terms. The accuracy of the forecasts obtained with GEARS was far superior than the one observed for the predictions from an ARIMA. This result and the fact that our strategy for dealing with multivariate series is far simpler than VAR, State Space, or Cointegration approaches shines a light in the future of our procedure. An R package was written for the GEARS strategy. A prototype web application - using the R package “Shiny” - was also developed to disseminate this method.
  • Thumbnail Image
    Item
    Estimation of a Function of a Large Covariance Matrix Using Classical and Bayesian Methods
    (2018) Law, Judith N.; Lahiri, Partha; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this dissertation, we consider the problem of estimating a high dimensional co- variance matrix in the presence of small sample size. The proposed Bayesian solution is general and can be applied to dierent functions of the covariance matrix in a wide range of scientic applications, though we narrowly focus on a specic application of allocation of assets in a portfolio where the function is vector-valued with components which sum to unity. While often there exists a high dimension of time series data, in practice only a shorter length is tenable, to avoid violating the critical assumption of equal covariance matrix of investment returns over the period. Using Monte Carlo simulations and real data analysis, we show that for small sample size, allocation estimates based on the sample covariance matrix can perform poorly in terms of the traditional measures used to evaluate an allocation for portfolio analysis. When the sample size is less than the dimension of the covariance matrix, we encounter diculty computing the allocation estimates because of singularity of the sample covariance matrix. We evaluate a few classical estimators. Among them, the allocation estimator based on the well-known POET estimator is developed using a factor model. While our simulation and data analysis illustrate the good behavior of POET for large sample size (consistent with the asymptotic theory), our study indicates that it does not perform well in small samples when compared to our pro- posed Bayesian estimator. A constrained Bayes estimator of the allocation vector is proposed that is the best in terms of the posterior risk under a given prior among all estimators that satisfy the constraint. In this sense, it is better than all classi- cal plug-in estimators, including POET and the proposed Bayesian estimator. We compare the proposed Bayesian method with the constrained Bayes using the tradi- tional evaluation measures used in portfolio analysis and nd that they show similar behavior. In addition to point estimation, the proposed Bayesian approach yields a straightforward measure of uncertainty of the estimate and allows construction of credible intervals for a wide range of parameters.
  • Thumbnail Image
    Item
    Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection
    (2009) Lotze, Thomas Harvey; Shmueli, Galit; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way.