Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
5 results
Search Results
Item Innovations In Time Series Forecasting: New Validation Procedures to Improve Forecasting Accuracy and A Novel Machine Learning Strategy for Model Selection(2021) Varela Alvarenga, Gustavo; Kedem, Benjamin; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation is divided into two parts. The first part introduces the p-Holdout family of validation schemes for minimizing the generalization error rate and improving forecasting accuracy. More specifically, if one wants to compare different forecasting methods, or models, based on their performance, one may choose to use “out-of-sample tests” based on formal hypothesis tests, or “out-of-sample tests” based on data-driven procedures that directly compare the models using an error measure (e.g., MSE, MASE). To distinguish between the two “out-of-sample tests” terminologies seen in the literature, we will use the term “out-of-sample tests” for the former and “out-of-sample validation” for the latter. Both methods rely on some form of data split. We call these data partition methods “validation schemes.” We also provide a history of their use with time-series data, along with their formulas and the formulas for the associated out-of-sample generalization errors. We also attempt to organize the different terminologies used in the statistics, econometrics, and machine learning literature into one set of terms. Moreover, we noticed that the schemes used in a time series context overlook one crucial characteristic of this type of data: its seasonality. We also observed that deseasonalizing is not often done in the machine learning literature. With this in mind, we introduce the p-Holdout family of validation schemes. It has three new procedures that we have developed specifically to consider a series’ periodicity. Our results show that when applied to benchmark data and compared to state-of-the-art schemes, the new procedures are computationally inexpensive, improve the forecast accuracy, and greatly reduce, on average, the forecast error bias, especially when applied to non-stationary time series.In the second part of this dissertation, we introduce a new machine learning strategy to select forecasting models. We call it the GEARS (generalized and rolling sample) strategy. The “generalized” part of the name is because we use generalized linear models combined with partial likelihood inference to estimate the parameters. It has been shown that partial likelihood inference enables very flexible conditions that allow for correct time series analysis using GLMs. With this, it becomes easy for users to estimate multivariate (or univariate) time series models. All they have to do is provide the right-hand side variable, the variables that should enter the left-hand side of the model, and their lags. GLMs also allow for the inclusion of interactions and all sorts of non-linear links. This easy setup is an advantage over more complicated models like state-space and GARCH. And the fact that we can include covariates and interactions is an advantage over ARIMA, Theta-method, and other univariate methods. The “rolling sample” part relates to estimating the parameters over a sample of a fixed size that “moves forward” at different “rounds” of estimation (also known as “folds”). This part resembles the “rolling window” validation scheme, but ours does not start at T = 1. The “best” model is taken from the set with all possible combinations of covariates - and their respective lags - included in the right-hand side of the forecasting model. Its selection is based on the minimization of the average error measure over all folds. Once this is done, the best model’s estimated coefficients are used to get the out- of-sample forecasts. We applied the GEARS method to all the 100,000 time-series used in the 2018’s M-Competition, the M4 Forecasting Competition. We produced one-step-ahead forecasts for each series and compared our results with the submitted approaches and the bench- mark methods. The GEARS strategy yielded the best results - in terms of the smallest overall weighted average of the forecast errors - more often than any of the twenty-five top methods in that competition. We had the best results in 8,750 cases out of the 100,000, while the procedure that won the competition had better results in fewer than 7,300 series. Moreover, the GEARS strategy shows promise when dealing with multivariate time series. Here, we estimated several forecasting models based on a complex formulation that includes covariates with variable and fixed lags, quadratic terms, and interaction terms. The accuracy of the forecasts obtained with GEARS was far superior than the one observed for the predictions from an ARIMA. This result and the fact that our strategy for dealing with multivariate series is far simpler than VAR, State Space, or Cointegration approaches shines a light in the future of our procedure. An R package was written for the GEARS strategy. A prototype web application - using the R package “Shiny” - was also developed to disseminate this method.Item When Threats Strike: Establishment of a Linguistic Tool for Tracking Threats Over Time(2021) Choi, Jinny; Gelfand, Michele J.; Psychology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The ability to detect fluctuations and upticks in levels of societal threats has important implications for understanding a variety of social and psychological group processes. In this study, I develop and validate a comprehensive linguistic dictionary, which identifies the common terminology used to describe collective threats in the English lexicon. These threat-relevant terms are tracked across a large corpus of newspaper articles and social media postings over time, generating indices that enable real-time and historical assessments. As a comprehensive measure of collective threats over time, this study tests how threats correspond to key cultural, political, and economical societal shifts. Additionally, this project seeks to capture how content that deploys more threat terms can be instrumental in capturing more public attention.Item Estimation of a Function of a Large Covariance Matrix Using Classical and Bayesian Methods(2018) Law, Judith N.; Lahiri, Partha; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, we consider the problem of estimating a high dimensional co- variance matrix in the presence of small sample size. The proposed Bayesian solution is general and can be applied to dierent functions of the covariance matrix in a wide range of scientic applications, though we narrowly focus on a specic application of allocation of assets in a portfolio where the function is vector-valued with components which sum to unity. While often there exists a high dimension of time series data, in practice only a shorter length is tenable, to avoid violating the critical assumption of equal covariance matrix of investment returns over the period. Using Monte Carlo simulations and real data analysis, we show that for small sample size, allocation estimates based on the sample covariance matrix can perform poorly in terms of the traditional measures used to evaluate an allocation for portfolio analysis. When the sample size is less than the dimension of the covariance matrix, we encounter diculty computing the allocation estimates because of singularity of the sample covariance matrix. We evaluate a few classical estimators. Among them, the allocation estimator based on the well-known POET estimator is developed using a factor model. While our simulation and data analysis illustrate the good behavior of POET for large sample size (consistent with the asymptotic theory), our study indicates that it does not perform well in small samples when compared to our pro- posed Bayesian estimator. A constrained Bayes estimator of the allocation vector is proposed that is the best in terms of the posterior risk under a given prior among all estimators that satisfy the constraint. In this sense, it is better than all classi- cal plug-in estimators, including POET and the proposed Bayesian estimator. We compare the proposed Bayesian method with the constrained Bayes using the tradi- tional evaluation measures used in portfolio analysis and nd that they show similar behavior. In addition to point estimation, the proposed Bayesian approach yields a straightforward measure of uncertainty of the estimate and allows construction of credible intervals for a wide range of parameters.Item Anomaly Detection in Time Series: Theoretical and Practical Improvements for Disease Outbreak Detection(2009) Lotze, Thomas Harvey; Shmueli, Galit; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The automatic collection and increasing availability of health data provides a new opportunity for techniques to monitor this information. By monitoring pre-diagnostic data sources, such as over-the-counter cough medicine sales or emergency room chief complaints of cough, there exists the potential to detect disease outbreaks earlier than traditional laboratory disease confirmation results. This research is particularly important for a modern, highly-connected society, where the onset of disease outbreak can be swift and deadly, whether caused by a naturally occurring global pandemic such as swine flu or a targeted act of bioterrorism. In this dissertation, we first describe the problem and current state of research in disease outbreak detection, then provide four main additions to the field. First, we formalize a framework for analyzing health series data and detecting anomalies: using forecasting methods to predict the next day's value, subtracting the forecast to create residuals, and finally using detection algorithms on the residuals. The formalized framework indicates the link between the forecast accuracy of the forecast method and the performance of the detector, and can be used to quantify and analyze the performance of a variety of heuristic methods. Second, we describe improvements for the forecasting of health data series. The application of weather as a predictor, cross-series covariates, and ensemble forecasting each provide improvements to forecasting health data. Third, we describe improvements for detection. This includes the use of multivariate statistics for anomaly detection and additional day-of-week preprocessing to aid detection. Most significantly, we also provide a new method, based on the CuScore, for optimizing detection when the impact of the disease outbreak is known. This method can provide an optimal detector for rapid detection, or for probability of detection within a certain timeframe. Finally, we describe a method for improved comparison of detection methods. We provide tools to evaluate how well a simulated data set captures the characteristics of the authentic series and time-lag heatmaps, a new way of visualizing daily detection rates or displaying the comparison between two methods in a more informative way.Item Temporal Treemaps for Visualizing Time Series Data(2004-05-12) Chintalapani, Gouthami; Shneiderman, Ben; Plaisant, Catherine; Systems EngineeringTreemap is an interactive graphical technique for visualizing large hierarchical information spaces using nested rectangles in a space filling manner. The size and color of the rectangles show data attributes and enable users to spot trends, patterns or exceptions. Current implementations of treemaps help explore time-invariant data. However, many real-world applications require monitoring hierarchical, time-variant data. This thesis extends treemaps to interactively explore time series data by mapping temporal changes to color attribute of treemaps. Specific contributions of this thesis include: · Temporal treemaps for exploring time series data through visualizing absolute or relative changes, animating them over time, filtering data items, and discovering trends using time series graphs. · The design and implementation of extensible software modules based on systems engineering methodologies and object-oriented approach. · Validation through five case studies: health statistics, web logs, production data, birth statistics, and help-desk tickets; future improvements identified from the user feedback.