Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
6 results
Search Results
Item Planetary Scale Data Storage(2018) Bengfort, Benjamin; Keleher, Peter J; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The success of virtualization and container-based application deployment has fundamentally changed computing infrastructure from dedicated hardware provisioning to on-demand, shared clouds of computational resources. One of the most interesting effects of this shift is the opportunity to localize applications in multiple geographies and support mobile users around the globe. With relatively few steps, an application and its data systems can be deployed and scaled across continents and oceans, leveraging the existing data centers of much larger cloud providers. The novelty and ease of a global computing context means that we are closer to the advent of an Oceanstore, an Internet-like revolution in personalized, persistent data that securely travels with its users. At a global scale, however, data systems suffer from physical limitations that significantly impact its consistency and performance. Even with modern telecommunications technology, the latency in communication from Brazil to Japan results in noticeable synchronization delays that violate user expectations. Moreover, the required scale of such systems means that failure is routine. To address these issues, we explore consistency in the implementation of distributed logs, key/value databases and file systems that are replicated across wide areas. At the core of our system is hierarchical consensus, a geographically-distributed consensus algorithm that provides strong consistency, fault tolerance, durability, and adaptability to varying user access patterns. Using hierarchical consensus as a backbone, we further extend our system from data centers to edge regions using federated consistency, an adaptive consistency model that gives satellite replicas high availability at a stronger global consistency than existing weak consistency models. In a deployment of 105 replicas in 15 geographic regions across 5 continents, we show that our implementation provides high throughput, strong consistency, and resiliency in the face of failure. From our experimental validation, we conclude that planetary-scale data storage systems can be implemented algorithmically without sacrificing consistency or performance.Item MINIMIZATION OF RESOURCE CONSUMPTION THROUGH WORKLOAD CONSOLIDATION IN LARGE-SCALE DISTRIBUTED DATA PLATFORMS(2014) Kayyoor, Ashwin Kumar; Deshpande, Amol; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The rapid increase in the data volumes encountered in many application domains has led to widespread adoption of parallel and distributed data management systems like parallel databases and MapReduce-based frameworks (e.g., Hadoop) in recent years. Use of such parallel and distributed frameworks is expected to accelerate in the coming years, putting further strain on already-scarce resources like compute power, network bandwidth, and energy. To reduce total execution times, there is a trend towards increasing execution parallelism by spreading out data across a large number of machines. However, this often increases the total resource consumption, and especially energy consumption, significantly because of process startup costs and other overheads (e.g., communication overheads). In this dissertation, we develop several data management techniques to minimize resource consumption through workload consolidation. In this dissertation, we introduce a key metric called query span, i.e., number of machines involved in the execution of a query or a job. In order to minimize the per query resource consumption we propose to minimize query span. To that end, we develop several workload-driven data partitioning and replica selection algorithms that attempt to minimize the average query span by exploiting the fact that most distributed environments need to use replication for fault tolerance. Extensive experiments on various datasets show that judicious data placement and replication can dramatically reduce the average query spans resulting in significant reductions in resource consumption. We show our results primarily on two applications, distributed data warehouse system and distributed information retrieval. In the first case, we show that minimizing average query spans can minimize overall resource consumption for a given workload and can also improve the performance of complex analytical queries. In the second case, our approach minimizes the overall search cost as well as effectively trades off search cost with load imbalance. The best case of resource efficiency for any underlying data processing system is achieved when the job or the query can be run efficiently on a single machine (i.e., query span=1). In the final part of dissertation, we discuss an in-memory MapReduce system optimized for performing complex analytics tasks on input data sizes that fit in a single machine's memory. We argue that systems like Hadoop that are designed to operate across a large number of machines are not optimal in performance for small and medium sized complex analytics tasks because of high startup costs, heavy disk activity, and wasteful checkpointing. We have developed a prototype runtime called HONE that is API compatible with standard (distributed) Hadoop. In other words, we can take existing Hadoop code and run it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Overall, in this dissertation, our key contributions in this work include identification of key metric query span and its relationship with overall resource consumption in scale-out architectures. We introduce several workload-aware techniques to optimize this key metric. We go on to demonstrate the effectiveness of query span minimization on different application scenarios. In order to take advantage of scale-up architectures effectively we develop novel in-memory MapReduce system HONE for single machine. Our thorough experiments on real and synthetic datasets demonstrate the efficacy of our proposed approaches.Item Sharing Private Data Over Public Networks(2012) Baden, Randy; Bhattacharjee, Bobby; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Users share their sensitive personal data with each other through public services and applications provided by third parties. Users trust application providers with their private data since they want access to provided services. However, trusting third parties with private data can be risky: providers profit by sharing that data with others regardless of the user's desires and may fail to provide the security necessary to prevent data leaks. Though users may choose between service providers, in many cases no service providers provide the desired service without being granted access to user data. Users must make a choice: forego privacy or be denied service. I demonstrate that fine-grained user privacy policies and rich services and applications are not irreconcilable. I provide technical solutions to privacy problems that protect user data using cryptography while still allowing services to operate on that data. I do this primarily through content-agnostic references to data items and user-controlled pseudonymity. I support two classes of social networking applications without trusting third parties with private data: applications which do not require data contents to provide a service, and applications that deal with data where the only private information is the binding of the data to an identity. Together, these classes of applications encompass a broad range of social networking applications.Item Large Scale Distributed Testing for Fault Classification and Isolation(2010) Fouche, Sandro Maleewatana; Porter, Adam A; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Developing confidence in the quality of software is an increasingly difficult problem. As the complexity and integration of software systems increases, the tools and techniques used to perform quality assurance (QA) tasks must evolve with them. To date, several quality assurance tools have been developed to help ensure of quality in modern software, but there are still several limitations to be overcome. Among the challenges faced by current QA tools are (1) increased use of distributed software solutions, (2) limited test resources and constrained time schedules and (3) difficult to replicate and possibly rarely occurring failures. While existing distributed continuous quality assurance (DCQA) tools and techniques, including our own Skoll project, begin to address these issues, new and novel approaches are needed to address these challenges. This dissertation explores three strategies to do this. First, I present an improved version of our Skoll distributed quality assurance system. Skoll provides a platform for executing sophisticated, long-running QA processes across a large number of distributed, heterogeneous computing nodes. This dissertation details changes to Skoll resulting in a more robust, configurable, and user-friendly implementation for both the client and server components. Additionally, this dissertation details infrastructure development done to support the evaluation of DCQA processes using Skoll -- specifically the design and deployment of a dedicated 120-node computing cluster for evaluating DCQA practices. The techniques and case studies presented in the latter parts of this work leveraged the improvements to Skoll as their testbed. Second, I present techniques for automatically classifying test execution outcomes based on an adaptive-sampling classification technique along with a case study on the Java Architecture for Bytecode Analysis (JABA) system. One common need for these techniques is the ability to distinguish test execution outcomes (e.g., to collect only data corresponding to some behavior or to determine how often and under which conditions a specific behavior occurs). Most current approaches, however, do not perform any kind of classification of remote executions and either focus on easily observable behaviors (e.g., crashes) or assume that outcomes' classifications are externally provided (e.g., by the users). In this work, I present an empirical study on JABA where we automatically classified execution data into passing and failing behaviors using adaptive association trees. Finally, I present a long-term case study of the highly-configurable MySQL open-source project. Exhaustive testing of real-world software systems can involve configuration spaces that are too large to test exhaustively, but that nonetheless contain subtle interactions that lead to failure-inducing system faults. In the literature covering arrays, in combination with classification techniques, have been used to effectively sample these large configuration spaces and to detect problematic configuration dependencies. Applying this approach in practice, however, is tricky because testing time and resource availability are unpredictable. Therefore we developed and evaluated an alternative approach that incrementally builds covering array schedules. This approach begins at a low strength, and then iteratively increases strength as resources allow reusing previous test results to avoid duplicated effort. The results are test schedules that allow for successful classification with fewer test executions and that require less test-subject specific information to develop.Item DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS(2007-04-25) Nam, Beomseok; Sussman, Alan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Scientific data analysis applications require large scale computing power to effectively service client queries and also require large storage repositories for datasets that are generated continually from sensors and simulations. These scientific datasets are growing in size every day, and are becoming truly enormous. The goal of this dissertation is to provide efficient multidimensional indexing techniques that aid in navigating distributed scientific datasets. In this dissertation, we show significant improvements in accessing distributed large scientific datasets. The first approach we took to improve access to subsets of large multidimensional scientific datasets, was data chunking. The contents of scientific data files typically are a collection of multidimensional arrays, along with the corresponding metadata. Data chunking groups data elements into small chunks of a fixed, but data-specific, size to take advantage of spatio-temporal locality since it is not efficient to index individual data elements of large scientific datasets. The second approach was the design of an efficient multidimensional index for scientific datasets. This work investigates how existing multidimensional indexing structures perform on chunked scientific datasets, and compares their performance with that of our own indexing structure, SH-trees. Since R-trees were proposed, various multidimensional indexing structures have been proposed. However, there are a relatively small number of studies focused on improving the performance of indexing geographically distributed datasets, especially across heterogeneous machines. As a third approach, in an attempt to accelerate indexing performance for distributed datasets, we proposed several distributed multidimensional indexing schemes: replicated centralized indexing, hierarchical two level indexing, and decentralized two level indexing. Our experimental results show that great performance improvements are gained from distribution of multidimensional index. However, the design choices for distributed indexing, such as replication, partitioning, and decentralization, must be carefully considered since they may decrease the overall performance in certain situations. Therefore, this work provides performance guidelines to aid in selecting the best distributed multidimensional indexing scheme for various systems and applications. Finally, we describe how a distributed multidimensional indexing scheme can be used by a distributed multiple query optimization middleware as a case-study application to generate better query plans by leveraging information about the contents of remote caches.Item SEARCH, REPLICATION AND GROUPING FOR UNSTRUCTURED P2P NETWORKS(2006-08-31) Tsoumakos, Dimitrios; Roussopoulos, Nicholas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In my dissertation, I present a suite of protocols that assist in efficient content location and distribution in unstructured Peer-to-Peer overlays. The basis of these schemes is their ability to learn from past interactions, increasing their performance with time. Peer-to-Peer (P2P) networks are gaining increasing attention from both the scientific and the large Internet user community. Popular applications utilizing this new technology offer many attractive features to a growing number of users. P2P systems have two basic functions: Content search and dissemination. Search (or lookup) protocols define how participants locate remotely maintained resources. In data dissemination, users transmit or receive content from single or multiple sites in the network. P2P applications traditionally operate under purely decentralized and highly dynamic environments. Unstructured systems represent a particularly interesting class of P2P networks. Peers form an overlay in an ad-hoc manner, without any guarantees relative to lookup performance or content availability. Resources are locally maintained, while participants have limited knowledge, usually confined to their immediate neighborhood in the overlay. My work aims at providing effective and bandwidth-efficient searching and data sharing. A suite of algorithms which provide peers in unstructured P2P overlays with the state necessary in order to efficiently locate, disseminate and replicate objects is presented. The Adaptive Probabilistic Search (APS) scheme utilizes directed walkers to forward queries on a hop-by-hop basis. Peers store success probabilities for each of their neighbors in order to efficiently route towards object holders. AGNO performs implicit grouping of peers according to the demand incentive and utilizes state maintained by APS in order to route messages from content holders towards interested peers, without requiring any subscription process. Finally, the Adaptive Probabilistic REplication (APRE) scheme expands on the state that AGNO builds in order to replicate content inside query intensive areas according to demand.