Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
3 results
Search Results
Item Pig Squeal: Bridging Batch and Stream Processing Using Incremental Updates(2015) Lampton, James Holmes; Agrawala, Ashok; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)As developers shift from batch MapReduce to stream processing for better latency, they are faced with the dilemma of changing tools and maintaining multiple code bases. In this work we present a method for converting arbitrary chains of MapReduce jobs into pipelined, incremental processes to be executed in a stream processing framework. Pig Squeal is an enhancement of the Pig execution framework that runs lightly modified user scripts on Storm. The contributions of this work include: an analysis that tracks how information flows through MapReduce computations along with the influence of adding and deleting data from the input, a structure to generically handle these changes along with a description of the criteria to re-enable efficiencies using combiners, case studies for running word count and the more complex NationMind algorithms within Squeal, and a performance model which examines execution times of MapReduce algorithms after converted. A general solution to the conversion of analytics from batch to streaming impacts developers with expertise in batch systems by providing a means to use their expertise in a new environment. Imagine a medical researcher who develops a model for predicting emergency situations in a hospital on historical data (in a batch system). They could apply these techniques to quickly deploy these detectors on live patient feeds. It also significantly impacts organizations with large investments in batch codes by providing a tool for rapid prototyping and significantly lowering the costs of experimenting in these new environments.Item Automating Performance Diagnosis in Networked Systems(2012) McCann, Justin N.; Hicks, Michael W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Diagnosing performance degradation in distributed systems is a complex and difficult task. Software that performs well in one environment may be unusably slow in another, and determining the root cause is time-consuming and error-prone, even in environments in which all the data may be available. End users have an even more difficult time trying to diagnose system performance, since both software and network problems have the same symptom: a stalled application. The central thesis of this dissertation is that the source of performance stalls in a distributed system can be automatically detected and diagnosed with very limited information: the dependency graph of data flows through the system, and a few counters common to almost all data processing systems. This dissertation presents FlowDiagnoser, an automated approach for diagnosing performance stalls in networked systems. FlowDiagnoser requires as little as two bits of information per module to make a diagnosis: one to indicate whether the module is actively processing data, and one to indicate whether the module is waiting on its dependents. To support this thesis, FlowDiagnoser is implemented in two distinct environments: an individual host's networking stack, and a distributed streams processing system. In controlled experiments using real applications, FlowDiagnoser correctly diagnoses 99% of networking-related stalls due to application, connection-specific, or network-wide performance problems, with a false positive rate under 3%. The prototype system for diagnosing messaging stalls in a commercial streams processing system correctly finds 93% of message-processing stalls, with a false positive rate of 2%.Item investigating the effects of HPC novice programmer variations on code performance(2007-12-07) Alameh, Rola; Basili, Victor R; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this thesis, we quantitatively study the effect of High Performance Computing (HPC) novice programmer variations in effort on the performance of the code produced. We look at effort variations from three different perspectives: total effort spent, daily distribution of effort, and the distribution of effort over coding and debugging activities. The relationships are studied in the context of classroom studies. A qualitative study of both effort and performance of students was necessary in order to distinguish regular patterns and define metrics suitable for the student environment and goals. Our results suggest that total effort does not correlate with performance, and that effort spent coding does not count more than effort spent debugging towards performance. In addition, we were successful in identifying a daily distribution pattern of effort which correlates with performance, suggesting that subjects who distribute their workload uniformly across days, pace themselves, and minimize interruptions achieve better performance.