Tech Reports in Computer Science and Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/5
The technical reports collections in this community are deposited by the Library of the Computer Science department. If you have questions about these collections, please contact library staff at library@cs.umd.edu
Browse
37 results
Search Results
Item Using Content-Addressable Networks for Load Balancing in Desktop Grids(2007-03-29) Kim, Jik-Soo; Keleher, Peter; Marsh, Michael; Bhattacharjee, Bobby; Sussman, AlanDesktop grids combine Peer-to-Peer and Grid computing techniques to improve the robustness, reliability and scalability of job execution infrastructures. However, efficiently matching incoming jobs to available system resources and achieving good load balance in a fully decentralized and heterogeneous computing environment is a challenging problem. In this paper, we extend our prior work with a new decentralized algorithm for maintaining approximate global load information, and a job pushing mechanism that uses the global information to push jobs towards underutilized portions of the system. The resulting system more effectively balances load and improves overall system throughput. Through a comparative analysis of experimental results across different system configurations and job profiles, performed via simulation, we show that our system can reliably execute Grid applications on a distributed set of resources both with low cost and with good load balance.Item Matching Jobs to Resources in Distributed Desktop Grid Environments(2006-04) Kim, Jik-Soo; Bhattacharjee, Bobby; Keleher, Peter J.; Sussman, AlanDesktop grids use opportunistic sharing to exploit large collections of personal computers and workstations across the Internet and can achieve tremendous computing power with low cost. However, current systems are typically based on a traditional client-server architecture, which has inherent shortcomings with respect to robustness, reliability and scalability. In this paper, we propose a decentralized, robust, highly available, and scalable infrastructure to match incoming jobs to available resources. The key idea behind our proposed system is to leverage information provided by an underlying peer-to-peer system to create a hierarchical Rendezvous Node Tree, which performs the matching efficiently. Our experimental results obtained via simulation show that we can effectively match jobs with varying levels of resource constraints to available nodes and maintain good load balance in a fully decentralized heterogeneous computational environment.Item Multiple Range Query Optimization with Distributed Cache Indexing(2006-04-17) Nam, Beomseok; Sussman, AlanMQO is a distributed multiple query processing middleware that can optimize query processing for data analysis applications on the Grid. It has one or more proxies that act as front-end to a collection of backend servers. The basic idea behind this architecture is semantic caching, whereby queries can leverage available cached results in the proxy either directly or through transformations. While this approach has been shown to speed up query evaluation under multi-client workloads, the caching infrastructure in the backend servers is not well used for query planning. In this paper, we describe a distributed multidimensional indexing scheme that enables the proxy to directly consider the cache contents available at the backend servers for planning and scheduling. This approach is shown to produce better query plans and faster query response times. We experimentally demonstrate that system throughput can be improved up to 66%, compared to either load-based or round-robin scheduling.Item Indexing Cached Multidimensional Objects in Large Main Memory Systems(2006-06-05) Nam, Beomseok; Sussman, AlanSemantic caches allow queries into large datasets to leverage cached results either directly or through transformations, using semantic information about the data objects in the cache. As the price of main memory continues to drop and its size increases, the size of semantic caches grows proportionately, and it is becoming expensive to compare the semantic information for each data object in the cache against a query predicate. Instead, we propose to create an index for cached objects. Unlike straightforward linear scanning, indexing cached objects creates additional overhead for cache replacement. Since the contents of a semantic cache may change dynamically at a high rate, the cache index must support fast inserts and deletes as well as fast search. In this paper, we show that multidimensional indexing helps navigate efficiently through a large semantic cache in spite of the additional overhead and overall is considerably less expensive than linear scanning. Little emphasis has been laid upon the performance of multidimensional index inserts and deletes, as opposed to search performance. We compare the performance of a few widely used multidimensional indexing structures with our SH-tree, looking at insert, delete, and search operations, and show that SH-trees overall perform better for large semantic caches than the widely used indexing techniques.Item Time and Space Optimization for Processing Groups of Multi-Dimensional Scientific Queries(2004-04-19) Aryangat, Suresh; Andrade, Henrique; Sussman, AlanData analysis applications in areas as diverse as remote sensing and telepathology require operating on and processing very large datasets. For such applications to execute efficiently, careful attention must be paid to the storage, retrieval, and manipulation of the datasets. This paper addresses the optimizations performed by a high performance database system that processes groups of data analysis requests for these applications, which we call queries. The system performs end-to-end processing of the requests, formulated as PostgreSQL declarative queries. The queries are converted into imperative descriptions, multiple imperative descriptions are merged into a single execution plan, the plan is optimized to decrease execution time via common compiler optimization techniques, and, finally, the plan is optimized to decrease memory consumption. The last two steps effectively reduce both the time and space to execute query groups, as shown in the experimental results. (UMIACS-TR-2004-14)Item Efficient Communication Between Parallel Programs with InterComm(2004-02-25) Lee, Jae-Yong; Sussman, AlanWe present the design and implementation of InterComm, a framework to couple parallel components that enables efficient communication in the presence of complex data distributions within a coupled application. Multiple parallel libraries and languages may be used in different modules of a single application. The ability to couple such modules is required in many emerging application areas, such as complex simulations that model physical phenomena at multiple scales and resolutions, and remote sensing image data analysis applications. The complexity of the communication algorithms is highly dependent on the distribution of data across the processes in a distributed memory parallel program. We classify the distributions into two types - one that represents a data distribution in a compact way so that the distribution information can be replicated, and one that explicitly describes the location of each data element, so can be very large, requiring that the distribution information be distributed across processes as is the data. InterComm builds on our previous work on the Meta-Chaos program coupling framework. In that work, we showed that the steps required to perform the data exchanges include locating the data to be transferred within the local memories of each program, generating communication schedules (the patterns of interprocessor communication) for all processes, and transferring the data using the schedules. In this paper we describe the new algorithms we have developed, and show how those algorithms greatly improve on the Meta-Chaos algorithms by reducing the high cost of building the communication schedules. We present experimental results showing the performance of various algorithmic tradeoffs, and also compare against the original Meta-Chaos implementation. (UMIACS-TR-2004-04)Item A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets(2004-01-29) Nam, Beomseok; Sussman, AlanScientific applications that query into very large multidimensional datasets are becoming more common. These datasets are growing in size every day, and are becoming truly enormous, making it infeasible to index individual data elements. We have instead been experimenting with {\em chunking} the datasets to index them, grouping data elements into small chunks of a fixed, but dataset-specific, size to take advantage of spatial locality. While spatial indexing structures based on R-trees perform reasonably well for the rectangular bounding boxes of such chunked datasets, other indexing structures based on KDB-trees, such as Hybrid trees, have been shown to perform very well for point data. In this paper, we investigate how all these indexing structures perform for multidimensional scientific datasets, and compare their features and performance with that of {\bf SH-trees}, an extension of Hybrid trees, for indexing multidimensional rectangles. Our experimental results show that the algorithms for building and searching SH-trees outperform those for R-trees, R*-trees, and X-trees for both real application and synthetic datasets and queries. We show that the SH-tree algorithms perform well for both low and high dimensional data, and that they scale well to high dimensions both for building and searching the trees. (UMIACS-TR-2004-03)Item Efficient Execution of Multi-Query Data Analysis Batches Using Compiler Optimization Strategies(2003-08-01) Andrade, Henrique; Aryangat, Suresh; Kurc, Tahsin; Saltz, Joel; Sussman, AlanThis work investigates the leverage that can be obtained from compiler optimization techniques for efficient execution of multi-query workloads in data analysis applications. Our approach is to address multi-query optimization at the algorithmic level by transforming a declarative specification of scientific data analysis queries into a high-level imperative program that can be made more efficient by applying compiler optimization techniques. These techniques -- including loop fusion, common subexpression elimination and dead code elimination -- are employed to allow data and computation reuse across queries. We describe a preliminary experimental analysis on a real remote sensing application that is used to analyze very large quantities of satellite data. The results show our techniques achieve sizable reduction in the amount of computation and I/O necessary for executing query batches and in average executing times for the individual queries in a given batch. (UMIACS-TR-2003-76)Item Improving Access to Multi-dimensional Self-describing Scientific Datasets(2002-11-08) Nam, Beomseok; Sussman, AlanApplications that query into very large multi-dimensional datasets are becoming more common. Many self-describing scientific data file formats have also emerged, which have structural metadata to help navigate the multi-dimensional arrays that are stored in the files. The files may also contain application-specific semantic metadata. In this paper, we discuss efficient methods for performing searches for subsets of multi-dimensional data objects, using semantic information to build multi-dimensional indexes, and group data items into properly sized chunks to maximize disk I/O bandwidth. This work is the first step in the design and implementation of a generic indexing library that will work with various high-dimension scientific data file formats containing semantic information about the stored data. To validate the approach, we have implemented indexing structures for NASA remote sensing data stored in the HDF format with a specific schema (HDF-EOS), and show the performance improvements that are gained from indexing the datasets, compared to using the existing HDF library for accessing the data. (UMIACS-TR-2002-99)Item The Virtual Microscope(2002-10-07) Catalyurek, Umit; Beynon, Michael D.; Chang, Chialin; Kurc, Tahsin; Sussman, Alan; Saltz, JoelWe present the design and implementation of the Virtual Microscope, a software system employing a client/server architecture to provide a realistic emulation of a high power light microscope. The system provides a form of completely digital telepathology, allowing simultaneous access to archived digital slide images by multiple clients. The main problem the system targets is storing and processing the extremely large quantities of data required to represent a collection of slides. The Virtual Microscope client software runs on the end user's PC or workstation, while database software for storing, retrieving and processing the microscope image data runs on a parallel computer or on a set of workstations at one or more potentially remote sites. We have designed and implemented two versions of the data server software. One implementation is a customization of a database system framework that is optimized for a tightly coupled parallel machine with attached local disks. The second implementation is component-based, and has been designed to accommodate access to and processing of data in a distributed, heterogeneous environment. We also have developed caching client software, implemented in Java, to achieve good response time and portability across different computer platforms. The performance results presented show that the Virtual Microscope systems scales well, so that many clients can be adequately serviced by an appropriately configured data server. (Also UMIACS-TR-2002-85)