Technical Reports from UMIACS

Permanent URI for this collectionhttp://hdl.handle.net/1903/7

Browse

Search Results

Now showing 1 - 10 of 36
  • Thumbnail Image
    Item
    Querying Very Large Multi-dimensional Datasets in ADR - Extended Abstract
    (1999-05-26) Kurc, Tahsin; Chang, Chialin; Ferreira, Renato; Sussman, Alan; Saltz, Joel
    This paper addresses optimizing the execution of range queries into multi-dimensional datasets on distributed memory parallel machines within the Active Data Repository framework. ADR is an infrastructure that integrates storage, retrieval and processing of large multi-dimensional datasets on distributed memory parallel architectures with multiple disks attached to each node. We describe three potential strategies for efficient execution of such queries that employ different tiling and workload partitioning approaches. We evaluate scalability of these strategies for different application scenarios, varying both the number of processors and the input dataset size on a 128 processor IBM SP multicomputer. Also cross-referenced as UMIACS-TR-99-29
  • Thumbnail Image
    Item
    Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets
    (1999-02-23) Chang, Chialin; Kurc, Tahsin; Sussman, Alan; Saltz, Joel
    Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, the datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space. The processing is usually highly stylized, with the basic processing steps consisting of (1) retrieval of a subset of all available data in the input dataset via a range query, (2) projection of each input data item to one or more output data items, and (3) some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on shared-nothing architectures. In this paper we address query planning and execution strategies for range queries with user-defined processing. We evaluate three potential query planning strategies within the ADR framework under several application scenarios, and present experimental results on the performance of the strategies on a multiprocessor IBM SP2. (Also cross-refereced as UMIACS-TR-99-15)
  • Thumbnail Image
    Item
    An Evaluation of Architectural Alternatives for Rapidly Growing Datasets, Active Disks, Clusters, SMPs
    (1998-12-08) Uysal, Mustafa; Acharya, Anurag; Saltz, Joel
    Growth and usage trends for several large datasets indicate that there is a need for architectures that scale the processing power as the dataset increases. In this paper, we evaluate three architectural alternatives for rapidly growing and frequently reprocessed datasets: active disks, clusters, and shared memory multiprocessors (SMPs). The focus of this evaluation is to identify potential bottlenecks in each of the alternative architectures and to determine the performance of these architectures for the applications of interest. We evaluate these architectural alternatives using a detailed simulator and a suite of nine applications. Our results indicate that for most of these applications Active Disk and cluster configurations were able to achieve significantly better performance than SMP configurations. Active Disk configurations were able to match (and in some cases improve upon) the performance of commodity cluster configurations. (Also cross-referenced as UMIACS-TR-98-68)
  • Thumbnail Image
    Item
    An Evaluation of Architectural Alternatives for Rapidly Growing Datasets, Active Disks, Clusters, SMPs
    (1998-12-08) Uysal, Mustafa; Acharya, Anurag; Saltz, Joel
    Growth and usage trends for several large datasets indicate that there is a need for architectures that scale the processing power as the dataset increases. In this paper, we evaluate three architectural alternatives for rapidly growing and frequently reprocessed datasets: active disks, clusters, and shared memory multiprocessors (SMPs). The focus of this evaluation is to identify potential bottlenecks in each of the alternative architectures and to determine the performance of these architectures for the applications of interest. We evaluate these architectural alternatives using a detailed simulator and a suite of nine applications. Our results indicate that for most of these applications Active Disk and cluster configurations were able to achieve significantly better performance than SMP configurations. Active Disk configurations were able to match (and in some cases improve upon) the performance of commodity cluster configurations. (Also cross-referenced as UMIACS-TR-98-68)
  • Thumbnail Image
    Item
    Performance Impact of Proxies in Data Intensive Client-Server Parallel Applications
    (1998-11-20) Beynon, Michael D.; Sussman, Alan; Saltz, Joel
    Large client-server data intensive applications can place high demands on system and network resources. This is especially true when the connection between the client and server spans a wide-area internet link. In this paper, we consider changing the typical client-server architecture of a class of data intensive applications. We show that given sufficient common interest among multiple clients, our enhancements reduce the response time per-client and reduce the amount of data sent across the wide-area link. In addition, we also see a reduction in server utilization which helps to improve server scalability as more clients are added to the system. (Also cross-referenced as UMIACS-TR-98-70)
  • Thumbnail Image
    Item
    Deferred Data-Flow Analysis : Algorithms, Proofs and Applications
    (1998-11-03) Sharma, Shamik D.; Acharya, Anurag; Saltz, Joel
    Loss of precision due to the conservative nature of compile-time dataflow analysis is a general problem and impacts a wide variety of optimizations. We propose a limited form of runtime dataflow analysis, called deferred dataflow analysis (DDFA), which attempts to sharpen dataflow results by using control-flow information that is available at runtime. The overheads of runtime analysis are minimized by performing the bulk of the analysis at compile-time and deferring only a summarized version of the dataflow problem to runtime. Caching and reusing of dataflow results reduces these overheads further. DDFA is an interprocedural framework and can handle arbitrary control structures including multi-way forks, recursion, separately compiled functions and higher-order functions. It is primarily targeted towards optimization of heavy-weight operations such as communication calls, where one can expect significant benefits from sharper dataflow analysis. We outline how DDFA can be used to optimize different kinds of heavy-weight operations such as bulk-prefetching on distributed systems and dynamic linking in mobile programs. We prove that DDFA is safe and that it yields better dataflow information than strictly compile-time dataflow analysis. (Also cross-referenced as UMIACS-TR-98-46)
  • Item
    Mobile Streams
    (1998-10-15) Ranganathan, M.; Acharya, Anurag; Andrey, Laurent; Schaal, Virginie; Saltz, Joel
    A large class of distributed testing, control and collaborative applications are reactive or event driven in nature. Such applications can be structured as a set of handlers that react to events and that in turn can trigger other events. We have developed an application building toolkit that facilitates development of such applications. Our system is based on the concept of Mobile Streams. Applications developed in our system are dynamically extensible and re-configurable and our system provides the application designer a means to control how the system can be extended and reconfigured. We describe our system model and implementation and compare our design to the design of other systems. (Also cross-referenced as UMIACS-TR-98-36)
  • Thumbnail Image
    Item
    Infrastructure for Building Parallel Database Systems for Multi-dimensional Data
    (1998-10-15) Chang, Chialin; Sussman, Alan; Saltz, Joel
    As computational power and storage capacity increase, processing and analyzing large volumes of multi-dimensional datasets play an increasingly important part in many domains of scientific research. Our study of a large set of scientific applications over the past three years indicates that the processing for such datasets is often highly stylized and shares several important characteristics. Usually, both the input dataset as well as the result being computed have underlying multi-dimensional grids. The basic processing step usually consists of transforming individual input items, mapping the transformed items to the output grid and computing output items by aggregating, in some way, all the transformed input items mapped to the corresponding grid point. In this paper, we present the design of T2, a customizable parallel database that integrates storage, retrieval and processing of multi-dimensional datasets. T2 provides support for common operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. It achieves its primary advantage from the ability to seamlessly integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying grids. We also present some preliminary performance results comparing the implementation of a remote-sensing image database using the T2 services with a custom-built integrated implementation. (Also cross-referenced as UMIACS-TR-98-24)
  • Thumbnail Image
    Item
    T2: A Customizable Parallel Database For Multi-dimensional Data
    (1998-10-15) Chang, Chialin; Acharya, Anurag; Sussman, Alan; Saltz, Joel
    As computational power and storage capacity increase, processing and analyzing large volumes of multi-dimensional datasets play an increasingly important part in many domains of scientific research. Several database research groups and vendors have developed object-relational database systems to provide some support for managing and/or visualizing multi-dimensional datasets. These systems, however, provide little or no support for analyzing or processing these datasets -- the assumption is that this is too application-specific to warrant common support. As a result, applications that process these datasets are analyzing large volumes of multi-dimensional datasets play an increasingly important part in many domains of scientific research. Several database research groups and vendors have developed object-relational database systems to provide some support for managing and/or visualizing multi-dimensional datasets. These systems, however, provide little or no support for analyzing or processing these datasets -- the assumption is that this is too application-specific to warrant common support. As a result, applications that process these datasets are usually decoupled from data storage and management, resulting in inefficiency due to copying and loss of locality. Furthermore, every application developer has to implement complex support for managing and scheduling the processing. Our study of a large set of scientific applications over the past three years indicates that the processing for such datasets is often highly stylized and shares several important characteristics. Usually, both the input dataset as well as the result being computed have underlying multi-dimensional grids. The basic processing step usually consists of transforming individual input items, mapping the transformed items to the output grid and computing output items by aggregating, in some way, all the transformed input items mapped to the corresponding grid point. In this paper, we present the design of T2, a customizable parallel database that integrates storage, retrieval and processing of multi-dimensional datasets. T2 provides support for common operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. It achieves its primary advantage from the ability to seamlessly integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying grids. (Also cross-referenced as UMIACS-TR-98-04)
  • Thumbnail Image
    Item
    Applying DEF/USE Information of Pointer Statements toTraversal-Pattern-Aware Pointer Analysis
    (1998-10-15) Hwang, Yuan-Shin; Saltz, Joel
    Pointer analysis is essential for optimizing and parallelizing compilers. It examines pointer assignment statements and estimates pointer-induced aliases among pointer variables or possible shapes of dynamic recursive data structures. However, previously proposed techniques are not able to gather useful information or have to give up further optimizations when overall recursive data structures appear to be cyclic even though patterns of traversal are linear. The reason is that these proposed techniques perform pointer analysis without the knowledge of traversal patterns of dynamic recursive data structures to be constructed. This paper proposes an approach, {\em traversal-pattern-aware pointer analysis}, that has the ability to first identify the structures specified by traversal patterns of programs from cyclic data structures and then perform analysis on the specified structures. This paper presents an algorithm to perform shape analysis on the structures specified by traversal patterns. The advantage of this approach is that if the specified structures are recognized to be acyclic, parallelization or optimizations can be applied even when overall data structures might be cyclic. The DEF/USE information of pointer statements is used to relate the identified traversal patterns to the pointer statements which build recursive data structures. (Also cross-referenced as UMIACS-TR-97-66)