A. James Clark School of Engineering

Permanent URI for this communityhttp://hdl.handle.net/1903/1654

The collections in this community comprise faculty research works, as well as graduate theses and dissertations.

Browse

Search Results

Now showing 1 - 7 of 7

Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor
(2020) Edwards, James Alexander; Vishkin, Uzi; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster. Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach. Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research.
Easy PRAM-based High-performance parallel Programming
(2016) Ghanim, Fady; Barua, Rajeev; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
Parallel machines have become more widely used. Unfortunately parallel programming technologies have advanced at a much slower pace except for regular programs. For irregular programs, this advancement is inhibited by high synchronization costs, non-loop parallelism, non-array data structures, recursively expressed parallelism and parallelism that is too fine-grained to be exploitable. This work introduced ICE, a new parallel programming language that is easy-to-program, since: (i) ICE is a synchronous, lock-step language; (ii) for a PRAM algorithm its ICE program amounts to directly transcribing it; and (iii) the PRAM algorithmic theory offers unique wealth of parallel algorithms and techniques. This work suggests that ICE be a part of an ecosystem consisting of the XMT architecture, the PRAM algorithmic model, and ICE itself, that together deliver on the twin goal of easy programming and efficient parallelization of irregular programs. The XMT architecture, developed at UMD, can exploit fine-grained parallelism in irregular programs. This work also presents the ICE compiler which translates the ICE language into the multithreaded XMTC language; the significance of this is that multi-threading is a feature shared by practically all current scalable parallel programming languages. As one indication of ease of programming, it was observed a reduction in code size in 11 out of 16 benchmarks vs. XMTC. For these programs, the average reduction in number of lines of code was 35.53% when compared to hand optimized XMTC The remaining 4 benchmarks had the same code size. The ICE compiler achieved comparable run-time to XMTC with a 0.48% average gain for ICE across all benchmarks.
Easy PRAM-based High-performance Parallel Programming with ICE
(2016-08-31) Ghanim, Fady; Barua, Rajeev; Vishkin, Uzi
Parallel machines have become more widely used. Unfortunately parallel programming technologies have advanced at a much slower pace except for regular programs. For irregular programs, this advancement is inhibited by high synchronization costs, non-loop parallelism, non-array data structures, recursively expressed parallelism and parallelism that is too fine-grained to be exploitable. We present ICE, a new parallel programming language that is easy-to-program, since: (i) ICE is a synchronous, lock-step language; (ii) for a PRAM algorithm its ICE program amounts to directly transcribing it; and (iii) the PRAM algorithmic theory offers unique wealth of parallel algorithms and techniques. We propose ICE to be a part of an ecosystem consisting of the XMT architecture, the PRAM algorithmic model, and ICE itself, that together deliver on the twin goal of easy programming and efficient parallelization of irregular programs. The XMT architecture, developed at UMD, can exploit fine-grained parallelism in irregular programs. We built the ICE compiler which translates the ICE language into the multithreaded XMTC language; the significance of this is that multi-threading is a feature shared by practically all current scalable parallel programming languages. As one indication of ease of programming, we observed a reduction in code size in 7 out of 11 benchmarks vs. XMTC. For these programs, the average reduction in number of lines of code was when compared to hand optimized XMTC The remaining 4 benchmarks had the same code size. Our main result is perhaps surprising: The run-time was comparable to XMTC with a 0.76% average gain for ICE across all benchmarks.
Feasibility Study of Scaling an XMT Many-Core
(2015-01-19) O'Brien, Sean; Vishkin, Uzi; Edwards, James; Waks, Edo; Yang, Bao
The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm designer comes at significant costs to programmer’s productivity, as well as: (i) reduced speedups and (ii) the risk of repelling application developers from adopting parallelism. The UMD Explicit Multi-Threading (XMT) framework has demonstrated advantages on ease of parallel programming through its support of PRAM-like programming, combined with strong, often unprecedented speedups. Such programming and speedups involve considerable data movement between processors and shared memory. Another reason that XMT is a good test case for a study of data movement is that XMT permits isolation and direct study of most of its data movement (and its power dissipation). Our new results demonstrate that an XMT single-chip many-core processor with tens of thousands of cores and a high throughput network on chip is thermally feasible, though at some cost. This leads to a perhaps game-changing outcome: instead of imposing upfront strict restrictions on data movement, as advocated in a recent report from the National Academies, opt for due diligence that accounts for the full impact on cost. For example, does the increased cost due to communication avoidance (including programmer’s productivity, reduced speedups and desertion risk) indeed offset the cost of the solution we present? More specifically, we investigate in this paper the design of an XMT many-core for 3D VLSI with microfluidic cooling. We used state-of-the-art simulation tools to model the power and thermal properties of such an architecture with 8k to 64k lightweight cores, requiring between 2 and 8 silicon layers. Inter-chip communication using silicon compatible photonics is also considered. We found that, with the use of microfluidic cooling, power dissipation becomes a cost issue rather than a feasibility constraint. Robustness of the results is also discussed.
Empirical Speedup Study of Truly Parallel Data Compression
(2013-04-20) Edwards, James A.; Vishkin, Uzi
We present an empirical study of novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. To validate these theoretical algorithms, we implement them on the experimental XMT computing platform developed especially for supporting parallel algorithms at the University of Maryland. We show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Unlike existing approaches, which assign an entire (e.g., 900KB) block to a processor that processes the block serially, our approach is “truly parallel” as it processes in parallel the entire input. Besides the theoretical interest in solving the “right” problem, the importance of data compression speed for small inputs even at great expense of quality (compressed size of data) is demonstrated by the introduction of Google’s Snappy for MapReduce. Perhaps surprisingly, we show feasibility of holding on to quality, while even beating Snappy on speed. In turn, this work adds new evidence in support of the XMT/PRAM thesis: that an XMT-like many-core hardware/ software platform may be necessary for enabling general-purpose parallel computing. Comparison of our results to recently published work suggests 70x improvement over what current commercial parallel hardware can achieve.
Parallel Algorithms for Burrows-Wheeler Compression and Decompression
(2012-11-12) Edwards, James A.; Vishkin, Uzi
We present work-optimal PRAM algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), and the depth of the the corresponding decompression algorithm is O(log n). These appear to be the first polylogarithmic-time work-optimal parallel algorithms for any standard lossless compression scheme. The algorithms for the individual stages of compression and decompression may also be of independent interest: 1. a novel O(log n)-time, O(n)-work PRAM algorithm for Huffman decoding; 2. original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent, allowing them to be mapped to elementary parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression.
Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm
(2008-08-06) Wen, Xingzhi; Vishkin, Uzi; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
With the end of exponential performance improvements in sequential computers, parallel computers, dubbed "chip multiprocessor", "multicore", or "manycore", has been introduced. Unfortunately, programming current parallel computers tends to be far more difficult than programming sequential computers. The Parallel Random Access Model (PRAM) is known to be an easy-to-program parallel computer model and has been widely used by theorists to develop parallel algorithms because it abstracts away architecture details and allows algorithm designers to focus on critical issues. The eXplicit Multi-Threading (XMT) PRAM-On-Chip project seeks to build an easy-to-program on-chip parallel processor by supporting a PRAM-like programming (performance) model. This dissertation focuses on the design, study of the micro-architecture of the XMT processor as well as performance optimization. The main contributions are:(1) Presented a scalable micro-architecture of the XMT based on high level description of the architecture. (2) Designed a synthesizable Verilog HDL (hardware design language) description of XMT, which lead to the first commitment to the silicon of the XMT processor, a 75 MHz XMT FPGA computer. With the same design, we expect to see the first XMT ASIC processor using IBM 90nm technology. (3) Proposed and implemented some architecture upgrades to the XMT: (i)value broadcasting, (ii)hardware/software co-managed prefetch buffers and (iii) hardware/software co-managed read-only buffers. (4) Quantitatively studied the performance of XMT using non-trivial application kernels with the 75 MHz XMT FPGA computer, in addition, the performance of a 800MHz XMT processor is projected. (5) The choice of not having local private caches in the XMT architecture is studied by comparing current architecture with an alternative one that includes conventional coherent private caches.

A. James Clark School of Engineering

Browse

Filters

Settings

Sort By

Results per page

Search Results