A. James Clark School of Engineering

Permanent URI for this communityhttp://hdl.handle.net/1903/1654

The collections in this community comprise faculty research works, as well as graduate theses and dissertations.

Browse

Search Results

Now showing 1 - 4 of 4

Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor
(2020) Edwards, James Alexander; Vishkin, Uzi; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster. Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach. Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research.
Can Cooling Technology Save Many-Core Parallel Programming from Its Programming Woes?
(2015-10-16) O'Brien, Sean; Vishkin, Uzi; Edwards, James; Waks, Edo; Yang, Bao
This paper is advancing the following premise (henceforth, "vision"): that it is feasible to greatly enhance data movement in the short term, and do it in ways that would be both power efficient and pragmatic in the long term. The paper spells this premise out in greater detail: 1. it is feasible to build first generations of a variety of (power-inefficient) designs for which data movement will not be a restriction and begin application software development for them; 2. growing reliance on silicon compatible photonic technologies, and feasible advances in them with proper investment, will allow reduction of power consumption in these design by several orders of magnitude; 3. successful high performance application software, the ease of programming demonstrated and growing adoption by customers, software vendors and programmers will incentivize (hardware vendor) investment in new application-software-compatible generations of these designs (a new "software spiral" a la former Intel CEO, Andy Grove) with further reduction of power consumption in each generation; 4. microfluidic cooling is instrumental for enabling item 1, as well as for midwifing this overall vision. The opening paragraph of the paper provides a preamble to that vision, the body of the paper supports it and the paragraph "Moore's-Law-type vision" summarizes it. The scope of the paper is a bit forward looking and it may not exactly fit any particular community. However, its new directions for interaction among architecture and programming may suggest new horizons for representing and exposing a greater variety of data and task parallelism.
Feasibility Study of Scaling an XMT Many-Core
(2015-01-19) O'Brien, Sean; Vishkin, Uzi; Edwards, James; Waks, Edo; Yang, Bao
The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm designer comes at significant costs to programmer’s productivity, as well as: (i) reduced speedups and (ii) the risk of repelling application developers from adopting parallelism. The UMD Explicit Multi-Threading (XMT) framework has demonstrated advantages on ease of parallel programming through its support of PRAM-like programming, combined with strong, often unprecedented speedups. Such programming and speedups involve considerable data movement between processors and shared memory. Another reason that XMT is a good test case for a study of data movement is that XMT permits isolation and direct study of most of its data movement (and its power dissipation). Our new results demonstrate that an XMT single-chip many-core processor with tens of thousands of cores and a high throughput network on chip is thermally feasible, though at some cost. This leads to a perhaps game-changing outcome: instead of imposing upfront strict restrictions on data movement, as advocated in a recent report from the National Academies, opt for due diligence that accounts for the full impact on cost. For example, does the increased cost due to communication avoidance (including programmer’s productivity, reduced speedups and desertion risk) indeed offset the cost of the solution we present? More specifically, we investigate in this paper the design of an XMT many-core for 3D VLSI with microfluidic cooling. We used state-of-the-art simulation tools to model the power and thermal properties of such an architecture with 8k to 64k lightweight cores, requiring between 2 and 8 silicon layers. Inter-chip communication using silicon compatible photonics is also considered. We found that, with the use of microfluidic cooling, power dissipation becomes a cost issue rather than a feasibility constraint. Robustness of the results is also discussed.
Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism
(2009-02-06) Vishkin, Uzi
The sudden shift from single-processor computer systems to many-processor parallel computing systems requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good performance and is easy to program. Unfortunately, this same objective has eluded decades of parallel computing research. Now, continued delays and uncertainty could start affecting important sectors of the economy. This paper advocates a minimalist stepping-stone: settle first on a simple abstraction that encapsulates the new interface between programmers, on one hand, and system builders, on the other hand. This paper also makes several concrete suggestions: (i) the Immediate Concurrent Execution (ICE) abstraction as a candidate for the new abstraction, and (ii) the Explicit Multi-Threaded (XMT) general-purpose parallel platform, under development at the University of Maryland, as a possible embodiment of ICE. ICE and XMT build on a formidable body of knowledge, known as PRAM (for parallel random-access machine, or model) algorithmics, and a latent, though not widespread, familiarity with it. Ease-of-programming, strong speedups and other attractive properties of the approach suggest that we may be much better prepared for the challenges ahead than many realize.

A. James Clark School of Engineering

Browse

Filters

Settings

Sort By

Results per page

Search Results