Computer Science Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1593

Browse

Search Results

Now showing 1 - 8 of 8

Parallel unit propagation: Optimal speedup 3CNF Horn SAT
(2021-05-01) Vishkin, Uzi
A linear work parallel algorithm for 3CNF Horn SAT is presented, which is interesting since the problem is P-complete.
Feasibility Study of Scaling an XMT Many-Core
(2015-01-19) O'Brien, Sean; Vishkin, Uzi; Edwards, James; Waks, Edo; Yang, Bao
The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm designer comes at significant costs to programmer’s productivity, as well as: (i) reduced speedups and (ii) the risk of repelling application developers from adopting parallelism. The UMD Explicit Multi-Threading (XMT) framework has demonstrated advantages on ease of parallel programming through its support of PRAM-like programming, combined with strong, often unprecedented speedups. Such programming and speedups involve considerable data movement between processors and shared memory. Another reason that XMT is a good test case for a study of data movement is that XMT permits isolation and direct study of most of its data movement (and its power dissipation). Our new results demonstrate that an XMT single-chip many-core processor with tens of thousands of cores and a high throughput network on chip is thermally feasible, though at some cost. This leads to a perhaps game-changing outcome: instead of imposing upfront strict restrictions on data movement, as advocated in a recent report from the National Academies, opt for due diligence that accounts for the full impact on cost. For example, does the increased cost due to communication avoidance (including programmer’s productivity, reduced speedups and desertion risk) indeed offset the cost of the solution we present? More specifically, we investigate in this paper the design of an XMT many-core for 3D VLSI with microfluidic cooling. We used state-of-the-art simulation tools to model the power and thermal properties of such an architecture with 8k to 64k lightweight cores, requiring between 2 and 8 silicon layers. Inter-chip communication using silicon compatible photonics is also considered. We found that, with the use of microfluidic cooling, power dissipation becomes a cost issue rather than a feasibility constraint. Robustness of the results is also discussed.
White paper: Towards a Second Line of Defense for Computer Security
(2013-04-30) Vishkin, Uzi
Much academic research on computer security has followed a perfectionist approach, seeking a system so secure that it cannot be penetrated. However, in reality security is breached and systems are penetrated. This working paper outlines some preliminary concepts for thinking about such less fortunate circumstances. It also reviews a hardware mechanism called security master for addressing them.
XMTSim: A Simulator of the XMT Many-core Architecture
(2011) Keceli, Fuat; Vishkin, Uzi
This paper documents the features and the design of XMTSim, the cycle-accurate simulator of the Explicit Multi-Threading (XMT) computer architecture. The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. XMTSim is a primary component in its publicly available toolchain along with an optimizing compiler. Research and experimentation enabled by the toolchain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture. The compiler and the simulator are also important milestones for an efficient programmer's workflow from PRAM algorithms to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the goal of ease-of-programming and performance. The applicability of the XMT simulator extends beyond specific XMT choices. It can be used to explore the much greater design space of shared memory many-cores by system researchers or by programmers. As the toolchain can practically run on any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component.
Empirical Speedup Study of Truly Parallel Data Compression
(2013-04-20) Edwards, James A.; Vishkin, Uzi
We present an empirical study of novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. To validate these theoretical algorithms, we implement them on the experimental XMT computing platform developed especially for supporting parallel algorithms at the University of Maryland. We show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Unlike existing approaches, which assign an entire (e.g., 900KB) block to a processor that processes the block serially, our approach is “truly parallel” as it processes in parallel the entire input. Besides the theoretical interest in solving the “right” problem, the importance of data compression speed for small inputs even at great expense of quality (compressed size of data) is demonstrated by the introduction of Google’s Snappy for MapReduce. Perhaps surprisingly, we show feasibility of holding on to quality, while even beating Snappy on speed. In turn, this work adds new evidence in support of the XMT/PRAM thesis: that an XMT-like many-core hardware/ software platform may be necessary for enabling general-purpose parallel computing. Comparison of our results to recently published work suggests 70x improvement over what current commercial parallel hardware can achieve.
Parallel Algorithms for Burrows-Wheeler Compression and Decompression
(2012-11-12) Edwards, James A.; Vishkin, Uzi
We present work-optimal PRAM algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), and the depth of the the corresponding decompression algorithm is O(log n). These appear to be the first polylogarithmic-time work-optimal parallel algorithms for any standard lossless compression scheme. The algorithms for the individual stages of compression and decompression may also be of independent interest: 1. a novel O(log n)-time, O(n)-work PRAM algorithm for Huffman decoding; 2. original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent, allowing them to be mapped to elementary parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression.
Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism
(2009-02-06) Vishkin, Uzi
The sudden shift from single-processor computer systems to many-processor parallel computing systems requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good performance and is easy to program. Unfortunately, this same objective has eluded decades of parallel computing research. Now, continued delays and uncertainty could start affecting important sectors of the economy. This paper advocates a minimalist stepping-stone: settle first on a simple abstraction that encapsulates the new interface between programmers, on one hand, and system builders, on the other hand. This paper also makes several concrete suggestions: (i) the Immediate Concurrent Execution (ICE) abstraction as a candidate for the new abstraction, and (ii) the Explicit Multi-Threaded (XMT) general-purpose parallel platform, under development at the University of Maryland, as a possible embodiment of ICE. ICE and XMT build on a formidable body of knowledge, known as PRAM (for parallel random-access machine, or model) algorithmics, and a latent, though not widespread, familiarity with it. Ease-of-programming, strong speedups and other attractive properties of the approach suggest that we may be much better prepared for the challenges ahead than many realize.
An Immediate Concurrent Execution (ICE) Abstraction Proposal for Many-Cores
(2008-12) Vishkin, Uzi
Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform. The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests “immediate concurrent execution (ICE)” as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easy-to-program general-purpose many-core platform.

Computer Science Research Works

Browse

Filters

Settings

Sort By

Results per page

Search Results