Minimizing Communication while Preserving Parallelism
Minimizing Communication while Preserving Parallelism
Loading...
Files
Publication or External Link
Date
1998-10-15
Authors
Kelly, Wayne
Pugh, William
Advisor
Citation
DRUM DOI
Abstract
To compile programs for message passing architectures and to
obtain good performance on NUMA architectures it is necessary to control
how computations and data are mapped to processors. Languages such as
High-Performance Fortran use data distributions supplied by the
programmer and the owner computes rule to specify this. However, the best
data and computation decomposition may differ from machine to machine and
require substantial expertise to determine. Therefore, automated
decomposition is desirable.
All existing methods for automated data/computation decomposition
share a common failing: they are very sensitive to the original loop
structure of the program. While they find a good decomposition for that
loop structure, it may be possible to apply transformations (such as loop
interchange and distribution) so that a different decomposition gives
even better results. We have developed automatic data/computation
decomposition methods that are not sensitive to the original program
structure. We can model static and dynamic data decompositions as well as
computation decompositions that cannot be represented by data
decompositions and the owner computes rule. We make use of both parallel
loops and doacross/pipelined loops to exploit parallelism.
We describe an automated translation of the decomposition problem
into a weighted graph that incorporates estimates of both parallelism and
communication for various candidate computation decompositions. We solve
the resulting search problem exactly in a very short time using a new
algorithm that has shown to be able to prune away a majority of the vast
search space. We assume that the selection of the computation
decomposition is followed by a transformation phase that reorders the
iterations to best match the selected computation decomposition. Our
graph includes constraints to ensure that a reordering transformation
giving the predicted parallelism exists.
(Also cross-referenced as UMIACS-TR-95-118)