Scalable Fast Multipole Methods on Heterogeneous Architecture

Hu, Qi

Scalable Fast Multipole Methods on Heterogeneous Architecture

Files

Hu_umd_0117E_14381.pdf (4.52 MB)

No. of downloads: 706

Date

2013

Authors

Hu, Qi

Advisor

Duraiswami, Ramani
Gumerov, Nail A.

Abstract

The N-body problem appears in many computational physics simulations. At each time step the computation involves an all-pairs sum whose complexity is quadratic, followed by an update of particle positions. This cost means that it is not practical to solve such dynamic N-body problems on large scale. To improve this situation, we use both algorithmic and hardware approaches. Our algorithmic approach is to use the Fast Multipole Method (FMM), which is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop, to reduce such quadratic complexity to linear with guaranteed accuracy. Our hardware approach is to use heterogeneous clusters, which comprised of nodes that contain multi-core CPUs tightly coupled with accelerators, such as graphics processors unit (GPU) as our underline parallel processing hardware, on which efficient implementations require highly non-trivial re-designed algorithms.

In this dissertation, we fundamentally reconsider the FMM algorithms on heterogeneous architectures to achieve a significant improvement over recent/previous implementations in literature and to make the algorithm ready for use as a workhorse simulation tool for both time-dependent vortex flow problems and for boundary element methods. Our major contributions include:

Novel FMM data structures using parallel construction algorithms for dynamic problems.
A fast hetegenenous FMM algorithm for both single and multiple computing nodes.
An efficient inter-node communication management using fast parallel data structures.
A scalable FMM algorithm using novel Helmholz decomposition for Vortex Methods (VM).

The proposed algorithms can handle non-uniform distributions with irregular partition shapes to achieve workload balance and their MPI-CUDA implementations are highly tuned up and demonstrate the state of the art performances.

URI (handle)

http://hdl.handle.net/1903/14454

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page