Scalable Fast Multipole Methods on Heterogeneous Architecture
Scalable Fast Multipole Methods on Heterogeneous Architecture
Files
Publication or External Link
Date
2013
Authors
Hu, Qi
Advisor
Duraiswami, Ramani
Gumerov, Nail A.
Gumerov, Nail A.
Citation
DRUM DOI
Abstract
The N-body problem appears in many computational physics simulations. At each time step the computation involves an all-pairs sum whose complexity is quadratic, followed by an update of particle positions. This cost means that it is not practical to solve such dynamic N-body problems on large scale. To improve this situation, we use both algorithmic and hardware approaches. Our algorithmic approach is to use the Fast Multipole Method (FMM), which is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop, to reduce such quadratic complexity to linear with guaranteed accuracy. Our hardware approach is to use heterogeneous clusters, which comprised of nodes that contain multi-core CPUs tightly coupled with accelerators, such as graphics processors unit (GPU) as our underline parallel processing hardware, on which efficient implementations require highly non-trivial re-designed algorithms.
In this dissertation, we fundamentally reconsider the FMM algorithms on heterogeneous architectures to achieve a significant improvement over recent/previous implementations in literature and to make the algorithm ready for use as a workhorse simulation tool for both time-dependent vortex flow problems and for boundary element methods. Our major contributions include:
1. Novel FMM data structures using parallel construction algorithms for dynamic problems.
2. A fast hetegenenous FMM algorithm for both single and multiple computing nodes.
3. An efficient inter-node communication management using fast parallel data structures.
4. A scalable FMM algorithm using novel Helmholz decomposition for Vortex Methods (VM).
The proposed algorithms can handle non-uniform distributions with irregular partition shapes to achieve workload balance and their MPI-CUDA implementations are highly tuned up and demonstrate the state of the art performances.