Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models
Hollingsworth, Jeffrey K.
MetadataShow full item record
Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance. In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels.