Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
Files
Publication or External Link
Date
2010-10-05
Authors
Wu, Meng-Ju
Yeung, Donald
Advisor
Citation
DRUM DOI
Abstract
Performance on multicore processors is determined largely by on-chip
cache. Computer architects have conducted numerous studies in the past
that vary core count and cache capacity as well as problem size to
understand impact on cache behavior. These studies are very costly due
to the combinatorial design spaces they must explore.
Reuse distance (RD) analysis can help architects explore multicore cache
performance more efficiently. One problem, however, is multicore RD
analysis requires measuring concurrent reuse distance (CRD) profiles
across thread-interleaved memory reference streams. Sensitivity to
memory interleaving makes CRD profiles architecture dependent,
undermining RD analysis benefits. But for parallel programs with
symmetric threads, CRD profiles vary with architecture tractably: they
change only slightly with cache capacity scaling, and shift predictably
to larger CRD values with core count scaling. This enables analysis of a
large number of multicore configurations from a small set of measured
CRD profiles.
This paper investigates using RD analysis to efficiently analyze
multicore cache performance for parallel programs, making several
contributions. First, we characterize how CRD profiles change with core
count and cache capacity. One of our findings is core count scaling
degrades locality, but the degradation only impacts last-level caches
(LLCs) below 16MB for our benchmarks and problem sizes, increasing to
128MB if problem size scales by 64x. Second, we apply reference groups
to predict CRD profiles across core count scaling, and evaluate
prediction accuracy. Finally, we use CRD profiles to analyze multicore
cache performance. We find predicted CRD profiles can estimate LLC MPKI
within 76% of simulation for configurations without pathologic cache
conflicts in 1/1200th the time to perform simulation of the full design
space.