Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
Loading...
Files
Publication or External Link
Date
2012-01-17
Authors
Wu, Meng-Ju
Yeung, Donald
Advisor
Citation
DRUM DOI
Abstract
Understanding multicore memory behavior is crucial, but can be
challenging due to the cache hierarchies employed in modern CPUs. In
today's hierarchies, performance is determined by complex thread
interactions, such as interference in shared caches and replication and
communication in private caches. Researchers normally perform simulation
to sort out these interactions, but this can be costly and not very
insightful. An alternative is reuse distance (RD) analysis. RD analysis
for multicore processors is becoming feasible because recent research
has developed new notions of reuse distance that can analyze thread
interactions. In particular, concurrent reuse distance (CRD) models
shared cache interference, while private-stack reuse distance (PRD)
models private cache replication and communication. Previous multicore
RD research has centered around developing techniques and verifying
accuracy. In this paper, we apply multicore RD analysis to better
understand memory behavior. We focus on loop-based parallel programs, an
important class of programs for which RD analysis provides high
accuracy. First, we develop techniques to isolate thread interactions,
permitting analysis of their relative contributions. Then, we use our
techniques to extract several new insights that can help architects
optimize multicore cache hierarchies. One of our findings is that data
sharing in parallel loops varies with reuse distance, becoming
significant only at larger RD values. This implies capacity sharing in
shared caches and replication/communication in private caches occur only
beyond some capacity. We define Cshare to be the turn-on capacity for
data sharing, and study its impact on private vs. shared cache
performance. In addition, we find machine scaling degrades locality at
smaller RD values and increases sharing frequency (i.e., reduces
Cshare). We characterize how these effects vary with core count, and
study their impact on the preference for private vs. shared caches.