|
DRUM >
College of Computer, Mathematical & Natural Sciences >
Computer Science >
Technical Reports from UMIACS >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/1903/12438
|
| Title: | Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis |
| Authors: | Wu, Meng-Ju Yeung, Donald |
| Type: | Technical Report |
| Issue Date: | 17-Jan-2012 |
| Series/Report no.: | UMIACS;UMIACS-TR-2012-01 |
| Abstract: | Understanding multicore memory behavior is crucial, but can be
challenging due to the cache hierarchies employed in modern CPUs. In
today's hierarchies, performance is determined by complex thread
interactions, such as interference in shared caches and replication and
communication in private caches. Researchers normally perform simulation
to sort out these interactions, but this can be costly and not very
insightful. An alternative is reuse distance (RD) analysis. RD analysis
for multicore processors is becoming feasible because recent research
has developed new notions of reuse distance that can analyze thread
interactions. In particular, concurrent reuse distance (CRD) models
shared cache interference, while private-stack reuse distance (PRD)
models private cache replication and communication. Previous multicore
RD research has centered around developing techniques and verifying
accuracy. In this paper, we apply multicore RD analysis to better
understand memory behavior. We focus on loop-based parallel programs, an
important class of programs for which RD analysis provides high
accuracy. First, we develop techniques to isolate thread interactions,
permitting analysis of their relative contributions. Then, we use our
techniques to extract several new insights that can help architects
optimize multicore cache hierarchies. One of our findings is that data
sharing in parallel loops varies with reuse distance, becoming
significant only at larger RD values. This implies capacity sharing in
shared caches and replication/communication in private caches occur only
beyond some capacity. We define Cshare to be the turn-on capacity for
data sharing, and study its impact on private vs. shared cache
performance. In addition, we find machine scaling degrades locality at
smaller RD values and increases sharing frequency (i.e., reduces
Cshare). We characterize how these effects vary with core count, and
study their impact on the preference for private vs. shared caches. |
| URI: | http://hdl.handle.net/1903/12438 |
| Appears in Collections: | Technical Reports from UMIACS
|
All items in DRUM are protected by copyright, with all rights reserved.
|