Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis

dc.contributor.authorWu, Meng-Ju
dc.contributor.authorYeung, Donald
dc.date.accessioned2012-03-09T21:42:16Z
dc.date.available2012-03-09T21:42:16Z
dc.date.issued2012-01-17
dc.description.abstractUnderstanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly and not very insightful. An alternative is reuse distance (RD) analysis. RD analysis for multicore processors is becoming feasible because recent research has developed new notions of reuse distance that can analyze thread interactions. In particular, concurrent reuse distance (CRD) models shared cache interference, while private-stack reuse distance (PRD) models private cache replication and communication. Previous multicore RD research has centered around developing techniques and verifying accuracy. In this paper, we apply multicore RD analysis to better understand memory behavior. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. First, we develop techniques to isolate thread interactions, permitting analysis of their relative contributions. Then, we use our techniques to extract several new insights that can help architects optimize multicore cache hierarchies. One of our findings is that data sharing in parallel loops varies with reuse distance, becoming significant only at larger RD values. This implies capacity sharing in shared caches and replication/communication in private caches occur only beyond some capacity. We define Cshare to be the turn-on capacity for data sharing, and study its impact on private vs. shared cache performance. In addition, we find machine scaling degrades locality at smaller RD values and increases sharing frequency (i.e., reduces Cshare). We characterize how these effects vary with core count, and study their impact on the preference for private vs. shared caches.en_US
dc.identifier.urihttp://hdl.handle.net/1903/12438
dc.language.isoen_USen_US
dc.relation.ispartofseriesUMIACS;UMIACS-TR-2012-01
dc.titleUnderstanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysisen_US
dc.typeTechnical Reporten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
UMIACS-TR-2012-01.pdf
Size:
752.55 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.57 KB
Format:
Item-specific license agreed upon to submission
Description: