Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs
Abstract
CMPs allow threads to share portions of the on-chip cache. Critical to
successful sharing are the policies for allocating the shared cache to
threads. Researchers have observed the best allocation policy often
depends on the degree of cache interference. Typically, workloads with
high cache interference require explicit working set isolation via cache
partitioning, while workloads with low cache interference perform well
without explicit allocation-i.e., using LRU.
While cache interference impacts cache allocation policies, relatively
little is known about its root causes.
This paper investigates how different sharing patterns in multiprogrammed
workloads give rise to varying degrees of cache interference. We show
cache interference is tied to the granularity of interleaving amongst
inter-thread memory references: ne-grain interleaving yields high cache
interference while coarse-grain interleaving yields low cache
differences in mapping and timing of per-thread references in
the cache. For example, coarse-grain interleaving occurs
anytime per-thread references map to disjoint cache sets, or
are performed during distinct time intervals.
We quantify such spatial and temporal isolation of per-thread
memory references, and correlate its degree to LRU and
partitioning performance.
This paper also proposes probabilistic replacement (PR), a new
cache allocation policy motivated by our reference
interleaving insights. PR controls the rate at which
inter-thread replacements transfer cache resources between
threads, and hence, the per-thread cache allocation boundary.
Because PR operates on inter-thread replacements, it adapts to
cache interference. When interleaving is coarse-grained (low
interference), inter-thread replacements are rare, so PR
reverts to LRU. As interleaving becomes more ne-grained
(higher interference), inter-thread replacements increase, and
PR in turn creates more resistance to impede
cache allocation. Our results show PR outperforms LRU, UCP,
and an ideal cache partitioning technique by 4.86%, 3.15%, and
1.09%, respectively.