Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption
Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption
Files
Publication or External Link
Date
2002-08-01
Authors
Deepak Agarwal
Donald Yeung
Advisor
Citation
DRUM DOI
Abstract
s processors continue to deliver higher levels of performance and as memory
latency toler-ance techniques become widespread to address the increasing cost
of accessing memory, memory bandwidth will emerge as a major performance
bottleneck. Rather than rely solely on widerand faster memories to address
memory bandwidth shortages, an alternative is to use existing memory bandwidth
more efficiently. A promising approach is hardware-based selective
sub-blocking. In this technique, hardware predictors track the portions of
cache blocks that are referenced by the processor. On a cache miss, the
predictors are consulted and only previ-ously referenced portions are fetched
into the cache, thus conserving memory bandwidth.
This paper proposes a software-centric approch to selective sub-blocking. We
make the keyobservation that wasteful data fetching inside long cache blocks
arises due to certain sparse memory references, and that such memory
references can be identified in the application sourcecode. Rather than use
hardware predictors to discover sparse memory reference patterns from the
dynamic memory reference stream, our approach relies on the programmer or
compiler toidentify the sparse memory references statically, and to use
special annotated memory instructions to specify the amount of spatial reuse
associated with such memory references. At runtime,the size annotations
select the amount of data to fetch on each cache miss, thus fetching only
data that will likely be accessed by the processor. Our results show annotated
memory instruc-tions remove between 54% and 71% of cache traffic for 7
applications, reducing more traffic than hardware selective sub-blocking using
a 32 Kbyte predictor on all applications, and reducingas much traffic as
hardware selective sub-blocking using an 8 Mbyte predictor on 5 out of 7
applications. Overall, annotated memory instructions achieve a 17% performance
gain whenused alone, and a 22.3% performance gain when combined with software
prefetching, compared to a 7.2% performance degradation when prefetching
without annotated memory instructions.
This is an updated version of CS-TR-4304.
Also UMIACS-TR-2002-64