Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?

Cuppu, Vinodh; Jacob, Bruce

Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?

Files

Concurrency_rights.pdf (114.82 KB)

No. of downloads: 1174

Date

2001-06

Authors

Cuppu, Vinodh

Jacob, Bruce

Citation

"Concurrency, latency, or system overhead: Which has the largest impact on uniprocessor DRAM-system performance?" Vinodh Cuppu and Bruce Jacob. Proc. 28th International Symposium on Computer Architecture (ISCA'01), pp. 62-71. Goteborg Sweden, June 2001.

Abstract

Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times; for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10–20% lower than execution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system’s exposure to performance loss. In particular, we look at mechanisms to increase a system’s support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the “system overhead”—the portion of the primary memory system’s overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read/write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interface to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches and split-transaction busses to all DRAM banks.

URI (handle)

http://hdl.handle.net/1903/7458

Collections

Electrical & Computer Engineering Research Works

Full item page