Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results
Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results
Files
Publication or External Link
Date
1999-11
Authors
Cuppu, Vinodh
Jacob, Bruce
Advisor
Citation
"Organizational design trade-offs at the DRAM, memory bus, and memory controller level: Initial results." Vinodh Cuppu and Bruce Jacob. University of Maryland Systems and Computer Architecture Group Technical Report UMD-SCA-TR-1999-2. November 1999.
DRUM DOI
Abstract
This paper presents initial results in a study of organization level parameters associated with the design of the primary memory system—the DRAM system beneath the lowest level of the cache hierarchy. These parameters are orthogonal to architecture-level parameters such as DRAM core speed, bus arbitration protocol, etc. and include bus width, bus speed, number of independent channels, degree of banking, read
burst width, write burst width, etc; this study presents the effective cross-product of varying each of these parameters independently. The simulator is based on SimpleScalar 3.0a and models a fast (simulated as 2GHz), highly aggressive
out-of-order uniprocessor. The interface to the primary memory system is fully non-blocking, supporting up to 32 outstanding
misses at both the level-1 and level-2 caches. Our simulations show the following: (a) the choice of primary memory-system organization is critical, as it can effect
total execution time by a factor of 3x for a constant CPU organization and DRAM speed; (b) the most important factors in the performance of the primary memory system are the
channel speed (bus cycle time) and the granularity of data access, the burst width—each of these can independently
affect total execution time by a factor of 2x; (c) for small bursts, multiple narrow independent channels to the memory system exhibit better performance than a single wide channel;
for large bursts, channel cycle time is the most important
factor; (d) the degree of DRAM multi-banking plays a secondary role in its impact on total execution time; (e) the optimal
burst width tends to be high (large enough to fetch an L2 cache block in 2 bursts) and scales with the block size of the
level 2 cache; and (f) the memory queue sizes can be extremely large, due to the bursty nature of references to the primary memory system and the promotion of reads ahead of
writes. Among other things, we conclude that the scheduling of the memory bus is the primary bottleneck and that it should
be the focus of further study.