Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results

Thumbnail Image


Publication or External Link





"Organizational design trade-offs at the DRAM, memory bus, and memory controller level: Initial results." Vinodh Cuppu and Bruce Jacob. University of Maryland Systems and Computer Architecture Group Technical Report UMD-SCA-TR-1999-2. November 1999.



This paper presents initial results in a study of organization level parameters associated with the design of the primary memory system—the DRAM system beneath the lowest level of the cache hierarchy. These parameters are orthogonal to architecture-level parameters such as DRAM core speed, bus arbitration protocol, etc. and include bus width, bus speed, number of independent channels, degree of banking, read burst width, write burst width, etc; this study presents the effective cross-product of varying each of these parameters independently. The simulator is based on SimpleScalar 3.0a and models a fast (simulated as 2GHz), highly aggressive out-of-order uniprocessor. The interface to the primary memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches. Our simulations show the following: (a) the choice of primary memory-system organization is critical, as it can effect total execution time by a factor of 3x for a constant CPU organization and DRAM speed; (b) the most important factors in the performance of the primary memory system are the channel speed (bus cycle time) and the granularity of data access, the burst width—each of these can independently affect total execution time by a factor of 2x; (c) for small bursts, multiple narrow independent channels to the memory system exhibit better performance than a single wide channel; for large bursts, channel cycle time is the most important factor; (d) the degree of DRAM multi-banking plays a secondary role in its impact on total execution time; (e) the optimal burst width tends to be high (large enough to fetch an L2 cache block in 2 bursts) and scales with the block size of the level 2 cache; and (f) the memory queue sizes can be extremely large, due to the bursty nature of references to the primary memory system and the promotion of reads ahead of writes. Among other things, we conclude that the scheduling of the memory bus is the primary bottleneck and that it should be the focus of further study.