Electrical & Computer Engineering Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1658

Browse

Search Results

Now showing 1 - 6 of 6
  • Thumbnail Image
    Item
    A Performance Comparison of Contemporary DRAM Architectures
    (1999-05) Cuppu, Vinodh; Jacob, Bruce; Davis, Brian; Mudge, Trevor
    In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM chips. The study covers Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Synchronous Link, Rambus, and Direct Rambus designs. Our simulations reveal several things: (a) current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem; (b) bus transmission speed will soon become a primary factor limiting memory-system performance; (c) the post-L2 address stream still contains significant locality, though it varies from application to application; and (d) as we move to wider buses, row access time becomes more prominent, making it important to investigate techniques to exploit the available locality to decrease access time.
  • Thumbnail Image
    Item
    DDR2 and Low Latency Variants
    (2000-07) Davis, Brian; Mudge, Trevor; Jacob, Bruce; Cuppu, Vinodh
    This paper describes a performance examination of the DDR2 DRAM architecture and the proposed cache-enhanced variants. These preliminary studies are based upon ongoing collaboration between the authors and the Joint Electronic Device Engineering Council (JEDEC) Low Latency DRAM Working Group, a working group within the JEDEC 42.3 Future DRAM Task Group. This Task Group is responsible for developing the DDR2 standard. The goal of the Low Latency DRAM Working Group is the creation of a single cache-enhanced (i.e. low-latency) architecture based upon this same interface. There are a number of proposals for reducing the average access time of DRAM devices, most of which involve the addition of SRAM to the DRAM device. As DDR2 is viewed as a future standard, these proposals are frequently applied to a DDR2 interface device. For the same reasons it is advantageous to have a single DDR2 specification, it is similarly beneficial to have a single low-latency specification. The authors are involved in ongoing research to evaluate which enhancements to the baseline DDR2 devices will yield lower average latency, and for what type of applications. To provide context, experimental results will be compared against those for systems utilizing PC100 SDRAM, DDR133 SDRAM, and Direct Rambus (DRDRAM). This work is just starting to produce performance data. Initial results show performance improvements for low-latency devices that are significant, but less so than a generational change in DRAM interface. It is also apparent that there are at least two classifications of applications: 1) those that saturate the memory bus, for which performance is dependent upon the potential bandwidth and bus utilization of the system; and 2) those that do not contain the access parallelism to fully utilize the memory bus, and for which performance is dependent upon the latency of the average primary memory access.
  • Thumbnail Image
    Item
    Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?
    (2001-06) Cuppu, Vinodh; Jacob, Bruce
    Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times; for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10–20% lower than execution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system’s exposure to performance loss. In particular, we look at mechanisms to increase a system’s support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the “system overhead”—the portion of the primary memory system’s overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read/write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interface to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches and split-transaction busses to all DRAM banks.
  • Thumbnail Image
    Item
    A Case for Studying DRAM Issues at the System Level
    (IEEE Micro, 2003-08) Jacob, Bruce
    THE WIDENING GAP BETWEEN TODAY’S PROCESSOR AND MEMORY SPEEDS MAKES DRAM SUBSYSTEM DESIGN AN INCREASINGLY IMPORTANT PART OF COMPUTER SYSTEM DESIGN. IF THE DRAM RESEARCH COMMUNITY WOULD FOLLOW THE MICROPROCESSOR COMMUNITY’S LEAD BY LEANING MORE HEAVILY ON ARCHITECTURE- AND SYSTEM-LEVEL SOLUTIONS IN ADDITION TO TECHNOLOGY-LEVEL SOLUTIONS TO ACHIEVE HIGHER PERFORMANCE, THE GAP MIGHT BEGIN TO CLOSE.
  • Thumbnail Image
    Item
    DRAMsim: A Memory System Simulator
    (ACM (Association for Computing Machinery) Publications, 2005-09) Wang, David; Ganesh, Brinda; Tuaycharoen, Nuengwong; Baynes, Kathleen; Jaleel, Aamer; Jacob, Bruce
    As memory accesses become slower with respect to the processor and consume more power with increasing memory size, the focus of memory performance and power consumption has become increasingly important. With the trend to develop multi-threaded, multi-core processors, the demands on the memory system will continue to scale. However, determining the optimal memory system configuration is non-trivial. The memory system performance is sensitive to a large number of parameters. Each of these parameters take on a number of values and interact in fashions that make overall trends difficult to discern. A comparison of the memory system architectures becomes even harder when we add the dimensions of power consumption and manufacturing cost. Unfortunately, there is a lack of tools in the public-domain that support such studies. Therefore, we introduce DRAMsim, a detailed and highly configurable C-based memory system simulator to fill this gap. DRAMsim implements detailed timing models for a variety of existing memories, including SDRAM, DDR, DDR2, DRDRAM and FB-DIMM, with the capability to easily vary their parameters. It also models the power consumption of SDRAM and its derivatives. It can be used as a standalone simulator or as part of a more comprehensive system-level model. We have successfully integrated DRAMsim into a variety of simulators including MASE[15], Sim-alpha[14], BOCHS[2] and GEMS[13]. The simulator can be downloaded from www.ece.umd.edu/dramsim.
  • Thumbnail Image
    Item
    Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results
    (1999-11) Cuppu, Vinodh; Jacob, Bruce
    This paper presents initial results in a study of organization level parameters associated with the design of the primary memory system—the DRAM system beneath the lowest level of the cache hierarchy. These parameters are orthogonal to architecture-level parameters such as DRAM core speed, bus arbitration protocol, etc. and include bus width, bus speed, number of independent channels, degree of banking, read burst width, write burst width, etc; this study presents the effective cross-product of varying each of these parameters independently. The simulator is based on SimpleScalar 3.0a and models a fast (simulated as 2GHz), highly aggressive out-of-order uniprocessor. The interface to the primary memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches. Our simulations show the following: (a) the choice of primary memory-system organization is critical, as it can effect total execution time by a factor of 3x for a constant CPU organization and DRAM speed; (b) the most important factors in the performance of the primary memory system are the channel speed (bus cycle time) and the granularity of data access, the burst width—each of these can independently affect total execution time by a factor of 2x; (c) for small bursts, multiple narrow independent channels to the memory system exhibit better performance than a single wide channel; for large bursts, channel cycle time is the most important factor; (d) the degree of DRAM multi-banking plays a secondary role in its impact on total execution time; (e) the optimal burst width tends to be high (large enough to fetch an L2 cache block in 2 bursts) and scales with the block size of the level 2 cache; and (f) the memory queue sizes can be extremely large, due to the bursty nature of references to the primary memory system and the promotion of reads ahead of writes. Among other things, we conclude that the scheduling of the memory bus is the primary bottleneck and that it should be the focus of further study.