A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
12 results
Search Results
Item Performance study of various modern DRAM Architectures(2018) Nallapa Yoge, Dhiraj Reddy; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Several DRAM architectures exist with each differing in their performance, power and cost metrics. This thesis compares the performance and power characteristics of some of such DRAM architectures which are compliant to JEDEC standard DDR protocols such as DDR3, DDR4, LPDDR3, LPDDR4, GDDR5 and HBM. To accurately model the differences in performance and power characteristics of these architectures, a new cycle level DRAM memory simulator has been designed and implemented from scratch. Several distinguishing features of these protocols such as - bankgroups in DDR4 and beyond, 32 activation window constraint in GDDR5, granularity of refresh at per rank level vs at per bank level and dual command issue mode in HBM - are modeled and studied for their impact on workload performance and power consumption. The internal structure of DRAM exhibits different kinds of parallelisms such as channel level parallelism, rank level parallelism and bank level parallelism. The type and the degree of parallelism together with the associated DRAM command timing constraints determine the latency and bandwidth characteristics of any DRAM architecture. Abstract studies are performed to determine the potential of each of these parallelisms in attaining the maximum supported pin bandwidth for a set of SPEC 2006 CPU workloads. Finally, several real DRAM architecture designs belonging to each of the above mentioned protocols are studied to quantify their relative performance and power trade-off.Item Performance Exploration of the Hybrid Memory Cube(2014) Rosenfeld, Paul; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The Hybrid Memory Cube (HMC) is an emerging main memory technology that leverages advances in 3D fabrication techniques to create a memory device with several DRAM dies stacked on top of a CMOS logic layer. The logic layer at the base of each stack contains several DRAM memory controllers that communicate with the host processor over high speed serial links using an abstracted packet interface. Each memory controller is connected to several memory banks in the DRAM stack with Through-Silicon Vias (TSVs), which are metal connections that extend vertically through each chip in the die stack. Since the TSVs form a dense interconnect with short path lengths, the data bus between the controller and memory banks can be operated at higher throughput and lower energy per bit compared to traditional Double Data Rate (DDRx) memories, which uses many long and parallel wires on the motherboard to communicate with the memory controller located on the CPU die. The TSV connections combined with the presence of multiple memory controllers near the memory arrays form a device that exposes significant memory-level parallelism and is capable of delivering an order of magnitude more bandwidth than current DDRx solutions. While the architecture of this type of device is still nascent, we present several parameter sweeps to highlight the performance characteristics and trade-offs in the HMC architecture. In the first part of this dissertation, we attempt to understand and optimize the architecture of a single HMC device that is not connected to any other HMCs. We begin by quantifying the impact of a packetized high-speed serial interface on the performance of the memory system and how it differs from current generation DDRx memories. Next, we perform a sensitivity analysis to gain insight into how various queue sizes, interconnect parameters, and DRAM timings affect the overall performance of the memory system. Then, we analyze several different cube configurations that are resource-constrained to illustrate the trade-offs in choosing the number of memory controllers, DRAM dies, and memory banks in the system. Finally, we use a full system simulation environment running multi-threaded workloads on top of an unmodified Linux kernel to compare the performance of HMC against DDRx and "ideal" memory systems. We conclude that today's CPU protocols such as coherent caches pose a problem for a high-throughput memory system such as the HMC. After removing the bottleneck, however, we see that memory intensive workloads can benefit significantly from the HMC's high bandwidth. In addition to being used as a single HMC device attached to a CPU socket, the HMC allows two or more devices to be "chained" together to form a diverse set of topologies with unique performance characteristics. Since each HMC regenerates the high speed signal on its links, in theory any number of cubes can be connected together to extend the capacity of the memory system. There are, however, practical limits on the number of cubes and types of topologies that can be implemented. In the second part of this work, we describe the challenges and performance impacts of chaining multiple HMC cubes together. We implement several cube topologies of two, four, and eight cubes and apply a number of different routing heuristics of varying complexity. We discuss the effects of the topology on the overall performance of the memory system and the practical limits of chaining. Finally, we quantify the impact of chaining on the execution of workloads using full-system simulation and show that chaining overheads are low enough for it to be a viable avenue to extend memory capacity.Item SCALABLE AND ENERGY EFFICIENT DRAM REFRESH TECHNIQUES(2014) Bhati, Ishwar Singh; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A DRAM cell requires periodic refresh operations to preserve data in its leaky capacitor. Previously, the overheads of refresh operations were insignificant. But, as both the size and speed of DRAM chips have increased significantly in the past decade, refresh has become a dominating factor of DRAM performance and power dissipation. The objective of this dissertation is to conduct a comprehensive study of the issues related to refresh operations in modern DRAM devices and thereafter, propose techniques to mitigate refresh penalties. To understand the growing consequences of refresh operations, first we describe various refresh command scheduling schemes; analyze the refresh modes and timings in modern commodity DRAM devices; and characterize the variations in DRAM cells' retention time. Then, we quantify refresh penalties by varying device speed, size, timings, and total memory capacity. Furthermore, we also summarize prior refresh mechanisms and their applicability in future computing systems. Finally, based on our experiments and observations, we propose techniques to improve refresh energy efficiency and mitigate refresh scalability problems. Refresh operations not only introduce performance penalty but also pose energy overheads. In addition to the energy required for refreshing, the background energy component, dissipated by DRAM peripheral circuitry and on-die DLL during refresh command, will become significant in future devices. We propose a set of techniques referred collectively as "coordinated refresh", in which scheduling of low power modes and refresh commands are coordinated so that most of the required refreshes are issued when the DRAM device is in the deepest low power "self refresh" (SR) mode. Our approach saves background power because the peripheral circuitry and clocks are turned off in the SR mode. Moreover, we observe that as the number of rows in DRAM scales, a large body of research on refresh reduction using retention time and access awareness will be rendered ineffective. Because these mechanisms require the memory controller to have fine-grained control over which regions of the memory are refreshed, while in JEDEC DDRx devices, a refresh operation is carried out via an "auto-refresh" command, which refreshes multiple rows from multiple banks simultaneously. The internal implementation of "auto-refresh" is completely opaque outside the DRAM -- all the memory controller can do is tell the DRAM to refresh itself -- the DRAM handles everything else, in particular determining which rows in which banks are to be refreshed. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter and also introduce a new "dummy refresh" command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.Item Buffer-On-Board Memory System(2012) Cooper-Balis, Elliott; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The design and implementation of the commodity memory architecture has resulted in significant limitations in a system's speed and capacity. To circumvent these limitations, designers and vendors have begun to place intermediate logic between the CPU and DRAM. This additional logic has two functions: to control the DRAM and to communicate with the CPU over a fast and narrow bus. The benefit provided by this logic is a reduction in pin-out to the memory system from the CPU and increased signal integrity seen by the DRAM, granting faster clock rates while increasing capacity. This new design is reminiscent of the FB-DIMM memory system yet makes key changes to its architecture including the use of existing DIMMs to reduce cost, a reduction in power (relative to FB-DIMM), and a more stable request latency. The problem is that the few vendors utilizing this design have the same general approach, yet the implementations vary greatly in their non-trivial details. A hardware verified simulation suite is developed to accurately model and evaluate the behavior of this buffer-on-board memory system. A study of this design space is performed to determine optimal use of the resources involved. This includes DRAM and bus organization, queue storage, and mapping schemes. Various constraints based on implementation costs are placed on simulated configurations to confirm that these optimizations apply to viable systems. Finally, full system simulations are performed with MARSSx86 to better understand how this memory system interacts with a CPU, cache, and operating system executing an application. Full system simulations uncover behaviors not present in simple limit-case simulations such as the impact of address and channel mapping schemes or the organization of ports and the associated buffers. When applying insights gleaned from these simulations, optimal performance can be achieved while still considering outside constraints (i.e., pin-out, power, and fabrication costs).Item High-Performance DRAM System Design Constraints and Considerations(2010) Gross, Joseph; Jacob, Bruce L; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The effects of a realistic memory system have not received much attention in recent decades. Often, the memory controller and DRAMs are modeled as a fixed-latency or random-latency system, which leads to simulations that are less accurate. As more cores are added to each die and CPU clock rates continue to outpace memory access times, the gap will only grow wider and simulation results will be less accurate. This thesis proposes to look at the way a memory controller and DRAM system work and attempt to model them accurately in a simulator. It will use a simulated Alpha 21264 processor in conjunction with a full system simulator and memory system simulator. Various SPEC06 benchmarks are used to look at runtimes. The process of mapping a memory location to a physical location, the algorithm for choosing the ordering of commands to be sent to the DRAMs and the method of managing the row buffers are examined in detail. We find that the choice in these algorithms and policies can affect application runtime by up to 200% or more. It is also shown that energy use can vary by up to 300% by changing changing the address mapping policy. These results show that it is important to look at all the available policies to optimize the memory system for the type of workload that a machine will be running. No single policy is best for every application, so it is important to understand the interaction of the application and the memory system to improve performance and reduce the energy consumed.Item A Performance Comparison of Contemporary DRAM Architectures(1999-05) Cuppu, Vinodh; Jacob, Bruce; Davis, Brian; Mudge, TrevorIn response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use on the order of 10 DRAM chips. The study covers Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Synchronous Link, Rambus, and Direct Rambus designs. Our simulations reveal several things: (a) current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem; (b) bus transmission speed will soon become a primary factor limiting memory-system performance; (c) the post-L2 address stream still contains significant locality, though it varies from application to application; and (d) as we move to wider buses, row access time becomes more prominent, making it important to investigate techniques to exploit the available locality to decrease access time.Item DDR2 and Low Latency Variants(2000-07) Davis, Brian; Mudge, Trevor; Jacob, Bruce; Cuppu, VinodhThis paper describes a performance examination of the DDR2 DRAM architecture and the proposed cache-enhanced variants. These preliminary studies are based upon ongoing collaboration between the authors and the Joint Electronic Device Engineering Council (JEDEC) Low Latency DRAM Working Group, a working group within the JEDEC 42.3 Future DRAM Task Group. This Task Group is responsible for developing the DDR2 standard. The goal of the Low Latency DRAM Working Group is the creation of a single cache-enhanced (i.e. low-latency) architecture based upon this same interface. There are a number of proposals for reducing the average access time of DRAM devices, most of which involve the addition of SRAM to the DRAM device. As DDR2 is viewed as a future standard, these proposals are frequently applied to a DDR2 interface device. For the same reasons it is advantageous to have a single DDR2 specification, it is similarly beneficial to have a single low-latency specification. The authors are involved in ongoing research to evaluate which enhancements to the baseline DDR2 devices will yield lower average latency, and for what type of applications. To provide context, experimental results will be compared against those for systems utilizing PC100 SDRAM, DDR133 SDRAM, and Direct Rambus (DRDRAM). This work is just starting to produce performance data. Initial results show performance improvements for low-latency devices that are significant, but less so than a generational change in DRAM interface. It is also apparent that there are at least two classifications of applications: 1) those that saturate the memory bus, for which performance is dependent upon the potential bandwidth and bus utilization of the system; and 2) those that do not contain the access parallelism to fully utilize the memory bus, and for which performance is dependent upon the latency of the average primary memory access.Item Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?(2001-06) Cuppu, Vinodh; Jacob, BruceGiven a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times; for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10–20% lower than execution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system’s exposure to performance loss. In particular, we look at mechanisms to increase a system’s support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the “system overhead”—the portion of the primary memory system’s overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read/write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interface to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches and split-transaction busses to all DRAM banks.Item A Case for Studying DRAM Issues at the System Level(IEEE Micro, 2003-08) Jacob, BruceTHE WIDENING GAP BETWEEN TODAY’S PROCESSOR AND MEMORY SPEEDS MAKES DRAM SUBSYSTEM DESIGN AN INCREASINGLY IMPORTANT PART OF COMPUTER SYSTEM DESIGN. IF THE DRAM RESEARCH COMMUNITY WOULD FOLLOW THE MICROPROCESSOR COMMUNITY’S LEAD BY LEANING MORE HEAVILY ON ARCHITECTURE- AND SYSTEM-LEVEL SOLUTIONS IN ADDITION TO TECHNOLOGY-LEVEL SOLUTIONS TO ACHIEVE HIGHER PERFORMANCE, THE GAP MIGHT BEGIN TO CLOSE.Item DRAMsim: A Memory System Simulator(ACM (Association for Computing Machinery) Publications, 2005-09) Wang, David; Ganesh, Brinda; Tuaycharoen, Nuengwong; Baynes, Kathleen; Jaleel, Aamer; Jacob, BruceAs memory accesses become slower with respect to the processor and consume more power with increasing memory size, the focus of memory performance and power consumption has become increasingly important. With the trend to develop multi-threaded, multi-core processors, the demands on the memory system will continue to scale. However, determining the optimal memory system configuration is non-trivial. The memory system performance is sensitive to a large number of parameters. Each of these parameters take on a number of values and interact in fashions that make overall trends difficult to discern. A comparison of the memory system architectures becomes even harder when we add the dimensions of power consumption and manufacturing cost. Unfortunately, there is a lack of tools in the public-domain that support such studies. Therefore, we introduce DRAMsim, a detailed and highly configurable C-based memory system simulator to fill this gap. DRAMsim implements detailed timing models for a variety of existing memories, including SDRAM, DDR, DDR2, DRDRAM and FB-DIMM, with the capability to easily vary their parameters. It also models the power consumption of SDRAM and its derivatives. It can be used as a standalone simulator or as part of a more comprehensive system-level model. We have successfully integrated DRAMsim into a variety of simulators including MASE[15], Sim-alpha[14], BOCHS[2] and GEMS[13]. The simulator can be downloaded from www.ece.umd.edu/dramsim.