Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 7 of 7

Scalable and Accurate Memory System Simulation
(2019) Li, Shang; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
Memory systems today possess more complexity than ever. On one hand, main memory technology has a much more diverse portfolio. Other than the mainstream DDR DRAMs, a variety of DRAM protocols have been proliferating in certain domains. Non-Volatile Memory(NVM) also finally has commodity main memory products, introducing more heterogeneity to the main memory media. On the other hand, the scale of computer systems, from personal computers, server computers, to high performance computing systems, has been growing in response to increasing computing demand. Memory systems have to be able to keep scaling to avoid bottlenecking the whole system. However, current memory simulation works cannot accurately or efficiently model these developments, making it hard for researchers and developers to evaluate or to optimize designs for memory systems. In this study, we attack these issues from multiple angles. First, we develop a fast and validated cycle accurate main memory simulator that can accurately model almost all existing DRAM protocols and some NVM protocols, and it can be easily extended to support upcoming protocols as well. We showcase this simulator by conducting a thorough characterization over existing DRAM protocols and provide insights on memory system designs. Secondly, to efficiently simulate the increasingly paralleled memory systems, we propose a lax synchronization model that allows efficient parallel DRAM simulation. We build the first ever practical parallel DRAM simulator that can speedup the simulation by up to a factor of three with single digit percentage loss in accuracy comparing to cycle accurate simulations. We also developed mitigation schemes to further improve the accuracy with no additional performance cost. Moreover, we discuss the limitation of cycle accurate models, and explore the possibility of alternative modeling of DRAM. We propose a novel approach that converts DRAM timing simulation into a classification problem. By doing so we can make predictions on DRAM latency for each memory request upon first sight, which makes it compatible for scalable architecture simulation frameworks. We developed prototypes based on various machine learning models and they demonstrate excellent performance and accuracy results that makes them a promising alternative to cycle accurate models. Finally, for large scale memory systems where data movement is often the performance limiting factor, we propose a set of interconnect topologies and implement them in a parallel discrete event simulation framework. We evaluate the proposed topologies through simulation and prove that their scalability and performance exceeds existing topologies with increasing system size or workloads.
Performance study of various modern DRAM Architectures
(2018) Nallapa Yoge, Dhiraj Reddy; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
Several DRAM architectures exist with each differing in their performance, power and cost metrics. This thesis compares the performance and power characteristics of some of such DRAM architectures which are compliant to JEDEC standard DDR protocols such as DDR3, DDR4, LPDDR3, LPDDR4, GDDR5 and HBM. To accurately model the differences in performance and power characteristics of these architectures, a new cycle level DRAM memory simulator has been designed and implemented from scratch. Several distinguishing features of these protocols such as - bankgroups in DDR4 and beyond, 32 activation window constraint in GDDR5, granularity of refresh at per rank level vs at per bank level and dual command issue mode in HBM - are modeled and studied for their impact on workload performance and power consumption. The internal structure of DRAM exhibits different kinds of parallelisms such as channel level parallelism, rank level parallelism and bank level parallelism. The type and the degree of parallelism together with the associated DRAM command timing constraints determine the latency and bandwidth characteristics of any DRAM architecture. Abstract studies are performed to determine the potential of each of these parallelisms in attaining the maximum supported pin bandwidth for a set of SPEC 2006 CPU workloads. Finally, several real DRAM architecture designs belonging to each of the above mentioned protocols are studied to quantify their relative performance and power trade-off.
Architectural-Physical Co-Design of 3D CPUs with Micro-Fluidic Cooling
(2016) Serafy, Caleb M.; Srivastava, Ankur; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
The performance, energy efficiency and cost improvements due to traditional technology scaling have begun to slow down and present diminishing returns. Underlying reasons for this trend include fundamental physical limits of transistor scaling, the growing significance of quantum effects as transistors shrink, and a growing mismatch between transistors and interconnects regarding size, speed and power. Continued Moore's Law scaling will not come from technology scaling alone, and must involve improvements to design tools and development of new disruptive technologies such as 3D integration. 3D integration presents potential improvements to interconnect power and delay by translating the routing problem into a third dimension, and facilitates transistor density scaling independent of technology node. Furthermore, 3D IC technology opens up a new architectural design space of heterogeneously-integrated high-bandwidth CPUs. Vertical integration promises to provide the CPU architectures of the future by integrating high performance processors with on-chip high-bandwidth memory systems and highly connected network-on-chip structures. Such techniques can overcome the well-known CPU performance bottlenecks referred to as memory and communication wall. However the promising improvements to performance and energy efficiency offered by 3D CPUs does not come without cost, both in the financial investments to develop the technology, and the increased complexity of design. Two main limitations to 3D IC technology have been heat removal and TSV reliability. Transistor stacking creates increases in power density, current density and thermal resistance in air cooled packages. Furthermore the technology introduces vertical through silicon vias (TSVs) that create new points of failure in the chip and require development of new BEOL technologies. Although these issues can be controlled to some extent using thermal-reliability aware physical and architectural 3D design techniques, high performance embedded cooling schemes, such as micro-fluidic (MF) cooling, are fundamentally necessary to unlock the true potential of 3D ICs. A new paradigm is being put forth which integrates the computational, electrical, physical, thermal and reliability views of a system. The unification of these diverse aspects of integrated circuits is called Co-Design. Independent design and optimization of each aspect leads to sub-optimal designs due to a lack of understanding of cross-domain interactions and their impacts on the feasibility region of the architectural design space. Co-Design enables optimization across layers with a multi-domain view and thus unlocks new high-performance and energy efficient configurations. Although the co-design paradigm is becoming increasingly necessary in all fields of IC design, it is even more critical in 3D ICs where, as we show, the inter-layer coupling and higher degree of connectivity between components exacerbates the interdependence between architectural parameters, physical design parameters and the multitude of metrics of interest to the designer (i.e. power, performance, temperature and reliability). In this dissertation we present a framework for multi-domain co-simulation and co-optimization of 3D CPU architectures with both air and MF cooling solutions. Finally we propose an approach for design space exploration and modeling within the new Co-Design paradigm, and discuss the possible avenues for improvement of this work in the future.
Multi-Level Main Memory Systems: Technology Choices, Design Considerations, and Trade-off Analysis
(2015) Tschirhart, Paul Kenton; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
Multi-level main memory systems provide a way to leverage the advantages of different memory technologies to build a main memory that overcomes the limitations of the current flat DRAM-based architecture. The slowdown of DRAM scaling has resulted in the development of new memory technologies that potentially enable the continued improvement of the main memory system in terms of performance, capacity, and energy efficiency. However, all of these novel technologies have weaknesses that necessitate the utilization of a multi-level main memory hierarchy in order to build a main memory system with acceptable characteristics. This dissertation investigates the implications of these new multi-level main memory architectures and provides key insights into the trade-offs associated with the technology and organization choices that are integral to their design. The design space of multi-level main memory systems is much larger than the traditional main memory system's because it also includes additional cache design and technology choices. This dissertation divides the analysis of that space into three more manageable components. First, we begin by exploring the ways in which high level design choices affect this new type of system differently than current state of the art systems. Second, we focus on the details of the DRAM cache and propose a novel design that efficiently enables associativity. Finally, we turn our attention to the backing store and evaluate the performance effects of different organizations and optimizations for that system. From these studies we are able to identify the critical aspects of the system that contribute significantly to its overall performance. In particular, we note that in most potential systems the ratio of hit latency to miss latency is the dominant factor that determines performance. This motivated the development of our novel associative DRAM cache design in order to minimize the miss rate and reduce the impact of the miss latency while maintaining an acceptable hit latency. In addition, we also observe that selecting the page size, organization, and prefetching degree that best suits each particular backing store technology can help to reduce the miss penalty thereby improving the performance of the overall system.
SCALABLE AND ENERGY EFFICIENT DRAM REFRESH TECHNIQUES
(2014) Bhati, Ishwar Singh; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
A DRAM cell requires periodic refresh operations to preserve data in its leaky capacitor. Previously, the overheads of refresh operations were insignificant. But, as both the size and speed of DRAM chips have increased significantly in the past decade, refresh has become a dominating factor of DRAM performance and power dissipation. The objective of this dissertation is to conduct a comprehensive study of the issues related to refresh operations in modern DRAM devices and thereafter, propose techniques to mitigate refresh penalties. To understand the growing consequences of refresh operations, first we describe various refresh command scheduling schemes; analyze the refresh modes and timings in modern commodity DRAM devices; and characterize the variations in DRAM cells' retention time. Then, we quantify refresh penalties by varying device speed, size, timings, and total memory capacity. Furthermore, we also summarize prior refresh mechanisms and their applicability in future computing systems. Finally, based on our experiments and observations, we propose techniques to improve refresh energy efficiency and mitigate refresh scalability problems. Refresh operations not only introduce performance penalty but also pose energy overheads. In addition to the energy required for refreshing, the background energy component, dissipated by DRAM peripheral circuitry and on-die DLL during refresh command, will become significant in future devices. We propose a set of techniques referred collectively as "coordinated refresh", in which scheduling of low power modes and refresh commands are coordinated so that most of the required refreshes are issued when the DRAM device is in the deepest low power "self refresh" (SR) mode. Our approach saves background power because the peripheral circuitry and clocks are turned off in the SR mode. Moreover, we observe that as the number of rows in DRAM scales, a large body of research on refresh reduction using retention time and access awareness will be rendered ineffective. Because these mechanisms require the memory controller to have fine-grained control over which regions of the memory are refreshed, while in JEDEC DDRx devices, a refresh operation is carried out via an "auto-refresh" command, which refreshes multiple rows from multiple banks simultaneously. The internal implementation of "auto-refresh" is completely opaque outside the DRAM -- all the memory controller can do is tell the DRAM to refresh itself -- the DRAM handles everything else, in particular determining which rows in which banks are to be refreshed. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter and also introduce a new "dummy refresh" command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.
Greedy Coordinate Descent CMP Multi-Level Cache Resizing
(2014) Choi, Inseok Stephen; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
Hardware designers are constantly looking for ways to squeeze waste out of architectures to achieve better power efficiency. Cache resizing is a technique that can remove wasteful power consumption in caches. The idea is to determine the minimum cache a program needs to run at near-peak performance, and then reconfigure the cache to implement this efficient capacity. While there has been significant previous work on cache resizing, existing techniques have focused on controlling resizing for a single level of cache only. This sacrifices significant opportunities for power savings in modern CPU hierarchies which routinely employ 3 levels of cache. Moreover, as CMP scaling will likely continue for the foreseeable future, eliminating wasteful power consumption from a CMP multi-level cache hierarchy is crucial to achieve better power efficiency. In this dissertation, we propose a noble technique, greedy coordinate descent CMP multi-level cache resizing, that minimizes a power consumption while maintaining a high performance. We simutaneously resizes all caches in a modern CMP cache hierarchy to minimize the power consumption. Specifically, our approach predicts the power consumption and the performance level without direct evaluations. We also develop greedy coordinate descent method to search an optimal cache configuration utilizing power efficiency gain (PEG) that we propose in this dissertation. This dissertation makes three contributions for a CMP multi-level cache resizing. First, we discover the limits of power savings and performance. This limit study identifies the potential power savings in a CMP multi-level cache hierarchy when wasteful power consumption is eliminated. Second, we propose a prediction-based greedy coordinate descent (GCD) method to find an optimal cache configuration and to orchestrate them. Third, we implement online GCD techniques for a CMP multi-level cache resizing. Our approach exhibits 13.9% power savings and achieves 91% of the power savings of the static oracle cache hierarchy configuration.
DISK DESIGN-SPACE EXPLORATION IN TERMS OF SYSTEM-LEVEL PERFORMANCE, POWER, AND ENERGY CONSUMPTION
(2007-01-16) Tuaycharoen, Nuengwong; Jacob, Bruce L.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
To make the common case fast, most studies focus on the computation phase of applications in which most instructions are executed. However, many programs spend significant time in the I/O intensive phase due to the I/O latency. To obtain a system with more balanced phases, we require greater insight into the effects of the I/O configurations to the entire system in both performance and power dissipation domains. Due to lack of public tools with the complete picture of the entire memory hierarchy, we developed SYSim. SYSim is a complete-system simulator aiming at complete memory hierarchy studies in both performance and power consumption domains. In this dissertation, we used SYSim to investigate the system-level impacts of several disk enhancements and technology improvements to the detailed interaction in memory hierarchy during the I/O-intensive phase. The experimental results are reported in terms of both total system performance and power/energy consumption. With SYSim, we conducted the complete-system experiments and revealed intriguing behaviors including, but not limited to, the following: During the I/O intensive phase which consists of both disk reads and writes, the average system CPI tracks only average disk read response time, and not overall average disk response time, which is the widely-accepted metric in disk drive research. In disk read-dominating applications, Disk Prefetching is more important than increasing the disk RPM. On the other hand, in applications with both disk reads and writes, the disk RPM matters. The execution time can be improved to an order of magnitude by applying some disk enhancements. Using disk caching and prefetching can improve the performance by the factor of 2, and write-buffering can improve the performance by the factor of 10. Moreover, using disk caching/prefetching and the write-buffering techniques in conjunction can improve the total system performance by at least an order of magnitude. Increasing the disk RPM and the number of disks in RAID disk system also have an impressive improvement over the total system performance. However, employing such techniques requires careful consideration for trade-offs in power/energy consumption.

Theses and Dissertations from UMD

Browse

Filters

Settings

Sort By

Results per page

Search Results