Electrical & Computer Engineering Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1658

Browse

Search Results

Now showing 1 - 10 of 19

DDR2 and Low Latency Variants
(2000-07) Davis, Brian; Mudge, Trevor; Jacob, Bruce; Cuppu, Vinodh
This paper describes a performance examination of the DDR2 DRAM architecture and the proposed cache-enhanced variants. These preliminary studies are based upon ongoing collaboration between the authors and the Joint Electronic Device Engineering Council (JEDEC) Low Latency DRAM Working Group, a working group within the JEDEC 42.3 Future DRAM Task Group. This Task Group is responsible for developing the DDR2 standard. The goal of the Low Latency DRAM Working Group is the creation of a single cache-enhanced (i.e. low-latency) architecture based upon this same interface. There are a number of proposals for reducing the average access time of DRAM devices, most of which involve the addition of SRAM to the DRAM device. As DDR2 is viewed as a future standard, these proposals are frequently applied to a DDR2 interface device. For the same reasons it is advantageous to have a single DDR2 specification, it is similarly beneficial to have a single low-latency specification. The authors are involved in ongoing research to evaluate which enhancements to the baseline DDR2 devices will yield lower average latency, and for what type of applications. To provide context, experimental results will be compared against those for systems utilizing PC100 SDRAM, DDR133 SDRAM, and Direct Rambus (DRDRAM). This work is just starting to produce performance data. Initial results show performance improvements for low-latency devices that are significant, but less so than a generational change in DRAM interface. It is also apparent that there are at least two classifications of applications: 1) those that saturate the memory bus, for which performance is dependent upon the potential bandwidth and bus utilization of the system; and 2) those that do not contain the access parallelism to fully utilize the memory bus, and for which performance is dependent upon the latency of the average primary memory access.
Extended Split-Issue: Enabling Flexibility in the Hardware Implementation of NUAL VLIW DSPs
(2004-06) Iyer, Bharath; Srinivasan, Sadagopan; Jacob, Bruce
VLIW architecture based DSPs have become widespread due to the combined benefits of simple hardware and compiler-extracted instruction-level parallelism. However, the VLIW instruction set architecture and its hardware implementation are tightly coupled, especially so for Non-Unit Assumed Latency (NUAL) VLIWs. The problem of object code compatibility across processors having different numbers of functional units or hardware latencies has been the Achilles' heel of this otherwise powerful architecture. In this paper, we propose eXtended Split-Issue (XSI), a novel mechanism that breaks the instruction packet syntax of an NUAL VLIW compiler without violating the dataflow dependences. XSI provides a designer the freedom of disassociating the hardware implementation of the NUAL VLIW processor from the instruction set architecture. Further, we investigate fairly radical (in the context of VLIW) changes to the hardware—like removing an adder, adding a multiplier, and incorporating simultaneous multithreading (SMT)—to show that our technique works for a variety of hardware configurations without compromising on performance. The technique can be used in both single-threaded and multi-threaded architectures to achieve a level of flexibility heretofore unavailable in the VLIW arena.
Uniprocessor Virtual Memory Without TLBs
(IEEE, 2001-05) Jacob, Bruce; Mudge, Trevor
We present a feasibility study for performing virtual address translation without specialized translation hardware. Removing address translation hardware and instead managing address translation in software has the potential to make the processor design simpler, smaller, and more energy-efficient at little or no cost in performance. The purpose of this study is to describe the design and quantify its performance impact. Trace-driven simulations show that software-managed address translation is just as efficient as hardware-managed address translation. Moreover, mechanisms to support such features such as shared memory, superpages, fine-grained protection, and sparse address spaces can be defined completely in software, allowing for more flexibility than in hardware-defined mechanisms.
Instruction-Level Power Dissipation in the Intel XScale Embedded Microprocessor
(2005-01) Varma, Ankush; Debes, Eric; Kozintsev, Igor; Jacob, Bruce
We present an instruction-level power dissipation model of the Intel XScale R° microprocessor. The XScale implements the ARMTMISA, but uses an aggressive microarchitecture and a SIMD Wireless MMXTMco-processor to speed up execution of multimedia workloads in the embedded domain. Instruction-Level power modelling was ¯rst proposed by Tiwari et. al. in 1994. Adaptations of this model have been found to be applicable to simple ARM processors. Research also shows that instructions can be clustered into groups with similar energy characteristics. We adapt these methodologies to the significantly more complex XScale processor. We characterize the processor in terms of the energy costs of opcode execution, operand values, pipeline stalls etc. through accurate measurements on hardware. This instruction-based (rather than microarchitectural) approach allows us to build a high-speed power-accurate simulator that runs at MIPS-range speeds, while achieving accuracy better than 5%. The processor core accounts only for a portion of overall power consumption, and we move beyond the core to explore the issues involved in building a SystemC simulation framework that models power dissipation of complete systems quickly, flexibly and accurately.
TERPS: The Embedded Reliable Processing System
(2005-01) Wang, Hongxia; Rodriguez, Samuel; Dirik, Cagdas; Gole, Amol; Chan, Vincent; Jacob, Bruce
TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a special safe-storage device. The detection of EMI invokes rollback, which recovers processor state from a previously check-pointed state and resumes normal execution. Rollback results in loss of performance dictated by the EMI duration; TERPS ensures forward progress of the system provided EMI events are separated by some minimum time interval (e.g., at least 5.12μs for our prototype processor running at 100MHz). The performance overhead of our mechanism is reasonable: 5–6% overhead when checkpointing every 128 processor cycles.
Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?
(2001-06) Cuppu, Vinodh; Jacob, Bruce
Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times; for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10–20% lower than execution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system’s exposure to performance loss. In particular, we look at mechanisms to increase a system’s support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the “system overhead”—the portion of the primary memory system’s overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read/write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interface to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches and split-transaction busses to all DRAM banks.
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions
(2005-02) Jaleel, Aamer; Jacob, Bruce
The use of large instruction windows coupled with aggressive out-of order and prefetching capabilities has provided significant improvements in processor performance. In this paper, we quantify the effects of increased out-of-order aggressiveness on a processor’s memory ordering/consistency model as well as an application’s cache behavior. We observe that increasing reorder buffer sizes cause less than one third of issued memory instructions to be executed in actual program order. We show that increasing the reorder buffer size from 80 to 512 entries results in an increase in the frequency of memory traps by a factor of six and an increase in total execution overhead by 10–40%. Additionally, we observe that the reordering of memory instructions increases the L1 data cache accesses by 10–60% and the L1 data cache misses by 10–20%. These findings reveal that increased out-of-order capability can waste energy in two ways. First, re-fetching and re-executing instructions flushed due to traps require the fetch, map, and execution units to dissipate energy on work that has already been done before. Second, an increase in the number of cache accesses and cache misses needlessly dissipates energy. Both these side effects can be related to the reordering of memory instructions. Thus, to avoid wasting both energy and performance, we propose a virtual load/ store queue (VLSQ) within the existing physical load/store queue. The VLSQ reduces the reordering of memory instructions by limiting the number of memory instructions visible to the select and issue logic. We show that VLSQs can reduce trap overhead, cache accesses, and cache misses by as much as 45%, 50%, and 15% respectively when compared to traditional load/store queues. We observe that these reductions yield net power savings of 10–50% with degradation in performance by 1–5%.
High-Performance DRAMs in Workstation Environments
(2001-10) Cuppu, Vinodh; Jacob, Bruce; Davis, Brian; Mudge, Trevor
This paper presents a simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization. These small-system organizations correspond to workstation-class computers and use only a handful of DRAM chips (~10, as opposed to ~1 or ~100). The study covers Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Double Data Rate, Synchronous Link, Rambus, and Direct Rambus designs. Our simulations reveal several things: (a) current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem; (b) bus transmission speed will soon become a primary factor limiting memory-system performance; (c) the post-L2 address stream still contains significant locality, though it varies from application to application; (d) systems without L2 caches are feasible for low- and medium-speed CPUs (1GHz and below); and (e) as we move to wider buses, row access time becomes more prominent, making it important to investigate techniques to exploit the available locality to decrease access time.
BioBench: A Benchmark Suite of Bioinformatics Applications
(2005-03) Albayraktaroglu, Kursad; Jaleel, Aamer; Wu, Xue; Franklin, Manoj; Jacob, Bruce; Tseng, Chau-Wen; Yeung, Donald
Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating point operation content, and higher IPC than either SPEC INT and SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers.
Radio Frequency Effects on the Clock Networks of Digital Circuits
(2004-08) Wang, Hongxia; Dirik, Cagdas; Rodriguez, Samuel V.; Gole, Amol V.; Jacob, Bruce
Radio frequency interference (RFI) can have adverse effects on commercial electronics. Current properties of high performance integrated circuits (ICs), such as very small feature sizes, high clock frequencies, and reduced voltage levels, increase the susceptibility of these circuits to RFI, causing them to be more prone to smaller interference levels. Also, recent developments of mobile devices and wireless networks create a hostile electromagnetic environment for ICs. Therefore, it is important to measure the susceptibility of ICs to RFI. In this study, we investigate the susceptibility levels to RFI of the clock network of a basic digital building block. Our experimental setup is designed to couple a pulse modulated RF signal using the pin direct injection method. The device under test is an 8-bit ripple counter, designed and fabricated using AMI 0.5 μm process technology. Our experiments showed that relatively low levels of RFI (e.g., 16.8 dBm with carrier frequency of 1 GHz) could adversely affect the normal functioning of the device under test.

Electrical & Computer Engineering Research Works

Browse

Filters

Settings

Sort By

Results per page

Search Results