Electrical & Computer Engineering Research Works

Permanent URI for this collectionhttp://hdl.handle.net/1903/1658

Browse

Search Results

Now showing 1 - 6 of 6
  • Thumbnail Image
    Item
    Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions
    (2005-02) Jaleel, Aamer; Jacob, Bruce
    The use of large instruction windows coupled with aggressive out-of order and prefetching capabilities has provided significant improvements in processor performance. In this paper, we quantify the effects of increased out-of-order aggressiveness on a processor’s memory ordering/consistency model as well as an application’s cache behavior. We observe that increasing reorder buffer sizes cause less than one third of issued memory instructions to be executed in actual program order. We show that increasing the reorder buffer size from 80 to 512 entries results in an increase in the frequency of memory traps by a factor of six and an increase in total execution overhead by 10–40%. Additionally, we observe that the reordering of memory instructions increases the L1 data cache accesses by 10–60% and the L1 data cache misses by 10–20%. These findings reveal that increased out-of-order capability can waste energy in two ways. First, re-fetching and re-executing instructions flushed due to traps require the fetch, map, and execution units to dissipate energy on work that has already been done before. Second, an increase in the number of cache accesses and cache misses needlessly dissipates energy. Both these side effects can be related to the reordering of memory instructions. Thus, to avoid wasting both energy and performance, we propose a virtual load/ store queue (VLSQ) within the existing physical load/store queue. The VLSQ reduces the reordering of memory instructions by limiting the number of memory instructions visible to the select and issue logic. We show that VLSQs can reduce trap overhead, cache accesses, and cache misses by as much as 45%, 50%, and 15% respectively when compared to traditional load/store queues. We observe that these reductions yield net power savings of 10–50% with degradation in performance by 1–5%.
  • Thumbnail Image
    Item
    BioBench: A Benchmark Suite of Bioinformatics Applications
    (2005-03) Albayraktaroglu, Kursad; Jaleel, Aamer; Wu, Xue; Franklin, Manoj; Jacob, Bruce; Tseng, Chau-Wen; Yeung, Donald
    Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area. To this end, this paper presents BioBench, a benchmark suite that represents a diverse set of bioinformatics applications. The first version of BioBench includes applications from different application domains, with a particular emphasis on mature genomics applications. The applications in the benchmark are described briefly, and basic execution characteristics obtained on a real processor are presented. Compared to SPEC INT and SPEC FP benchmarks, applications in BioBench display a higher percentage of load/store instructions, almost negligible floating point operation content, and higher IPC than either SPEC INT and SPEC FP applications. Our evaluation suggests that bioinformatics applications have distinctly different characteristics from the applications in both of the mentioned SPEC suites; and our findings indicate that bioinformatics workloads can benefit from architectural improvements to memory bandwidth and techniques that exploit their high levels of ILP. The entire BioBench suite and accompanying reference data will be made freely available to researchers.
  • Thumbnail Image
    Item
    Last Level Cache (LLC) Performance of Data Mining Workloads On a CMP — Case Study of Parallel Bioinformatics Workloads
    (2006-02) Jaleel, Aamer; Mattina, Matthew; Jacob, Bruce
    With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing—50–95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98% of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available—the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3–625 when compared to private last-level caches.
  • Thumbnail Image
    Item
    In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs)
    (2006-05) Jaleel, Aamer; Jacob, Bruce
    The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs. This paper proposes the prepend and append schemes of in-lining the interrupt handler into the available reorder buffer space. The two schemes are implemented on a performance model of the Alpha 21264 processor built by Alpha designers at the Palo Alto Design Center (PADC), California. We compare the overhead and performance impact of handling TLB interrupts by the traditional scheme, the append in-lined scheme, and the prepend in-lined scheme. For small, medium, and large memory footprints, the overhead is quantified by comparing the number and pipeline state of instructions flushed, the energy savings, and the performance improvements. We find that lock-up free TLBs reduce the overhead of refetching and reexecuting the instructions flushed by 30-95 percent, reduce the execution time by 5-25 percent, and also reduce the energy wasted by 30-90 percent.
  • Thumbnail Image
    Item
    Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling
    (2007-02) Ganesh, Brinda; Jaleel, Aamer; Wang, David; Jacob, Bruce
    Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a Fully- Buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed.
  • Thumbnail Image
    Item
    DRAMsim: A Memory System Simulator
    (ACM (Association for Computing Machinery) Publications, 2005-09) Wang, David; Ganesh, Brinda; Tuaycharoen, Nuengwong; Baynes, Kathleen; Jaleel, Aamer; Jacob, Bruce
    As memory accesses become slower with respect to the processor and consume more power with increasing memory size, the focus of memory performance and power consumption has become increasingly important. With the trend to develop multi-threaded, multi-core processors, the demands on the memory system will continue to scale. However, determining the optimal memory system configuration is non-trivial. The memory system performance is sensitive to a large number of parameters. Each of these parameters take on a number of values and interact in fashions that make overall trends difficult to discern. A comparison of the memory system architectures becomes even harder when we add the dimensions of power consumption and manufacturing cost. Unfortunately, there is a lack of tools in the public-domain that support such studies. Therefore, we introduce DRAMsim, a detailed and highly configurable C-based memory system simulator to fill this gap. DRAMsim implements detailed timing models for a variety of existing memories, including SDRAM, DDR, DDR2, DRDRAM and FB-DIMM, with the capability to easily vary their parameters. It also models the power consumption of SDRAM and its derivatives. It can be used as a standalone simulator or as part of a more comprehensive system-level model. We have successfully integrated DRAMsim into a variety of simulators including MASE[15], Sim-alpha[14], BOCHS[2] and GEMS[13]. The simulator can be downloaded from www.ece.umd.edu/dramsim.