A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
4 results
Search Results
Item Using Locality and Interleaving Information to Improve Shared Cache Performance(2009) Liu, Wanli; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The cache interference is found to play a critical role in optimizing cache allocation among concurrent threads for shared cache. Conventional LRU policy usually works well for low interference workloads, while high cache interference among threads demands explicit allocation regulation, such as cache partitioning. Cache interference is shown to be tied to inter-thread memory reference interleaving granularity: high interference is caused by ne-grain interleaving while low interference is caused coarse-grain interleaving. Proling of real multi-program workloads shows that cache set mapping and temporal phase result in the variation of interleaving granularity. When memory references from dierent threads map to disjoint cache sets, or they occur in distinct time windows, they tend to cause little interference due to coarse-grain interleaving. The interleaving granularity measured by runlength in workloads is found to correlate with the preference of cache management policy: ne-grain interleaving workloads perform better with cache partitioning, and coarse-grain interleaving workloads perform better with LRU. Most existing shared cache management techniques are based on working set locality analysis. This dissertation studies the shared cache performance by taking both locality and interleaving information into consideration. Oracle algorithm which provides theoretical best performance is investigated to provide insight into how to design a better practical policy. Proling and analysis of Oracle algorithm lead to the proposal of probabilistic replacement (PR), a novel cache allocation policy. With aggressor threads information learned on-line, PR evicts the bad locality blocks of aggressor threads probabilistically while preserving good locality blocks of non-aggressor threads. PR is shown to be able to adapt to the different interleaving granularities in different sets over time. Its flexibility in tuning eviction probability also improves fairness among thread performance. Evaluation indicates that PR outperforms LRU, UCP, and ideal cache partitioning at moderate hardware cost. For single program cache management, this dissertation also proposes a novel technique: reuse distance last touch predictor (RD-LTP). RD-LTP is able to capture reuse distance information, which represents the intrinsic memory reference pattern. Based on this improved LT predictor, an MRU LT eviction policy is developed to select the right victim at the presence of incorrect LT prediction. In addition to LT predictor, another predictor: reuse distance predictors (RDPs) is proposed, which is able to predict actual reuse distance values. Compared to various existing cache management techniques, these two novel predictors deliver higher cache performance with higher prediction coverage and accuracy at moderate hardware cost.Item An Analytical Model for Designing Memory Hierarchies(1996-10) Jacob, Bruce; Chen, Peter M.; Silverman, Seth R.; Mudge, Trevor N.Memory hierarchies have long been studied by many means: system building, trace-driven simulation, and mathematical analysis. Yet little help is available for the system designer wishing to quickly size the different levels in a memory hierarchy to a first-order approximation. In this paper, we present a simple analysis for providing this practical help and some unexpected results and intuition that come out of the analysis. By applying a specific, parametized model of workload locality, we are able to derive a closed-form solution for the optimal size of each hierarchy level. We verify the accuracy of this solution against exhaustive simulation with two case studies: a three-level I/O storage hierarchy and a three-level processor-cache hierarchy. In all but one case, the configuration recommended by the model performs within 5% of optimal. One result of our analysis is that the first place to spend money is the cheapest (rather than the fastest) cache level, particularly with small system budgets. Another is that money spent on an n-level hierarchy is spent in a fixed proportion until another level is added.Item Cache Design for Embedded Real-Time Systems(1999-06-30) Jacob, BruceCaches have long been a mechanism for speeding memory access and are popular in embedded hardware architectures from microcontrollers to core-based ASIC designs. However, caches are considered ill-suited for embedded real-time systems because they provide a probabilistic performance boost— a cache may or may not contain the desired data at any given moment. Analysis that guarantees when an item will or will not be in the cache has proven difficult, so many real-time systems simply disable caching and schedule tasks based on worst-case memory access time. Yet there are several cache organizations that provide the benefit of caching without the real-time drawbacks of hardware-managed caches. These are software-managed caches, and several different examples can be found, from DSP-style on-chip RAM to academic designs. This paper compares the operation and organization of caches as found in general-purpose processors, microcontrollers, and DSPs; it also discusses designs for embedded realtime systems.Item Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems(2006-07-27) udayakumaran, sumesh; Barua, Rajeev; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this research we propose a highly predictable, low overhead and yet dynamic, memory allocation strategy for embedded systems with scratch-pad memory. A scratch-pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in energy consumption, area and overall runtime, even with a simple allocation scheme. Scratch-pad allocation methods primarily are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption and SRAM space for tags and deliver poor real-time guarantees, just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. We propose a dynamic allocation methodology for global and stack data and program code that (i) accounts for changing program requirements at runtime (ii) has no software-caching tags (iii) requires no run-time checks (iv) has extremely low overheads, and (v) yields 100% predictable memory access times. In this method data that is about to be accessed frequently is copied into the scratch-pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to an existing static allocation scheme, results show that our scheme reduces runtime by up to 39.8% and energy by up to 31.3% on average for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in run-time and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture in runtime and energy while delivering better real-time benefits.