Technical Reports from UMIACS
Permanent URI for this collectionhttp://hdl.handle.net/1903/7
Browse
7 results
Search Results
Item Pipelined CPU-GPU Scheduling for Caches(2021-03-23) Gerzhoy, Daniel; Yeung, DonaldHeterogeneous microprocessors integrate a CPU and GPU with a shared cache hierarchy on the same chip, affording low-overhead communication between the CPU and GPU's cores. Often times, large array data structures are communicated from the CPU to the GPU and back. While the on-chip cache hierarchy can support such CPU-GPU producer-consumer sharing, this almost never happens due to poor temporal reuse. Because the data structures can be quite large, by the time the consumer reads the data, it has been evicted from cache even though the producer had brought it on-chip when it originally wrote the data. As a result, the CPU-GPU communication happens through main memory instead of the cache, hurting performance and energy. This paper exploits the on-chip caches in a heterogeneous microprocessor to improve CPU-GPU communication efficiency. We divide streaming computations executed by the CPU and GPU that exhibit producer-consumer sharing into chunks, and overlap the execution of CPU chunks with GPU chunks in a software pipeline. To enforce data dependences, the producer executes one chunk ahead of the consumer at all times. We also propose a low-overhead synchronization mechanism in which the CPU directly controls thread-block scheduling in the GPU to maintain the producer's "run-ahead distance" relative to the consumer. By adjusting the chunk size or run-ahead distance, we can make the CPU-GPU working set fit in the last-level cache, thus permitting the producer-consumer sharing to occur through the LLC. We show through simulation that our technique reduces the number of DRAM accesses by 30.4%, improves performance by 26.8%, and lowers memory system energy by 27.4% averaged across 7 benchmarks.Item Design and Evaluation of Monolithic Computers Implemented Using Crossbar ReRAM(2019-07-16) Jagasivamani, Meenatchi; Walden, Candace; Singh, Devesh; Li, Shang; Kang, Luyi; Asnaashari, Mehdi; Dubois, Sylvain; Jacob, Bruce; Yeung, DonaldA monolithic computer is an emerging architecture in which a multicore CPU and a high-capacity main memory system are all integrated in a single die. We believe such architectures will be possible in the near future due to nonvolatile memory technology, such as the resistive random access memory, or ReRAM, from Crossbar Incorporated. Crossbar's ReRAM can be fabricated in a standard CMOS logic process, allowing it to be integrated into a CPU's die. The ReRAM cells are manufactured in between metal wires and do not employ per-cell access transistors, leaving the bulk of the base silicon area vacant. This means that a CPU can be monolithically integrated directly underneath the ReRAM memory, allowing the cores to have massively parallel access to the main memory. This paper presents the characteristics of Crossbar's ReRAM technology, informing architects on how ReRAM can enable monolithic computers. Then, it develops a CPU and memory system architecture around those characteristics, especially to exploit the unprecedented memory-level parallelism. The architecture employs a tiled CPU, and incorporates memory controllers into every compute tile that support a variable access granularity to enable high scalability. Lastly, the paper conducts an experimental evaluation of monolithic computers on graph kernels and streaming computations. Our results show that compared to a DRAM-based tiled CPU, a monolithic computer achieves 4.7x higher performance on the graph kernels, and achieves roughly parity on the streaming computations. Given a future 7nm technology node, a monolithic computer could outperform the conventional system by 66% for the streaming computations.Item Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors(2016-11-10) Zuzak, Michael; Yeung, DonaldHeterogeneous microprocessors integrate CPUs and GPUs on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data "in place." These advantages will permit integrated GPUs to exploit a smaller unit of parallelism. But one challenge will be exposing sufficient parallelism to keep all of the on-chip compute resources fully utilized. In this paper, we argue that integrated CPU-GPU chips should exploit parallelism from multiple loops simultaneously. One example of this is nested parallelism in which one or more inner SIMD loops are nested underneath a parallel outer (non- SIMD) loop. By scheduling the parallel outer loop on multiple CPU cores, multiple dynamic instances of the inner SIMD loops can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the non-SIMD code. Our preliminary results show exploiting such multi-loop parallelism provides a 3.12x performance gain over exploiting parallelism from individual loops one at a time.Item Studying Directory Access Patterns via Reuse Distance Analysis and Evaluating Their Impact on Multi-Level Directory Caches(2014-01-13) Zhao, Minshu; Yeung, DonaldThe trend for multicore CPUs is towards increasing core count. One of the key limiters to scaling will be the on-chip directory cache. Our work investigates moving portions of the directory away from the cores, perhaps to off-chip DRAM, where ample capacity exists. While such multi-level directory caches exhibit increased latency, several aspects of directory accesses will shield CPU performance from the slower directory, including low access frequency and latency hiding underneath data accesses to main memory. While multi-level directory caches have been studied previously, no work has of yet comprehensively quantified the directory access patterns themselves, making it difficult to understand multi-level behavior in depth. This paper presents a framework based on multicore reuse distance for studying directory cache access patterns. Using our analysis framework, we show between 69-93% of directory entries are looked up only once or twice during their liftimes in the directory cache, and between 51-71% of dynamic directory accesses are latency tolerant. Using cache simulations, we show a very small L1 directory cache can service 80% of latency critical directory lookups. Although a significant number of directory lookups and eviction notifications must access the slower L2 directory cache, virtually all of these are latency tolerant.Item Symbiotic Cache Resizing for CMPs with Shared LLC(2013-09-11) Choi, Inseok; Yeung, DonaldThis paper investigates the problem of finding the optimal sizes of private caches and a shared LLC in CMPs. Resizing private and shared caches in modern CMPs is one way to squeeze wasteful power consumption out of architectures to improve power efficiency. However, shrinking each private/shared cache has different impact on the performance loss and the power savings to the CMPs because each cache contributes differently to performance and power. It is beneficial for both performance and power to shrink the LRU way of the private/shared cache which saves power most and increases data traffic least. This paper presents Symbiotic Cache Resizing (SCR), a runtime technique that reduces the total power consumption of the on-chip cache hierarchy in CMPs with a shared LLC. SCR turnoffs private/shared-cache ways in an inter-core and inter-level manner so that each disabling achieves best power saving while maintaining high performance. SCR finds such optimal cache sizes by utilizing greedy algorithms that we develop in this study. In particular, Prioritized Way Selection picks the most power-inefficient way. LLC-Partitioning-aware Prioritized Way Selection finds optimal cache sizes from the multi-level perspective. Lastly, Weighted Threshold Throttling finds optimal threshold per cache level. We evaluate SCR in two-core, four-core and eight-core systems. Results show that SCR saves 13% power in the on-chip cache hierarchy and 4.2% power in the system compared to an even LLC partitioning technique. SCR saves 2.7X more power in the cache hierarchy than the state-of-the-art LLC resizing technique while achieving better performance.Item Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis(2012-01-17) Wu, Meng-Ju; Yeung, DonaldUnderstanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly and not very insightful. An alternative is reuse distance (RD) analysis. RD analysis for multicore processors is becoming feasible because recent research has developed new notions of reuse distance that can analyze thread interactions. In particular, concurrent reuse distance (CRD) models shared cache interference, while private-stack reuse distance (PRD) models private cache replication and communication. Previous multicore RD research has centered around developing techniques and verifying accuracy. In this paper, we apply multicore RD analysis to better understand memory behavior. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. First, we develop techniques to isolate thread interactions, permitting analysis of their relative contributions. Then, we use our techniques to extract several new insights that can help architects optimize multicore cache hierarchies. One of our findings is that data sharing in parallel loops varies with reuse distance, becoming significant only at larger RD values. This implies capacity sharing in shared caches and replication/communication in private caches occur only beyond some capacity. We define Cshare to be the turn-on capacity for data sharing, and study its impact on private vs. shared cache performance. In addition, we find machine scaling degrades locality at smaller RD values and increases sharing frequency (i.e., reduces Cshare). We characterize how these effects vary with core count, and study their impact on the preference for private vs. shared caches.Item Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance(2010-10-05) Wu, Meng-Ju; Yeung, DonaldPerformance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1/1200th the time to perform simulation of the full design space.