Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors
    (2021) Gerzhoy, Daniel; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some cases- with increased performance. This thesis discusses previously published work on exploiting nested MIMD-SIMD Parallelization for Heterogeneous microprocessors. We examined loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling outer loops on the multicore CPU part of the microprocessor, each thread launches dynamic, independent instances of the inner loop onto the GPU, boosting GPU utilization while simultaneously parallelizing the outer loop. The second portion of the thesis proposal explores heterogeneous producer-consumer data-sharing between the CPU and GPU on the microprocessor. One advantage of tight integration -- the sharing of the on-chip cache system -- could improve the impact that memory accesses have on performance and power. Producer-consumer data sharing commonly occurs between the CPU and GPU portions of programs, but large kernel sizes whose data footprint far exceeds that of a typical CPU cache, cause shared data to be evicted before it is reused. We propose Pipelined CPU-GPU Scheduling for Caches, a locality transformation for producer-consumer relationships between CPUs and GPUs. By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance. To keep the cached data on chip, we allow the producer to run ahead of the consumer by a certain amount of loop iterations or threads. Choosing this "run-ahead distance" becomes the main constraint in the scheduling of work in this software pipeline, and we provide a method of statically predicting it. We assert that with intelligent scheduling and the hardware and software mechanisms to support it, more workloads can be gainfully executed on integrated heterogeneous CPU-GPU microprocessors than previously assumed.
  • Thumbnail Image
    Item
    myCACTI: A new cache design tool for pipelined nanometer caches
    (2006-11-28) Rodriguez, Samuel; Jacob, Bruce; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    TThe presence of caches in microprocessors has always been one of the most important techniques in bridging the memory wall, or the speed gap between the microprocessor and main memory. This importance is continuously increasing especially as we enter the regime of nanometer process technologies (i.e. 90nm and below), as industry has favored investing a larger and larger fraction of a chip.s transistor budget to improving the on-chip cache. This is the case in practice, as it has proven to be an efficient way to utilize the increasing number of transistors available with each succeeding technology. Consequently, it becomes even more important to have cache design tools that give accurate representations of designs that exist in actual microprocessors. The prevalent cache design tools that are the most widely used in academe are CACTI [Wilton1996] and eCACTI [Mamidipaka2004], and these have proven to be very useful tools not just for cache designers, but also for computer architects. This dissertation will show that both CACTI and eCACTI still contain major limitations and even flaws in their design, making them unsuitable for use in very-deep submicron and nanometer caches, especially pipelined designs. These limitations and flaws will be discussed in detail. This dissertation then introduces a new tool, called myCACTI, that addresses all these limitations and, in addition, introduces major enhancements to the simulation framework. This dissertation then demonstrates the use of myCACTI in the cache design process. Detailed design space explorations are done on multiple cache configurations to produce pareto optimal curves of the caches to show optimal implementations. Detailed studies are also performed to characterize the delay and power dissipation of different cache configurations and implementations. Finally, future directions to the development of myCACTI are identified to show possible ways that the tool can be improved in such a way as to allow even more different kinds of studies to be performed.