A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
8 results
Search Results
Item On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors(2021) Gerzhoy, Daniel; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some cases- with increased performance. This thesis discusses previously published work on exploiting nested MIMD-SIMD Parallelization for Heterogeneous microprocessors. We examined loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling outer loops on the multicore CPU part of the microprocessor, each thread launches dynamic, independent instances of the inner loop onto the GPU, boosting GPU utilization while simultaneously parallelizing the outer loop. The second portion of the thesis proposal explores heterogeneous producer-consumer data-sharing between the CPU and GPU on the microprocessor. One advantage of tight integration -- the sharing of the on-chip cache system -- could improve the impact that memory accesses have on performance and power. Producer-consumer data sharing commonly occurs between the CPU and GPU portions of programs, but large kernel sizes whose data footprint far exceeds that of a typical CPU cache, cause shared data to be evicted before it is reused. We propose Pipelined CPU-GPU Scheduling for Caches, a locality transformation for producer-consumer relationships between CPUs and GPUs. By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance. To keep the cached data on chip, we allow the producer to run ahead of the consumer by a certain amount of loop iterations or threads. Choosing this "run-ahead distance" becomes the main constraint in the scheduling of work in this software pipeline, and we provide a method of statically predicting it. We assert that with intelligent scheduling and the hardware and software mechanisms to support it, more workloads can be gainfully executed on integrated heterogeneous CPU-GPU microprocessors than previously assumed.Item Advancing the Multi-Solver Paradigm for Overset CFD Toward Heterogeneous Architectures(2019) Jude, Dylan P; Baeder, James; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A multi-solver, overset, computational fluid dynamics framework is developed for efficient, large-scale simulation of rotorcraft problems. Two primary features distinguish the developed framework from the current state of the art. First, the framework is designed for heterogeneous compute architectures, making use of both traditional codes run on the Central Processing Unit (CPU) as well as codes run on the Graphics Processing Unit (GPU). Second, a framework-level implementation of the Generalized Minimal Residual linear solver is used to consider all meshes from all solvers in a single linear system. The developed GPU flow solver and framework are validated against conventional implementations, achieving a 5.35× speedup for a single GPU compared to 24 CPU cores. Similarly, the overset linear solver is compared to traditional techniques, demonstrating the same convergence order can be achieved using as few as half the number of iterations. Applications of the developed methods are organized into two chapters. First, the heterogeneous, overset framework is applied to a notional helicopter configuration based on the ROBIN wind tunnel experiments. A tail rotor and hub are added to create a challenging case representative of a realistic, full-rotorcraft simulation. Interactional aerodynamics between the different components are reviewed in detail. The second application chapter focuses on performance of the overset linear solver for unsteady applications. The GPU solver is used along with an unstructured code to simulate laminar flow over a sphere as well as laminar coaxial rotors designed for a Mars helicopter. In all results, the overset linear solver out-performs the traditional, de-coupled approach. Conclusions drawn from both the full-rotorcraft and overset linear solver simulations can have a significant impact on improving modeling of complex rotorcraft aerodynamics.Item Contributions Toward Understanding the Effects of Rotor and Airframe Configurations On Brownout Dust Clouds(2014) Govindarajan, Bharath Madapusi; Leishman, J. Gordon; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Brownout dust cloud simulations were conducted for rotorcraft undergoing representative landing maneuvers, primarily to examine the effects of different rotor placement and rotor/airframe configurations. The flow field generated by a helicopter rotor in ground effect operations was modeled by using an inviscid, incompressible, time-accurate Lagrangian free-vortex method, coupled to a semi-empirical approximation for the boundary layer flow near the ground. A surface singularity method was employed to represent the aerodynamic influence of a fuselage. A rigorous coupling strategy for the free-vortex method was developed to include the effects of rotors operating at different rotational speeds, such as a tail rotor. For the dispersed phase of the flow, particle tracking was used to model the dust cloud based on solutions to a decoupled form of the Basset-Boussinesq-Oseen equations appropriate to dilute gas particle suspensions of low Reynolds number Stokes flow. Important aspects of particle mobility and uplift in such vortically driven dust flows were modeled, which included a threshold-based model for sediment mobility and bombardment effects when previously suspended particles impact the bed and eject new particles. Various techniques were employed to reduce the computational cost of the dust cloud simulations, such as particle clustering and parallel programming using graphics processing units. The predicted flow fields near the ground and resulting dust clouds during the landing maneuvers were analyzed to better understand the physics behind their development, and to examine differences produced by various rotor and airframe configurations. Metrics based on particle counts and particle velocities in the field of view were developed to help quantify the severity of the computed brownout dust clouds. The presence of both a tail rotor and the fuselage was shown to cause both local and global changes to the aerodynamic environment near the ground and also influenced the development of the resulting dust clouds. Studies were also performed to examine the accuracy of self-induced velocities of vortex filaments by augmenting the straight-line vortex segments with a curved filament correction term. It was found that while curved elements can accurately recover the self-induced velocity in the case of a vortex ring, there existed bounds of applicability when extended to three-dimensional rotor wakes. Finally, exploratory two-dimensional and three-dimensional studies were performed to examine the effects of blade/particle collisions. The loss in particle kinetic energy during the collision was adopted as a surrogate metric to quantify the extent of potential blade erosion.Item Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms(2014) Wu, Jing; JaJa, Joseph F; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU.Item HIERARCHICAL MAPPING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS ON PARALLEL PLATFORMS(2014) Wang, Lai-Huei; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Dataflow models are widely used for expressing the functionality of digital signal processing (DSP) applications due to their useful features, such as providing formal mechanisms for description of application functionality, imposing minimal data-dependency constraints in specifications, and exposing task and data level parallelism effectively. Due to the increased complexity of dynamics in modern DSP applications, dataflow-based design methodologies require significant enhancements in modeling and scheduling techniques to provide for efficient and flexible handling of dynamic behavior. To address this problem, in this thesis, we propose an innovative framework for mode- and dynamic-parameter-based modeling and scheduling. We apply, in a systematically integrated way, the structured mode-based dataflow modeling capability of dynamic behavior together with the features of dynamic parameter reconfiguration and quasi-static scheduling. Moreover, in our proposed framework, we present a new design method called parameterized multidimensional design hierarchy mapping (PMDHM), which is targeted to the flexible, multi-level reconfigurability, and intensive real-time processing requirements of emerging dynamic DSP systems. The proposed approach allows designers to systematically represent and transform multi-level specifications of signal processing applications from a common, dataflow-based application-level model. In addition, we propose a new technique for mapping optimization that helps designers derive efficient, platform-specific parameters for application-to-architecture mapping. These parameters help to maximize system performance on state-of-the-art parallel platforms for embedded signal processing. To further enhance the scalability of our design representations and implementation techniques, we present a formal method for analysis and mapping of parameterized DSP flowgraph structures, called topological patterns, into efficient implementations. The approach handles an important class of parameterized schedule structures in a form that is intuitive for representation and efficient for implementation. We demonstrate our methods with case studies in the fields of wireless communication and computer vision. Experimental results from these case studies show that our approaches can be used to derive optimized implementations on parallel platforms, and enhance trade-off analysis during design space exploration. Furthermore, their basis in formal modeling and analysis techniques promotes the applicability of our proposed approaches to diverse signal processing applications and architectures.Item A GPU-ACCELERATED, HYBRID FVM-RANS METHODOLOGY FOR MODELING ROTORCRAFT BROWNOUT(2013) Thomas, Sebastian; Baeder, James D; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A numerically effecient, hybrid Eulerian- Lagrangian methodology has been developed to help better understand the complicated two- phase flowfield encountered in rotorcraft brownout environments. The problem of brownout occurs when rotorcraft operate close to surfaces covered with loose particles such as sand, dust or snow. These particles can get entrained, in large quantities, into the rotor wake leading to a potentially hazardous degradation of the pilots visibility. It is believed that a computationally efficient model of this phenomena, validated against available experimental measurements, can be a used as a valuable tool to reveal the underlying physics of rotorcraft brownout. The present work involved the design, development and validation of a hybrid solver for the purpose of modeling brownout-like environments. The proposed methodology combines the numerical efficiency of a free-vortex method with the relatively high-fidelity of a 3D, time-accurate, Reynolds- averaged, Navier-Stokes (RANS) solver. For dual-phase simulations, this hybrid method can be unidirectionally coupled with a sediment tracking algorithm to study cloud development. In the past, large clusters of CPUs have been the standard approach for large simulations involving the numerical solution of PDEs. In recent years, however, an emerging trend is the use of Graphics Processing Units (GPUs), once used only for graphics rendering, to perform scientific computing. These platforms deliver superior computing power and memory bandwidth compared to traditional CPUs and their prowess continues to grow rapidly with each passing generation. CFD simulations have been ported successfully onto GPU platforms in the past. However, the nature of GPU architecture has restricted the set of algorithms that exhibit significant speedups on these platforms - GPUs are optimized for operations where a massively large number of threads, relative to the problem size, are working in parallel, executing identical instructions on disparate datasets. For this reason, most implementations in the scientific literature involve the use of explicit algorithms for time-stepping, reconstruction, etc. To overcome the difficulty associated with implicit methods, the current work proposes a multi-granular approach to reduce performance penalties typically encountered with such schemes. To explore the use of GPUs for RANS simulations, a 3D, time- accurate, implicit, structured, compressible, viscous, turbulent, finite-volume RANS solver was designed and developed in CUDA-C. During the development phase, various strategies for performance optimization were used to make the implementation better suited to the GPU architecture. Validation and verification of the GPU-based solver was performed for both canonical and realistic bench-mark problems on a variety of GPU platforms. In these test- cases, a performance assessment of the GPU-RANS solver indicated that it was between one and two orders of magnitude faster than equivalent single CPU core computations ( as high as 50X for fine-grain computations on the latest platforms). For simulations involving implicit methods, a multi-granular technique was used that sought to exploit the intermediate coarse- grain parallelism inherent in families of line- parallel methods like Alternating Direction Implicit (ADI) schemes coupled with con- servative variable parallelism. This approach had the dual effect of reducing memory bandwidth usage as well as increasing GPU occupancy leading to significant performance gains. The multi-granular approach for implicit methods used in this work has demonstrated speedups that are close to 50% of those expected with purely explicit methods. The validated GPU-RANS solver was then coupled with GPU-based free-vortex and sediment tracking methods to model single and dual-phase, model- scale brownout environments. A qualitative and quantitative validation of the methodology was performed by comparing predictions with available measurements, including flowfield measurements and observations of particle transport mechanisms that have been made with laboratory-scale rotor/jet configurations in ground effect. In particular, dual-phase simulations were able to resolve key transport phenomena in the dispersed phase such as creep, vortex trapping and sediment wave formation. Furthermore, these simulations were demonstrated to be computationally more efficient than equivalent computations on a cluster of traditional CPUs - a model-scale brownout simulation using the hybrid approach on a single GTX Titan now takes 1.25 hours per revolution compared to 6 hours per revolution on 32 Intel Xeon cores.Item Parallelization of Non-Rigid Image Registration(2008) Philip, Mathew; Shekhar, Raj; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Non-rigid image registration finds use in a wide range of medical applications ranging from diagnostics to minimally invasive image-guided interventions. Automatic non-rigid image registration algorithms are computationally intensive in that they can take hours to register two images. Although hierarchical volume subdivision-based algorithms are inherently faster than other non-rigid registration algorithms, they can still take a long time to register two images. We show a parallel implementation of one such previously reported and well tested algorithm on a cluster of thirty two processors which reduces the registration time from hours to a few minutes. Mutual information (MI) is one of the most commonly used image similarity measures used in medical image registration and also in the mentioned algorithm. In addition to parallel implementation, we propose a new concept based on bit-slicing to accelerate computation of MI on the cluster and, more generally, on any parallel computing platform such as the Graphics processor units (GPUs). GPUs are becoming increasingly common for general purpose computing in the area of medical imaging as they can execute algorithms faster by leveraging the parallel processing power they offer. However, the standard implementation of MI does not map well to the GPU architecture, leading earlier investigators to compute only an inexact version of MI on the GPU to achieve speedup. The bit-slicing technique we have proposed enables us to demonstrate an exact implementation of MI on the GPU without adversely affecting the speedup.Item High-Speed Reconstruction of Low-Dose CT Using Iterative Techniques for Image-Guided Interventions(2008-07-18) Bhat, Venkatesh Bantwal; Shekhar, Raj; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Minimally invasive image-guided interventions(IGIs) lead to improved treatment outcomes while significantly reducing patient trauma. Because of features such as fast scanning, high resolution, three-dimensional view and ease of operation, Computed-Tomography(CT) is increasingly the choice for IGIs. The risk of radiation exposure, however, limits its current and future use. We perform ultra low-dose scanning to overcome this limitation. To address the image quality problem at low doses, we reconstruct images using the iterative Paraboloidal Surrogate(PS) algorithm. Using actual scanner data, we demonstrate improvement in the quality of reconstructed images using the iterative algorithm at low doses as compared to the standard Filtered Back Projection(FBP) technique. We also accelerate the PS algorithm on a cluster of 32 processors and a GPU. We demonstrate approximately 20 times speedup for the cluster and two orders of improvement in speed for the GPU, while maintaining comparable image quality to the traditional uni-processor implementation.