Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 10 of 13
  • Thumbnail Image
    Item
    Simulating Bursty and Continuous Reionization Using GPU Computing
    (2023) Hartley, Blake Teixeira; Ricotti, Massimo; Astronomy; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Reionization is the process by which the neutral intergallactic medium of the early universe was ionized by the first galaxies, and took place somewhere between roughly redshift 30 and redshift 6, or from 100~Myr into the universe to 1~Gyr. The details of this transition are still not well understood, but observational constraints suggest that reionization happened faster than naive estimates would suggest. In this thesis, we investigate the theory that galaxies which form their stars in short bursts could complete reionization faster than galaxies which emit their photons continuously over their lifespans. We began investigating this theory with a semi-analytic model of the early universe. We used analytic methods to model the expansion of \HII (ionized hydrogen) regions around isolated galaxies, as well as the behavior of the remnant \HII regions after star formation ceases. We then compiled assortments of galaxies matching dark matter simulation profiles and associated each with an \HII region that could either grow continuously or grow quickly before entering a dormant period of recombination. These tests indicated that the remnants of bursty star formation had lower overall recombination rates than those of continuously expanding \HII regions, and that these remnants could allow for ionizing radiation from more distant sources to influence ionization earlier. We decided that the next step towards demonstrating the differences between continuous and bursty star formation would require the use of a more accurate model of the early universe. We chose a photon conserving ray tracing algorithm which follows the path of millions of rays from each galaxy and calculates the ionization rate at every point in a uniform 3D grid. The massive amount of computation required for such an algorithm led us to choose MPI as the framework for building our simulation. MPI allowed us to break the grid into 8 sub-volumes, each of which could be assigned to a node on a supercomputer. We then used CUDA to track the millions of rays, with each of the thousands of CUDA cores handling a single ray. Creating my own simulation library would afford us complete control over the distribution and time dependence of ionizing radiation emission, which is critical to isolating the effect of bursty star formation on reionization. Once we had completed, we conducted a suite of simulations across a selection of model parameters using this library. Every set of model parameters we selected corresponds to two models, one continuous and one bursty. This selection allowed us to isolate the effect of bursty star formation on the results of the simulations. We found that the effects we hoped to see were present in our simulations, and obtained simple estimates of the size of these effects.
  • Thumbnail Image
    Item
    On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors
    (2021) Gerzhoy, Daniel; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be run on a heterogeneous CPU-GPU microprocessor efficiently and in some cases- with increased performance. This thesis discusses previously published work on exploiting nested MIMD-SIMD Parallelization for Heterogeneous microprocessors. We examined loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling outer loops on the multicore CPU part of the microprocessor, each thread launches dynamic, independent instances of the inner loop onto the GPU, boosting GPU utilization while simultaneously parallelizing the outer loop. The second portion of the thesis proposal explores heterogeneous producer-consumer data-sharing between the CPU and GPU on the microprocessor. One advantage of tight integration -- the sharing of the on-chip cache system -- could improve the impact that memory accesses have on performance and power. Producer-consumer data sharing commonly occurs between the CPU and GPU portions of programs, but large kernel sizes whose data footprint far exceeds that of a typical CPU cache, cause shared data to be evicted before it is reused. We propose Pipelined CPU-GPU Scheduling for Caches, a locality transformation for producer-consumer relationships between CPUs and GPUs. By intelligently scheduling the execution of the producer and consumer in a software pipeline, evictions can be avoided, saving DRAM accesses, power, and performance. To keep the cached data on chip, we allow the producer to run ahead of the consumer by a certain amount of loop iterations or threads. Choosing this "run-ahead distance" becomes the main constraint in the scheduling of work in this software pipeline, and we provide a method of statically predicting it. We assert that with intelligent scheduling and the hardware and software mechanisms to support it, more workloads can be gainfully executed on integrated heterogeneous CPU-GPU microprocessors than previously assumed.
  • Thumbnail Image
    Item
    Advancing the Multi-Solver Paradigm for Overset CFD Toward Heterogeneous Architectures
    (2019) Jude, Dylan P; Baeder, James; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A multi-solver, overset, computational fluid dynamics framework is developed for efficient, large-scale simulation of rotorcraft problems. Two primary features distinguish the developed framework from the current state of the art. First, the framework is designed for heterogeneous compute architectures, making use of both traditional codes run on the Central Processing Unit (CPU) as well as codes run on the Graphics Processing Unit (GPU). Second, a framework-level implementation of the Generalized Minimal Residual linear solver is used to consider all meshes from all solvers in a single linear system. The developed GPU flow solver and framework are validated against conventional implementations, achieving a 5.35× speedup for a single GPU compared to 24 CPU cores. Similarly, the overset linear solver is compared to traditional techniques, demonstrating the same convergence order can be achieved using as few as half the number of iterations. Applications of the developed methods are organized into two chapters. First, the heterogeneous, overset framework is applied to a notional helicopter configuration based on the ROBIN wind tunnel experiments. A tail rotor and hub are added to create a challenging case representative of a realistic, full-rotorcraft simulation. Interactional aerodynamics between the different components are reviewed in detail. The second application chapter focuses on performance of the overset linear solver for unsteady applications. The GPU solver is used along with an unstructured code to simulate laminar flow over a sphere as well as laminar coaxial rotors designed for a Mars helicopter. In all results, the overset linear solver out-performs the traditional, de-coupled approach. Conclusions drawn from both the full-rotorcraft and overset linear solver simulations can have a significant impact on improving modeling of complex rotorcraft aerodynamics.
  • Thumbnail Image
    Item
    ARCHITECTURE, MODELS, AND ALGORITHMS FOR TEXTUAL SIMILARITY
    (2018) He, Hua; Lin, Jimmy; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Identifying similar pieces of texts remains one of the fundamental problems in computational linguistics. This dissertation focuses on the textual similarity measurement and identification problem by studying a variety of major tasks that share common properties, and presents our efforts to address 7 closely-related similarity tasks given over 20 public benchmarks, including paraphrase identification, answer selection for question answering, pairwise learning to rank, monolingual/cross-lingual semantic textual similarity measurement, insight extraction on biomedical literature, and high performance cross-lingual pattern matching for machine translation on GPUs. We investigate how to make textual similarity measurement more accurate with deep neural networks. Traditional approaches are either based on feature engineering which leads to disconnected solutions, or the Siamese architecture which treats inputs independently, utilizes single representation view and straightforward similarity comparison. In contrast, we focus on modeling stronger interactions between inputs and develop interaction-based neural modeling that explicitly encodes the alignments of input words or aggregated sentence representations into our models. As a result, our multiple deep neural networks show highly competitive performance on many textual similarity measurement public benchmarks we evaluated. Our multi-perspective convolutional neural networks (MPCNN) uses a multiplicity of perspectives to process input sentences with multiple parallel convolutional neural networks, is able to extract salient sentence-level features automatically at multiple granularities with different types of pooling. Our novel structured similarity layer encourages stronger input interactions by comparing local regions of both sentence representations. This model is the first example of our interaction-based neural modeling. We also provide an attention-based input interaction layer on top of the MPCNN model. The input interaction layer models a closer relationship of input words by converting two separate sentences into an inter-related sentence pair. This layer utilizes the attention mechanism in a straightforward way, and is another example of our interaction-based neural modeling. We then provide our pairwise word interaction model with very deep neural networks (PWI). This model directly encodes input word interactions with novel pairwise word interaction modeling and a novel similarity focus layer. The use of very deep architecture in this model is the first example in NLP domain for better textual similarity modeling. Our PWI model outperforms the Siamese architecture and feature engineering approach on multiple tasks, and is another example of our interaction-based neural modeling. We also focus on the question answering task with a pairwise ranking approach. Unlike traditional pointwise approach of the task, our pairwise ranking approach with the use of negative sampling focuses on modeling interactions between two pairs of question and answer inputs, then learns a relative order of the pairs to predict which answer is more relevant to the question. We demonstrate its high effectiveness against competitive previous pointwise baselines. For the insight extraction on biomedical literature task, we develop neural networks with similarity modeling for better causality/correlation relation extraction, as we convert the extraction task into a similarity measurement task. Our approach innovates in that it explicitly models the interactions among the trio: named entities, entity relations and contexts, and then measures both relational and contextual similarity among them, finally integrate both similarity evaluations into considerations for insight extraction. We also build an end-to-end system to extract insights, with human evaluations we show our system is able to extract insights with high human acceptance accuracy. Lastly, we explore how to exploit massive parallelism offered by modern GPUs for high-efficiency pattern matching. We take advantage of GPU hardware advances and develop a massive parallelism approach. We firstly work on phrase-based SMT, where we enable phrase lookup and extraction on suffix arrays to be massively parallelized and vastly many queries to be carried out in parallel. We then work on computationally expensive hierarchical SMT model, which requires matching grammar patterns that contain ''gaps''. In order to get high efficiency for the similarity identification task on GPUs, we show developing massively parallel algorithms on GPUs is the most important approach to fully utilize GPU's raw processing power, and developing compact data structures on GPUs is helpful to lower GPU's memory latency. Compared to a highly-optimized, state-of-the-art multi-threaded CPU implementation, our techniques achieve orders of magnitude improvement in terms of throughput.
  • Thumbnail Image
    Item
    Contributions Toward Understanding the Effects of Rotor and Airframe Configurations On Brownout Dust Clouds
    (2014) Govindarajan, Bharath Madapusi; Leishman, J. Gordon; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Brownout dust cloud simulations were conducted for rotorcraft undergoing representative landing maneuvers, primarily to examine the effects of different rotor placement and rotor/airframe configurations. The flow field generated by a helicopter rotor in ground effect operations was modeled by using an inviscid, incompressible, time-accurate Lagrangian free-vortex method, coupled to a semi-empirical approximation for the boundary layer flow near the ground. A surface singularity method was employed to represent the aerodynamic influence of a fuselage. A rigorous coupling strategy for the free-vortex method was developed to include the effects of rotors operating at different rotational speeds, such as a tail rotor. For the dispersed phase of the flow, particle tracking was used to model the dust cloud based on solutions to a decoupled form of the Basset-Boussinesq-Oseen equations appropriate to dilute gas particle suspensions of low Reynolds number Stokes flow. Important aspects of particle mobility and uplift in such vortically driven dust flows were modeled, which included a threshold-based model for sediment mobility and bombardment effects when previously suspended particles impact the bed and eject new particles. Various techniques were employed to reduce the computational cost of the dust cloud simulations, such as particle clustering and parallel programming using graphics processing units. The predicted flow fields near the ground and resulting dust clouds during the landing maneuvers were analyzed to better understand the physics behind their development, and to examine differences produced by various rotor and airframe configurations. Metrics based on particle counts and particle velocities in the field of view were developed to help quantify the severity of the computed brownout dust clouds. The presence of both a tail rotor and the fuselage was shown to cause both local and global changes to the aerodynamic environment near the ground and also influenced the development of the resulting dust clouds. Studies were also performed to examine the accuracy of self-induced velocities of vortex filaments by augmenting the straight-line vortex segments with a curved filament correction term. It was found that while curved elements can accurately recover the self-induced velocity in the case of a vortex ring, there existed bounds of applicability when extended to three-dimensional rotor wakes. Finally, exploratory two-dimensional and three-dimensional studies were performed to examine the effects of blade/particle collisions. The loss in particle kinetic energy during the collision was adopted as a surrogate metric to quantify the extent of potential blade erosion.
  • Thumbnail Image
    Item
    Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms
    (2014) Wu, Jing; JaJa, Joseph F; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU.
  • Thumbnail Image
    Item
    HIERARCHICAL MAPPING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS ON PARALLEL PLATFORMS
    (2014) Wang, Lai-Huei; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Dataflow models are widely used for expressing the functionality of digital signal processing (DSP) applications due to their useful features, such as providing formal mechanisms for description of application functionality, imposing minimal data-dependency constraints in specifications, and exposing task and data level parallelism effectively. Due to the increased complexity of dynamics in modern DSP applications, dataflow-based design methodologies require significant enhancements in modeling and scheduling techniques to provide for efficient and flexible handling of dynamic behavior. To address this problem, in this thesis, we propose an innovative framework for mode- and dynamic-parameter-based modeling and scheduling. We apply, in a systematically integrated way, the structured mode-based dataflow modeling capability of dynamic behavior together with the features of dynamic parameter reconfiguration and quasi-static scheduling. Moreover, in our proposed framework, we present a new design method called parameterized multidimensional design hierarchy mapping (PMDHM), which is targeted to the flexible, multi-level reconfigurability, and intensive real-time processing requirements of emerging dynamic DSP systems. The proposed approach allows designers to systematically represent and transform multi-level specifications of signal processing applications from a common, dataflow-based application-level model. In addition, we propose a new technique for mapping optimization that helps designers derive efficient, platform-specific parameters for application-to-architecture mapping. These parameters help to maximize system performance on state-of-the-art parallel platforms for embedded signal processing. To further enhance the scalability of our design representations and implementation techniques, we present a formal method for analysis and mapping of parameterized DSP flowgraph structures, called topological patterns, into efficient implementations. The approach handles an important class of parameterized schedule structures in a form that is intuitive for representation and efficient for implementation. We demonstrate our methods with case studies in the fields of wireless communication and computer vision. Experimental results from these case studies show that our approaches can be used to derive optimized implementations on parallel platforms, and enhance trade-off analysis during design space exploration. Furthermore, their basis in formal modeling and analysis techniques promotes the applicability of our proposed approaches to diverse signal processing applications and architectures.
  • Thumbnail Image
    Item
    A GPU-ACCELERATED, HYBRID FVM-RANS METHODOLOGY FOR MODELING ROTORCRAFT BROWNOUT
    (2013) Thomas, Sebastian; Baeder, James D; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A numerically effecient, hybrid Eulerian- Lagrangian methodology has been developed to help better understand the complicated two- phase flowfield encountered in rotorcraft brownout environments. The problem of brownout occurs when rotorcraft operate close to surfaces covered with loose particles such as sand, dust or snow. These particles can get entrained, in large quantities, into the rotor wake leading to a potentially hazardous degradation of the pilots visibility. It is believed that a computationally efficient model of this phenomena, validated against available experimental measurements, can be a used as a valuable tool to reveal the underlying physics of rotorcraft brownout. The present work involved the design, development and validation of a hybrid solver for the purpose of modeling brownout-like environments. The proposed methodology combines the numerical efficiency of a free-vortex method with the relatively high-fidelity of a 3D, time-accurate, Reynolds- averaged, Navier-Stokes (RANS) solver. For dual-phase simulations, this hybrid method can be unidirectionally coupled with a sediment tracking algorithm to study cloud development. In the past, large clusters of CPUs have been the standard approach for large simulations involving the numerical solution of PDEs. In recent years, however, an emerging trend is the use of Graphics Processing Units (GPUs), once used only for graphics rendering, to perform scientific computing. These platforms deliver superior computing power and memory bandwidth compared to traditional CPUs and their prowess continues to grow rapidly with each passing generation. CFD simulations have been ported successfully onto GPU platforms in the past. However, the nature of GPU architecture has restricted the set of algorithms that exhibit significant speedups on these platforms - GPUs are optimized for operations where a massively large number of threads, relative to the problem size, are working in parallel, executing identical instructions on disparate datasets. For this reason, most implementations in the scientific literature involve the use of explicit algorithms for time-stepping, reconstruction, etc. To overcome the difficulty associated with implicit methods, the current work proposes a multi-granular approach to reduce performance penalties typically encountered with such schemes. To explore the use of GPUs for RANS simulations, a 3D, time- accurate, implicit, structured, compressible, viscous, turbulent, finite-volume RANS solver was designed and developed in CUDA-C. During the development phase, various strategies for performance optimization were used to make the implementation better suited to the GPU architecture. Validation and verification of the GPU-based solver was performed for both canonical and realistic bench-mark problems on a variety of GPU platforms. In these test- cases, a performance assessment of the GPU-RANS solver indicated that it was between one and two orders of magnitude faster than equivalent single CPU core computations ( as high as 50X for fine-grain computations on the latest platforms). For simulations involving implicit methods, a multi-granular technique was used that sought to exploit the intermediate coarse- grain parallelism inherent in families of line- parallel methods like Alternating Direction Implicit (ADI) schemes coupled with con- servative variable parallelism. This approach had the dual effect of reducing memory bandwidth usage as well as increasing GPU occupancy leading to significant performance gains. The multi-granular approach for implicit methods used in this work has demonstrated speedups that are close to 50% of those expected with purely explicit methods. The validated GPU-RANS solver was then coupled with GPU-based free-vortex and sediment tracking methods to model single and dual-phase, model- scale brownout environments. A qualitative and quantitative validation of the methodology was performed by comparing predictions with available measurements, including flowfield measurements and observations of particle transport mechanisms that have been made with laboratory-scale rotor/jet configurations in ground effect. In particular, dual-phase simulations were able to resolve key transport phenomena in the dispersed phase such as creep, vortex trapping and sediment wave formation. Furthermore, these simulations were demonstrated to be computationally more efficient than equivalent computations on a cluster of traditional CPUs - a model-scale brownout simulation using the hybrid approach on a single GTX Titan now takes 1.25 hours per revolution compared to 6 hours per revolution on 32 Intel Xeon cores.
  • Thumbnail Image
    Item
    Highly Parallel Geometric Characterization and Visualization of Volumetric Data Sets
    (2012) Juba, Derek Christopher; Varshney, Amitabh; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Volumetric 3D data sets are being generated in many different application areas. Some examples are CAT scans and MRI data, 3D models of protein molecules represented by implicit surfaces, multi-dimensional numeric simulations of plasma turbulence, and stacks of confocal microscopy images of cells. The size of these data sets has been increasing, requiring the speed of analysis and visualization techniques to also increase to keep up. Recent advances in processor technology have stopped increasing clock speed and instead begun increasing parallelism, resulting in multi-core CPUS and many-core GPUs. To take advantage of these new parallel architectures, algorithms must be explicitly written to exploit parallelism. In this thesis we describe several algorithms and techniques for volumetric data set analysis and visualization that are amenable to these modern parallel architectures. We first discuss modeling volumetric data with Gaussian Radial Basis Functions (RBFs). RBF representation of a data set has several advantages, including lossy compression, analytic differentiability, and analytic application of Gaussian blur. We also describe a parallel volume rendering algorithm that can create images of the data directly from the RBF representation. Next we discuss a parallel, stochastic algorithm for measuring the surface area of volumetric representations of molecules. The algorithm is suitable for implementation on a GPU and is also progressive, allowing it to return a rough answer almost immediately and refine the answer over time to the desired level of accuracy. After this we discuss the concept of Confluent Visualization, which allows the visualization of the interaction between a pair of volumetric data sets. The interaction is visualized through volume rendering, which is well suited to implementation on parallel architectures. Finally we discuss a parallel, stochastic algorithm for classifying stem cells as having been grown on a surface that induces differentiation or on a surface that does not induce differentiation. The algorithm takes as input 3D volumetric models of the cells generated from confocal microscopy. This algorithm builds on our algorithm for surface area measurement and, like that algorithm, this algorithm is also suitable for implementation on a GPU and is progressive.
  • Thumbnail Image
    Item
    Gyrofluid Modeling of Turbulent, Kinetic Physics
    (2011) Despain, Kate Marie; Dorland, William; Physics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Gyrofluid models to describe plasma turbulence combine the advantages of fluid models, such as lower dimensionality and well-developed intuition, with those of gyrokinetics models, such as finite Larmor radius (FLR) effects. This allows gyrofluid models to be more tractable computationally while still capturing much of the physics related to the FLR of the particles. We present a gyrofluid model derived to capture the behavior of slow solar wind turbulence and describe the computer code developed to implement the model. In addition, we describe the modifications we made to a gyrofluid model and code that simulate plasma turbulence in tokamak geometries. Specifically, we describe a nonlinear phase mixing phenomenon, part of the E∙B term, that was previously missing from the model. An inherently FLR effect, it plays an important role in predicting turbulent heat flux and diffusivity levels for the plasma. We demonstrate this importance by comparing results from the updated code to studies done previously by gyrofluid and gyrokinetic codes. We further explain what would be necessary to couple the updated gyrofluid code, gryffin, to a turbulent transport code, thus allowing gryffin to play a role in predicting profiles for fusion devices such as ITER and to explore novel fusion configurations. Such a coupling would require the use of Graphical Processing Units (GPUs) to make the modeling process fast enough to be viable. Consequently, we also describe our experience with GPU computing and demonstrate that we are poised to complete a gryffin port to this innovative architecture.