Technical Reports from UMIACS

Permanent URI for this collectionhttp://hdl.handle.net/1903/7

Browse

Search Results

Now showing 1 - 10 of 950
  • Thumbnail Image
    Item
    Pipelined CPU-GPU Scheduling for Caches
    (2021-03-23) Gerzhoy, Daniel; Yeung, Donald
    Heterogeneous microprocessors integrate a CPU and GPU with a shared cache hierarchy on the same chip, affording low-overhead communication between the CPU and GPU's cores. Often times, large array data structures are communicated from the CPU to the GPU and back. While the on-chip cache hierarchy can support such CPU-GPU producer-consumer sharing, this almost never happens due to poor temporal reuse. Because the data structures can be quite large, by the time the consumer reads the data, it has been evicted from cache even though the producer had brought it on-chip when it originally wrote the data. As a result, the CPU-GPU communication happens through main memory instead of the cache, hurting performance and energy. This paper exploits the on-chip caches in a heterogeneous microprocessor to improve CPU-GPU communication efficiency. We divide streaming computations executed by the CPU and GPU that exhibit producer-consumer sharing into chunks, and overlap the execution of CPU chunks with GPU chunks in a software pipeline. To enforce data dependences, the producer executes one chunk ahead of the consumer at all times. We also propose a low-overhead synchronization mechanism in which the CPU directly controls thread-block scheduling in the GPU to maintain the producer's "run-ahead distance" relative to the consumer. By adjusting the chunk size or run-ahead distance, we can make the CPU-GPU working set fit in the last-level cache, thus permitting the producer-consumer sharing to occur through the LLC. We show through simulation that our technique reduces the number of DRAM accesses by 30.4%, improves performance by 26.8%, and lowers memory system energy by 27.4% averaged across 7 benchmarks.
  • Thumbnail Image
    Item
    Nervous system maps on the C. elegans genome
    (2020-09-28) Cherniak, Christopher; Mokhtarzada, Zekeria; Rodriguez-Esteban, Raul
    This project begins from a synoptic point of view, focusing upon the large-scale (global) landscape of the genome. This is along the lines of combinatorial network optimization in computational complexity theory [1]. Our research program here in turn originated along parallel lines in computational neuroanatomy [2,3,4,5]. Rather than mapping body structure onto the genome, the present report focuses upon statistically significant mappings of the Caenorhabditis elegans nervous system onto its genome. Via published datasets, evidence is derived for a "wormunculus", on the model of a homunculus representation, but on the C. elegans genome. The main method of testing somatic-genomic position-correlations here is via public genome databases, with r^2 analyses and p evaluations. These findings appear to yield some of the basic structural and functional organization of invertebrate nucleus and chromosome architecture. The design rationale for somatic maps on the genome in turn may be efficient interconnections. A next question this study raises: How do these various somatic maps mesh (interrelate, interact) with each other?
  • Thumbnail Image
    Item
    Design and Evaluation of Monolithic Computers Implemented Using Crossbar ReRAM
    (2019-07-16) Jagasivamani, Meenatchi; Walden, Candace; Singh, Devesh; Li, Shang; Kang, Luyi; Asnaashari, Mehdi; Dubois, Sylvain; Jacob, Bruce; Yeung, Donald
    A monolithic computer is an emerging architecture in which a multicore CPU and a high-capacity main memory system are all integrated in a single die. We believe such architectures will be possible in the near future due to nonvolatile memory technology, such as the resistive random access memory, or ReRAM, from Crossbar Incorporated. Crossbar's ReRAM can be fabricated in a standard CMOS logic process, allowing it to be integrated into a CPU's die. The ReRAM cells are manufactured in between metal wires and do not employ per-cell access transistors, leaving the bulk of the base silicon area vacant. This means that a CPU can be monolithically integrated directly underneath the ReRAM memory, allowing the cores to have massively parallel access to the main memory. This paper presents the characteristics of Crossbar's ReRAM technology, informing architects on how ReRAM can enable monolithic computers. Then, it develops a CPU and memory system architecture around those characteristics, especially to exploit the unprecedented memory-level parallelism. The architecture employs a tiled CPU, and incorporates memory controllers into every compute tile that support a variable access granularity to enable high scalability. Lastly, the paper conducts an experimental evaluation of monolithic computers on graph kernels and streaming computations. Our results show that compared to a DRAM-based tiled CPU, a monolithic computer achieves 4.7x higher performance on the graph kernels, and achieves roughly parity on the streaming computations. Given a future 7nm technology node, a monolithic computer could outperform the conventional system by 66% for the streaming computations.
  • Thumbnail Image
    Item
    Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors
    (2016-11-10) Zuzak, Michael; Yeung, Donald
    Heterogeneous microprocessors integrate CPUs and GPUs on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data "in place." These advantages will permit integrated GPUs to exploit a smaller unit of parallelism. But one challenge will be exposing sufficient parallelism to keep all of the on-chip compute resources fully utilized. In this paper, we argue that integrated CPU-GPU chips should exploit parallelism from multiple loops simultaneously. One example of this is nested parallelism in which one or more inner SIMD loops are nested underneath a parallel outer (non- SIMD) loop. By scheduling the parallel outer loop on multiple CPU cores, multiple dynamic instances of the inner SIMD loops can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the non-SIMD code. Our preliminary results show exploiting such multi-loop parallelism provides a 3.12x performance gain over exploiting parallelism from individual loops one at a time.
  • Thumbnail Image
    Item
    Body Maps on Human Chromosomes
    (2015-11-08) Cherniak, Christopher; Rodriguez-Esteban, Raul
    An exploration of the hypothesis that human genes are organized somatotopically: For each autosomal chromosome, its tissue-specific genes tend to have relative positions on the chromosome that mirror corresponding positions of the tissues in the body. In addition, there appears to be a division of labor: Such a homunculus representation on a chromosome holds significantly for either the anteroposterior or the dorsoventral body axis. In turn, anteroposterior and dorsoventral chromosomes tend to occupy separate zones in the spermcell nucleus. One functional rationale of such largescale organization is for efficient interconnections in the genome.
  • Thumbnail Image
    Item
    Accurate computation of Galerkin double surface integrals in the 3-D boundary element method
    (2015-05-29) Adelman, Ross; Gumerov, Nail A.; Duraiswami, Ramani
    Many boundary element integral equation kernels are based on the Green’s functions of the Laplace and Helmholtz equations in three dimensions. These include, for example, the Laplace, Helmholtz, elasticity, Stokes, and Maxwell equations. Integral equation formulations lead to more compact, but dense linear systems. These dense systems are often solved iteratively via Krylov subspace methods, which may be accelerated via the fast multipole method. There are advantages to Galerkin formulations for such integral equations, as they treat problems associated with kernel singularity, and lead to symmetric and better conditioned matrices. However, the Galerkin method requires each entry in the system matrix to be created via the computation of a double surface integral over one or more pairs of triangles. There are a number of semi-analytical methods to treat these integrals, which all have some issues, and are discussed in this paper. We present novel methods to compute all the integrals that arise in Galerkin formulations involving kernels based on the Laplace and Helmholtz Green’s functions to any specified accuracy. Integrals involving completely geometrically separated triangles are non-singular and are computed using a technique based on spherical harmonics and multipole expansions and translations, which results in the integration of polynomial functions over the triangles. Integrals involving cases where the triangles have common vertices, edges, or are coincident are treated via scaling and symmetry arguments, combined with automatic recursive geometric decomposition of the integrals. Example results are presented, and the developed software is available as open source.
  • Thumbnail Image
    Item
    A Stochastic Approach to Uncertainty in the Equations of MHD Kinematics
    (2014-07-10) Phillips, Edward G.; Elman, Howard C.
    The magnetohydodynamic (MHD) kinematics model describes the electromagnetic behavior of an electrically conducting fluid when its hydrodynamic properties are assumed to be known. In particular, the MHD kinematics equations can be used to simulate the magnetic field induced by a given velocity field. While prescribing the velocity field leads to a simpler model than the fully coupled MHD system, this may introduce some epistemic uncertainty into the model. If the velocity of a physical system is not known with certainty, the magnetic field obtained from the model may not be reflective of the magnetic field seen in experiments. Additionally, uncertainty in physical parameters such as the magnetic resistivity may affect the reliability of predictions obtained from this model. By modeling the velocity and the resistivity as random variables in the MHD kinematics model, we seek to quantify the effects of uncertainty in these fields on the induced magnetic field. We develop stochastic expressions for these quantities and investigate their impact within a finite element discretization of the kinematics equations. We obtain mean and variance data through Monte-Carlo simulation for several test problems. Toward this end, we develop and test an efficient block preconditioner for the linear systems arising from the discretized equations.
  • Thumbnail Image
    Item
    Preconditioning Techniques for Reduced Basis Methods for Parameterized Partial Differential Equations
    (2014-05-27) Elman, Howard C.; Forstall, Virginia
    The reduced basis methodology is an efficient approach to solve parameterized discrete partial differential equations when the solution is needed at many parameter values. An offline step approximates the solution space and an online step utilizes this approximation, the reduced basis, to solve a smaller reduced problem, which provides an accurate estimate of the solution. Traditionally, the reduced problem is solved using direct methods. However, the size of the reduced system needed to produce solutions of a given accuracy depends on the characteristics of the problem, and it may happen that the size is significantly smaller than that of the original discrete problem but large enough to make direct solution costly. In this scenario, it may be more effective to use iterative methods to solve the reduced problem. We construct preconditioners for reduced iterative methods which are derived from preconditioners for the full problem. This approach permits reduced basis methods to be practical for larger bases than direct methods allow. We illustrate the effectiveness of iterative methods for solving reduced problems by considering two examples, the steady-state diffusion and convection-diffusion-reaction equations.
  • Thumbnail Image
    Item
    Anomaly Detection for Symbolic Representations
    (2014-03-25) Cox, Michael T.; Paisner, Matt; Oates, Tim; Perlis, Don
    A fully autonomous agent recognizes new problems, explains what causes such problems, and generates its own goals to solve these problems. Our approach to this goal-driven model of autonomy uses a methodology called the Note-Assess-Guide procedure. It instantiates a monitoring process in which an agent notes an anomaly in the world, assesses the nature and cause of that anomaly, and guides appropriate modifications to behavior. This report describes a novel approach to the note phase of that procedure. A-distance, a sliding-window statistical distance metric, is applied to numerical vector representations of intermediate states from plans generated for two symbolic domains. Using these representations, the metric is able to detect anomalous world states caused by restricting the actions available to the planner.
  • Thumbnail Image
    Item
    Recursive computation of spherical harmonic rotation coefficients of large degree
    (2014-03-28) Gumerov, Nail A.; Duraiswami, Ramani
    Computation of the spherical harmonic rotation coefficients or elements of Wigner's d-matrix is important in a number of quantum mechanics and mathematical physics applications. Particularly, this is important for the Fast Multipole Methods in three dimensions for the Helmholtz, Laplace and related equations, if rotation-based decomposition of translation operators are used. In these and related problems related to representation of functions on a sphere via spherical harmonic expansions computation of the rotation coefficients of large degree n (of the order of thousands and more) may be necessary. Existing algorithms for their computation, based on recursions, are usually unstable, and do not extend to n. We develop a new recursion and study its behavior for large degrees, via computational and asymptotic analyses. Stability of this recursion was studied based on a novel application of the Courant-Friedrichs-Lewy condition and the von Neumann method for stability of finite-difference schemes for solution of PDEs. A recursive algorithm of minimal complexity O(n^2) for degree n and FFT-based algorithms of complexity O(n^2 log n) suitable for computation of rotation coefficients of large degrees are proposed, studied numerically, and cross-validated. It is shown that the latter algorithm can be used for n <~ 10^3 in double precision, while the former algorithm was tested for large n (up to 10^4 in our experiments) and demonstrated better performance and accuracy compared to the FFT-based algorithm.