# Technical Reports from UMIACS

## Permanent URI for this collection

## Browse

### Recent Submissions

Item Pipelined CPU-GPU Scheduling for Caches(2021-03-23) Gerzhoy, Daniel; Yeung, DonaldHeterogeneous microprocessors integrate a CPU and GPU with a shared cache hierarchy on the same chip, affording low-overhead communication between the CPU and GPU's cores. Often times, large array data structures are communicated from the CPU to the GPU and back. While the on-chip cache hierarchy can support such CPU-GPU producer-consumer sharing, this almost never happens due to poor temporal reuse. Because the data structures can be quite large, by the time the consumer reads the data, it has been evicted from cache even though the producer had brought it on-chip when it originally wrote the data. As a result, the CPU-GPU communication happens through main memory instead of the cache, hurting performance and energy. This paper exploits the on-chip caches in a heterogeneous microprocessor to improve CPU-GPU communication efficiency. We divide streaming computations executed by the CPU and GPU that exhibit producer-consumer sharing into chunks, and overlap the execution of CPU chunks with GPU chunks in a software pipeline. To enforce data dependences, the producer executes one chunk ahead of the consumer at all times. We also propose a low-overhead synchronization mechanism in which the CPU directly controls thread-block scheduling in the GPU to maintain the producer's "run-ahead distance" relative to the consumer. By adjusting the chunk size or run-ahead distance, we can make the CPU-GPU working set fit in the last-level cache, thus permitting the producer-consumer sharing to occur through the LLC. We show through simulation that our technique reduces the number of DRAM accesses by 30.4%, improves performance by 26.8%, and lowers memory system energy by 27.4% averaged across 7 benchmarks.Item Nervous system maps on the C. elegans genome(2020-09-28) Cherniak, Christopher; Mokhtarzada, Zekeria; Rodriguez-Esteban, RaulThis project begins from a synoptic point of view, focusing upon the large-scale (global) landscape of the genome. This is along the lines of combinatorial network optimization in computational complexity theory [1]. Our research program here in turn originated along parallel lines in computational neuroanatomy [2,3,4,5]. Rather than mapping body structure onto the genome, the present report focuses upon statistically significant mappings of the Caenorhabditis elegans nervous system onto its genome. Via published datasets, evidence is derived for a "wormunculus", on the model of a homunculus representation, but on the C. elegans genome. The main method of testing somatic-genomic position-correlations here is via public genome databases, with r^2 analyses and p evaluations. These findings appear to yield some of the basic structural and functional organization of invertebrate nucleus and chromosome architecture. The design rationale for somatic maps on the genome in turn may be efficient interconnections. A next question this study raises: How do these various somatic maps mesh (interrelate, interact) with each other?Item Design and Evaluation of Monolithic Computers Implemented Using Crossbar ReRAM(2019-07-16) Jagasivamani, Meenatchi; Walden, Candace; Singh, Devesh; Li, Shang; Kang, Luyi; Asnaashari, Mehdi; Dubois, Sylvain; Jacob, Bruce; Yeung, DonaldA monolithic computer is an emerging architecture in which a multicore CPU and a high-capacity main memory system are all integrated in a single die. We believe such architectures will be possible in the near future due to nonvolatile memory technology, such as the resistive random access memory, or ReRAM, from Crossbar Incorporated. Crossbar's ReRAM can be fabricated in a standard CMOS logic process, allowing it to be integrated into a CPU's die. The ReRAM cells are manufactured in between metal wires and do not employ per-cell access transistors, leaving the bulk of the base silicon area vacant. This means that a CPU can be monolithically integrated directly underneath the ReRAM memory, allowing the cores to have massively parallel access to the main memory. This paper presents the characteristics of Crossbar's ReRAM technology, informing architects on how ReRAM can enable monolithic computers. Then, it develops a CPU and memory system architecture around those characteristics, especially to exploit the unprecedented memory-level parallelism. The architecture employs a tiled CPU, and incorporates memory controllers into every compute tile that support a variable access granularity to enable high scalability. Lastly, the paper conducts an experimental evaluation of monolithic computers on graph kernels and streaming computations. Our results show that compared to a DRAM-based tiled CPU, a monolithic computer achieves 4.7x higher performance on the graph kernels, and achieves roughly parity on the streaming computations. Given a future 7nm technology node, a monolithic computer could outperform the conventional system by 66% for the streaming computations.Item Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors(2016-11-10) Zuzak, Michael; Yeung, DonaldHeterogeneous microprocessors integrate CPUs and GPUs on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data "in place." These advantages will permit integrated GPUs to exploit a smaller unit of parallelism. But one challenge will be exposing sufficient parallelism to keep all of the on-chip compute resources fully utilized. In this paper, we argue that integrated CPU-GPU chips should exploit parallelism from multiple loops simultaneously. One example of this is nested parallelism in which one or more inner SIMD loops are nested underneath a parallel outer (non- SIMD) loop. By scheduling the parallel outer loop on multiple CPU cores, multiple dynamic instances of the inner SIMD loops can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the non-SIMD code. Our preliminary results show exploiting such multi-loop parallelism provides a 3.12x performance gain over exploiting parallelism from individual loops one at a time.Item Body Maps on Human Chromosomes(2015-11-08) Cherniak, Christopher; Rodriguez-Esteban, RaulAn exploration of the hypothesis that human genes are organized somatotopically: For each autosomal chromosome, its tissue-specific genes tend to have relative positions on the chromosome that mirror corresponding positions of the tissues in the body. In addition, there appears to be a division of labor: Such a homunculus representation on a chromosome holds significantly for either the anteroposterior or the dorsoventral body axis. In turn, anteroposterior and dorsoventral chromosomes tend to occupy separate zones in the spermcell nucleus. One functional rationale of such largescale organization is for efficient interconnections in the genome.Item Accurate computation of Galerkin double surface integrals in the 3-D boundary element method(2015-05-29) Adelman, Ross; Gumerov, Nail A.; Duraiswami, RamaniMany boundary element integral equation kernels are based on the Green’s functions of the Laplace and Helmholtz equations in three dimensions. These include, for example, the Laplace, Helmholtz, elasticity, Stokes, and Maxwell equations. Integral equation formulations lead to more compact, but dense linear systems. These dense systems are often solved iteratively via Krylov subspace methods, which may be accelerated via the fast multipole method. There are advantages to Galerkin formulations for such integral equations, as they treat problems associated with kernel singularity, and lead to symmetric and better conditioned matrices. However, the Galerkin method requires each entry in the system matrix to be created via the computation of a double surface integral over one or more pairs of triangles. There are a number of semi-analytical methods to treat these integrals, which all have some issues, and are discussed in this paper. We present novel methods to compute all the integrals that arise in Galerkin formulations involving kernels based on the Laplace and Helmholtz Green’s functions to any specified accuracy. Integrals involving completely geometrically separated triangles are non-singular and are computed using a technique based on spherical harmonics and multipole expansions and translations, which results in the integration of polynomial functions over the triangles. Integrals involving cases where the triangles have common vertices, edges, or are coincident are treated via scaling and symmetry arguments, combined with automatic recursive geometric decomposition of the integrals. Example results are presented, and the developed software is available as open source.Item A Stochastic Approach to Uncertainty in the Equations of MHD Kinematics(2014-07-10) Phillips, Edward G.; Elman, Howard C.The magnetohydodynamic (MHD) kinematics model describes the electromagnetic behavior of an electrically conducting fluid when its hydrodynamic properties are assumed to be known. In particular, the MHD kinematics equations can be used to simulate the magnetic field induced by a given velocity field. While prescribing the velocity field leads to a simpler model than the fully coupled MHD system, this may introduce some epistemic uncertainty into the model. If the velocity of a physical system is not known with certainty, the magnetic field obtained from the model may not be reflective of the magnetic field seen in experiments. Additionally, uncertainty in physical parameters such as the magnetic resistivity may affect the reliability of predictions obtained from this model. By modeling the velocity and the resistivity as random variables in the MHD kinematics model, we seek to quantify the effects of uncertainty in these fields on the induced magnetic field. We develop stochastic expressions for these quantities and investigate their impact within a finite element discretization of the kinematics equations. We obtain mean and variance data through Monte-Carlo simulation for several test problems. Toward this end, we develop and test an efficient block preconditioner for the linear systems arising from the discretized equations.Item Preconditioning Techniques for Reduced Basis Methods for Parameterized Partial Differential Equations(2014-05-27) Elman, Howard C.; Forstall, VirginiaThe reduced basis methodology is an efficient approach to solve parameterized discrete partial differential equations when the solution is needed at many parameter values. An offline step approximates the solution space and an online step utilizes this approximation, the reduced basis, to solve a smaller reduced problem, which provides an accurate estimate of the solution. Traditionally, the reduced problem is solved using direct methods. However, the size of the reduced system needed to produce solutions of a given accuracy depends on the characteristics of the problem, and it may happen that the size is significantly smaller than that of the original discrete problem but large enough to make direct solution costly. In this scenario, it may be more effective to use iterative methods to solve the reduced problem. We construct preconditioners for reduced iterative methods which are derived from preconditioners for the full problem. This approach permits reduced basis methods to be practical for larger bases than direct methods allow. We illustrate the effectiveness of iterative methods for solving reduced problems by considering two examples, the steady-state diffusion and convection-diffusion-reaction equations.Item Anomaly Detection for Symbolic Representations(2014-03-25) Cox, Michael T.; Paisner, Matt; Oates, Tim; Perlis, DonA fully autonomous agent recognizes new problems, explains what causes such problems, and generates its own goals to solve these problems. Our approach to this goal-driven model of autonomy uses a methodology called the Note-Assess-Guide procedure. It instantiates a monitoring process in which an agent notes an anomaly in the world, assesses the nature and cause of that anomaly, and guides appropriate modifications to behavior. This report describes a novel approach to the note phase of that procedure. A-distance, a sliding-window statistical distance metric, is applied to numerical vector representations of intermediate states from plans generated for two symbolic domains. Using these representations, the metric is able to detect anomalous world states caused by restricting the actions available to the planner.Item Recursive computation of spherical harmonic rotation coefficients of large degree(2014-03-28) Gumerov, Nail A.; Duraiswami, RamaniComputation of the spherical harmonic rotation coefficients or elements of Wigner's d-matrix is important in a number of quantum mechanics and mathematical physics applications. Particularly, this is important for the Fast Multipole Methods in three dimensions for the Helmholtz, Laplace and related equations, if rotation-based decomposition of translation operators are used. In these and related problems related to representation of functions on a sphere via spherical harmonic expansions computation of the rotation coefficients of large degree n (of the order of thousands and more) may be necessary. Existing algorithms for their computation, based on recursions, are usually unstable, and do not extend to n. We develop a new recursion and study its behavior for large degrees, via computational and asymptotic analyses. Stability of this recursion was studied based on a novel application of the Courant-Friedrichs-Lewy condition and the von Neumann method for stability of finite-difference schemes for solution of PDEs. A recursive algorithm of minimal complexity O(n^2) for degree n and FFT-based algorithms of complexity O(n^2 log n) suitable for computation of rotation coefficients of large degrees are proposed, studied numerically, and cross-validated. It is shown that the latter algorithm can be used for n <~ 10^3 in double precision, while the former algorithm was tested for large n (up to 10^4 in our experiments) and demonstrated better performance and accuracy compared to the FFT-based algorithm.Item Studying Directory Access Patterns via Reuse Distance Analysis and Evaluating Their Impact on Multi-Level Directory Caches(2014-01-13) Zhao, Minshu; Yeung, DonaldThe trend for multicore CPUs is towards increasing core count. One of the key limiters to scaling will be the on-chip directory cache. Our work investigates moving portions of the directory away from the cores, perhaps to off-chip DRAM, where ample capacity exists. While such multi-level directory caches exhibit increased latency, several aspects of directory accesses will shield CPU performance from the slower directory, including low access frequency and latency hiding underneath data accesses to main memory. While multi-level directory caches have been studied previously, no work has of yet comprehensively quantified the directory access patterns themselves, making it difficult to understand multi-level behavior in depth. This paper presents a framework based on multicore reuse distance for studying directory cache access patterns. Using our analysis framework, we show between 69-93% of directory entries are looked up only once or twice during their liftimes in the directory cache, and between 51-71% of dynamic directory accesses are latency tolerant. Using cache simulations, we show a very small L1 directory cache can service 80% of latency critical directory lookups. Although a significant number of directory lookups and eviction notifications must access the slower L2 directory cache, virtually all of these are latency tolerant.Item A Block Preconditioner for an Exact Penalty Formulation for Stationary MHD(2014-02-04) Phillips, Edward G.; Elman, Howard C.; Cyr, Eric C.; Shadid, John N.; Pawlowski, Roger P.The magnetohydrodynamics (MHD) equations are used to model the flow of electrically conducting fluids in such applications as liquid metals and plasmas. This system of non-self adjoint, nonlinear PDEs couples the Navier-Stokes equations for fluids and Maxwell's equations for electromagnetics. There has been recent interest in fully coupled solvers for the MHD system because they allow for fast steady-state solutions that do not require pseudo-time stepping. When the fully coupled system is discretized, the strong coupling can make the resulting algebraic systems difficult to solve, requiring effective preconditioning of iterative methods for efficiency. In this work, we consider a finite element discretization of an exact penalty formulation for the stationary MHD equations. This formulation has the benefit of implicitly enforcing the divergence free condition on the magnetic field without requiring a Lagrange multiplier. We consider extending block preconditioning techniques developed for the Navier-Stokes equations to the full MHD system. We analyze operators arising in block decompositions from a continuous perspective and apply arguments based on the existence of approximate commutators to develop new preconditioners that account for the physical coupling. This results in a family of parameterized block preconditioners for both Picard and Newton linearizations. We develop an automated method for choosing the relevant parameters and demonstrate the robustness of these preconditioners for a range of the physical non-dimensional parameters and with respect to mesh refinement.Item Proceedings of the 2013 Annual Conference on Advances in Cognitive Systems: Workshop on Metacognition about Artificial Situated Agents(2013-12-14) Josyula, Darsana; Robertson, Paul; Cox, Michael T.Metacognition is the process of thinking about thinking. It provides cognitive systems the ability to note and deal with anomalies, changes, opportunities, surprises and uncertainty. It includes both monitoring of cognitive activities and control of such activities. Monitoring helps to evaluate and explain the cognitive activities, while control helps to adapt or modify the cognitive activities. Situated agents are agents embedded in a dynamic environment that they can sense or perceive and manipulate or change through their actions. Similarly, they can act in order to manipulate other agents among which they are situated. Examples might include robots, natural language dialog interfaces, web-based agents or virtual-reality bots. An agent can leverage metacognition of its own thinking about other agents in its situated environment. It can equally benefit from metacognition of the thinking of other agents towards itself. Metacognitive monitoring can help situated agents in negotiations, conflict resolution and norm-awareness. Metacognitive control can help coordination and coalition formations of situated social agents. In this workshop, we investigate the monitoring and control aspects of metacognition about self and other agents, and their application to situated artificial agents. The papers in this report cover some of the current work related to metacognition in the areas of meta-knowledge representation, meta-reasoning and meta-cognitive architecture. Perlis et al. outlines a high-level view of architectures for real-time situated agents and the reliance of such agents on metacognition. Mbale, K. and Josyula, D. present a generic metacognitive component based on preserving the homeostasis of a host agent. Pickett, M. presents a framework for representing, learning, and processing meta-knowledge. Riddle, P. et al. discuss meta-level search through a problem representation space for problem–reformulation. Caro, M et al. use metamemory to adapt to changes in memory retrieval constraints. Langley, P. et al. abstract general problem specific abilities into strategic problem solving knowledge in an architecture for flexible problem solving across various domains. Samsonovich, A. examines metacognition as a means to improve fluid intelligence in a cognitive architecture. Perlis, D. and Cox, M. discuss the application of metacognitive monitoring to anomaly detection and goal generation.Item Goal Reasoning: Papers from the ACS workshop(2013-12-14) Aha, David W.; Cox, Michael T.; Munoz-Avila, HectorThis technical report contains the 11 accepted papers presented at the Workshop on Goal Reasoning, which was held as part of the 2013 Conference on Advances in Cognitive Systems (ACS-13) in Baltimore, Maryland on 14 December 2013. This is the third in a series of workshops related to this topic, the first of which was the AAAI-10 Workshop on Goal-Directed Autonomy while the second was the Self-Motivated Agents (SeMoA) Workshop, held at Lehigh University in November 2012. Our objective for holding this meeting was to encourage researchers to share information on the study, development, integration, evaluation, and application of techniques related to goal reasoning, which concerns the ability of an intelligent agent to reason about, formulate, select, and manage its goals/objectives. Goal reasoning differs from frameworks in which agents are told what goals to achieve, and possibly how goals can be decomposed into subgoals, but not how to dynamically and autonomously decide what goals they should pursue. This constraint can be limiting for agents that solve tasks in complex environments when it is not feasible to manually engineer/encode complete knowledge of what goal(s) should be pursued for every conceivable state. Yet, in such environments, states can be reached in which actions can fail, opportunities can arise, and events can otherwise take place that strongly motivate changing the goal(s) that the agent is currently trying to achieve. This topic is not new; researchers in several areas have studied goal reasoning (e.g., in the context of cognitive architectures, automated planning, game AI, and robotics). However, it has infrequently been the focus of intensive study, and (to our knowledge) no other series of meetings has focused specifically on goal reasoning. As shown in these papers, providing an agent with the ability to reason about its goals can increase performance measures for some tasks. Recent advances in hardware and software platforms (involving the availability of interesting/complex simulators or databases) have increasingly permitted the application of intelligent agents to tasks that involve partially observable and dynamically-updated states (e.g., due to unpredictable exogenous events), stochastic actions, multiple (cooperating, neutral, or adversarial) agents, and other complexities. Thus, this is an appropriate time to foster dialogue among researchers with interests in goal reasoning. Research on goal reasoning is still in its early stages; no mature application of it yet exists (e.g., for controlling autonomous unmanned vehicles or in a deployed decision aid). However, it appears to have a bright future. For example, leaders in the automated planning community have specifically acknowledged that goal reasoning has a prominent role among intelligent agents that act on their own plans, and it is gathering increasing attention from roboticists and cognitive systems researchers. In addition to a survey, the papers in this workshop relate to, among other topics, cognitive architectures and models, environment modeling, game AI, machine learning, meta-reasoning, planning, selfmotivated systems, simulation, and vehicle control. The authors discuss a wide range of issues pertaining to goal reasoning, including representations and reasoning methods for dynamically revising goal priorities. We hope that readers will find that this theme for enhancing agent autonomy to be appealing and relevant to their own interests, and that these papers will spur further investigations on this important yet (mostly) understudied topic.Item Efficient Iterative Algorithms for Linear Stability Analysis of Incompressible Flows(2013-11-07) Elman, Howard C.; Rostami, Minghao W.Linear stability analysis of a dynamical system entails finding the rightmost eigenvalue for a series of eigenvalue problems. For large-scale systems, it is known that conventional iterative eigenvalue solvers are not reliable for computing this eigenvalue. A more robust method recently developed in Elman & Wu (2012) and Meerbergen & Spence (2010), Lyapunov inverse iteration, involves solving large-scale Lyapunov equations, which in turn requires the solution of large, sparse linear systems analogous to those arising from solving the underlying partial differential equations. This study explores the efficient implementation of Lyapunov inverse iteration when it is used for linear stability analysis of incompressible flows. Efficiencies are obtained from effective solution strategies for the Lyapunov equations and for the underlying partial differential equations. Existing solution strategies are tested and compared, and a modified version of a Lyapunov solver is proposed that achieves significant savings in computational cost.Item A Method to Compute Periodic Sums(2013-10-09) Gumerov, Nail A.; Duraiswami, RamaniIn a number of problems in computational physics, a finite sum of kernel functions centered at N particle locations located in a box in three dimensions must be extended by imposing periodic boundary conditions on box boundaries. Even though the finite sum can be efficiently computed via fast summation algorithms, such as the fast multipole method (FMM), the periodized extension, posed as an infinite sum of kernel functions, centered at the particle locations in the box, and their images, is usually treated via a different algorithm, Ewald summation. This is then accelerated via the fast Fourier transform (FFT). A method for computing this periodized sum just using a blackbox finite fast summation algorithm is presented in this paper. The method splits the periodized sum in to two parts. The first, comprising the contribution of all points outside a large sphere enclosing the box, and some of its neighbors, is approximated inside the box by a collection of kernel functions (“sources”) placed on the surface of the sphere. These are approximated within the box using an expansion in terms of spectrally convergent local basis functions. The second part, comprising the part inside the sphere, and including the box and its immediate neighborhood, is treated via the summation algorithm. The coefficients of the sources are determined by least squares collocation of the periodicity condition of the total potential, imposed on a circumspherical surface for the box. While the method is presented in general, details are worked out for the case of evaluating potentials and forces due to electrostatically charged particles in a box. Results show that when used with the FMM, the periodized sum can be computed to any specified accuracy, at a cost of about twice the cost of the free-space FMM with the same accuracy. Several technical details and efficient algorithms for auxiliary computations are also provided, as are numerical comparisons.Item MIDCA: A Metacognitive, Integrated Dual-Cycle Architecture for Self-Regulated Autonomy(2013-09-23) Cox, Michael T.; Oates, TimThis report documents research performed under ONR grant N000141210172 for the period 1 June 2012 through 31 May 2013. The goals of this research are to provide a sound theoretical understanding of the role of metacognition in cognitive architectures and to demonstrate the underlying theory through implemented computational models. During the last year, the team has been integrating existing implemented systems to form an initial architectural structure that approximates the major functions of MIDCA. These include the SHOP2 hierarchical planning system and the Meta-AQUA integrated multistrategy learning system. We have also produced substantial progress on the data-driven track of the interpretation procedure. Last year’s work on using the A-distance metric for anomaly detection has been matured, and we have collected substantial observations used in empirical evaluation. Additionally we started implementation of a neural network to induce proto-type nodes for observed anomalies, and we are developing methods to prioritize explanations and responses that have proven effective with past anomalies in proto-type categories. The data are encouraging and the research community has reacted favorably. Several new publications support our claims herein.Item Symbiotic Cache Resizing for CMPs with Shared LLC(2013-09-11) Choi, Inseok; Yeung, DonaldThis paper investigates the problem of finding the optimal sizes of private caches and a shared LLC in CMPs. Resizing private and shared caches in modern CMPs is one way to squeeze wasteful power consumption out of architectures to improve power efficiency. However, shrinking each private/shared cache has different impact on the performance loss and the power savings to the CMPs because each cache contributes differently to performance and power. It is beneficial for both performance and power to shrink the LRU way of the private/shared cache which saves power most and increases data traffic least. This paper presents Symbiotic Cache Resizing (SCR), a runtime technique that reduces the total power consumption of the on-chip cache hierarchy in CMPs with a shared LLC. SCR turnoffs private/shared-cache ways in an inter-core and inter-level manner so that each disabling achieves best power saving while maintaining high performance. SCR finds such optimal cache sizes by utilizing greedy algorithms that we develop in this study. In particular, Prioritized Way Selection picks the most power-inefficient way. LLC-Partitioning-aware Prioritized Way Selection finds optimal cache sizes from the multi-level perspective. Lastly, Weighted Threshold Throttling finds optimal threshold per cache level. We evaluate SCR in two-core, four-core and eight-core systems. Results show that SCR saves 13% power in the on-chip cache hierarchy and 4.2% power in the system compared to an even LLC partitioning technique. SCR saves 2.7X more power in the cache hierarchy than the state-of-the-art LLC resizing technique while achieving better performance.Item Hierarchical O(N) Computation of Small-Angle Scattering Profiles and their Associated Derivatives(2013-05-25) Berlin, Konstantin; Gumerov, Nail A.; Fushman, David; Duraiswami, RamaniFast algorithms for Debye summation, which arises in computations performed in crystallography, small/wide-angle X-ray scattering (SAXS/WAXS) and small-angle neutron scattering (SANS), were recently presented in Gumerov et al. (J. Comput. Chem., 2012, 33:1981). The use of these algorithms can speed up computation of scattering profiles in macromolecular structure refinement protocols. However, these protocols often employ an iterative gradient-based optimization procedure, which then requires derivatives of the profile with respect to atomic coordinates. An extension to one of the algorithms is presented which allows accurate, O(N) cost computation of the derivatives along with the scattering profile. Computational results show orders of magnitude improvement in computational efficiency, while maintaining prescribed accuracy. This opens the possibility to efficiently integrate small-angle scattering data into structure determination and refinement of macromolecular systems.Item The compiler for the XMTC parallel language: Lessons for compiler developers and in-depth description(2011-02-18) Tzannes, Alexandros; Caragea, George C.; Vishkin, Uzi; Barua, RajeevIn this technical report, we present information on the XMTC compiler and language. We start by presenting the XMTC Memory Model and the issues we encountered when using GCC, the popular GNU compiler for C and other sequential languages, as the basis for a compiler for XMTC, a parallel language. These topics, along with some information on XMT specific optimizations were presented in [10]. Then, we proceed to give some more details on how outer spawn statements (i.e., parallel loops) are compiled to take advantage of XMT’s unique hardware primitives for scheduling flat parallelism and how we incremented this basic compiler to support nested parallelism.