Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    Fair and Scalable Algorithms on Massive Graphs
    (2023) Knittel, Marina; Hajiaghayi, MohammadTaghi; Dickerson, John; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The impact of massive data processing on our daily lives is becoming increasingly more clear. With this, we must guarantee that vital, data-driven decision-making systems mitigate the potential negative impacts of the technologies themselves and scale to the massive datasets we have access to today. We explore both of these facets by studying fairness and scalability for algorithms on large graphs. In Part I, we focus on on fair hierarchical clustering. Our first work on this topic [Ahmadian et al., 2020b] initiates the study of fair hierarchical clustering by extending Chierichetti et al.’s [Chierichetti et al., 2017] notion of representationally proportional flat clustering to the hierarchical setting. From there, we introduce the first approximation algorithms for three well studied hierarchical clustering optimization problems in the fair context: cost [Dasgupta, 2016], revenue [Moseley and Wang, 2017], and value [Cohen-Addad et al., 2018]. Our initial work studies all three fair optimization problems, and our follow-up works [Knittel et al., 2023a, Knittel et al., 2023b] dive deeper into the notoriously difficult cost optimization problem. Regarding scalability, we leverage the Massively Parallel Computation (MPC) model, as well as its recent successor Adaptive Massively Parallel Computation (AMPC), to develop efficient graph algorithms for big data. MPC, discussed in Part II, has been one of the most practically useful models for massively parallel algorithms in the past decade, influencing a number of major frameworks including MapReduce, Hadoop, Spark, and Flume. In this model, we present our work on edge coloring [Behnezhad et al., 2019b], hierarchical clustering [Hajiaghayi and Knittel, 2020], and tree embeddings [Ahanchi et al., 2023]. AMPC improves upon the MPC model by adding access to a distributed hash table while still remaining highly implementable in practice. This allows it to overcome some shortcomings proposed in MPC literature, notably, the 1vs2Cycle Conjecture (i.e., that differentiating between a single cycle and two cycles is difficult in MPC). In Part III, we introduce a highly efficient and general tool for executing tree contractions in AMPC [Hajiaghayi et al., 2022b] and additionally exhibit the power of AMPC on minimum cut problems [Hajiaghayi et al., 2022a].
  • Thumbnail Image
    Item
    Scalable Domain Decomposition for Parallel Solution of 3D Finite Element Multibody Rotorcraft Aeromechanics
    (2022) Lumba, Ravi Tyler; Datta, Anubhav; Aerospace Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A specialized mesh partitioner is developed for large-scale multibody three-dimensional finite element models. This partitioner enables modern domain decomposition algorithms to be leveraged for the parallel solution of complex, multibody, three-dimensional finite element-based rotor structural dynamics problems. The partitioner works with any domain decomposition algorithm, but contains special features for FETI-DP, a state-of-the-art iterative substructuring algorithm. The algorithm was implemented into an aeroelastic rotor solver X3D, with several modifications to improve performance. The parallel solver was applied to two practical test cases: the NASA Tiltrotor Aeroacoustic Model (TRAM) and the NASA Rotor Optimization for the Advancement of Mars eXploration (ROAMX) rotor blade. The mesh partitioner was developed from two sets of requirements: one standard to any domain decomposition algorithm and one specific to the FETI-DP method. The main feature of the partitioner is the ability to robustly partition any multibody structure, but with several special features for rotary-wing structures. The NASA TRAM, a 1/4 scale V-22 model, was specially released by NASA as a challenge test case. This model contained four flexible parts, six joints, nearly twenty composite material decks, a fluid-structure interface, and trim control inputs. The solver performance was studied for three test problems of increasing complexity: 1) an elementary beam, 2) the isolated TRAM blade, and 3) the TRAM blade and hub assembly. A key conclusion is that the use of a skyline solver for the coarse problem eliminates the coarse problem scalability barrier. Overall, the principle barrier of computational time that prevented the use of high-fidelity three-dimensional structures in rotorcraft is thus resolved. The two selected cases provided a template for how 3D structures should be used in the future. A detailed aeromechanical analysis of the NASA TRAM rotor was conducted. The solver was validated against experimental results in hover. The stresses in the blade and hub components were examined, illustrating the unique benefit of 3D structures. The NASA ROAMX blade was the first rotor blade to our knowledge designed exclusively with 3D structures. The torsional stability, blade loads, blade deformations, and 3D stresses/strains were evaluated for multiple blade designs before the final selection. The aeroelastic behavior of this blade was studied in steady and unsteady hover. Inertial effects were found to dominate over aerodynamics on Mars. The rotor blade was found to have sufficient factor of safety and damping for all test conditions. Over 20 thousand cases were executed with detailed stresses/strains as means of downselection, demonstrating the efficiency and utility of the parallel solver, and providing a roadmap for its use in future designs.
  • Thumbnail Image
    Item
    MULTI-SCALE SCHEDULING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS
    (2013) Zhou, Zheng; Bhattacharyya, Shuvra S; Qu, Gang; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A variety of hardware platforms for signal processing has emerged, from distributed systems such as Wireless Sensor Networks (WSNs) to parallel systems such as Multicore Programmable Digital Signal Processors (PDSPs), Multicore General Purpose Processors (GPPs), and Graphics Processing Units (GPUs) to heterogeneous combinations of parallel and distributed devices. When a signal processing application is implemented on one of those platforms, the performance critically depends on the scheduling techniques, which in general allocate computation and communication resources for competing processing tasks in the application to optimize performance metrics such as power consumption, throughput, latency, and accuracy. Signal processing systems implemented on such platforms typically involve multiple levels of processing and communication hierarchy, such as network-level, chip-level, and processor-level in a structural context, and application-level, subsystem-level, component-level, and operation- or instruction-level in a behavioral context. In this thesis, we target scheduling issues that carefully address and integrate scheduling considerations at different levels of these structural and behavioral hierarchies. The core contributions of the thesis include the following. Considering both the network-level and chip-level, we have proposed an adaptive scheduling algorithm for wireless sensor networks (WSNs) designed for event detection. Our algorithm exploits discrepancies among the detection accuracy of individual sensors, which are derived from a collaborative training process, to allow each sensor to operate in a more energy efficient manner while the network satisfies given constraints on overall detection accuracy. Considering the chip-level and processor-level, we incorporated both temperature and process variations to develop new scheduling methods for throughput maximization on multicore processors. In particular, we studied how to process a large number of threads with high speed and without violating a given maximum temperature constraint. We targeted our methods to multicore processors in which the cores may operate at different frequencies and different levels of leakage. We develop speed selection and thread assignment schedulers based on the notion of a core's steady state temperature. Considering the application-level, component-level and operation-level, we developed a new dataflow based design flow within the targeted dataflow interchange format (TDIF) design tool. Our new multiprocessor system-on-chip (MPSoC)-oriented design flow, called TDIF-PPG, is geared towards analysis and mapping of embedded DSP applications on MPSoCs. An important feature of TDIF-PPG is its capability to integrate graph level parallelism and actor level parallelism into the application mapping process. Here, graph level parallelism is exposed by the dataflow graph application representation in TDIF, and actor level parallelism is modeled by a novel model for multiprocessor dataflow graph implementation that we call the Parallel Processing Group (PPG) model. Building on the contribution above, we formulated a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for chip-level MPSoC mapping of DSP systems that are represented as synchronous dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized can be parallelized. For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows --- particle swarm optimization with a mixed integer programming formulation, particle swarm optimization with a simulated annealing engine, and particle swarm optimization with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support general PAS problem which considers the actors cannot be parallelized.
  • Thumbnail Image
    Item
    Reuse Distance Analysis for Large-Scale Chip Multiprocessors
    (2012) Wu, Meng-Ju; Yeung, Donald; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program's detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across thread-interleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers. This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%-96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs.
  • Thumbnail Image
    Item
    Parallel and Serial Algorithms for Vehicle Routing Problems
    (2008) Groer, Christopher Sebastian; Golden, Bruce; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The vehicle routing problem (VRP) is a widely studied combinatorial optimization problem that has many applications. Due to its intrinsic difficulty and the size of problems encountered in practice, most solution methods for the VRP are heuristic in nature and lead to high quality, yet probably not optimal solutions. When one considers the additional constraints that can be encountered in practice, the need for high quality heuristic methods is clear. We present two new variations of the VRP suggested to us by industry contacts, the Consistent VRP and the Balanced Billing Cycle VRP. We develop solution algorithms that incorporate heuristic methods as well as integer programming. Additionally, we develop a highly effective cooperative parallel algorithm for the classical VRP that generates new best solutions to a number of well-studied benchmark instances. We present extensive computational results and describe the C/C++ library that we developed to solve these vehicle routing problems. We describe the features and design philosophy behind this library and discuss how the framework can be used to implement additional heuristic algorithms and incorporate additional constraints.