Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
3 results
Search Results
Item Fair and Scalable Algorithms on Massive Graphs(2023) Knittel, Marina; Hajiaghayi, MohammadTaghi; Dickerson, John; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The impact of massive data processing on our daily lives is becoming increasingly more clear. With this, we must guarantee that vital, data-driven decision-making systems mitigate the potential negative impacts of the technologies themselves and scale to the massive datasets we have access to today. We explore both of these facets by studying fairness and scalability for algorithms on large graphs. In Part I, we focus on on fair hierarchical clustering. Our first work on this topic [Ahmadian et al., 2020b] initiates the study of fair hierarchical clustering by extending Chierichetti et al.’s [Chierichetti et al., 2017] notion of representationally proportional flat clustering to the hierarchical setting. From there, we introduce the first approximation algorithms for three well studied hierarchical clustering optimization problems in the fair context: cost [Dasgupta, 2016], revenue [Moseley and Wang, 2017], and value [Cohen-Addad et al., 2018]. Our initial work studies all three fair optimization problems, and our follow-up works [Knittel et al., 2023a, Knittel et al., 2023b] dive deeper into the notoriously difficult cost optimization problem. Regarding scalability, we leverage the Massively Parallel Computation (MPC) model, as well as its recent successor Adaptive Massively Parallel Computation (AMPC), to develop efficient graph algorithms for big data. MPC, discussed in Part II, has been one of the most practically useful models for massively parallel algorithms in the past decade, influencing a number of major frameworks including MapReduce, Hadoop, Spark, and Flume. In this model, we present our work on edge coloring [Behnezhad et al., 2019b], hierarchical clustering [Hajiaghayi and Knittel, 2020], and tree embeddings [Ahanchi et al., 2023]. AMPC improves upon the MPC model by adding access to a distributed hash table while still remaining highly implementable in practice. This allows it to overcome some shortcomings proposed in MPC literature, notably, the 1vs2Cycle Conjecture (i.e., that differentiating between a single cycle and two cycles is difficult in MPC). In Part III, we introduce a highly efficient and general tool for executing tree contractions in AMPC [Hajiaghayi et al., 2022b] and additionally exhibit the power of AMPC on minimum cut problems [Hajiaghayi et al., 2022a].Item Applications of Graph Segmentation Algorithms For Quantitative Genomic Analyses(2020) Gunady, Mohamed; Bravo, Hector C; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)There is a growing interest in utilizing graph formulations and graph-based algorithms in different subproblems of genomic analysis. Since graphs provide a natural and efficient representation of sequences of data where some structural relationships are observed within the data, we study some graph applications in quantitative analysis of typical RNA-seq and Whole Genome Sequencing pipelines. Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. This work describes Yanagi, a tool for segmenting transcriptomes to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves transcriptome substrings and structural relationships between transcripts while eliminating unnecessary sequence duplications. First, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries. The resulting segment sequences can be used with pseudo-alignment tools to quantify gene expression and alternative splicing at the segment level and provide gene-level visualization of the segments for more interpretability. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses. Furthermore, we show how transcriptome quantification can be performed from segment-level statistics. We present an EM algorithm that uses segment counts as features to estimate transcripts relative abundances in a way that maximizes the likelihood of the observed sequenced data. Then we tackle the problem of quantification in an incomplete annotation setting. We propose an assembly-free correction procedure that reduces bias in the estimated abundances of the annotated transcripts caused by the presence of unannotated transcripts in an RNA-seq sample, while avoiding the need to assemble the missing transcripts first. Another use case of our graph segmentation approach is representing population reference genome graphs used in Whole Genome Sequencing (WGS), which can be crucial for some genomic analysis studying highly polymorphic genes like HLA. Usually graph-based aligners are slow and computationally demanding. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs. Lastly, we explore the use of Generative Adversarial Networks (GANs) for imputing the sparse and noisy expression data obtained in single cell RNA sequencing (scRNA-seq) experiments. scRNA-seq provides a rich view into the heterogeneity underlying a cell population which is usually lost when performing bulk RNA-seq. However, these datasets are usually noisy and very sparse, and a number of methods have been proposed to impute zeros in these datasets with the goal of improving downstream analysis. In this work, we propose an approach, scGAIN, to impute zero counts of dropout genes in single cell data using Generative Adversarial Networks (GANs) by learning an approximation of the data distribution. The work presented here discusses an approach to adopt GAIN, a GAN model developed to impute data in image data, into the domain of imputing single cell data. Experiments show that scGAIN gives competitive results compared to the state-of-the-art imputation approaches while showing superiority in various aspects in simulation and real data. Imputation by scGAIN successfully recovers the underlying clustering of cell sub-populations, provides sharp estimates around true mean expression, reducing variability in the data, and increases the correspondence with matched bulk RNA-seq experiments.Item Multicasting in All-Optical WDM Networks(2008-09-26) Rawat, Anuj; Shayman, Mark; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)n this dissertation, we study the problem of (i) routing and wavelength assignment, and (ii) traffic grooming for multicast traffic in Wavelength Division Multiplexing (WDM) based all-optical networks. We focus on the 'static' case where the set of multicast traffic requests is assumed to be known in advance. For the routing and wavelength assignment problem, we study the objective of minimizing the number of wavelengths required; and for the traffic grooming problem, we study the objectives of minimizing (i) the number of wavelengths required, and (ii) the number of electronic components required. Both the problems are known to be hard for general fiber network topologies. Hence, it makes sense to study the problems under some restrictions on the network topology. We study the routing and wavelength assignment problem for bidirected trees, and the traffic grooming problem for unidirectional rings. The selected topologies are simple in the sense that the routing for any multicast traffic request is trivially determined, yet complex in the sense that the overall problems still remain hard. A motivation for selecting these topologies is that they are of practical interest since most of the deployed optical networks can be decomposed into these elemental topologies. In the first part of the thesis, we study the the problem of multicast routing and wavelength assignment in all-optical bidirected trees with the objective of minimizing the number of wavelengths required in the network. We give a 5/2-approximation algorithm for the case when the degree of the bidirected tree is at most 3. We give another algorithm with approximation ratio 10/3, 3 and 2 for the case when the degree of the bidirected tree is equal to 4, 3 and 2, respectively. The time complexity analysis for both these algorithms is also presented. Next we prove that the problem is hard even for the two restricted cases when the bidirected tree has (i) depth 2, and (ii) degree 2. Finally, we present another hardness result for a related problem of finding the clique number for a class for intersection graphs. In the second part of the thesis, we study the problem of multicast traffic grooming in all-optical unidirectional rings. For the case when the objective is to minimize the number of wavelengths required in the network, given an 'a'-approximation algorithm for the circular arc coloring problem, we give an algorithm having asymptotic approximation ratio 'a' for the multicast traffic grooming problem. We develop an easy to calculate lower bound on the minimum number of electronic components required to support a given set of multicast traffic requests on a given unidirectional ring network. We use this lower bound to analyze the worst case performance of a pair of simple grooming schemes. We also study the case when no grooming is carried out in order to get an estimate on the maximum number of electronic components that can be saved by applying intelligent grooming. Finally, we present a new grooming scheme and compare its average performance against other grooming schemes via simulations. The time complexity analysis for all the grooming schemes is also presented.