Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    APPROXIMATION ALGORITHMS FOR FACILITY LOCATION AND CLUSTERING PROBLEMS
    (2017) Trinh, Khoa; Srinivasan, Aravind; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Facility Location (FL) problems are among the most fundamental problems in combinatorial optimization. FL problems are also closely related to Clustering problems. Generally, we are given a set of facilities, a set of clients, and a symmetric distance metric on these facilities and clients. The goal is to ``open'' the ``best'' subset of facilities, subject to certain budget constraints, and connect all clients to the opened facilities so that some objective function of the connection costs is minimized. In this dissertation, we consider generalizations of classical FL problems. Since these problems are NP-hard, we aim to find good approximate solutions in polynomial time. We study the classic $k$-median problem which asks to find a subset of at most $k$ facilities such that the sum of connection costs of all clients to the closest facility is as small as possible. Our main result is a $2.675$-approximation algorithm for this problem. We also consider the Knapsack Median (KM) problem which is a generalization of the $k$-median problem. In the KM problem, each facility is assigned an opening cost. A feasible set of opened facilities should have the total opening cost at most a given budget. The main technical challenge here is that the natural LP relaxation has unbounded integrality gap. We propose a $17.46$-approximation algorithm for the KM problem. We also show that, after a preprocessing step, the integrality gap of the residual instance is bounded by a constant. The next problem is a generalization of the $k$-center problem, which is called the Knapsack Center (KC) problem and has a similar budget constraint as in the KM problem. Here we want to minimize the maximum distance from any client to its closest opened facility. The KC problem is well-known to be $3$-approximable. However, the current approximation algorithms for KC are deterministic and it is not hard to construct instances in which almost all clients have the worst-possible connection cost. Unfairness also arises in this context: certain clients may consistently get connected to distant centers. We design a randomized algorithm which guarantees that the expected connection cost of ``most'' clients will be at most $(1+2/e) \approx 1.74$ times the optimal radius and the worst-case distance remains the same. We also show a similar result for the $k$-center problem: all clients have expected approximation ratio about $1.592$ with a deterministic upper-bound of $3$ in the worst case. It is well-known that a few \emph{outliers} (very distant clients) may result in a very large optimal radius in the center-type problems. One way to deal with this issue is to cover only some $t$ out of $n$ clients in the so-called robust model. In this thesis, we give tight approximation algorithms for both robust $k$-center and robust matroid center problems. We also introduce a lottery model in which each client $j$ wants to be covered with probability at least $p_j \in [0,1]$. We then give randomized approximation algorithms for center-type problems in this model which match the worst-case bounds of the robust model and slightly violate the coverage and fairness constraints. Several of our results for FL problems in this thesis rely on novel dependent rounding schemes. We develop these rounding techniques in the general setting and show that they guarantee new correlation properties. Given the wide applicability of the standard dependent rounding, we believe that our new techniques are of independent interests.
  • Thumbnail Image
    Item
    Data-Aware Scheduling in Datacenters
    (2016) Purohit, Manish; Khuller, Samir; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Datacenters have emerged as the dominant form of computing infrastructure over the last two decades. The tremendous increase in the requirements of data analysis has led to a proportional increase in power consumption and datacenters are now one of the fastest growing electricity consumers in the United States. Another rising concern is the loss of throughput due to network congestion. Scheduling models that do not explicitly account for data placement may lead to a transfer of large amounts of data over the network causing unacceptable delays. In this dissertation, we study different scheduling models that are inspired by the dual objectives of minimizing energy costs and network congestion in a datacenter. As datacenters are equipped to handle peak workloads, the average server utilization in most datacenters is very low. As a result, one can achieve huge energy savings by selectively shutting down machines when demand is low. In this dissertation, we introduce the network-aware machine activation problem to find a schedule that simultaneously minimizes the number of machines necessary and the congestion incurred in the network. Our model significantly generalizes well-studied combinatorial optimization problems such as hard-capacitated hypergraph covering and is thus strongly NP-hard. As a result, we focus on finding good approximation algorithms. Data-parallel computation frameworks such as MapReduce have popularized the design of applications that require a large amount of communication between different machines. Efficient scheduling of these communication demands is essential to guarantee efficient execution of the different applications. In the second part of the thesis, we study the approximability of the co-flow scheduling problem that has been recently introduced to capture these application-level demands. Finally, we also study the question, "In what order should one process jobs?'' Often, precedence constraints specify a partial order over the set of jobs and the objective is to find suitable schedules that satisfy the partial order. However, in the presence of hard deadline constraints, it may be impossible to find a schedule that satisfies all precedence constraints. In this thesis we formalize different variants of job scheduling with soft precedence constraints and conduct the first systematic study of these problems.
  • Thumbnail Image
    Item
    Decision making under uncertainty
    (2011) Li, Jian; Deshpande, Amol; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Almost all important decision problems are inevitably subject to some level of uncertainty either about data measurements, the parameters, or predictions describing future evolution. The significance of handling uncertainty is further amplified by the large volume of uncertain data automatically generated by modern data gathering or integration systems. Various types of problems of decision making under uncertainty have been subject to extensive research in computer science, economics and social science. In this dissertation, I study three major problems in this context, ranking, utility maximization, and matching, all involving uncertain datasets. First, we consider the problem of ranking and top-k query processing over probabilistic datasets. By illustrating the diverse and conflicting behaviors of the prior proposals, we contend that a single, specific ranking function may not suffice for probabilistic datasets. Instead we propose the notion of parameterized ranking functions, that generalize or can approximate many of the previously proposed ranking functions. We present novel exact or approximate algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations or the probability distributions are continuous. The second problem concerns with the stochastic versions of a broad class of combinatorial optimization problems. We observe that the expected value is inadequate in capturing different types of risk-averse or risk-prone behaviors, and instead we consider a more general objective which is to maximize the expected utility of the solution for some given utility function. We present a polynomial time approximation algorithm with additive error ε for any ε > 0, under certain conditions. Our result generalizes and improves several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack. The third is the stochastic matching problem which finds interesting applications in online dating, kidney exchange and online ad assignment. In this problem, the existence of each edge is uncertain and can be only found out by probing the edge. The goal is to design a probing strategy to maximize the expected weight of the matching. We give linear programming based constant-factor approximation algorithms for weighted stochastic matching, which answer an open question raised in prior work.
  • Thumbnail Image
    Item
    Algorithms for Data Migration
    (2005-06-15) Kim, Yoo-Ah; Khuller, Samir; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis is concerned with the problem related to data storage and management. A large storage server consists of several hundreds of disks. To balance the load across disks, the system computes data layouts that are typically adjusted according to the workload. As workloads change over time, the system recomputes the data layout, and rearranges the data items according to the new layout. We identify the problem of computing an efficient data migration plan that converts an initial layout to a target layout. We define the data migration problem as follows: for each item, there are a set of disks that have the item (sources) and a set of disks that want to receive the item (destinations). We want to migrate the data items from the sources to destinations. The crucial constraint is that each disk can participate in only one transfer at a time. The most common objective has been to minimize the makespan, which is the time when we finish all the migrations. The problem is NP-hard, and we develop polynomial time algorithms with constant factor approximation guarantees and several other heuristic algorithms. We present the performance evaluation of the different methods through an experimental study. We also consider the data migration problem to minimize the sum of completion times over all migration jobs or storage devices. Minimizing the sum of completion times of jobs is one of the most common objectives in scheduling literature. On the other hand, since a storage device may run inefficiently while the device is involved in migrations, another interesting objective is to minimize the sum of completion times over all storage devices. We present hardness results and constant factor approximation algorithms for these objectives. In addition, we consider the case when we have a heterogeneous collection of machines. We assume that heterogeneity is modeled by a non-uniform speed of the sending machine. For the basic problem of multicasting and broadcasting in the model, we show that Fastest Node First scheme gives a approximation ratio of 1.5 for minimizing the makespan. We also prove that there is a polynomial time approximation scheme.
  • Thumbnail Image
    Item
    Algorithms for Data Dissemination and Collection
    (2005-04-18) Wan, Yung-Chun Justin; Khuller, Samir; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Broadcasting and gossiping are classical problems that have been widely studied for decades. In broadcasting, one source node wishes to send a message to every other node, while in gossiping, each node has a message that they wish to send to everyone else. Both are some of the most basic problems arising in communication networks. In this dissertation we study problems that generalize gossiping and broadcasting. For example, the source node may have several messages to broadcast or multicast. Many of the works on broadcasting in the literature are focused on homogeneous networks. The algorithms developed are more applicable to managing data on local-area networks. However, large-scale storage systems often consist of storage devices clustered over a wide-area network. Finding a suitable model and developing algorithms for broadcast that recognize the heterogeneous nature of the communication network is a significant part of this dissertation. We also address the problem of data collection in a wide-area network, which has largely been neglected, and is likely to become more significant as the Internet becomes more embedded in everyday life. We consider a situation where large amounts of data have to be moved from several different locations to a destination. In this work, we focus on two key properties: the available bandwidth can fluctuate, and the network may not choose the best route to transfer the data between two hosts. We focus on improving the task completion time by re-routing the data through intermediate hosts and show that under certain network conditions we can reduce the total completion time by a factor of two. This is done by developing an approach for computing coordinated data collection schedules using network flows.