Parallel Algorithms for Processing Massive Texts and Graphs

Thumbnail Image

Publication or External Link





Designing parallel algorithms with sublinear space is an inevitable scenario when the input data does not fit in the memory of available systems. In recent years, with the abundance of data and the increasing demand for large scale data processing, the size of datasets easily surpasses that of the typical machine’s memory. Examples of the problem domain vary from social network graphs to DNA sequences. However, to address any problem subject to this restriction efficiently, we need alternative models of computation as the traditional RAM model is no longer an option. The main focus of our research is studying classical algorithms on texts and graphs in the Massively Parallel Computation model.

During the last decade, the Massively Parallel Computation (MPC) model attracted a considerable amount of attention. The MPC model was originally proposed to provide a theoretical foundation to algorithms implemented on modern large scale data processing frameworks such as MapReduce, Hadoop, and Spark. The key idea behind this model, and also aforementioned frameworks, is to use many machines to compensate for the shortage of space in individual machines. In this model, the data is distributed among a set of machines each with a sublinear memory, and the process is consisted of several rounds. In each round, machines perform an arbitrary amount of computation on their local data independently. At the end of each round, machines can communicate with each other. We propose constant rounds MPC algorithms for Edge Coloring, Pattern Matching, constructing Suffix Trees, Longest Common Substring, and Longest Palindrome Substring.

Recently, the Adaptive Massively Parallel Computation (AMPC) model was introduced as an extension of MPC. Despite the significant progress in designing highly efficient MPC algorithms, the community is facing an impenetrable barrier due to a hardness conjecture that Graph Connectivity requires at least a logarithmic number of rounds. Inspired by this limitation, the AMPC model adds distributed hash-tables to the setup where each machine can adaptively read from them during each round. This tweak not only leads to a constant-round Graph Connectivity algorithm and a plethora of logarithmic improvements in many graph problems, but it is also practical thanks to the recent developments in hardware infrastructure and technologies such as RDMA and eRPC. We explore the limitations of this model by introducing a general Tree Contraction framework which translates well-studied applications in the context of PRAM into constant-round AMPC algorithms, and solving the Minimum Cut problem in sublogarithmic rounds.