Towards scalability and efficiency in distributed systems
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Fault-tolerant consensus is one of the foundational problems in distributed computing. A number of autonomous processes aim to agree on a common value among the initial ones, despite failures in processes or the communication medium. The rise of decentralized systems, with their continuously growing user base and applications, poses an unprecedented challenge to the scalability of fault-tolerant consensus.
Current consensus mechanisms often struggle to balance scalability with fault tolerance and communication efficiency, particularly in dynamic environments where failures can be adaptive or even malicious. These challenges extend beyond communication complexity as decentralized systems must also achieve low network latency—a critical measure in the era of fast access to information.
Surprisingly, these issues are not confined to practical implementations; they also extend to foundational theoretical models. Despite the introduction of the consensus problem more than 40 years ago by Lamport, Pease, and Shostak, and a substantial body of literature aimed at understanding its complexity across various theoretical domains, optimal—or even close-to-optimal—solutions for consensus in many of these domains remain elusive. The focus of this thesis is to bridge some of the existing gaps and to progress the understanding of the Consensus problem in synchronous peer-to-peer networks where parties are subject to failures.
As a first result, we propose a new randomized algorithm against an adaptive adversary that reduces the communication complexity of the problem by a factor linear in the size of the system (denoted $n$), resulting in only a $\tilde{\Theta}(\sqrt{n})$ multiplicative gap from the known lower bounds. Under the same failure regime, the second result derandomizes the previous algorithm, providing a time- and communication-optimal \textit{deterministic} solution for Consensus when the number of failed processes is at most $n / \log{n}$.
Interestingly, the next proposed result precisely captures the tradeoff between time complexity and randomness in any solution to Consensus against adaptive adversaries capable of crashing processes. We prove that solutions with lower latency require more queries to random sources.
For systems exhibiting more severe omission failures, we provide the first algorithm that is optimal, up to polylogarithmic factors, in two key measures: time complexity (latency) and communication complexity (scalability).
Finally, we present new solutions for distributed systems enhanced by quantum communication channels.
A common feature among the proposed solutions is their reliance on combinatorial properties of sparse but well-communicated graphs---probabilistic or deterministic expanders. The thesis unveils and provides a thorough treatment of numerous new properties of these graphs related to fast and efficient communication patterns in distributed systems. Last but not least, beyond the main results, it introduces several novel paradigms and algorithms for crash-resilient distributed computing (both randomized and deterministic), including, among others, counting, weak local coin, gossip, and checkpointing.