Thumbnail Image
Publication or External Link
Jackson, Gary Lee
Sussman, Alan
Tightly-coupled parallel computing is an important tool for problem solving. Structured peer-to-peer network overlays are failure-tolerant and have a low admin- istrative burden. This work seeks to unite the two. First, I present a completely decentralized algorithm for parallel job scheduling and load balancing in distributed peer-to-peer environments. This algorithm is useful for meta-scheduling across known clusters and scheduling on desktop grids. To accomplish this, I build on previous work to route jobs to appropriate resources then use the new algorithm to start parallel jobs and balance load across the grid. I also discuss what constitutes useful clusterings for this algorithm as well as inherent scaling limitations. Ultimately, I show that my algorithm performs comparably to one using centralized load balancing with global up-to-date information. The principal contribution of this work is that the parallel job scheduling is completely decentralized, which is not featured in previous work, and enables reliable ad hoc sharing of distributed resources to run parallel computations. Second, I show how clusters of computers can be found dynamically by using an existing latency prediction technique coupled with a new refinement algorithm. Several latency prediction techniques are compared experimentally. One, based on a tree metric space embedding, is found to be superior to the others. Nevertheless, I show that it is not quite accurate enough. To solve this problem, I present a refinement algorithm for producing quality clusters while still maintaining bounds for the amount of information any given node must store about other nodes. I show that clusters derived this way have scheduler performance comparable to those chosen statically with global knowledge. Lastly, I discuss previously undiscovered under-specifications in the Content Addressable Network (CAN) structured peer to peer system. In high-churn situ- ations, the CAN allows stale information and changes to the overlay structure to create routing problems. I show solutions to these two problems, as well as discuss other issues that may also disrupt a CAN.