OPTIMIZING COMMUNICATION IN PARALLEL DEEP LEARNING ON EXASCALE-CLASS MACHINES

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Bhatele, Abhinav

Citation

Abstract

Deep learning has made significant advancements across various fields, driven by increasingly larger neural networks and massive datasets. However, these improvements come at the cost of high computational demands, necessitating the use of thousands of GPUs operating in parallel for extreme scale model training. At such scales, the overheads associated with inter-GPU communication become a major bottleneck, severely limiting efficient hardware resource utilization.

This dissertation addresses communication challenges in large-scale parallel training. It develops hybrid parallel algorithms designed to reduce communication overhead, along with asynchronous, message-driven communication methods that enable better overlap of computation and communication. A performance modeling framework is presented to identify communication-minimizing configurations for given workloads. Finally, scalable implementations of latency-optimal collective communication are developed to support efficient training at scale. These contributions improve the performance and scalability of distributed deep learning systems. By tackling these critical communication challenges, this work contributes to more efficient deep learning training at scale, enabling faster model convergence and better resource utilization across large GPU clusters.

Notes

Rights