OPTIMIZING COMMUNICATION IN PARALLEL DEEP LEARNING ON EXASCALE-CLASS MACHINES

Singh, Siddharth

OPTIMIZING COMMUNICATION IN PARALLEL DEEP LEARNING ON EXASCALE-CLASS MACHINES

Files

Singh_umd_0117E_25064.pdf (1.56 MB)

No. of downloads: 392

Date

2025

Authors

Singh, Siddharth

Advisor

Bhatele, Abhinav

DRUM DOI

https://doi.org/10.13016/nwae-tqjr

Abstract

Deep learning has made significant advancements across various fields, driven by increasingly larger neural networks and massive datasets. However, these improvements come at the cost of high computational demands, necessitating the use of thousands of GPUs operating in parallel for extreme scale model training. At such scales, the overheads associated with inter-GPU communication become a major bottleneck, severely limiting efficient hardware resource utilization.

This dissertation addresses communication challenges in large-scale parallel training. It develops hybrid parallel algorithms designed to reduce communication overhead, along with asynchronous, message-driven communication methods that enable better overlap of computation and communication. A performance modeling framework is presented to identify communication-minimizing configurations for given workloads. Finally, scalable implementations of latency-optimal collective communication are developed to support efficient training at scale. These contributions improve the performance and scalability of distributed deep learning systems. By tackling these critical communication challenges, this work contributes to more efficient deep learning training at scale, enabling faster model convergence and better resource utilization across large GPU clusters.

URI (handle)

http://hdl.handle.net/1903/34204

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page