OPTIMIZING COMMUNICATION IN PARALLEL DEEP LEARNING ON EXASCALE-CLASS MACHINES

dc.contributor.advisorBhatele, Abhinaven_US
dc.contributor.authorSingh, Siddharthen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-08-08T12:04:02Z
dc.date.issued2025en_US
dc.description.abstractDeep learning has made significant advancements across various fields, driven by increasingly larger neural networks and massive datasets. However, these improvements come at the cost of high computational demands, necessitating the use of thousands of GPUs operating in parallel for extreme scale model training. At such scales, the overheads associated with inter-GPU communication become a major bottleneck, severely limiting efficient hardware resource utilization. This dissertation addresses communication challenges in large-scale parallel training. It develops hybrid parallel algorithms designed to reduce communication overhead, along with asynchronous, message-driven communication methods that enable better overlap of computation and communication. A performance modeling framework is presented to identify communication-minimizing configurations for given workloads. Finally, scalable implementations of latency-optimal collective communication are developed to support efficient training at scale. These contributions improve the performance and scalability of distributed deep learning systems. By tackling these critical communication challenges, this work contributes to more efficient deep learning training at scale, enabling faster model convergence and better resource utilization across large GPU clusters.en_US
dc.identifierhttps://doi.org/10.13016/nwae-tqjr
dc.identifier.urihttp://hdl.handle.net/1903/34204
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pquncontrolledAsynchronous Communicationen_US
dc.subject.pquncontrolledCollective Communicationen_US
dc.subject.pquncontrolledExpert Parallelismen_US
dc.subject.pquncontrolledModel Parallelismen_US
dc.subject.pquncontrolledParallel Deep Learningen_US
dc.subject.pquncontrolledPerformance Modelingen_US
dc.titleOPTIMIZING COMMUNICATION IN PARALLEL DEEP LEARNING ON EXASCALE-CLASS MACHINESen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Singh_umd_0117E_25064.pdf
Size:
1.56 MB
Format:
Adobe Portable Document Format