A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

dc.contributor.authorSingh, Siddarth
dc.contributor.authorRuwase, Olatunji
dc.contributor.authorAwan, Ammar Ahmad
dc.contributor.authorRajbhandari, Samyam
dc.contributor.authorHe, Yuxiong
dc.contributor.authorBhatele, Abhinav
dc.date.accessioned2023-09-14T19:49:28Z
dc.date.available2023-09-14T19:49:28Z
dc.date.issued2023-06-21
dc.description.abstractMixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
dc.description.urihttps://doi.org/10.1145/3577193.3593704
dc.identifierhttps://doi.org/10.13016/dspace/u6zm-83ii
dc.identifier.citationSiddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, and Abhinav Bhatele. 2023. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training. In 2023 International Conference on Supercomputing (ICS ’23), June 21–23, 2023, Orlando, FL, USA. ACM, New York, NY, USA, 12 pages.
dc.identifier.urihttp://hdl.handle.net/1903/30507
dc.language.isoen_US
dc.publisherAssociation for Computer Machinery (ACM)
dc.relation.isAvailableAtCollege of Computer, Mathematical & Natural Sciencesen_us
dc.relation.isAvailableAtComputer Scienceen_us
dc.relation.isAvailableAtDigital Repository at the University of Marylanden_us
dc.relation.isAvailableAtUniversity of Maryland (College Park, MD)en_us
dc.subjectparallel deep learning
dc.subjectmixture-of-experts
dc.subjecttensor parellelism
dc.subjectexpert parallelism
dc.titleA Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
dc.typeArticle
local.equitableAccessSubmissionNo

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Singh et al.pdf
Size:
731.24 KB
Format:
Adobe Portable Document Format