
Large-Scale Training & Parallelism in DNN Models
Explore the challenges and solutions in training large models without batches, utilizing pipeline parallelism, model parallelism, and maximizing hardware utilization. Learn about synchronous vs. asynchronous training, data parallelism vs. model parallelism, and the impact of pipeline latency.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Pipelined Backpropagation at Scale: Training Pipelined Backpropagation at Scale: Training Large Models without Batches Large Models without Batches Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, Urs Koster from Cerebras VS PipeMare PipeMare: Asynchronous Pipeline Parallel : Asynchronous Pipeline Parallel DNN Training DNN Training Bowen Yang, Jian Zhang, Jonathan Li, Christopher Re, Christopher Aberger, Christopher De Sa from Sambanova 1
Motivation 40 GB On-chip SRAM 850,000 cores Models are growing larger Fit the model to memory Memory latency New accelerators: Large on-chip memory (SRAM) More cores on a single chip Goal: Maximize hardware utilization Use model parallelism/pipelining Fit new accelerators 2
Preliminaries Training procedure: For each data sample Forward: from input to output Backward: from output to input Collect gradients (?) from 1 or more samples Update model weights (?) with SGD (or variants): Vanilla SGD: ??+1= ?? ??? (Polyak s) Momentum SGD: ??= ??? 1+ ??? ??+1= ?? ?? ?: learning rate, ?: momentum ?: current iteration 3
Preliminaries Mini-batch training: Mini-batch: a group of random input samples Micro-batch: break down mini-batch Calculate the gradients for each micro-batch 4
Preliminaries Synchronous VS. Asynchronous Synchronous: ??+1= ?? ??? Always use the latest model Slow global barrier Asynchronous: ??+1= ?? ??? ? ?? ?: gradient evaluated on ?? ?, ?: staleness No barrier Staleness incurs error 5
Preliminaries Data parallelism VS. Model parallelism Data parallelism: All workers have the same copy of model weights Different workers use different micro-batches Model parallelism: Each worker have a part of model weights Different workers use different parts Pipeline parallelism: keep each whole layer Sharded parallelism: split each layer 6
Preliminaries Switching context Data parallelism VS. Pipeline parallelism Pipeline latency 7
Preliminaries Why pipeline parallelism? Assign each layer (weight) to a pipeline/worker (fine-grained) Keep weight in on-chip memory (SRAM) Avoid context switching Low memory latency Improved utilization Fit new accelerators Large SRAM for one layer Pipelining inside a huge chip 8
Preliminaries Issues in na ve pipelining (GPipe) : Pipeline must be empty before the next mini-batch Under-utilize the pipeline (drained pipeline) Worse for deeper pipeline Drained pipeline 9
Roadmap Idea: Eliminate the pipeline barrier Process the next mini-batch immediately Pipeline is never empty Need to deal with inconsistent weights Keep fully occupied Maximized utilization 10
Related work GPipe: na ve pipelining PipeDream: The weights are updated during the 2nd mini-batch Cache previous weights for backward Weight stashing Memory usage Doesn t fit SRAM Gradient staleness Asynchronous SGD 2nd micro-batch of 2nd mini-batch 1st micro-batch of 2nd mini-batch End of first mini-batch Model weight updated 11
Roadmap Inconsistent weights for gradient computation: ????: weight in forward ????: weight in backward ????: weight current iteration ???? ???? ???? ????= ???? ???? ????= ????= ???? Estimate ???? PipeDream PipeMare Cache ???? GPipe Other tricks PB: Pipeline Backpropagation SpecTrain Predict ???? Cerebras ???? ????= ???? ???? ????= ???? ???? ????= ???? 12
PipeMare PipeDream: ????= ???? ???? with cached ???? PipeMare: Without caching ???? Estimate past ???? based on current ???? Maintain a moving average to estimate the update of one step: ??+1= ???+ 1 + ? ??+1 ??, ? = ?1/?, ? is the staleness Use the following for backward: ???? ?? ???? 13
Pipeline Backpropagation (PB) ???? ????= ???? No cache, no estimation Always use the latest version of ?, in both forward and backward Easy to diverge due to the inconsistency PipeDream PB 14
SpecTrain PipeMare: estimate past weight in backward SpecTrain: predict future weight in forward Without caching Predict future ???? based on current ???? Maintain a moving average to estimate the update of one step: ??+1= ???+ 1 + ? ??+1 ??, with ? fixed Or use gradient instead of ??+1 ?? Use the following for forward: ????+ ?? ????, ? is the staleness 15
Cerebras method SpecTrain: predict future weight in forward, with moving average ??+1= ???+ 1 + ? ??+1 ??, ????+ ?? ???? Cerebras: Linear weight prediction (LWP): ????+ ? ??+1 ?? ???? Basically just take ? = 0 in SpecTrain Other variants 16
PipeMare VS. Cerebras PipeMare Cerebras Mitigate inconsistency Gradient estimation ? Gradient error ? ?(??,??) Estimate past weight ?(?? ?, ?? ?) ? ?? ?, ?? ? ? ?? ?,?? ? + ?(?? ?,?? ?) ?(??,??) Predict future weight ?( ??,??) ?( ??,??) ?(??,??) Inconsistency Inconsistency Staleness PipeMare: Additional error from staleness Estimating past is easier (intuitively) More flexible in asynchronous training Cerebras: Inconsistency error only Predicting future is harder ? must be known and fixed in advance 17
PipeMare Mitigate staleness error Re-scale learning rate w.r.t. staleness: ? ? 1 18
Momentum SGD: ??= ??? 1+ ??? ??+1= ?? ?? Cerebras Compensate delayed update Spike compensation (SC): Ideally, weights should be updated immediately ? steps ago, if no staleness But, the actual update is delayed by ? steps SC compensates the missing updates during the gap of ? steps ??+1= ?? (???+???? ? ) ? = ?? ? =1 ?? 1 ? Some potential concerns 19
Experiments Applications: Image classification Machine translation Baselines: GPipe PipeDream PB SpecTrain 20
Experiments 21
Experiments 22
Experiments 23
Experiments 24
Experiments 25
Experiments 26
Final comments Pipelining + inconsistent weights + error mitigation SpecTrain seems to work better than PipeMare Compare to SpecTrain, Cerebras has limited improvement Open questions: Other ways to mitigate inconsistency and staleness Better combination with complicated SGD variants like ADAM 27