Question 6 of 10Pro Only
Explain the three stages of ZeRO optimization in DeepSpeed. How does each stage reduce memory, and what are the communication trade-offs? When would you use each stage?
Sample answer preview
ZeRO, the Zero Redundancy Optimizer, eliminates memory redundancy in data parallel training by partitioning model states across devices instead of replicating them. DeepSpeed implements ZeRO in three stages, each providing greater memory savings with corresponding communication…
ZeROoptimizer state partitioninggradient partitioningparameter partitioningreduce-scatterall-gather