Question 5 of 10Pro Only
How does tensor parallelism work for large transformer models? Explain how attention and MLP layers are partitioned, and what communication is required.
Sample answer preview
Tensor parallelism splits individual layers across multiple devices, enabling layers too large for single-device memory. Megatron-LM pioneered efficient tensor parallelism strategies for transformers, demonstrating how to partition attention and MLP blocks with minimal…
tensor parallelismcolumn parallelismrow parallelismMegatron-LMall-reduceall-gather