Question 5 of 10Pro Only

How does tensor parallelism work for large transformer models? Explain how attention and MLP layers are partitioned, and what communication is required.

Sample answer preview

Tensor parallelism splits individual layers across multiple devices, enabling layers too large for single-device memory. Megatron-LM pioneered efficient tensor parallelism strategies for transformers, demonstrating how to partition attention and MLP blocks with minimal…

tensor parallelismcolumn parallelismrow parallelismMegatron-LMall-reduceall-gather

Unlock the full answer

Get the complete model answer, key points, common pitfalls, and access to 9+ more AI/ML Engineer interview questions.

Upgrade to Pro

Starting at $19/month • Cancel anytime