Explain the mathematical formulation of multi-head attention. Why is multi-head attention more effective than single-head attention, and how do different heads learn different patterns?

Question

Accepted Answer

Multi-head attention extends the basic attention mechanism by running multiple attention operations in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

Explain the mathematical formulation of multi-head attention. Why is multi-head attention more effective than single-head attention, and how do different heads learn different patterns?

Sample answer preview

Unlock the full answer