Question 8 of 10Pro Only
Explain the mathematical formulation of multi-head attention. Why is multi-head attention more effective than single-head attention, and how do different heads learn different patterns?
Sample answer preview
Multi-head attention extends the basic attention mechanism by running multiple attention operations in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
scaled dot-product attentionattention headsquery key value projectionshead specializationsyntactic headspositional heads