Question 9 of 10Pro Only
Standard transformers have quadratic complexity with respect to sequence length. What techniques exist to handle long sequences efficiently, and what are the trade-offs of each approach?
Sample answer preview
Standard transformer self-attention computes attention scores between all pairs of positions, resulting in O(n^2) time and memory complexity where n is sequence length. This becomes prohibitive for long documents, genomic sequences, or high-resolution images.
LongformerBigBirdPerformerTransformer-XLsparse attentionlinear attention