Self-Attention

Self-attention computes a weighted representation of every position in a sequence by comparing each token against every other token within the same sequence — queries, keys, and values all derive from the same input. For a sequence length n, self-attention computes an n×n attention matrix where entry (i,j) represents how much token i should attend to token j. This O(n²) computation is the transformer's core mechanism for modeling long-range dependencies, enabling a token to directly access information from any other token regardless of distance (unlike RNNs where distant information must propagate through many steps).

Self-attention is the "self" part — every token attends to every other token in the same sequence. This is what makes transformers powerful at understanding context. The cost is quadratic: processing 2048 tokens ≈ 4M attention pairs; 32768 tokens ≈ 1B attention pairs, 250× more compute for 16× more tokens. This is the fundamental scaling wall for long contexts.

Self-attention is the main bottleneck in inference. Monitor: (1) time_to_first_token (TTFT) — includes all the self-attention computation on the prompt, (2) tokens_per_second (TPS) during generation — lower because generation only computes attention for the new token against all previous. If TTFT is 5s for a 4K prompt, that's normal. If TPS at 32K context drops from 50 to 5 tok/s, you're hitting the memory-bandwidth wall.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example