Attention Mechanism
The attention mechanism is a neural network component that lets a model weigh the importance of different parts of the input when producing each output token. In transformers, it computes a weighted sum of values (e.g., token embeddings) based on learned query-key similarity scores. This allows the model to focus on relevant context, like a pronoun looking back at its noun. For operators, attention is the main computational bottleneck: it scales quadratically with sequence length (O(n²) memory and time), so longer contexts require more VRAM and slower inference.
Deeper dive
Attention computes three matrices from each input: queries (Q), keys (K), and values (V). The attention score is softmax(QK^T / sqrt(d_k))V, where d_k is the key dimension. This produces a context-aware representation for each token. Multi-head attention runs this in parallel with multiple sets of Q/K/V, allowing the model to attend to different types of relationships (e.g., syntax vs. semantics). The quadratic complexity means a 32K context uses ~16x more memory than 8K. Variants like FlashAttention optimize memory access patterns to reduce VRAM usage. In local AI, attention directly impacts max context length and tokens/sec, especially on consumer GPUs with limited VRAM.
Practical example
On an RTX 4090 (24 GB VRAM), running Llama 3.1 8B at Q4_K_M (~5 GB weights) with a 32K context leaves ~19 GB for attention. FlashAttention v2 reduces memory from O(n²) to near-linear, enabling 32K context at ~40 tok/s. Without FlashAttention, the same model might OOM at 16K context. On an M2 Max (32 GB unified memory), attention uses shared memory, so 32K context is feasible but tokens/sec drops from ~60 to ~30 as context grows.
Workflow example
In llama.cpp, you control attention behavior via context length (-c 4096) and FlashAttention (--flash-attn). Running ./llama-cli -m model.gguf -c 8192 --flash-attn enables the optimized attention kernel. In Ollama, set num_ctx in Modelfile: PARAMETER num_ctx 16384. In vLLM, --max-model-len 32768 sets the maximum context, and the scheduler manages attention memory. Operators monitor VRAM usage with nvidia-smi to see if attention memory causes OOM.
Reviewed by Fredoline Eruo. See our editorial policy.