Transformer & LLM components

Multi-Head Attention

Multi-Head Attention is a mechanism in transformer models where the input is projected into multiple parallel 'attention heads,' each learning different relationships between tokens. The outputs are concatenated and projected again. This allows the model to attend to information from different representation subspaces at different positions, improving its ability to capture diverse patterns like syntax, semantics, and long-range dependencies. In practice, the number of heads (e.g., 32 in Llama 3.1 8B) is a key architectural parameter that affects both model quality and compute cost, as each head requires separate matrix multiplications.

Deeper dive

Multi-Head Attention extends single-head scaled dot-product attention by running the attention function multiple times in parallel with different learned linear projections. Each head operates on queries, keys, and values that are linearly transformed from the input, typically with dimensions d_k = d_model / num_heads. The outputs are concatenated and linearly projected to the original dimension. This design enables the model to jointly attend to information from different representation subspaces at different positions. For example, one head might focus on subject-verb agreement while another captures positional relationships. The number of heads is a hyperparameter; common values range from 8 to 96 in large models. In local AI, more heads increase FLOPs and memory bandwidth usage, directly impacting tokens per second. Operators may see this parameter in model config files (e.g., config.json under num_attention_heads) and should note that models with many heads can be more sensitive to quantization precision.

Practical example

Consider Llama 3.1 8B, which has 32 attention heads. Each head processes queries, keys, and values of dimension 128 (since d_model=4096, 4096/32=128). When running on an RTX 4090, the parallel computation across heads is efficiently handled by GPU tensor cores. However, on an Apple M2 Max with 32 GB unified memory, the same model may see lower tokens/sec partly because the attention computation across 32 heads stresses memory bandwidth. Reducing the number of heads (e.g., via model surgery) is not standard, but operators can compare models like Mistral 7B (32 heads) vs. Gemma 7B (16 heads) to see different performance profiles.

Workflow example

When loading a model in Ollama or LM Studio, the runtime reads the model's config.json, which includes num_attention_heads. For example, in Mistral 7B's config, you'll see "num_attention_heads": 32. During inference, each forward pass computes multi-head attention across all tokens in the context. In llama.cpp, you can observe attention head usage via the --verbose flag, which prints per-layer timing; attention layers typically account for 30-50% of total inference time. Operators tuning for speed may consider models with fewer heads (e.g., 16) if memory bandwidth is the bottleneck.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work