RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Multi-Head Attention
Transformer & LLM components

Multi-Head Attention

Multi-Head Attention is a mechanism in transformer models where the input is projected into multiple parallel 'attention heads,' each learning different relationships between tokens. The outputs are concatenated and projected again. This allows the model to attend to information from different representation subspaces at different positions, improving its ability to capture diverse patterns like syntax, semantics, and long-range dependencies. In practice, the number of heads (e.g., 32 in Llama 3.1 8B) is a key architectural parameter that affects both model quality and compute cost, as each head requires separate matrix multiplications.

Deeper dive

Multi-Head Attention extends single-head scaled dot-product attention by running the attention function multiple times in parallel with different learned linear projections. Each head operates on queries, keys, and values that are linearly transformed from the input, typically with dimensions d_k = d_model / num_heads. The outputs are concatenated and linearly projected to the original dimension. This design enables the model to jointly attend to information from different representation subspaces at different positions. For example, one head might focus on subject-verb agreement while another captures positional relationships. The number of heads is a hyperparameter; common values range from 8 to 96 in large models. In local AI, more heads increase FLOPs and memory bandwidth usage, directly impacting tokens per second. Operators may see this parameter in model config files (e.g., config.json under num_attention_heads) and should note that models with many heads can be more sensitive to quantization precision.

Practical example

Consider Llama 3.1 8B, which has 32 attention heads. Each head processes queries, keys, and values of dimension 128 (since d_model=4096, 4096/32=128). When running on an RTX 4090, the parallel computation across heads is efficiently handled by GPU tensor cores. However, on an Apple M2 Max with 32 GB unified memory, the same model may see lower tokens/sec partly because the attention computation across 32 heads stresses memory bandwidth. Reducing the number of heads (e.g., via model surgery) is not standard, but operators can compare models like Mistral 7B (32 heads) vs. Gemma 7B (16 heads) to see different performance profiles.

Workflow example

When loading a model in Ollama or LM Studio, the runtime reads the model's config.json, which includes num_attention_heads. For example, in Mistral 7B's config, you'll see "num_attention_heads": 32. During inference, each forward pass computes multi-head attention across all tokens in the context. In llama.cpp, you can observe attention head usage via the --verbose flag, which prints per-layer timing; attention layers typically account for 30-50% of total inference time. Operators tuning for speed may consider models with fewer heads (e.g., 16) if memory bandwidth is the bottleneck.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →