Transformer & LLM components

Multi-Query Attention (MQA)

Also known as: multi-query attention

Multi-Query Attention (MQA) is a transformer attention variant where all attention heads share a single key/value projection instead of having their own. This shrinks the KV cache by a factor equal to the number of query heads, which directly reduces VRAM pressure and increases the maximum context length a given card can hold. MQA was the precursor to Grouped-Query Attention (GQA): MQA shares one KV head across all queries, GQA shares one KV head across a group of queries.

Deeper dive

In standard multi-head attention, each of N heads has its own Q, K, and V projections. The KV cache stores both K and V for every head and every token in the context. MQA collapses K and V to a single shared head, so the cache shrinks from 2 × N × seq_len × head_dim to 2 × 1 × seq_len × head_dim. The quality cost is small for many tasks — early benchmarks showed minimal perplexity regression — but the inference benefit is significant: more context fits, batching scales better, and decode latency drops because there's less memory to move per token. GQA generalizes this by letting operators trade quality for cache size with a head-grouping factor (e.g., 8 query heads share 1 KV head). Most modern open-weight models (Llama 3.x, Mistral, Qwen 2.x) use GQA; pure MQA remains in some smaller research models.

Practical example

On a card with 16 GB VRAM running an 8B model at FP16, standard multi-head attention with 32 heads might consume around 2 GB of KV cache at 4K context; the same model with MQA would consume ~64 MB — a 32× reduction. The freed VRAM converts directly into larger usable context windows or higher batch concurrency. Quality loss on instruction-following or code tasks is usually within the noise floor of evaluation runs, but on summarization or long-context retrieval the regression can be measurable; check the model card before betting a production workload on the reduction.

Workflow example

Most operators never configure MQA directly — it's a model-architecture choice baked into the weights. When you ollama pull a Llama 3 or Mistral model, you're getting GQA (the modern generalization). The relevance shows up in long-context behavior: if you switch from a non-MQA/GQA model to one that uses it, the same VRAM budget will tolerate a much longer context. Check the model's config.json for num_key_value_heads (KV heads) vs num_attention_heads (query heads) — if they're equal, it's standard MHA; if KV is 1, it's MQA; otherwise it's GQA with that ratio.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms