RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Multi-Query Attention (MQA)
Transformer & LLM components

Multi-Query Attention (MQA)

Also known as: multi-query attention

Multi-Query Attention (MQA) is a transformer attention variant where all attention heads share a single key/value projection instead of having their own. This shrinks the KV cache by a factor equal to the number of query heads, which directly reduces VRAM pressure and increases the maximum context length a given card can hold. MQA was the precursor to Grouped-Query Attention (GQA): MQA shares one KV head across all queries, GQA shares one KV head across a group of queries.

Deeper dive

In standard multi-head attention, each of N heads has its own Q, K, and V projections. The KV cache stores both K and V for every head and every token in the context. MQA collapses K and V to a single shared head, so the cache shrinks from 2 × N × seq_len × head_dim to 2 × 1 × seq_len × head_dim. The quality cost is small for many tasks — early benchmarks showed minimal perplexity regression — but the inference benefit is significant: more context fits, batching scales better, and decode latency drops because there's less memory to move per token. GQA generalizes this by letting operators trade quality for cache size with a head-grouping factor (e.g., 8 query heads share 1 KV head). Most modern open-weight models (Llama 3.x, Mistral, Qwen 2.x) use GQA; pure MQA remains in some smaller research models.

Practical example

On a card with 16 GB VRAM running an 8B model at FP16, standard multi-head attention with 32 heads might consume around 2 GB of KV cache at 4K context; the same model with MQA would consume ~64 MB — a 32× reduction. The freed VRAM converts directly into larger usable context windows or higher batch concurrency. Quality loss on instruction-following or code tasks is usually within the noise floor of evaluation runs, but on summarization or long-context retrieval the regression can be measurable; check the model card before betting a production workload on the reduction.

Workflow example

Most operators never configure MQA directly — it's a model-architecture choice baked into the weights. When you ollama pull a Llama 3 or Mistral model, you're getting GQA (the modern generalization). The relevance shows up in long-context behavior: if you switch from a non-MQA/GQA model to one that uses it, the same VRAM budget will tolerate a much longer context. Check the model's config.json for num_key_value_heads (KV heads) vs num_attention_heads (query heads) — if they're equal, it's standard MHA; if KV is 1, it's MQA; otherwise it's GQA with that ratio.

Related terms

KV CacheGrouped-Query Attention (GQA)Multi-Head Attention

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →