RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Attention Mechanism
Transformer & LLM components

Attention Mechanism

The attention mechanism is a neural network component that lets a model weigh the importance of different parts of the input when producing each output token. In transformers, it computes a weighted sum of values (e.g., token embeddings) based on learned query-key similarity scores. This allows the model to focus on relevant context, like a pronoun looking back at its noun. For operators, attention is the main computational bottleneck: it scales quadratically with sequence length (O(n²) memory and time), so longer contexts require more VRAM and slower inference.

Deeper dive

Attention computes three matrices from each input: queries (Q), keys (K), and values (V). The attention score is softmax(QK^T / sqrt(d_k))V, where d_k is the key dimension. This produces a context-aware representation for each token. Multi-head attention runs this in parallel with multiple sets of Q/K/V, allowing the model to attend to different types of relationships (e.g., syntax vs. semantics). The quadratic complexity means a 32K context uses ~16x more memory than 8K. Variants like FlashAttention optimize memory access patterns to reduce VRAM usage. In local AI, attention directly impacts max context length and tokens/sec, especially on consumer GPUs with limited VRAM.

Practical example

On an RTX 4090 (24 GB VRAM), running Llama 3.1 8B at Q4_K_M (~5 GB weights) with a 32K context leaves ~19 GB for attention. FlashAttention v2 reduces memory from O(n²) to near-linear, enabling 32K context at ~40 tok/s. Without FlashAttention, the same model might OOM at 16K context. On an M2 Max (32 GB unified memory), attention uses shared memory, so 32K context is feasible but tokens/sec drops from ~60 to ~30 as context grows.

Workflow example

In llama.cpp, you control attention behavior via context length (-c 4096) and FlashAttention (--flash-attn). Running ./llama-cli -m model.gguf -c 8192 --flash-attn enables the optimized attention kernel. In Ollama, set num_ctx in Modelfile: PARAMETER num_ctx 16384. In vLLM, --max-model-len 32768 sets the maximum context, and the scheduler manages attention memory. Operators monitor VRAM usage with nvidia-smi to see if attention memory causes OOM.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →