RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Decoder
Transformer & LLM components

Decoder

A decoder is the component of a transformer model that generates output tokens one at a time, using the input's encoded representation and the tokens it has already produced. In local AI, decoders are the core of autoregressive language models like Llama, Mistral, and Qwen. Each generation step runs a forward pass through the decoder, which includes masked self-attention (to prevent looking ahead) and cross-attention (to attend to the encoder, if present). Decoder-only architectures (e.g., GPT, Llama) skip the encoder entirely and rely solely on causal self-attention. The decoder's size directly determines VRAM requirements: a 70B-parameter decoder needs ~40 GB at 4-bit quantization, limiting which consumer GPUs can run it.

Deeper dive

In the original transformer (Vaswani et al., 2017), the decoder had two attention layers: masked self-attention over already-generated tokens, and cross-attention over encoder outputs. Most modern local-AI models are decoder-only (e.g., Llama, Mistral, Qwen), meaning they have no encoder. These models use causal (masked) self-attention in every layer, processing the entire prompt in parallel during prefill, then generating tokens autoregressively. The decoder's depth (number of layers) and width (hidden dimension) determine parameter count and compute cost. For operators, the decoder's KV cache is the main VRAM consumer during generation: each layer stores key and value tensors for all previous tokens, scaling linearly with sequence length and batch size. This is why longer contexts or larger batches quickly exhaust VRAM on consumer GPUs.

Practical example

When running Llama 3.1 8B (decoder-only) on an RTX 4090 (24 GB VRAM), the decoder's weights at Q4_K_M occupy 5 GB. With a 32K-token context, the KV cache for 32 layers, 8 attention heads, and 128-dim head uses 32 * 2 * 4096 * 128 * 2 bytes ≈ 1 GB. This fits comfortably. But a 70B model at Q4 (40 GB) exceeds 24 GB, forcing offload to system RAM and dropping tokens/sec from ~30 to ~5.

Workflow example

In llama.cpp, the decoder is implemented in the llama_eval loop. When you run ./main -m model.gguf -p "Hello" -n 50, the runtime first prefills the prompt through the decoder's layers, then generates 50 tokens one by one. Each generation step runs a single forward pass through the decoder, updating the KV cache. In Ollama, ollama run llama3.1:8b does the same under the hood. Operators can observe decoder behavior via --verbose output showing prompt processing time and token generation rate.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →