Decoder
A decoder is the component of a transformer model that generates output tokens one at a time, using the input's encoded representation and the tokens it has already produced. In local AI, decoders are the core of autoregressive language models like Llama, Mistral, and Qwen. Each generation step runs a forward pass through the decoder, which includes masked self-attention (to prevent looking ahead) and cross-attention (to attend to the encoder, if present). Decoder-only architectures (e.g., GPT, Llama) skip the encoder entirely and rely solely on causal self-attention. The decoder's size directly determines VRAM requirements: a 70B-parameter decoder needs ~40 GB at 4-bit quantization, limiting which consumer GPUs can run it.
Deeper dive
In the original transformer (Vaswani et al., 2017), the decoder had two attention layers: masked self-attention over already-generated tokens, and cross-attention over encoder outputs. Most modern local-AI models are decoder-only (e.g., Llama, Mistral, Qwen), meaning they have no encoder. These models use causal (masked) self-attention in every layer, processing the entire prompt in parallel during prefill, then generating tokens autoregressively. The decoder's depth (number of layers) and width (hidden dimension) determine parameter count and compute cost. For operators, the decoder's KV cache is the main VRAM consumer during generation: each layer stores key and value tensors for all previous tokens, scaling linearly with sequence length and batch size. This is why longer contexts or larger batches quickly exhaust VRAM on consumer GPUs.
Practical example
When running Llama 3.1 8B (decoder-only) on an RTX 4090 (24 GB VRAM), the decoder's weights at Q4_K_M occupy 5 GB. With a 32K-token context, the KV cache for 32 layers, 8 attention heads, and 128-dim head uses 32 * 2 * 4096 * 128 * 2 bytes ≈ 1 GB. This fits comfortably. But a 70B model at Q4 (40 GB) exceeds 24 GB, forcing offload to system RAM and dropping tokens/sec from ~30 to ~5.
Workflow example
In llama.cpp, the decoder is implemented in the llama_eval loop. When you run ./main -m model.gguf -p "Hello" -n 50, the runtime first prefills the prompt through the decoder's layers, then generates 50 tokens one by one. Each generation step runs a single forward pass through the decoder, updating the KV cache. In Ollama, ollama run llama3.1:8b does the same under the hood. Operators can observe decoder behavior via --verbose output showing prompt processing time and token generation rate.
Reviewed by Fredoline Eruo. See our editorial policy.