Transformer
The Transformer is a neural network architecture introduced in 2017 that replaced recurrent layers with a self-attention mechanism, enabling parallel processing of all tokens in a sequence. For local AI operators, this means models like Llama, Mistral, and Qwen are built on Transformer decoders—they process prompts by computing attention across all tokens simultaneously, which is why VRAM scales with context length (attention matrix grows quadratically). The architecture's feed-forward layers and attention heads are the primary targets for quantization (e.g., Q4_K_M) to fit models into consumer GPU memory.
Deeper dive
The Transformer consists of an encoder and decoder stack, but most local LLMs use only the decoder (e.g., GPT-style). The core innovation is the self-attention mechanism, which computes a weighted sum of all token representations for each token, allowing the model to capture long-range dependencies without the sequential bottleneck of RNNs. Each layer has multi-head attention (multiple parallel attention computations) followed by a feed-forward network (two linear layers with a non-linearity). Layer normalization and residual connections stabilize training. For operators, the key practical detail is that attention's memory and compute cost grow quadratically with sequence length—a 4096-token context uses 16× the attention memory of a 1024-token context. This is why techniques like sliding window attention (Mistral) or FlashAttention (optimized CUDA kernels) are critical for long-context inference on consumer GPUs.
Practical example
A 7B-parameter Llama 3 model has 32 layers, each with 32 attention heads. When running inference on an RTX 4090 (24 GB VRAM), a 4096-token prompt requires ~2 GB for the KV cache alone (2 bytes per key/value per layer per head per token). Quantizing from FP16 to Q4_K_M reduces the model weights from ~14 GB to ~4.5 GB, freeing VRAM for larger batches or longer contexts.
Workflow example
When you run llama-cli -m model.gguf -p "Explain transformers" -c 4096, llama.cpp loads the Transformer weights into VRAM, then computes attention for each token in the prompt. The KV cache grows as tokens are generated—you can monitor VRAM usage with nvidia-smi or ollama ps. If context length exceeds VRAM, the runtime falls back to system RAM, dropping tokens/sec from ~50 to ~5.
Reviewed by Fredoline Eruo. See our editorial policy.