RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Neural network architectures / Transformer
Neural network architectures

Transformer

The Transformer is a neural network architecture introduced in 2017 that replaced recurrent layers with a self-attention mechanism, enabling parallel processing of all tokens in a sequence. For local AI operators, this means models like Llama, Mistral, and Qwen are built on Transformer decoders—they process prompts by computing attention across all tokens simultaneously, which is why VRAM scales with context length (attention matrix grows quadratically). The architecture's feed-forward layers and attention heads are the primary targets for quantization (e.g., Q4_K_M) to fit models into consumer GPU memory.

Deeper dive

The Transformer consists of an encoder and decoder stack, but most local LLMs use only the decoder (e.g., GPT-style). The core innovation is the self-attention mechanism, which computes a weighted sum of all token representations for each token, allowing the model to capture long-range dependencies without the sequential bottleneck of RNNs. Each layer has multi-head attention (multiple parallel attention computations) followed by a feed-forward network (two linear layers with a non-linearity). Layer normalization and residual connections stabilize training. For operators, the key practical detail is that attention's memory and compute cost grow quadratically with sequence length—a 4096-token context uses 16× the attention memory of a 1024-token context. This is why techniques like sliding window attention (Mistral) or FlashAttention (optimized CUDA kernels) are critical for long-context inference on consumer GPUs.

Practical example

A 7B-parameter Llama 3 model has 32 layers, each with 32 attention heads. When running inference on an RTX 4090 (24 GB VRAM), a 4096-token prompt requires ~2 GB for the KV cache alone (2 bytes per key/value per layer per head per token). Quantizing from FP16 to Q4_K_M reduces the model weights from ~14 GB to ~4.5 GB, freeing VRAM for larger batches or longer contexts.

Workflow example

When you run llama-cli -m model.gguf -p "Explain transformers" -c 4096, llama.cpp loads the Transformer weights into VRAM, then computes attention for each token in the prompt. The KV cache grows as tokens are generated—you can monitor VRAM usage with nvidia-smi or ollama ps. If context length exceeds VRAM, the runtime falls back to system RAM, dropping tokens/sec from ~50 to ~5.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →