RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Cross-Attention
Transformer & LLM components

Cross-Attention

Cross-attention is a mechanism in transformer models where the query vectors come from one sequence (e.g., the decoder's current output) and the key/value vectors come from a different sequence (e.g., the encoder's output). It allows the model to attend to relevant parts of the input when generating each output token. In local AI, cross-attention is used in encoder-decoder models like T5, Whisper, and multimodal models (e.g., LLaVA) to align text with images or audio. It differs from self-attention, where all vectors come from the same sequence. Cross-attention increases compute and memory usage because the key/value cache must be stored for the entire input sequence, which can strain VRAM on consumer GPUs.

Deeper dive

Cross-attention is a core component of the transformer architecture, first introduced in the original 'Attention Is All You Need' paper for machine translation. In an encoder-decoder transformer, the encoder processes the input sequence (e.g., a sentence in French) and produces a set of key and value vectors. The decoder then uses cross-attention layers to query these encoder keys/values while generating each output token (e.g., English translation). This allows the decoder to dynamically focus on different parts of the input at each step. In practice, cross-attention is computationally expensive because the key/value cache for the entire input must be retained in memory during generation. For local AI, this means that models with cross-attention (e.g., T5, Whisper, LLaVA) require more VRAM than decoder-only models (e.g., Llama) for the same context length. Some architectures, like the one used in GPT-style models, avoid cross-attention entirely by using only self-attention in a decoder-only stack, which simplifies memory management.

Practical example

When running LLaVA 1.5 7B (a multimodal model) on an RTX 3090 (24 GB VRAM), cross-attention between the vision encoder and language model adds ~2 GB of memory overhead for a 336x336 image. The vision encoder produces 576 tokens (patch embeddings), and the cross-attention layers in the language model must cache keys/values for all 576 tokens. This means the model can handle fewer text tokens in context compared to a pure text model of the same size.

Workflow example

In Hugging Face Transformers, cross-attention is used when loading an encoder-decoder model like t5-base. When you call model.generate(input_ids), the encoder runs first and stores keys/values in encoder_last_hidden_state. The decoder then uses cross-attention layers that attend to these encoder outputs. In LM Studio, loading a model like Whisper (which uses cross-attention between audio encoder and text decoder) will show higher VRAM usage than a similarly sized Llama model. You can monitor this in the task manager or nvidia-smi.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →