Cross-Attention — AI glossary

Cross-attention is a mechanism in transformer models where the query vectors come from one sequence (e.g., the decoder's current output) and the key/value vectors come from a different sequence (e.g., the encoder's output). It allows the model to attend to relevant parts of the input when generating each output token. In local AI, cross-attention is used in encoder-decoder models like T5, Whisper, and multimodal models (e.g., LLaVA) to align text with images or audio. It differs from self-attention, where all vectors come from the same sequence. Cross-attention increases compute and memory usage because the key/value cache must be stored for the entire input sequence, which can strain VRAM on consumer GPUs.

Deeper dive

Cross-attention is a core component of the transformer architecture, first introduced in the original 'Attention Is All You Need' paper for machine translation. In an encoder-decoder transformer, the encoder processes the input sequence (e.g., a sentence in French) and produces a set of key and value vectors. The decoder then uses cross-attention layers to query these encoder keys/values while generating each output token (e.g., English translation). This allows the decoder to dynamically focus on different parts of the input at each step. In practice, cross-attention is computationally expensive because the key/value cache for the entire input must be retained in memory during generation. For local AI, this means that models with cross-attention (e.g., T5, Whisper, LLaVA) require more VRAM than decoder-only models (e.g., Llama) for the same context length. Some architectures, like the one used in GPT-style models, avoid cross-attention entirely by using only self-attention in a decoder-only stack, which simplifies memory management.

Practical example

When running LLaVA 1.5 7B (a multimodal model) on an RTX 3090 (24 GB VRAM), cross-attention between the vision encoder and language model adds ~2 GB of memory overhead for a 336x336 image. The vision encoder produces 576 tokens (patch embeddings), and the cross-attention layers in the language model must cache keys/values for all 576 tokens. This means the model can handle fewer text tokens in context compared to a pure text model of the same size.

Workflow example

In Hugging Face Transformers, cross-attention is used when loading an encoder-decoder model like t5-base. When you call model.generate(input_ids), the encoder runs first and stores keys/values in encoder_last_hidden_state. The decoder then uses cross-attention layers that attend to these encoder outputs. In LM Studio, loading a model like Whisper (which uses cross-attention between audio encoder and text decoder) will show higher VRAM usage than a similarly sized Llama model. You can monitor this in the task manager or nvidia-smi.