Transformer & LLM components

Flash Attention

Flash Attention is a memory-efficient implementation of the attention mechanism that reduces memory usage from O(n²) to O(n) for sequence length n, while also being faster on modern GPUs.

The key insight: standard attention materializes the full N×N attention matrix in HBM (the slow GPU memory). Flash Attention tiles the computation, keeps intermediate results in fast SRAM, and never writes the full matrix to HBM. Same math, much less memory bandwidth pressure.

For local inference this matters most at long context. Without Flash Attention, a 32K-context generation on a 7B model can OOM a 24GB card; with Flash Attention, it fits with room to spare. llama.cpp added support in late 2024; vLLM and ExLlamaV2 use it by default.

Related terms

KV Cache Context Window

Reviewed by Fredoline Eruo. See our editorial policy.