Token generation slows as conversation gets longer
Cause
Generation tokens-per-second is bandwidth-bound: every new token requires reading the entire KV cache from VRAM. As context grows, the cache grows, and per-token reads take longer.
This is expected, not a bug — but the slowdown is steeper than people expect. A model that generates at 50 tok/s at 1K context may drop to 25 tok/s at 16K context and 10 tok/s at 64K. Quadratic-ish in some attention implementations, linear with Flash Attention.
Solution
Verify Flash Attention is enabled (linear instead of quadratic context cost):
# llama.cpp
./main -m model.gguf --flash-attn
# Ollama (newer versions enable by default; verify with):
ollama show llama3.1:8b --modelfile | grep flash
Quantize the KV cache (FP8 or INT4 KV halves or quarters memory bandwidth at minor quality cost):
# llama.cpp
./main -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
Use a smaller context if you don't actually need 64K:
ollama run llama3.1:8b-32k # custom modelfile with num_ctx 32768
Consider context-summary patterns for chat — instead of feeding raw 50K tokens of history, summarize old turns. Trades fidelity for speed.
Faster card or more VRAM bandwidth is the hardware fix. RTX 5090 has 1.79 TB/s vs 5080's 960 GB/s — measurable speed advantage on long contexts.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.