Token generation slows as conversation gets longer
Cause
Generation tokens-per-second is bandwidth-bound: every new token requires reading the entire KV cache from VRAM. As context grows, the cache grows, and per-token reads take longer.
This is expected, not a bug — but the slowdown is steeper than people expect. A model that generates at 50 tok/s at 1K context may drop to 25 tok/s at 16K context and 10 tok/s at 64K. Quadratic-ish in some attention implementations, linear with Flash Attention.
Solution
Verify Flash Attention is enabled (linear instead of quadratic context cost):
# llama.cpp
./main -m model.gguf --flash-attn
# Ollama (newer versions enable by default; verify with):
ollama show llama3.1:8b --modelfile | grep flash
Quantize the KV cache (FP8 or INT4 KV halves or quarters memory bandwidth at minor quality cost):
# llama.cpp
./main -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
Use a smaller context if you don't actually need 64K:
ollama run llama3.1:8b-32k # custom modelfile with num_ctx 32768
Consider context-summary patterns for chat — instead of feeding raw 50K tokens of history, summarize old turns. Trades fidelity for speed.
Faster card or more VRAM bandwidth is the hardware fix. RTX 5090 has 1.79 TB/s vs 5080's 960 GB/s — measurable speed advantage on long contexts.
Related errors
Did this fix it?
If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.