Token generation slows as conversation gets longer

Q: How do you fix "Token generation slows as conversation gets longer"?

**Verify Flash Attention is enabled** (linear instead of quadratic context cost): ```bash # llama.cpp ./main -m model.gguf --flash-attn # Ollama (newer versions enable by default; verify with): ollama show llama3.1:8b --modelfile | grep flash ``` **Quantize the KV cache** (FP8 or INT4 KV halves or quarters memory bandwidth at minor quality cost): ```bash # llama.cpp ./main -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ``` **Use a smaller context** if you don't actually need 64K: ```bash ollama run llama3.1:8b-32k # custom modelfile with num_ctx 32768 ``` **Consider context-summary patterns** for chat — instead of feeding raw 50K tokens of history, summarize old turns. Trades fidelity for speed. **Faster card or more VRAM bandwidth** is the hardware fix. RTX 5090 has 1.79 TB/s vs 5080's 960 GB/s — measurable speed advantage on long contexts.

(no error — tok/s drops from 50 to 5 as context fills)

By Fredoline Eruo · Last verified May 6, 2026

Cause

Generation tokens-per-second is bandwidth-bound: every new token requires reading the entire KV cache from VRAM. As context grows, the cache grows, and per-token reads take longer.

This is expected, not a bug — but the slowdown is steeper than people expect. A model that generates at 50 tok/s at 1K context may drop to 25 tok/s at 16K context and 10 tok/s at 64K. Quadratic-ish in some attention implementations, linear with Flash Attention.

Solution

Verify Flash Attention is enabled (linear instead of quadratic context cost):

# llama.cpp
./main -m model.gguf --flash-attn

# Ollama (newer versions enable by default; verify with):
ollama show llama3.1:8b --modelfile | grep flash

Quantize the KV cache (FP8 or INT4 KV halves or quarters memory bandwidth at minor quality cost):

# llama.cpp
./main -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

Use a smaller context if you don't actually need 64K:

ollama run llama3.1:8b-32k  # custom modelfile with num_ctx 32768

Consider context-summary patterns for chat — instead of feeding raw 50K tokens of history, summarize old turns. Trades fidelity for speed.

Faster card or more VRAM bandwidth is the hardware fix. RTX 5090 has 1.79 TB/s vs 5080's 960 GB/s — measurable speed advantage on long contexts.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.