RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Configuration / Token generation slows as conversation gets longer
Configuration
Verified by owner

Token generation slows as conversation gets longer

(no error — tok/s drops from 50 to 5 as context fills)
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Generation tokens-per-second is bandwidth-bound: every new token requires reading the entire KV cache from VRAM. As context grows, the cache grows, and per-token reads take longer.

This is expected, not a bug — but the slowdown is steeper than people expect. A model that generates at 50 tok/s at 1K context may drop to 25 tok/s at 16K context and 10 tok/s at 64K. Quadratic-ish in some attention implementations, linear with Flash Attention.

Solution

Verify Flash Attention is enabled (linear instead of quadratic context cost):

# llama.cpp
./main -m model.gguf --flash-attn

# Ollama (newer versions enable by default; verify with):
ollama show llama3.1:8b --modelfile | grep flash

Quantize the KV cache (FP8 or INT4 KV halves or quarters memory bandwidth at minor quality cost):

# llama.cpp
./main -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

Use a smaller context if you don't actually need 64K:

ollama run llama3.1:8b-32k  # custom modelfile with num_ctx 32768

Consider context-summary patterns for chat — instead of feeding raw 50K tokens of history, summarize old turns. Trades fidelity for speed.

Faster card or more VRAM bandwidth is the hardware fix. RTX 5090 has 1.79 TB/s vs 5080's 960 GB/s — measurable speed advantage on long contexts.

Related errors

  • Ollama: Error: model 'X' not found
  • Ollama: bind: address already in use (port 11434)
  • Ollama: connection refused on localhost:11434
  • Ollama truncates input — default context length is only 2048
  • Slow tokens/sec on capable GPU (silent CPU fallback)

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.