Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Use persistent KV cache when your docs fit in the model's context. Use RAG when they don't.
Both solve "the model needs context it wasn't trained on." They solve it differently:
Persistent KV cache (prefix caching): The model processes your docs ONCE, the attention key+value tensors get cached in GPU memory, and every subsequent question re-uses that prefill. vLLM, llama.cpp, and SGLang all support this. The latency math:
- First request (cold): ~1 token-time per input token (e.g., 32K context = 5-10s prefill on RTX 4090)
- Every subsequent request (warm): ~50-100ms prefill — the cache hit
- Memory cost: KV cache size = 2 × num_layers × num_kv_heads × head_dim × tokens × bytes_per_value
For Llama 3.1 8B at 32K context: ~3-4 GB of cache. Fits comfortably on a 12-16GB card.
RAG (retrieval-augmented generation): You build a vector index of your docs. Every question triggers: embed query → retrieve top-K chunks → stuff into prompt → generate. The latency math:
- Embedding step: ~50-150ms (local embedder) or ~200-400ms (cloud API)
- Vector search: ~5-20ms on a 100K-chunk index
- Generation: full prefill of (query + retrieved chunks) — typically 4-8K tokens = 1-2s on a 4090
- Total per question: ~1.5-2.5s end-to-end
The decision rule:
| Your corpus | Pick |
|---|---|
| Single document (< 200K tokens) | Persistent KV cache. Faster, simpler, no retrieval drift. |
| 5-10 docs you re-read constantly | Persistent KV cache, swap models between them. |
| Large corpus (1000+ docs) | RAG. KV cache doesn't fit. |
| Mixed: hot 5 + cold archive | Hybrid — KV cache for hot, RAG for cold. |
Why the r/Rag thread "we replaced RAG with persistent KV cache" works:
- Your application's "context" is a fixed set of code files / docs / specs that DON'T change per-query.
- Embedding + retrieval adds latency without much quality gain when your corpus is small enough to keep warm.
- KV cache hits beat retrieval round-trips for the latency-sensitive use cases (interactive chat, IDE-integrated agents).
Why RAG still wins for most teams:
- Your corpus is bigger than VRAM × layers can hold as cached prefix.
- You need to add new documents continuously (KV cache invalidates when prefix changes).
- You need source-level citations (RAG gives you chunk-level attribution; KV cache doesn't).
- You're serving multi-tenant queries where each user has their own document set.
The honest middle ground: prefix-cache the system prompt + always-needed context, then RAG the corpus-specific retrievals on top. Both work in vLLM 0.20+ simultaneously.
Explore the numbers for your specific stack
Where we got the numbers
Prefix caching support: vLLM 0.6+ release notes (--enable-prefix-caching). KV cache math: GPT-4 architecture paper + community implementations. RAG latency numbers from AnythingLLM + Khoj benchmarks 2026.
Also see
vLLM has first-class prefix caching. Enable with --enable-prefix-caching.
The most battle-tested local-RAG app. Editorial verdict + setup guidance.
See exactly how much VRAM your KV cache will need at different context lengths.
The third option when local docs aren't enough — but local LLMs don't ship with web search.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
- Can I distribute local LLM inference across multiple machines (P2P)?
- I want my AI conversations to stay private — what's the realistic local-first setup?
- Is fine-tuning dead in 2026? RAG vs distillation vs prompting — when does fine-tuning actually win?
- Is NVFP4 a game-changer? What is it, and does it matter for me?
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.