Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?

Reviewed May 15, 20262 min read

kv-cacheragvllmprefix-cachingcontext-length

The answer

One paragraph. No hedging beyond what the data actually warrants.

Use persistent KV cache when your docs fit in the model's context. Use RAG when they don't.

Both solve "the model needs context it wasn't trained on." They solve it differently:

Persistent KV cache (prefix caching): The model processes your docs ONCE, the attention key+value tensors get cached in GPU memory, and every subsequent question re-uses that prefill. vLLM, llama.cpp, and SGLang all support this. The latency math:

First request (cold): ~1 token-time per input token (e.g., 32K context = 5-10s prefill on RTX 4090)
Every subsequent request (warm): ~50-100ms prefill — the cache hit
Memory cost: KV cache size = 2 × num_layers × num_kv_heads × head_dim × tokens × bytes_per_value

For Llama 3.1 8B at 32K context: ~3-4 GB of cache. Fits comfortably on a 12-16GB card.

RAG (retrieval-augmented generation): You build a vector index of your docs. Every question triggers: embed query → retrieve top-K chunks → stuff into prompt → generate. The latency math:

Embedding step: ~50-150ms (local embedder) or ~200-400ms (cloud API)
Vector search: ~5-20ms on a 100K-chunk index
Generation: full prefill of (query + retrieved chunks) — typically 4-8K tokens = 1-2s on a 4090
Total per question: ~1.5-2.5s end-to-end

The decision rule:

Your corpus	Pick
Single document (< 200K tokens)	Persistent KV cache. Faster, simpler, no retrieval drift.
5-10 docs you re-read constantly	Persistent KV cache, swap models between them.
Large corpus (1000+ docs)	RAG. KV cache doesn't fit.
Mixed: hot 5 + cold archive	Hybrid — KV cache for hot, RAG for cold.

Why the r/Rag thread "we replaced RAG with persistent KV cache" works:

Your application's "context" is a fixed set of code files / docs / specs that DON'T change per-query.
Embedding + retrieval adds latency without much quality gain when your corpus is small enough to keep warm.
KV cache hits beat retrieval round-trips for the latency-sensitive use cases (interactive chat, IDE-integrated agents).

Why RAG still wins for most teams:

Your corpus is bigger than VRAM × layers can hold as cached prefix.
You need to add new documents continuously (KV cache invalidates when prefix changes).
You need source-level citations (RAG gives you chunk-level attribution; KV cache doesn't).
You're serving multi-tenant queries where each user has their own document set.

The honest middle ground: prefix-cache the system prompt + always-needed context, then RAG the corpus-specific retrievals on top. Both work in vLLM 0.20+ simultaneously.

Explore the numbers for your specific stack

Open the RAG apps directory →

AnythingLLM, PrivateGPT, Verba, Khoj — the 4 apps that combine retrieval + chat with local models. Pick by privacy posture.

Where we got the numbers

Prefix caching support: vLLM 0.6+ release notes (--enable-prefix-caching). KV cache math: GPT-4 architecture paper + community implementations. RAG latency numbers from AnythingLLM + Khoj benchmarks 2026.

Also see

vLLM — prefix caching support →

vLLM has first-class prefix caching. Enable with --enable-prefix-caching.

PrivateGPT (RAG, air-gappable) →

The most battle-tested local-RAG app. Editorial verdict + setup guidance.

Quant advisor — KV cache sizing →

See exactly how much VRAM your KV cache will need at different context lengths.

What about web search? →

The third option when local docs aren't enough — but local LLMs don't ship with web search.

Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread