HOW-TO · INF
How to enable prompt caching to speed up repeated queries
PREREQUISITES
vLLM or Ollama with prompt caching support
What this does
Prompt caching stores KV cache entries for repeated prompt prefixes. Subsequent requests with the same prefix skip recomputation, reducing latency by 50-90% for shared system prompts or document contexts.
Steps
Enable prefix caching in vLLM.
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B \ --enable-prefix-caching \ --gpu-memory-utilization 0.90Send requests with a shared system prompt.
import requests, time system_prompt = "You are a helpful coding assistant. Provide concise answers." queries = [ "What is a decorator?", "Explain list comprehensions", "How do I use async/await?" ] for q in queries: messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": q} ] start = time.perf_counter() r = requests.post("http://localhost:8000/v1/chat/completions", json={"model": "meta-llama/Llama-3.2-3B", "messages": messages}) elapsed = time.perf_counter() - start print(f"Query: {q[:30]}... Time: {elapsed:.2f}s")For Ollama, caching is automatic for the same model. Keep the model loaded:
ollama run llama3.2Subsequent API calls reuse the loaded model's cache.
Measure the cache hit improvement.
# First request (cache miss): full computation # Second request (cache hit): reuse KV for system prompt start = time.perf_counter() requests.post("http://localhost:8000/v1/chat/completions", json=payload) first_time = time.perf_counter() - start start = time.perf_counter() requests.post("http://localhost:8000/v1/chat/completions", json=payload) second_time = time.perf_counter() - start print(f"First: {first_time:.2f}s, Cached: {second_time:.2f}s, Speedup: {first_time/second_time:.1f}x")
Verification
# Expected: Second request with same prefix is 2-10x faster than the first
# Example output: First: 1.2s, Cached: 0.3s, Speedup: 4.0x
Common failures
- No speedup observed: Prefix caching requires identical token sequences at the start. Any difference (even a space) invalidates the cache.
- Cache memory overhead: Cached prefixes consume VRAM. For very large caches, reduce
--max-num-seqsor set--max-pref-cache-tokens. - vLLM prefix caching disabled by default: Must explicitly pass
--enable-prefix-caching. Verify with server logs.
Related guides
RELATED GUIDES