What this does

Prompt caching stores KV cache entries for repeated prompt prefixes. Subsequent requests with the same prefix skip recomputation, reducing latency by 50-90% for shared system prompts or document contexts.

Steps

Enable prefix caching in vLLM.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.90

Send requests with a shared system prompt.

import requests, time

system_prompt = "You are a helpful coding assistant. Provide concise answers."
queries = [
    "What is a decorator?",
    "Explain list comprehensions",
    "How do I use async/await?"
]

for q in queries:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": q}
    ]
    start = time.perf_counter()
    r = requests.post("http://localhost:8000/v1/chat/completions",
        json={"model": "meta-llama/Llama-3.2-3B", "messages": messages})
    elapsed = time.perf_counter() - start
    print(f"Query: {q[:30]}... Time: {elapsed:.2f}s")

For Ollama, caching is automatic for the same model. Keep the model loaded:
```
ollama run llama3.2
```
Subsequent API calls reuse the loaded model's cache.

Measure the cache hit improvement.

# First request (cache miss): full computation
# Second request (cache hit): reuse KV for system prompt
start = time.perf_counter()
requests.post("http://localhost:8000/v1/chat/completions", json=payload)
first_time = time.perf_counter() - start

start = time.perf_counter()
requests.post("http://localhost:8000/v1/chat/completions", json=payload)
second_time = time.perf_counter() - start

print(f"First: {first_time:.2f}s, Cached: {second_time:.2f}s, Speedup: {first_time/second_time:.1f}x")

Verification

# Expected: Second request with same prefix is 2-10x faster than the first
# Example output: First: 1.2s, Cached: 0.3s, Speedup: 4.0x

Common failures

No speedup observed: Prefix caching requires identical token sequences at the start. Any difference (even a space) invalidates the cache.
Cache memory overhead: Cached prefixes consume VRAM. For very large caches, reduce --max-num-seqs or set --max-pref-cache-tokens.
vLLM prefix caching disabled by default: Must explicitly pass --enable-prefix-caching. Verify with server logs.

How to enable prompt caching to speed up repeated queries

What this does

Steps

Verification

Common failures

Related guides