RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to enable prompt caching to speed up repeated queries
HOW-TO · INF

How to enable prompt caching to speed up repeated queries

intermediate·10 min·By Fredoline Eruo
PREREQUISITES

vLLM or Ollama with prompt caching support

What this does

Prompt caching stores KV cache entries for repeated prompt prefixes. Subsequent requests with the same prefix skip recomputation, reducing latency by 50-90% for shared system prompts or document contexts.

Steps

  1. Enable prefix caching in vLLM.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-3B \
        --enable-prefix-caching \
        --gpu-memory-utilization 0.90
    
  2. Send requests with a shared system prompt.

    import requests, time
    
    system_prompt = "You are a helpful coding assistant. Provide concise answers."
    queries = [
        "What is a decorator?",
        "Explain list comprehensions",
        "How do I use async/await?"
    ]
    
    for q in queries:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": q}
        ]
        start = time.perf_counter()
        r = requests.post("http://localhost:8000/v1/chat/completions",
            json={"model": "meta-llama/Llama-3.2-3B", "messages": messages})
        elapsed = time.perf_counter() - start
        print(f"Query: {q[:30]}... Time: {elapsed:.2f}s")
    
  3. For Ollama, caching is automatic for the same model. Keep the model loaded:

    ollama run llama3.2
    

    Subsequent API calls reuse the loaded model's cache.

  4. Measure the cache hit improvement.

    # First request (cache miss): full computation
    # Second request (cache hit): reuse KV for system prompt
    start = time.perf_counter()
    requests.post("http://localhost:8000/v1/chat/completions", json=payload)
    first_time = time.perf_counter() - start
    
    start = time.perf_counter()
    requests.post("http://localhost:8000/v1/chat/completions", json=payload)
    second_time = time.perf_counter() - start
    
    print(f"First: {first_time:.2f}s, Cached: {second_time:.2f}s, Speedup: {first_time/second_time:.1f}x")
    

Verification

# Expected: Second request with same prefix is 2-10x faster than the first
# Example output: First: 1.2s, Cached: 0.3s, Speedup: 4.0x

Common failures

  • No speedup observed: Prefix caching requires identical token sequences at the start. Any difference (even a space) invalidates the cache.
  • Cache memory overhead: Cached prefixes consume VRAM. For very large caches, reduce --max-num-seqs or set --max-pref-cache-tokens.
  • vLLM prefix caching disabled by default: Must explicitly pass --enable-prefix-caching. Verify with server logs.

Related guides

  • How to structure prompts to maximize cache hit rates
  • How to enable and configure speculative decoding for faster generation
RELATED GUIDES
INF
How to structure prompts to maximize cache hit rates
INF
How to enable and configure speculative decoding for faster generation
← All how-to guidesCourses →