RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to structure prompts to maximize cache hit rates
HOW-TO · INF

How to structure prompts to maximize cache hit rates

intermediate·15 min·By Fredoline Eruo
PREREQUISITES

Prompt caching enabled on inference server

What this does

Prompt caching only works when prefixes are token-identical. This guide teaches prompt engineering techniques that maximize cache reuse across different queries.

Steps

  1. Move all static content to the beginning of the prompt. The shared prefix should come first, before any variable content.

    [SYSTEM] You are a coding assistant. Answer concisely.
    [USER] <variable query here>
    
  2. Use a fixed system prompt structure across all requests.

    SYSTEM_PROMPT = "You are a helpful assistant. Follow these rules:\n1. Be concise\n2. Be accurate\n3. Cite sources when possible\n\n"
    
    def make_prompt(query):
        # Variable content at the END, after the shared prefix
        return SYSTEM_PROMPT + f"User query: {query}\nAssistant:"
    
  3. Align multi-turn conversations to preserve the prefix. The full message history before the new query is the cacheable prefix.

    messages = [
        {"role": "system", "content": system_prompt},
        # All previous turns are cached
        {"role": "user", "content": "Old question"},
        {"role": "assistant", "content": "Old answer"},
        # Only the last user message changes
        {"role": "user", "content": new_query}
    ]
    
  4. Pad short prompts to a consistent length. If your system prompt is 300 tokens, ensure every request has at least 300 tokens of static prefix (pad with whitespace if needed, but use tokens that the model ignores).

  5. Measure cache hit rate.

    total, cached = 0, 0
    while True:
        r = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
        total += 1
        if r.headers.get("X-Cache") == "HIT":
            cached += 1
        if total % 10 == 0:
            print(f"Cache hit rate: {cached/total*100:.1f}%")
    

Verification

# Expected: > 80% cache hit rate for well-structured prompts
# vLLM logs show "PREFIX CACHE HIT" for cached requests
# Ollama shows reduced time-to-first-token for cached prefixes

Common failures

  • Trailing whitespace breaks cache: "Hello " and "Hello" tokenize differently. Normalize prompts with .strip().
  • Case sensitivity: "Hello" vs "hello" produce different tokens. Use consistent casing in the prefix.
  • Cache eviction under load: With many different prefixes, older cached entries are evicted. Keep the number of distinct prefixes small (ideally 1-3).

Related guides

  • How to enable prompt caching to speed up repeated queries
  • How to enable and configure speculative decoding for faster generation
RELATED GUIDES
INF
How to enable and configure speculative decoding for faster generation
INF
How to enable prompt caching to speed up repeated queries
← All how-to guidesCourses →