What this does

Prompt caching only works when prefixes are token-identical. This guide teaches prompt engineering techniques that maximize cache reuse across different queries.

Steps

Move all static content to the beginning of the prompt. The shared prefix should come first, before any variable content.
```
[SYSTEM] You are a coding assistant. Answer concisely.
[USER] <variable query here>
```

Use a fixed system prompt structure across all requests.

SYSTEM_PROMPT = "You are a helpful assistant. Follow these rules:\n1. Be concise\n2. Be accurate\n3. Cite sources when possible\n\n"

def make_prompt(query):
    # Variable content at the END, after the shared prefix
    return SYSTEM_PROMPT + f"User query: {query}\nAssistant:"

Align multi-turn conversations to preserve the prefix. The full message history before the new query is the cacheable prefix.

messages = [
    {"role": "system", "content": system_prompt},
    # All previous turns are cached
    {"role": "user", "content": "Old question"},
    {"role": "assistant", "content": "Old answer"},
    # Only the last user message changes
    {"role": "user", "content": new_query}
]

Pad short prompts to a consistent length. If your system prompt is 300 tokens, ensure every request has at least 300 tokens of static prefix (pad with whitespace if needed, but use tokens that the model ignores).

Measure cache hit rate.

total, cached = 0, 0
while True:
    r = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
    total += 1
    if r.headers.get("X-Cache") == "HIT":
        cached += 1
    if total % 10 == 0:
        print(f"Cache hit rate: {cached/total*100:.1f}%")

Verification

# Expected: > 80% cache hit rate for well-structured prompts
# vLLM logs show "PREFIX CACHE HIT" for cached requests
# Ollama shows reduced time-to-first-token for cached prefixes

Common failures

Trailing whitespace breaks cache: "Hello " and "Hello" tokenize differently. Normalize prompts with .strip().
Case sensitivity: "Hello" vs "hello" produce different tokens. Use consistent casing in the prefix.
Cache eviction under load: With many different prefixes, older cached entries are evicted. Keep the number of distinct prefixes small (ideally 1-3).

How to structure prompts to maximize cache hit rates

What this does

Steps

Verification

Common failures

Related guides