HOW-TO · INF
How to structure prompts to maximize cache hit rates
PREREQUISITES
Prompt caching enabled on inference server
What this does
Prompt caching only works when prefixes are token-identical. This guide teaches prompt engineering techniques that maximize cache reuse across different queries.
Steps
Move all static content to the beginning of the prompt. The shared prefix should come first, before any variable content.
[SYSTEM] You are a coding assistant. Answer concisely. [USER] <variable query here>Use a fixed system prompt structure across all requests.
SYSTEM_PROMPT = "You are a helpful assistant. Follow these rules:\n1. Be concise\n2. Be accurate\n3. Cite sources when possible\n\n" def make_prompt(query): # Variable content at the END, after the shared prefix return SYSTEM_PROMPT + f"User query: {query}\nAssistant:"Align multi-turn conversations to preserve the prefix. The full message history before the new query is the cacheable prefix.
messages = [ {"role": "system", "content": system_prompt}, # All previous turns are cached {"role": "user", "content": "Old question"}, {"role": "assistant", "content": "Old answer"}, # Only the last user message changes {"role": "user", "content": new_query} ]Pad short prompts to a consistent length. If your system prompt is 300 tokens, ensure every request has at least 300 tokens of static prefix (pad with whitespace if needed, but use tokens that the model ignores).
Measure cache hit rate.
total, cached = 0, 0 while True: r = requests.post("http://localhost:8000/v1/chat/completions", json=payload) total += 1 if r.headers.get("X-Cache") == "HIT": cached += 1 if total % 10 == 0: print(f"Cache hit rate: {cached/total*100:.1f}%")
Verification
# Expected: > 80% cache hit rate for well-structured prompts
# vLLM logs show "PREFIX CACHE HIT" for cached requests
# Ollama shows reduced time-to-first-token for cached prefixes
Common failures
- Trailing whitespace breaks cache:
"Hello "and"Hello"tokenize differently. Normalize prompts with.strip(). - Case sensitivity:
"Hello"vs"hello"produce different tokens. Use consistent casing in the prefix. - Cache eviction under load: With many different prefixes, older cached entries are evicted. Keep the number of distinct prefixes small (ideally 1-3).
Related guides
RELATED GUIDES