13. Semantic Caching
Semantic caching stores query-response pairs indexed by semantic similarity rather than exact string matches. When a user asks "How do I reset my password?", a cached answer for "I forgot my login credentials" can be retrieved if similarity exceeds the configured threshold.
The core implementation uses a vector database as the cache store:
import numpy as np
from redis import Redis
from redis.commands.search import SearchCommands
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self, model_name="all-MiniLM-L6-v2",
similarity_threshold=0.85):
self.model = SentenceTransformer(model_name)
self.redis = Redis(host='localhost', port=6379)
self.threshold = similarity_threshold
def _embed(self, query: str) -> np.ndarray:
return self.model.encode(query).astype(np.float32)
def get_cached_response(self, query: str) -> str | None:
embedding = self._embed(query)
results = self.redis.ft("idx:queries").search(
f"*=>[KNN 1 @embedding $vec AS score]",
query_params={"vec": embedding.tobytes()},
)
if not results:
return None
top_result = results[0]
similarity = 1 - top_result.score # Redis KNN returns distance
if similarity >= self.threshold:
return top_result.payload
return None
def store_response(self, query: str, response: str, ttl: int = 3600):
embedding = self._embed(query)
self.redis.hset(f"cache:{query[:100]}", mapping={
"embedding": embedding.tobytes(),
"response": response
})
self.redis.expire(f"cache:{query[:100]}", ttl)
Failure Modes:
- Embedding drift: Cached responses become semantically misaligned with newer queries as models evolve. Mitigation requires cache TTL limits and periodic invalidation.
- False positives: High similarity scores don't guarantee answer applicability. A threshold of 0.85 catches most cases, but medical or legal queries may require 0.95.
- Memory pressure: Unbounded cache growth. Redis eviction policy should be set to
allkeys-lruand cache size monitored.
Cache warming on startup loads frequently-accessed queries into hot memory, reducing cold-start latency by 40-60% in typical deployments.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement a semantic cache that stores embeddings in Redis. Test it with five query variations and measure which threshold values produce false positives for your specific use case.