16. Caching Layer

Chapter 16 of 18 · 15 min

KEY INSIGHT

Caching eliminates redundant inference requestsΓÇöcache keys based on request hashes enable sub-millisecond responses for repeated identical inputs. Inference requests are computationally expensive. Identical inputs to large language models produce identical outputs. Caching these responses eliminates GPU cycles for requests that have been answered before. Redis serves as the caching backend for most API deployments. TTL policies balance cache freshness against storage costs. Cache invalidation strategies determine when stored responses become stale. ```python import hashlib import json import redis.asyncio as redis from typing import Optional from pydantic import BaseModel class CachedRequest(BaseModel): model: str messages: list[dict] temperature: float class InferenceCache: def __init__(self, redis_url: str, ttl_seconds: int = 3600): self.redis_url = redis_url self.ttl = ttl_seconds async def __aenter__(self): self.client = await redis.from_url(self.redis_url) return self async def __aexit__(self, *args): await self.client.close() def _make_key(self, request: CachedRequest) -> str: canonical = json.dumps(request.model_dump(), sort_keys=True) hash_value = hashlib.sha256(canonical.encode()).hexdigest()[:16] return f"inference:v1:{request.model}:{hash_value}" async def get(self, request: CachedRequest) -> Optional[dict]: key = self._make_key(request) cached = await self.client.get(key) return json.loads(cached) if cached else None async def set(self, request: CachedRequest, response: dict) -> None: key = self._make_key(request) await self.client.setex(key, self.ttl, json.dumps(response)) # Usage in endpoint @app.post("/v1/chat/completions") async def completions(request: CompletionRequest, cache: InferenceCache): cached_response = await cache.get(CachedRequest.model_validate(request)) if cached_response: cached_response["cached"] = True return cached_response # Generate response response = await generate_completion(request) await cache.set(CachedRequest.model_validate(request), response) return response ``` Cache hit rates above 80% typically indicate effective caching strategies. Monitor cache effectiveness with metrics tracking hit ratio, memory usage, and eviction rates. High eviction rates suggest insufficient cache capacity or TTL values that are too short. Consider semantic caching for near-identical requests. Requests differing only in whitespace or formatting should hit the cache. Normalize inputs before computing cache keys to maximize hit rates.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a cache invalidation mechanism that removes entries for a specific model. Add a DELETE /v1/cache/{model} endpoint that clears all cached completions for the specified model.