RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI APIs and Integration
  6. /Ch. 16
Local AI APIs and Integration

16. Caching Layer

Chapter 16 of 18 · 15 min
KEY INSIGHT

Caching eliminates redundant inference requestsΓÇöcache keys based on request hashes enable sub-millisecond responses for repeated identical inputs. Inference requests are computationally expensive. Identical inputs to large language models produce identical outputs. Caching these responses eliminates GPU cycles for requests that have been answered before. Redis serves as the caching backend for most API deployments. TTL policies balance cache freshness against storage costs. Cache invalidation strategies determine when stored responses become stale. ```python import hashlib import json import redis.asyncio as redis from typing import Optional from pydantic import BaseModel class CachedRequest(BaseModel): model: str messages: list[dict] temperature: float class InferenceCache: def __init__(self, redis_url: str, ttl_seconds: int = 3600): self.redis_url = redis_url self.ttl = ttl_seconds async def __aenter__(self): self.client = await redis.from_url(self.redis_url) return self async def __aexit__(self, *args): await self.client.close() def _make_key(self, request: CachedRequest) -> str: canonical = json.dumps(request.model_dump(), sort_keys=True) hash_value = hashlib.sha256(canonical.encode()).hexdigest()[:16] return f"inference:v1:{request.model}:{hash_value}" async def get(self, request: CachedRequest) -> Optional[dict]: key = self._make_key(request) cached = await self.client.get(key) return json.loads(cached) if cached else None async def set(self, request: CachedRequest, response: dict) -> None: key = self._make_key(request) await self.client.setex(key, self.ttl, json.dumps(response)) # Usage in endpoint @app.post("/v1/chat/completions") async def completions(request: CompletionRequest, cache: InferenceCache): cached_response = await cache.get(CachedRequest.model_validate(request)) if cached_response: cached_response["cached"] = True return cached_response # Generate response response = await generate_completion(request) await cache.set(CachedRequest.model_validate(request), response) return response ``` Cache hit rates above 80% typically indicate effective caching strategies. Monitor cache effectiveness with metrics tracking hit ratio, memory usage, and eviction rates. High eviction rates suggest insufficient cache capacity or TTL values that are too short. Consider semantic caching for near-identical requests. Requests differing only in whitespace or formatting should hit the cache. Normalize inputs before computing cache keys to maximize hit rates.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a cache invalidation mechanism that removes entries for a specific model. Add a DELETE /v1/cache/{model} endpoint that clears all cached completions for the specified model.

← Chapter 15
Load Testing
Chapter 17 →
Production Hardening