08. Multi-Model Gateway

Chapter 8 of 18 · 20 min

KEY INSIGHT

A gateway that routes requests to different local models based on the model identifier allows a single API endpoint to serve multiple use cases. The routing logic should be abstracted so that adding new models requires only configuration changes, not code changes. ### Architecture Overview ``` Client Request (model: "llama3.2") Γöé Γû╝ Gateway Layer Γöé Γö£ΓöÇΓöÇΓû║ Model Registry Γöé Γöé Γöé "llama3.2" ΓåÆ /models/llama3.2 Γöé "mistral" ΓåÆ /models/mistral Γöé Γû╝ Inference Engines Γöé Γö£ΓöÇΓöÇΓû║ vLLM Engine ΓööΓöÇΓöÇΓû║ Ollama Engine ``` The gateway receives all requests, looks up the model in a registry, and forwards the request to the appropriate engine. The client never knows which engine handles their request. ### Model Registry ```python from pydantic import BaseModel class ModelConfig(BaseModel): name: str engine: str # "vllm" or "ollama" endpoint: str max_tokens: int supports_streaming: bool MODEL_REGISTRY: dict[str, ModelConfig] = {} def register_model(config: ModelConfig): MODEL_REGISTRY[config.name] = config register_model(ModelConfig( name="llama3.2:latest", engine="ollama", endpoint="http://localhost:11434", max_tokens=4096, supports_streaming=True )) ``` The registry maps model identifiers to their serving configuration. Add new models by calling `register_model()` with their configuration. ### Request Routing ```python async def route_request(model: str, request_data: dict): if model not in MODEL_REGISTRY: raise HTTPException(status_code=404, detail=f"Model '{model}' not found") config = MODEL_REGISTRY[model] if config.engine == "ollama": return await ollama_generate(config.endpoint, request_data) elif config.engine == "vllm": return await vllm_generate(config.endpoint, request_data) ``` The routing function selects the correct inference engine based on model configuration. Adding support for new engines requires adding a new branch and a new client function. ### Failure Modes Model registry lookup fails silently if case sensitivity is not handled consistently. A model marked as supporting streaming will return 500 errors if the actual engine does not support it. Unreachable engine endpoints cause timeouts unless connection pooling and retry logic are implemented.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a gateway that routes requests for "llama3.2" to one mock endpoint and "mistral" to a different mock endpoint. Verify that sending requests with different model identifiers reaches the correct destination.