08. Unified API Layer
The unified API layer presents a consistent interface that abstracts backend heterogeneity from consuming applications. Clients interact with this facade without concerning themselves with routing logic, backend availability, or protocol translation. This abstraction boundary enables operational flexibility while preserving developer experience.
Contract definition establishes the shared language between clients and the routing infrastructure. Request schemas specify expected fields, types, and validation rules. Response schemas guarantee consistent output structure regardless of backend selection. Breaking changes require version increments that maintain backward compatibility during migration periods.
Protocol translation bridges client-side expectations with backend capabilities. Some providers expect OpenAI-compatible payloads while others require vendor-specific formats. The API layer transforms requests based on target backend requirements. Response normalization reconciles divergent response structures into the unified schema.
python
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, AsyncIterator
import httpx
@dataclass
class InferenceRequest:
"""Unified request format consumed by the API layer."""
model: str
prompt: str
max_tokens: int = 256
temperature: float = 0.7
stream: bool = False
metadata: dict(Pagination[str, Any] = field(default_factory=dict))
@dataclass
class InferenceResponse:
"""Unified response format returned by the API layer."""
content: str
model: str
prompt_tokens: int
completion_tokens: int
latency_ms: float
finish_reason: str
class BackendAdapter(ABC):
"""Abstract interface for backend-specific request handling."""
@abstractmethod
async def complete(self, request: InferenceRequest) -> InferenceResponse:
"""Execute inference and return normalized response."""
pass
@abstractmethod
async def stream(self, request: InferenceRequest) -> AsyncIterator[str]:
"""Execute inference with streaming and yield content chunks."""
pass
class OpenAIAdapter(BackendAdapter):
"""Adapter for OpenAI-compatible backend API."""
def __init__(self, base_url: str, api_key: str):
self.client = httpx.AsyncClient(
base_url=base_url,
headers={"Authorization": f"Bearer {api_key}"}
)
async def complete(self, request: InferenceRequest) -> InferenceResponse:
response = await self.client.post("/v1/chat/completions", json={
"model": request.model,
"messages": [{"role": "user", "content": request.prompt}],
"max_tokens": request.max_tokens,
"temperature": request.temperature
})
response.raise_for_status()
return self._normalize_response(response.json())
Authentication delegation propagates client credentials to backends while maintaining security boundaries. Bearer token inspection extracts identity for logging. Backend-specific authentication schemes get injected during request forwarding. API key rotation happens centrally rather than in consuming applications.
Rate limiting coordination prevents quota exhaustion across multiple backends. Client-level limits aggregate consumption across backend calls. Per-model limits address provider-specific restrictions. Bucket algorithms enforce steady-state throughput without burst-related degradation.
Documentation generation accompanies the API surface area with machine-readable specifications. OpenAPI schemas enable client code generation. Postman collections simplify integration testing. Interactive documentation portals demonstrate usage patterns and capture example requests.
Document the current interface between your inference clients and routing infrastructure. Identify translation points where protocol conversion occurs and evaluate whether these boundaries align with logical abstraction seams.