Unified API Layer — Hybrid Local-Cloud AI Architecture (Chapter 8)

The unified API layer presents a consistent interface that abstracts backend heterogeneity from consuming applications. Clients interact with this facade without concerning themselves with routing logic, backend availability, or protocol translation. This abstraction boundary enables operational flexibility while preserving developer experience.

Contract definition establishes the shared language between clients and the routing infrastructure. Request schemas specify expected fields, types, and validation rules. Response schemas guarantee consistent output structure regardless of backend selection. Breaking changes require version increments that maintain backward compatibility during migration periods.

Protocol translation bridges client-side expectations with backend capabilities. Some providers expect OpenAI-compatible payloads while others require vendor-specific formats. The API layer transforms requests based on target backend requirements. Response normalization reconciles divergent response structures into the unified schema.

python
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, AsyncIterator
import httpx

@dataclass
class InferenceRequest:
    """Unified request format consumed by the API layer."""
    model: str
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    stream: bool = False
    metadata: dict(Pagination[str, Any] = field(default_factory=dict))

@dataclass
class InferenceResponse:
    """Unified response format returned by the API layer."""
    content: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    latency_ms: float
    finish_reason: str

class BackendAdapter(ABC):
    """Abstract interface for backend-specific request handling."""
    
    @abstractmethod
    async def complete(self, request: InferenceRequest) -> InferenceResponse:
        """Execute inference and return normalized response."""
        pass
    
    @abstractmethod
    async def stream(self, request: InferenceRequest) -> AsyncIterator[str]:
        """Execute inference with streaming and yield content chunks."""
        pass

class OpenAIAdapter(BackendAdapter):
    """Adapter for OpenAI-compatible backend API."""
    
    def __init__(self, base_url: str, api_key: str):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"}
        )
    
    async def complete(self, request: InferenceRequest) -> InferenceResponse:
        response = await self.client.post("/v1/chat/completions", json={
            "model": request.model,
            "messages": [{"role": "user", "content": request.prompt}],
            "max_tokens": request.max_tokens,
            "temperature": request.temperature
        })
        response.raise_for_status()
        return self._normalize_response(response.json())

Authentication delegation propagates client credentials to backends while maintaining security boundaries. Bearer token inspection extracts identity for logging. Backend-specific authentication schemes get injected during request forwarding. API key rotation happens centrally rather than in consuming applications.

Rate limiting coordination prevents quota exhaustion across multiple backends. Client-level limits aggregate consumption across backend calls. Per-model limits address provider-specific restrictions. Bucket algorithms enforce steady-state throughput without burst-related degradation.

Documentation generation accompanies the API surface area with machine-readable specifications. OpenAPI schemas enable client code generation. Postman collections simplify integration testing. Interactive documentation portals demonstrate usage patterns and capture example requests.