OpenAI-Compatible Gateway — Hybrid Local-Cloud AI Architecture (Chapter 9)

OpenAI-compatible gateways accept requests formatted according to the OpenAI API specification and route them to arbitrary backends. This compatibility enables drop-in replacement of OpenAI services with local or alternative cloud providers. Applications already written for OpenAI integrate without modification.

The Chat Completions endpoint represents the primary integration surface. Request format matches the official specification with messages arrays, role assignments, and completion parameters. Response format preserves field names and structures that consuming applications expect. This symmetry eliminates adaptation overhead during gateway deployment.

python
# Example OpenAI-compatible gateway server

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional, Literal
import httpx
import asyncio

app = FastAPI(title="Hybrid OpenAI-Compatible Gateway")

class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant"]
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: list[ChatMessage]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 256
    stream: Optional[bool] = False
    seed: Optional[int] = None

# Backend router injected via dependency
# Production implementations would include full routing logic
ROUTES = {
    "gpt-3.5-turbo": "http://localhost:11434/v1/chat/completions",
    "gpt-4-turbo": "http://cloud-router:8000/v1/chat/completions",
    "claude-3": "http://anthropic-proxy:9000/v1/chat/completions",
    "local-llama": "http://ollama:11434/api/chat"
}

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    """OpenAI-compatible chat completions endpoint."""
    
    # Route to appropriate backend based on model name
    backend_url = ROUTES.get(request.model)
    if not backend_url:
        # Attempt to find equivalent local model
        backend_url = ROUTES.get("local-llama")
    
    if not backend_url:
        raise HTTPException(status_code=404, detail="Model not found")
    
    # Forward request to selected backend
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            backend_url,
            json=request.model_dump(mode="json")
        )
    
    if response.status_code != 200:
        raise HTTPException(
            status_code=response.status_code,
            detail=response.text
        )
    
    return response.json()

@app.get("/v1/models")
async def list_models():
    """Return available models matching OpenAI schema."""
    return {
        "object": "list",
        "data": [
            {"id": model, "object": "model", "created": 1700000000}
            for model in ROUTES.keys()
        ]
    }

Embedding endpoints follow similar compatibility patterns. The embedding specification expects vector outputs in standard dimensions. Backend adapters translate local embedding model outputs into the expected format. Cosine similarity computations proceed identically whether inputs originated from OpenAI or local alternatives.

Fine-tuning compatibility extends the gateway surface area. Model upload endpoints prepare weights for local serving. Training completion callbacks maintain standard event schemas. Deployment endpoints trigger model loading with configurable parameters. The fine-tuning pipeline remains portable across infrastructure choices.

Streaming response handling preserves chunk-based delivery through the gateway. Server-sent events format matches client expectations. Token timestamps align with inference progression. Error chunks report failures consistently. This streaming compatibility enables real-time applications without protocol renegotiation.

Credential passthrough routes authenticated requests to backends maintaining their own identity tracking. Some backends implement their own quota and rate management. The gateway preserves authentication headers on supported routes. Rate limit responses propagate accurately when backends enforce consumption caps.