09. OpenAI-Compatible Gateway
OpenAI-compatible gateways accept requests formatted according to the OpenAI API specification and route them to arbitrary backends. This compatibility enables drop-in replacement of OpenAI services with local or alternative cloud providers. Applications already written for OpenAI integrate without modification.
The Chat Completions endpoint represents the primary integration surface. Request format matches the official specification with messages arrays, role assignments, and completion parameters. Response format preserves field names and structures that consuming applications expect. This symmetry eliminates adaptation overhead during gateway deployment.
python
# Example OpenAI-compatible gateway server
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional, Literal
import httpx
import asyncio
app = FastAPI(title="Hybrid OpenAI-Compatible Gateway")
class ChatMessage(BaseModel):
role: Literal["system", "user", "assistant"]
content: str
class ChatCompletionRequest(BaseModel):
model: str
messages: list[ChatMessage]
temperature: Optional[float] = 0.7
max_tokens: Optional[int] = 256
stream: Optional[bool] = False
seed: Optional[int] = None
# Backend router injected via dependency
# Production implementations would include full routing logic
ROUTES = {
"gpt-3.5-turbo": "http://localhost:11434/v1/chat/completions",
"gpt-4-turbo": "http://cloud-router:8000/v1/chat/completions",
"claude-3": "http://anthropic-proxy:9000/v1/chat/completions",
"local-llama": "http://ollama:11434/api/chat"
}
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
"""OpenAI-compatible chat completions endpoint."""
# Route to appropriate backend based on model name
backend_url = ROUTES.get(request.model)
if not backend_url:
# Attempt to find equivalent local model
backend_url = ROUTES.get("local-llama")
if not backend_url:
raise HTTPException(status_code=404, detail="Model not found")
# Forward request to selected backend
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(
backend_url,
json=request.model_dump(mode="json")
)
if response.status_code != 200:
raise HTTPException(
status_code=response.status_code,
detail=response.text
)
return response.json()
@app.get("/v1/models")
async def list_models():
"""Return available models matching OpenAI schema."""
return {
"object": "list",
"data": [
{"id": model, "object": "model", "created": 1700000000}
for model in ROUTES.keys()
]
}
Embedding endpoints follow similar compatibility patterns. The embedding specification expects vector outputs in standard dimensions. Backend adapters translate local embedding model outputs into the expected format. Cosine similarity computations proceed identically whether inputs originated from OpenAI or local alternatives.
Fine-tuning compatibility extends the gateway surface area. Model upload endpoints prepare weights for local serving. Training completion callbacks maintain standard event schemas. Deployment endpoints trigger model loading with configurable parameters. The fine-tuning pipeline remains portable across infrastructure choices.
Streaming response handling preserves chunk-based delivery through the gateway. Server-sent events format matches client expectations. Token timestamps align with inference progression. Error chunks report failures consistently. This streaming compatibility enables real-time applications without protocol renegotiation.
Credential passthrough routes authenticated requests to backends maintaining their own identity tracking. Some backends implement their own quota and rate management. The gateway preserves authentication headers on supported routes. Rate limit responses propagate accurately when backends enforce consumption caps.
Configure an OpenAI-compatible gateway to route requests to three distinct backends based on model name prefixes. Implement streaming support and verify that existing applications receive compatible responses without code changes.