RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hybrid Local-Cloud AI Architecture
  6. /Ch. 9
Hybrid Local-Cloud AI Architecture

09. OpenAI-Compatible Gateway

Chapter 9 of 18 · 15 min
KEY INSIGHT

OpenAI-compatible gateways provide maximum integration flexibility with minimal friction. By conforming to established API contracts, hybrid infrastructure becomes transparent to existing toolchains and application code.

OpenAI-compatible gateways accept requests formatted according to the OpenAI API specification and route them to arbitrary backends. This compatibility enables drop-in replacement of OpenAI services with local or alternative cloud providers. Applications already written for OpenAI integrate without modification.

The Chat Completions endpoint represents the primary integration surface. Request format matches the official specification with messages arrays, role assignments, and completion parameters. Response format preserves field names and structures that consuming applications expect. This symmetry eliminates adaptation overhead during gateway deployment.

python
# Example OpenAI-compatible gateway server

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional, Literal
import httpx
import asyncio

app = FastAPI(title="Hybrid OpenAI-Compatible Gateway")

class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant"]
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: list[ChatMessage]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 256
    stream: Optional[bool] = False
    seed: Optional[int] = None

# Backend router injected via dependency
# Production implementations would include full routing logic
ROUTES = {
    "gpt-3.5-turbo": "http://localhost:11434/v1/chat/completions",
    "gpt-4-turbo": "http://cloud-router:8000/v1/chat/completions",
    "claude-3": "http://anthropic-proxy:9000/v1/chat/completions",
    "local-llama": "http://ollama:11434/api/chat"
}

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    """OpenAI-compatible chat completions endpoint."""
    
    # Route to appropriate backend based on model name
    backend_url = ROUTES.get(request.model)
    if not backend_url:
        # Attempt to find equivalent local model
        backend_url = ROUTES.get("local-llama")
    
    if not backend_url:
        raise HTTPException(status_code=404, detail="Model not found")
    
    # Forward request to selected backend
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            backend_url,
            json=request.model_dump(mode="json")
        )
    
    if response.status_code != 200:
        raise HTTPException(
            status_code=response.status_code,
            detail=response.text
        )
    
    return response.json()

@app.get("/v1/models")
async def list_models():
    """Return available models matching OpenAI schema."""
    return {
        "object": "list",
        "data": [
            {"id": model, "object": "model", "created": 1700000000}
            for model in ROUTES.keys()
        ]
    }

Embedding endpoints follow similar compatibility patterns. The embedding specification expects vector outputs in standard dimensions. Backend adapters translate local embedding model outputs into the expected format. Cosine similarity computations proceed identically whether inputs originated from OpenAI or local alternatives.

Fine-tuning compatibility extends the gateway surface area. Model upload endpoints prepare weights for local serving. Training completion callbacks maintain standard event schemas. Deployment endpoints trigger model loading with configurable parameters. The fine-tuning pipeline remains portable across infrastructure choices.

Streaming response handling preserves chunk-based delivery through the gateway. Server-sent events format matches client expectations. Token timestamps align with inference progression. Error chunks report failures consistently. This streaming compatibility enables real-time applications without protocol renegotiation.

Credential passthrough routes authenticated requests to backends maintaining their own identity tracking. Some backends implement their own quota and rate management. The gateway preserves authentication headers on supported routes. Rate limit responses propagate accurately when backends enforce consumption caps.

EXERCISE

Configure an OpenAI-compatible gateway to route requests to three distinct backends based on model name prefixes. Implement streaming support and verify that existing applications receive compatible responses without code changes.

← Chapter 8
Unified API Layer
Chapter 10 →
Fallback Chains