01. API Design Principles

Chapter 1 of 18 · 15 min

KEY INSIGHT

An API is a contract. The moment you publish an endpoint, you are promising stability to every consumer that depends on it. Designing for local AI serving means understanding the difference between the *capabilities* of your inference engine and the *interface* you expose. These two layers should remain independent. The interface should never leak implementation details about what model is running or how it is being served. ### What Makes a Good Local AI API A well-designed local AI API prioritizes three properties: compatibility, predictability, and observability. Compatibility means existing clients work without modification. Predictability means the API behaves consistently under load. Observability means failures can be diagnosed without guessing. Start by defining the request-response contract in a schema. For OpenAI-compatible endpoints, this means mapping your internal representation to the format clients expect. A request to `/v1/chat/completions` should receive a response that matches the OpenAI schema, not a custom format unique to your setup. ### Core Design Decisions Request validation happens at the boundary. Reject malformed requests with 422 status codes and clear error messages before they reach your inference pipeline. Never let invalid input propagate downstream where it becomes harder to debug. Response structure should be consistent even when the underlying operation fails. A 200 response and a 500 response should share the same top-level keys. Clients should never encounter different JSON shapes depending on which code path executed. Streaming responses require special consideration. A streaming response is not just a series of HTTP chunks. It is a structured byte stream where the client parses content and reconstructs the complete response. Breaking changes in the stream format will silently break all consumers. ### The Portability Principle Design interfaces that do not depend on the serving backend. If you switch from vLLM to Ollama or change your inference engine, the API layer should adapt without forcing clients to update their code. This means abstracting model loading, batching, and memory management behind a service layer that your HTTP handlers call.

EXERCISE

Document the request and response schema for a hypothetical /v1/embeddings endpoint. Include field names, types, required vs optional status, and example values. Identify three places where a breaking change to this schema would affect existing clients.