RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI APIs and Integration
  6. /Ch. 5
Local AI APIs and Integration

05. Streaming with SSE

Chapter 5 of 18 · 20 min
KEY INSIGHT

Server-Sent Events (SSE) deliver real-time responses without WebSocket complexity. Understanding the SSE format and chunk serialization is essential for implementing streaming endpoints that clients like the OpenAI SDK expect. ### SSE Protocol Basics SSE uses a text-based format where each event is separated by double newlines. Events contain a `data:` prefix followed by the payload. The stream terminates with an optional `data: [DONE]` message. ``` data: {"id":"1","choices":[{"delta":{"content":"Hello"}}]} data: {"id":"1","choices":[{"delta":{"content":" world"}}]} data: [DONE] ``` Clients parse these lines and reconstruct the complete response. A missing newline breaks the entire stream. An incorrectly formatted chunk causes the client to ignore all subsequent data. ### FastAPI Streaming Response ```python from fastapi.responses import StreamingResponse import json async def stream_completion(request: CompletionRequest): async def event_generator(): prompt = format_messages(request.messages) async for chunk in inference_client.stream_generate(prompt): delta = {"content": chunk.text} event = { "id": f"chatcmpl-{random_id()}", "object": "chat.completion.chunk", "created": int(time.time()), "model": request.model, "choices": [{"index": 0, "delta": delta, "finish_reason": None}] } yield f"data: {json.dumps(event)}\n\n" yield "data: [DONE]\n\n" return StreamingResponse(event_generator(), media_type="text/event-stream") ``` The `StreamingResponse` class handles the HTTP chunked transfer encoding automatically. The generator yields bytes that FastAPI sends to the client as they arrive. ### Chunk Structure Each chunk follows the chat completion chunk schema with `delta` instead of `message`. The `finish_reason` is null during streaming and only appears in the final conceptual chunk. ```json { "id": "chatcmpl-abc", "object": "chat.completion.chunk", "created": 1700000000, "model": "llama3.2", "choices": [{ "index": 0, "delta": {"content": "Hello"}, "finish_reason": null }] } ``` ### Common Failure Modes Forgetting to flush the response buffer causes the client to receive all chunks at once instead of in real-time. Mixing text and binary data in the stream breaks clients expecting text-only. Sending chunks after `data: [DONE]` causes parsing errors on the client.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a streaming endpoint that yields chunks with a 100ms delay between each word of a static sentence. Use curl or a browser to verify the chunks arrive progressively rather than all at once.

← Chapter 4
Chat Completions Endpoint
Chapter 6 →
API Key Authentication