04. Streaming Responses

Chapter 4 of 15 · 15 min

To stream from FastAPI to the browser, use StreamingResponse with an async generator. Add to app/main.py:

from fastapi.responses import StreamingResponse

@app.post("/chat")
async def chat(model: str, messages: list[dict]):
    from app.ollama_client import stream_chat
    return StreamingResponse(
        stream_chat(model, messages),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

The media_type="text/event-stream" header tells the browser this is SSE. Each yielded line must be prefixed with data: for the browser's EventSource to parse it. Fix stream_chat in app/ollama_client.py:

def stream_chat(model: str, messages: list[dict]):
    payload = {
        "model": model,
        "messages": messages,
        "stream": True,
    }
    with httpx.stream("POST", f"{OLLAMA_BASE}/api/chat", json=payload, timeout=120.0) as resp:
        resp.raise_for_status()
        for line in resp.iter_lines():
            if line:
                yield f"data: {line}\n\n"

The double newline \n\n is the SSE message delimiter. Missing it causes the browser to buffer indefinitely.

A failure mode: if the client disconnects (user closes the tab) while FastAPI is streaming, the async generator raises CancelledError. Catch it in the route with a try/except or let the framework handle it—FastAPI handles CancelledError silently by default, which is fine.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Use curl -N http://localhost:8000/chat with a POST body to test the stream manually. Watch the chunks arrive.