HOW-TO · INF
How to handle streaming response chunks in your application
PREREQUISITES
API endpoint with streaming support, Python or JavaScript
What this does
Streaming APIs return Server-Sent Events (SSE) with token-by-token data. This guide covers parsing, accumulating, and handling errors in both Python and JavaScript applications.
Steps
Parse SSE chunks in Python. Each line is a JSON object with
"response"and"done"fields.import requests, json def stream_completion(model, prompt): full_response = [] with requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": prompt, "stream": True}, stream=True) as r: for line in r.iter_lines(): if not line: continue chunk = json.loads(line) if chunk.get("response"): full_response.append(chunk["response"]) yield chunk["response"] if chunk.get("done"): print(f"\nTotal tokens: {chunk['eval_count']}") return "".join(full_response) for token in stream_completion("llama3.2", "Explain streaming"): print(token, end="", flush=True)Handle SSE in JavaScript (browser/Node.js).
const response = await fetch('http://localhost:11434/api/generate', { method: 'POST', body: JSON.stringify({ model: 'llama3.2', prompt: 'Hello', stream: true }), headers: { 'Content-Type': 'application/json' } }); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop(); // Keep incomplete line for (const line of lines) { if (!line.trim()) continue; const chunk = JSON.parse(line); if (chunk.response) process.stdout.write(chunk.response); } }Accumulate chunks for the final response. Concatenate all
responsefields and strip the trailing newline.Handle streaming errors with timeout and retry.
import signal class TimeoutError(Exception): pass def handler(signum, frame): raise TimeoutError() signal.signal(signal.SIGALRM, handler) signal.alarm(30) # 30 second timeout try: for token in stream_completion("llama3.2", "Long prompt"): pass except TimeoutError: print("Stream timed out — consider shorter prompts")
Verification
# Expected: Tokens streamed to stdout one by one, final accumulated response equals non-streamed output
# Compare: streamed text matches non-streamed response
Common failures
- Partial JSON at buffer boundary: SSE messages may split across chunks. Always buffer and split by
\n. - Missing
donesignal: The stream may close without a final{"done": true}. Set a timeout as a safety net. - Memory leak from unbounded accumulation: For very long responses, periodically flush accumulated text to disk or a database.
Related guides
RELATED GUIDES