HOW-TO · INF
How to enable streaming responses for real-time output
PREREQUISITES
Ollama or OpenAI-compatible API endpoint
What this does
Streaming returns tokens one by one as they are generated instead of waiting for the full response. This provides a better user experience for chat applications and real-time tools.
Steps
Enable streaming in an Ollama API request. Set
"stream": truein the JSON body.curl -N http://localhost:11434/api/generate \ -d '{"model": "llama3.2", "prompt": "Write a short poem", "stream": true}'Expected: Tokens arrive incrementally as newline-delimited JSON objects.
Stream from the chat endpoint for multi-turn conversations.
curl -N http://localhost:11434/api/chat \ -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Tell me a joke"}], "stream": true}'Stream in Python, processing each chunk as it arrives.
import requests, json response = requests.post("http://localhost:11434/api/generate", json={"model": "llama3.2", "prompt": "Write a haiku", "stream": True}, stream=True) for line in response.iter_lines(): if line: chunk = json.loads(line) if chunk.get("response"): print(chunk["response"], end="", flush=True) if chunk.get("done"): print() print(f"Tokens: {chunk['eval_count']}, Duration: {chunk['eval_duration']/1e9:.2f}s")
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -N -s http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Count to 5","stream":true}' \
| python -c "import sys,json; [print(json.loads(l)['response'],end='',flush=True) for l in sys.stdin if l.strip()]"
# Expected: Characters appear one at a time, not all at once
Common failures
- No streaming, response arrives all at once: Verify
"stream": trueis in the request body. Some clients default tostream: false. - Chunks arrive with delay: The first chunk includes model loading time. Keep the model loaded with a warm-up request first.
- Connection closed prematurely: Network proxies may buffer streaming responses. Use
--no-bufferwith nginx orstream=Truein Python.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Related guides
RELATED GUIDES