RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to enable streaming responses for real-time output
HOW-TO · INF

How to enable streaming responses for real-time output

intermediate·10 min·By Fredoline Eruo
PREREQUISITES

Ollama or OpenAI-compatible API endpoint

What this does

Streaming returns tokens one by one as they are generated instead of waiting for the full response. This provides a better user experience for chat applications and real-time tools.

Steps

  1. Enable streaming in an Ollama API request. Set "stream": true in the JSON body.

    curl -N http://localhost:11434/api/generate \
      -d '{"model": "llama3.2", "prompt": "Write a short poem", "stream": true}'
    

    Expected: Tokens arrive incrementally as newline-delimited JSON objects.

  2. Stream from the chat endpoint for multi-turn conversations.

    curl -N http://localhost:11434/api/chat \
      -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Tell me a joke"}], "stream": true}'
    
  3. Stream in Python, processing each chunk as it arrives.

    import requests, json
    
    response = requests.post("http://localhost:11434/api/generate",
        json={"model": "llama3.2", "prompt": "Write a haiku", "stream": True},
        stream=True)
    for line in response.iter_lines():
        if line:
            chunk = json.loads(line)
            if chunk.get("response"):
                print(chunk["response"], end="", flush=True)
            if chunk.get("done"):
                print()
                print(f"Tokens: {chunk['eval_count']}, Duration: {chunk['eval_duration']/1e9:.2f}s")
    
  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -N -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"Count to 5","stream":true}' \
  | python -c "import sys,json; [print(json.loads(l)['response'],end='',flush=True) for l in sys.stdin if l.strip()]"
# Expected: Characters appear one at a time, not all at once

Common failures

  • No streaming, response arrives all at once: Verify "stream": true is in the request body. Some clients default to stream: false.
  • Chunks arrive with delay: The first chunk includes model loading time. Keep the model loaded with a warm-up request first.
  • Connection closed prematurely: Network proxies may buffer streaming responses. Use --no-buffer with nginx or stream=True in Python.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • How to handle streaming response chunks in your application
  • How to configure context window size for long documents
RELATED GUIDES
INF
How to configure context window size for long documents
INF
How to handle streaming response chunks in your application
← All how-to guidesCourses →