RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /LangChain for Local AI
  6. /Ch. 16
LangChain for Local AI

16. Streaming with LangChain

Chapter 16 of 18 · 20 min
KEY INSIGHT

Streaming requires explicit `stream()` calls—`invoke()` always waits for complete generation even with `stream=True` in the LLM config.

Streaming delivers tokens as they generate rather than waiting for complete responses. For local models with 10-30 second generation times, streaming keeps users engaged and provides perceived performance improvement.

Enable streaming in the LLM configuration.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2",
    base_url="http://localhost:11434",
    stream=True  # Enable streaming
)

# Stream response manually
for chunk in llm.stream("Explain Docker in one sentence."):
    print(chunk.content, end="", flush=True)
print()

For chains, use astream on the chain directly.

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

async def stream_answer():
    async for chunk in qa_chain.astream({"query": "your question"}):
        if "result" in chunk:
            print(chunk["result"], end="", flush=True)

import asyncio
asyncio.run(stream_answer())

Sync streaming uses stream method on the chain object.

for chunk in qa_chain.stream({"query": "your question"}):
    if "result" in chunk:
        print(chunk["result"], end="", flush=True)

For chat interfaces, stream individual tokens.

from langchain.schema import HumanMessage

messages = [HumanMessage(content="What is retrieval-augmented generation?")]

for token in llm.stream(messages):
    print(token.content, end="", flush=True)

A common mistake: calling invoke() on a streaming-enabled chain returns the complete response. Streaming only activates with stream() or astream().

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a simple chain and implement streaming that prints tokens as they arrive. Measure total time for streaming completion versus invoke() completion.

← Chapter 15
RetrievalQA Chain
Chapter 17 →
LangChain Callbacks