Streaming with LangChain — LangChain for Local AI (Chapter 16)

Streaming delivers tokens as they generate rather than waiting for complete responses. For local models with 10-30 second generation times, streaming keeps users engaged and provides perceived performance improvement.

Enable streaming in the LLM configuration.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2",
    base_url="http://localhost:11434",
    stream=True  # Enable streaming
)

# Stream response manually
for chunk in llm.stream("Explain Docker in one sentence."):
    print(chunk.content, end="", flush=True)
print()

For chains, use astream on the chain directly.

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

async def stream_answer():
    async for chunk in qa_chain.astream({"query": "your question"}):
        if "result" in chunk:
            print(chunk["result"], end="", flush=True)

import asyncio
asyncio.run(stream_answer())

Sync streaming uses stream method on the chain object.

for chunk in qa_chain.stream({"query": "your question"}):
    if "result" in chunk:
        print(chunk["result"], end="", flush=True)

For chat interfaces, stream individual tokens.

from langchain.schema import HumanMessage

messages = [HumanMessage(content="What is retrieval-augmented generation?")]

for token in llm.stream(messages):
    print(token.content, end="", flush=True)

A common mistake: calling invoke() on a streaming-enabled chain returns the complete response. Streaming only activates with stream() or astream().

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.