16. Streaming with LangChain
Streaming delivers tokens as they generate rather than waiting for complete responses. For local models with 10-30 second generation times, streaming keeps users engaged and provides perceived performance improvement.
Enable streaming in the LLM configuration.
from langchain_ollama import ChatOllama
llm = ChatOllama(
model="llama3.2",
base_url="http://localhost:11434",
stream=True # Enable streaming
)
# Stream response manually
for chunk in llm.stream("Explain Docker in one sentence."):
print(chunk.content, end="", flush=True)
print()
For chains, use astream on the chain directly.
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
async def stream_answer():
async for chunk in qa_chain.astream({"query": "your question"}):
if "result" in chunk:
print(chunk["result"], end="", flush=True)
import asyncio
asyncio.run(stream_answer())
Sync streaming uses stream method on the chain object.
for chunk in qa_chain.stream({"query": "your question"}):
if "result" in chunk:
print(chunk["result"], end="", flush=True)
For chat interfaces, stream individual tokens.
from langchain.schema import HumanMessage
messages = [HumanMessage(content="What is retrieval-augmented generation?")]
for token in llm.stream(messages):
print(token.content, end="", flush=True)
A common mistake: calling invoke() on a streaming-enabled chain returns the complete response. Streaming only activates with stream() or astream().
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create a simple chain and implement streaming that prints tokens as they arrive. Measure total time for streaming completion versus invoke() completion.