02. Ollama LLM Integration
Ollama runs GGUF-quantized models locally as a long-running HTTP server. By default it listens on localhost:11434 and exposes a REST API for chat completions. LangChain's Ollama integration connects to this API using either the ChatOllama class (for chat models) or OllamaLLM class (for legacy completion models). As of LangChain 0.1.x, use the langchain-ollama package rather than the older langchain.llms.ollama path, which was deprecated.
Verify Ollama is running first:
# Check if Ollama backend is available
curl -s http://localhost:11434/api/tags | head -20
If you see JSON listing available models, Ollama is up. If you see a connection error, start it:
# Linux/macOS
ollama serve
# Or start as a background service depending on your init system
sudo systemctl start ollama
Once running, list your installed models:
import json
import subprocess
result = subprocess.run(
["ollama", "list"], capture_output=True, text=True
)
print(result.stdout)
# Output looks like:
# NAME ID SIZE MODIFIED
# llama3.2:3b a9e1f02f0de8 1.8GB 2024-12-01 10:00:00
# mixtral:8x7b 7d4e0f02f1a9 26GB 2024-11-28 08:00:00
Connect LangChain to the running Ollama instance:
from langchain_ollama import ChatOllama
# Initialize with a model you have installed
llm = ChatOllama(
model="llama3.2:3b",
base_url="http://localhost:11434",
temperature=0.7,
# Optional: stream all responses
streaming=True,
)
# Test the connection with a simple invocation
response = llm.invoke("Say hello in exactly three words.")
print(response.content)
# Expected: something like "Hello there, friend." (3 words)
The ChatOllama class returns AIMessage objects (part of langchain_core.messages). This matters because downstream components like ChatPromptTemplate expect the message schema to conform to LangChain's BaseMessage interface.
Common failure modes with the Ollama integration:
| Error | Cause | Fix |
|---|---|---|
ConnectionError: HTTPConnectionPool |
Ollama not running | ollama serve in another terminal |
ValueError: model not found |
Model not pulled | ollama pull llama3.2:3b |
APIStatusError: 500 |
Model loaded from previous session, context mismatch | ollama ps, then ollama kill <model-id> or restart |
| Slow first response | Cold start loading model into VRAM | Keep Ollama running; first call always slow |
The 500 error is the most insidious. Ollama reloads the model if the context window size changes between invocations or if the model was evicted. Check its status:
ollama ps
# NAME ID SIZE MODIFIED
# llama3.2:3b a9e1f02f0de8 2.1GB 2 minutes ago
If the model shows no recent activity and you get 500s, kill and reload:
ollama kill llama3.2:3b
# Then try your LangChain call again—it will reload automatically
For production-style deployments, keep Ollama running as a persistent service rather than on-demand. The model load time dominates latency for interactive applications if you restart Ollama between calls.
Write a Python script that checks Ollama availability, lists installed models, initializes ChatOllama, and prints a one-sentence response from the model. Handle the ConnectionError case explicitly with a message telling the operator to start Ollama.