07. Ollama Python Client
The Python library provides programmatic access to Ollama's API with typed objects and convenience methods. Install it with pip:
pip install ollama
Basic Usage
from ollama import chat
from ollama import ChatResponse
response: ChatResponse = chat(model='llama3.2:1b', messages=[
{
'role': 'user',
'content': 'What is recursion?',
},
])
print(response.message.content)
The chat function sends a request to http://localhost:11434/api/chat. It waits for the complete response by default.
Streaming Responses
from ollama import chat
stream = chat(model='llama3.2:1b', messages=[
{'role': 'user', 'content': 'List the planets'}
], stream=True)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print()
Streaming yields dictionaries as each chunk arrives. Set stream=True for real-time output.
Embeddings
from ollama import embeddings
response = embeddings(model='nomic-embed-text', prompt='Hello world')
embedding = response['embedding']
print(f'Embedding dimension: {len(embedding)}')
Error Handling
import httpx
from ollama import chat
try:
response = chat(model='nonexistent-model', messages=[
{'role': 'user', 'content': 'Hello'}
])
except httpx.ConnectError:
print("Cannot connect to Ollama. Is the server running?")
except httpx.HTTPStatusError as e:
print(f"API error: {e.response.status_code} - {e.response.text}")
The client raises httpx.ConnectError when the server is unreachable and httpx.HTTPStatusError for API-level errors (like requesting a non-existent model).
Client Configuration
from ollama import Client
client = Client(host='http://localhost:11434')
# Or connect to a remote host
client = Client(host='http://192.168.1.100:11434')
response = client.chat(model='llama3.2:1b', messages=[
{'role': 'user', 'content': 'Hello'}
])
The explicit client lets you target remote Ollama instances. By default, the client connects to http://localhost:11434.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Write a Python script that uses the streaming chat function to echo tokens as they arrive, and measure the time to first token versus time to complete response.