HOW-TO · RAG
How to Build Multi-Modal RAG for Images and Text
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Vision model installed, vector store with multi-modal support
What this does
This guide explains how to build a RAG pipeline that retrieves and reasons over both images and text. You will configure Ollama's multi-modal model to encode images, store embeddings in a vector database, and answer questions that require visual context.
Steps
Verify the vision model is available.
ollama list | grep -E "llava|moondream" # Expected: llava latest 7b ... or moondream latest ...Create a ChromaDB collection with a compatible embedding function. Use
chroma-utilsor the built-inOpenCLIPEmbeddingFunctionif available. If not, store images as base64 and use a fallback text embedder for descriptions.import chromadb, base64, io from PIL import Image client = chromadb.PersistentClient(path="./multimodal_db") # Fallback: describe images with text and embed descriptions def image_to_description(image_path: str, model="llava") -> str: with open(image_path, "rb") as f: img_bytes = f.read() # Use llava to caption the image response = ollama.generate( model=model, prompt=f"Describe this image concisely: <image>" ) return response["response"] # Store the caption alongside image reference def store_image_doc(client, image_path, metadata=None): caption = image_to_description(image_path) col = client.get_or_create_collection("multimodal_docs") col.add( ids=[image_path], documents=[caption], metadatas=[{**(metadata or {}), "image_path": image_path}] )Index a set of images. Create metadata tags for filtering.
import os image_dir = "./images" for fname in os.listdir(image_dir): if fname.endswith((".png", ".jpg", ".jpeg")): store_image_doc(client, os.path.join(image_dir, fname), {"category": "diagram"})Query the collection with a multi-modal prompt. Combine retrieved captions with the user's question.
def multimodal_query(client, query_text, n=3): query_vec = ollama.embeddings(model="mxbai-embed-large", prompt=query_text)["embedding"] col = client.get_collection("multimodal_docs") results = col.query(query_embeddings=[query_vec], n_results=n) context = "\n".join(results["documents"][0]) answer = ollama.generate( model="llava", prompt=f"Use the following image descriptions to answer the question.\n\nDescriptions:\n{context}\n\nQuestion: {query_text}" ) return answer["response"]
Verification
python3 -c "from PIL import Image; print('Pillow OK')"
# Expected: Pillow OK
ollama run llava "Describe a simple diagram" --quiet
# Expected: [brief caption text]
Common failures
- Vision model not pulled. Run
ollama pull llavabefore using it in code; the API call fails silently without the model on disk. - Image format not supported. Pillow handles PNG/JPEG well; for TIFF or HEIC, convert first with
Image.convert("RGB"). - Duplicate IDs in ChromaDB. Adding the same image path twice throws a duplicate key error. Use unique IDs like
f"{image_path}_{timestamp}"or check existence first. - Context window overflow. Very long caption concatenations can exceed the model context. Limit retrieval to top-3 results and truncate captions to 200 tokens.
- Slow image captioning. Encoding images through Ollama per image is slow for large corpora. Batch-encode during ingestion and cache captions in the vector store.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
RELATED GUIDES