What this does

This guide explains how to build a RAG pipeline that retrieves and reasons over both images and text. You will configure Ollama's multi-modal model to encode images, store embeddings in a vector database, and answer questions that require visual context.

Steps

Verify the vision model is available.

ollama list | grep -E "llava|moondream"
# Expected: llava   latest    7b    ...  or moondream  latest  ...

Create a ChromaDB collection with a compatible embedding function. Use chroma-utils or the built-in OpenCLIPEmbeddingFunction if available. If not, store images as base64 and use a fallback text embedder for descriptions.

import chromadb, base64, io
from PIL import Image

client = chromadb.PersistentClient(path="./multimodal_db")

# Fallback: describe images with text and embed descriptions
def image_to_description(image_path: str, model="llava") -> str:
    with open(image_path, "rb") as f:
        img_bytes = f.read()
    # Use llava to caption the image
    response = ollama.generate(
        model=model,
        prompt=f"Describe this image concisely: <image>"
    )
    return response["response"]

# Store the caption alongside image reference
def store_image_doc(client, image_path, metadata=None):
    caption = image_to_description(image_path)
    col = client.get_or_create_collection("multimodal_docs")
    col.add(
        ids=[image_path],
        documents=[caption],
        metadatas=[{**(metadata or {}), "image_path": image_path}]
    )

Index a set of images. Create metadata tags for filtering.

import os

image_dir = "./images"
for fname in os.listdir(image_dir):
    if fname.endswith((".png", ".jpg", ".jpeg")):
        store_image_doc(client, os.path.join(image_dir, fname), {"category": "diagram"})

Query the collection with a multi-modal prompt. Combine retrieved captions with the user's question.

def multimodal_query(client, query_text, n=3):
    query_vec = ollama.embeddings(model="mxbai-embed-large", prompt=query_text)["embedding"]
    col = client.get_collection("multimodal_docs")
    results = col.query(query_embeddings=[query_vec], n_results=n)

    context = "\n".join(results["documents"][0])
    answer = ollama.generate(
        model="llava",
        prompt=f"Use the following image descriptions to answer the question.\n\nDescriptions:\n{context}\n\nQuestion: {query_text}"
    )
    return answer["response"]

Verification

python3 -c "from PIL import Image; print('Pillow OK')"
# Expected: Pillow OK
ollama run llava "Describe a simple diagram" --quiet
# Expected: [brief caption text]

Common failures

Vision model not pulled. Run ollama pull llava before using it in code; the API call fails silently without the model on disk.
Image format not supported. Pillow handles PNG/JPEG well; for TIFF or HEIC, convert first with Image.convert("RGB").
Duplicate IDs in ChromaDB. Adding the same image path twice throws a duplicate key error. Use unique IDs like f"{image_path}_{timestamp}" or check existence first.
Context window overflow. Very long caption concatenations can exceed the model context. Limit retrieval to top-3 results and truncate captions to 200 tokens.
Slow image captioning. Encoding images through Ollama per image is slow for large corpora. Batch-encode during ingestion and cache captions in the vector store.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Build Multi-Modal RAG for Images and Text

What this does

Steps

Verification

Common failures

Related guides