RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Build Multi-Modal RAG for Images and Text
HOW-TO · RAG

How to Build Multi-Modal RAG for Images and Text

advanced·40 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Vision model installed, vector store with multi-modal support

What this does

This guide explains how to build a RAG pipeline that retrieves and reasons over both images and text. You will configure Ollama's multi-modal model to encode images, store embeddings in a vector database, and answer questions that require visual context.

Steps

  1. Verify the vision model is available.

    ollama list | grep -E "llava|moondream"
    # Expected: llava   latest    7b    ...  or moondream  latest  ...
    
  2. Create a ChromaDB collection with a compatible embedding function. Use chroma-utils or the built-in OpenCLIPEmbeddingFunction if available. If not, store images as base64 and use a fallback text embedder for descriptions.

    import chromadb, base64, io
    from PIL import Image
    
    client = chromadb.PersistentClient(path="./multimodal_db")
    
    # Fallback: describe images with text and embed descriptions
    def image_to_description(image_path: str, model="llava") -> str:
        with open(image_path, "rb") as f:
            img_bytes = f.read()
        # Use llava to caption the image
        response = ollama.generate(
            model=model,
            prompt=f"Describe this image concisely: <image>"
        )
        return response["response"]
    
    # Store the caption alongside image reference
    def store_image_doc(client, image_path, metadata=None):
        caption = image_to_description(image_path)
        col = client.get_or_create_collection("multimodal_docs")
        col.add(
            ids=[image_path],
            documents=[caption],
            metadatas=[{**(metadata or {}), "image_path": image_path}]
        )
    
  3. Index a set of images. Create metadata tags for filtering.

    import os
    
    image_dir = "./images"
    for fname in os.listdir(image_dir):
        if fname.endswith((".png", ".jpg", ".jpeg")):
            store_image_doc(client, os.path.join(image_dir, fname), {"category": "diagram"})
    
  4. Query the collection with a multi-modal prompt. Combine retrieved captions with the user's question.

    def multimodal_query(client, query_text, n=3):
        query_vec = ollama.embeddings(model="mxbai-embed-large", prompt=query_text)["embedding"]
        col = client.get_collection("multimodal_docs")
        results = col.query(query_embeddings=[query_vec], n_results=n)
    
        context = "\n".join(results["documents"][0])
        answer = ollama.generate(
            model="llava",
            prompt=f"Use the following image descriptions to answer the question.\n\nDescriptions:\n{context}\n\nQuestion: {query_text}"
        )
        return answer["response"]
    

Verification

python3 -c "from PIL import Image; print('Pillow OK')"
# Expected: Pillow OK
ollama run llava "Describe a simple diagram" --quiet
# Expected: [brief caption text]

Common failures

  • Vision model not pulled. Run ollama pull llava before using it in code; the API call fails silently without the model on disk.
  • Image format not supported. Pillow handles PNG/JPEG well; for TIFF or HEIC, convert first with Image.convert("RGB").
  • Duplicate IDs in ChromaDB. Adding the same image path twice throws a duplicate key error. Use unique IDs like f"{image_path}_{timestamp}" or check existence first.
  • Context window overflow. Very long caption concatenations can exceed the model context. Limit retrieval to top-3 results and truncate captions to 200 tokens.
  • Slow image captioning. Encoding images through Ollama per image is slow for large corpora. Batch-encode during ingestion and cache captions in the vector store.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • optimize-rag-low-latency
  • setup-chromadb-scratch
RELATED GUIDES
RAG
How to Optimize RAG for Low Latency
RAG
How to Set Up ChromaDB from Scratch
← All how-to guidesCourses →