HOW-TO · SUP

How to Set Up Multi-Modal RAG with Images

advanced35 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Vision model (LLaVA), vector store, image processing tools

What this does

Multi-modal RAG retrieves both text and images relevant to a query by indexing images with captions and storing them alongside text documents in a unified vector store. When a query arrives, the system retrieves the top-K most relevant items (images and text) and passes them to a vision-language model for synthesis.

Steps

Step 1 — Collect and preprocess images.

Gather images from the source corpus (local filesystem, S3, or a database). Resize images to a maximum of 1024×1024 pixels to reduce embedding cost and latency while preserving visual fidelity. Convert to PNG or JPEG with consistent color profiles.

Step 2 — Generate captions for each image.

Use a vision model (LLaVA, GPT-4o, or equivalent) to generate a descriptive caption for each image. Prompt the model with: "Describe this image in 2-3 sentences as if for a document retrieval system." Store the caption alongside the image filename. For images with existing alt-text or surrounding context, prepend that text to the caption.

Step 3 — Create a dual embedding strategy.

Generate two embeddings per image: one for the caption text (using a standard text embedding model) and one for the image itself (using a vision embedding model such as CLIP). Store both embeddings in the vector database with references to the same image_id. Also embed text documents normally.

Step 4 — Index all items in the vector store.

Insert each text chunk and each image (caption + image embedding) as separate records. Assign a type field (text or image) to enable filtering. Ensure the image_id links the caption embedding to the image embedding for retrieval assembly.

Step 5 — Build the retrieval and synthesis pipeline.

On receiving a user query, embed the query text. Retrieve the top-K results from the vector store (optionally filtering by type or source). For each retrieved image, fetch the original image file and the generated caption. Package the top-K text chunks and images into a prompt for the vision-language model with a synthesis instruction: "Based on the retrieved documents and images, answer the user's question."

Step 6 — Validate retrieval relevance.

Run a set of known questions against the pipeline. For each question, inspect the top-5 retrieved items. Verify that each item is contextually relevant to the query. If irrelevant images appear in the top-5, adjust the embedding model or increase the weight of the caption embedding relative to the image embedding.

Step 7 — Handle image retrieval failures.

If an image file is missing or the vision-language model cannot load it, fall back to using only the caption text. Log the failure with the image_id and return a partial response rather than crashing.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

  • Submit a query about a visual feature (e.g., "Which diagram shows the failover architecture?"). Confirm an image is retrieved in the top-3 results.
  • Submit a text-only question. Confirm text chunks are retrieved and no images appear unless explicitly relevant.
  • Remove an image file from storage. Confirm the pipeline falls back gracefully to the caption text without crashing.

Common failures

  • Caption quality degradation: Generic captions ("a picture of a graph") do not aid retrieval. Include source document text, figure labels, or surrounding paragraphs in the caption to provide discriminative context.
  • Image embedding drift: Vision embedding models trained on natural photos underperform on charts and diagrams. Use a model specifically fine-tuned for document images if the corpus includes technical figures.
  • Mismatched embedding spaces: Text and image embeddings from different model families may not align in the shared vector space. Use CLIP or a model that natively produces aligned image-text embeddings to ensure cross-modal retrieval accuracy.

Related guides

  • How to Implement Vector Search with Metadata Filtering — applies metadata filters to image retrieval results
  • How to Set Up Batch Processing for Large Document Sets — handles the caption generation step for large image corpora