How to Set Up Multi-Modal RAG with Images
Vision model (LLaVA), vector store, image processing tools
What this does
Multi-modal RAG retrieves both text and images relevant to a query by indexing images with captions and storing them alongside text documents in a unified vector store. When a query arrives, the system retrieves the top-K most relevant items (images and text) and passes them to a vision-language model for synthesis.
Steps
Step 1 — Collect and preprocess images.
Gather images from the source corpus (local filesystem, S3, or a database). Resize images to a maximum of 1024×1024 pixels to reduce embedding cost and latency while preserving visual fidelity. Convert to PNG or JPEG with consistent color profiles.
Step 2 — Generate captions for each image.
Use a vision model (LLaVA, GPT-4o, or equivalent) to generate a descriptive caption for each image. Prompt the model with: "Describe this image in 2-3 sentences as if for a document retrieval system." Store the caption alongside the image filename. For images with existing alt-text or surrounding context, prepend that text to the caption.
Step 3 — Create a dual embedding strategy.
Generate two embeddings per image: one for the caption text (using a standard text embedding model) and one for the image itself (using a vision embedding model such as CLIP). Store both embeddings in the vector database with references to the same image_id. Also embed text documents normally.
Step 4 — Index all items in the vector store.
Insert each text chunk and each image (caption + image embedding) as separate records. Assign a type field (text or image) to enable filtering. Ensure the image_id links the caption embedding to the image embedding for retrieval assembly.
Step 5 — Build the retrieval and synthesis pipeline.
On receiving a user query, embed the query text. Retrieve the top-K results from the vector store (optionally filtering by type or source). For each retrieved image, fetch the original image file and the generated caption. Package the top-K text chunks and images into a prompt for the vision-language model with a synthesis instruction: "Based on the retrieved documents and images, answer the user's question."
Step 6 — Validate retrieval relevance.
Run a set of known questions against the pipeline. For each question, inspect the top-5 retrieved items. Verify that each item is contextually relevant to the query. If irrelevant images appear in the top-5, adjust the embedding model or increase the weight of the caption embedding relative to the image embedding.
Step 7 — Handle image retrieval failures.
If an image file is missing or the vision-language model cannot load it, fall back to using only the caption text. Log the failure with the image_id and return a partial response rather than crashing.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
- Submit a query about a visual feature (e.g., "Which diagram shows the failover architecture?"). Confirm an image is retrieved in the top-3 results.
- Submit a text-only question. Confirm text chunks are retrieved and no images appear unless explicitly relevant.
- Remove an image file from storage. Confirm the pipeline falls back gracefully to the caption text without crashing.
Common failures
- Caption quality degradation: Generic captions ("a picture of a graph") do not aid retrieval. Include source document text, figure labels, or surrounding paragraphs in the caption to provide discriminative context.
- Image embedding drift: Vision embedding models trained on natural photos underperform on charts and diagrams. Use a model specifically fine-tuned for document images if the corpus includes technical figures.
- Mismatched embedding spaces: Text and image embeddings from different model families may not align in the shared vector space. Use CLIP or a model that natively produces aligned image-text embeddings to ensure cross-modal retrieval accuracy.
Related guides
- How to Implement Vector Search with Metadata Filtering — applies metadata filters to image retrieval results
- How to Set Up Batch Processing for Large Document Sets — handles the caption generation step for large image corpora