KEY INSIGHT
Retrieval pipelines must handle text, images, tables, and code as distinct modalities with modality-appropriate embeddings and generation.
### Beyond Text-Only Retrieval
Production RAG systems face documents that contain figures, diagrams, screenshots, code blocks, and tables. A text-only embedding pipeline discards the visual structure. Multi-modal RAG retrieves across modalities and generates answers that synthesize text with visual evidence.
### Modality-Specific Processing
```python
import pypdf
from PIL import Image
import pytesseract
def extractModalContent(pdf_path: str) -> list[dict]:
"""
Extract text, images, and tables from a PDF with modality labels.
"""
content = []
reader = pypdf.PdfReader(pdf_path)
for page_num, page in enumerate(reader.pages):
# Extract text
text = page.extract_text()
if text.strip():
content.append({
"modality": "text",
"content": text,
"page": page_num + 1,
"source": f"{pdf_path}#page={page_num + 1}"
})
# Extract images
for img_idx, image in enumerate(page.images):
img_bytes = image.data
img_desc = describeImage(img_bytes) # LLM or vision model
content.append({
"modality": "image",
"content": img_desc,
"raw_bytes": img_bytes,
"page": page_num + 1,
"source": f"{pdf_path}#page={page_num + 1}&image={img_idx}"
})
# Extract tables
tables = page.extract_tables()
for tbl_idx, table in enumerate(tables):
table_text = formatTable(table)
content.append({
"modality": "table",
"content": table_text,
"raw_table": table,
"page": page_num + 1,
"source": f"{pdf_path}#page={page_num + 1}&table={tbl_idx}"
})
return content
```
### Describing Images for Retrieval
```python
from openai import OpenAI
client = OpenAI()
def describeImage(image_bytes: bytes) -> str:
"""
Use vision model to generate a text description for image retrieval.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_bytes}"}},
{"type": "text", "text": "Describe this image with enough technical detail "
"that a semantic search system can retrieve it for a RAG query. "
"Include labels, values, and relationships visible in the diagram."}
]}
],
temperature=0.0,
max_tokens=256
)
return response.choices[0].message.content.strip()
```
### Multi-Modal Indexing
```python
def indexByModality(content: list[dict], vector_store) -> None:
"""
Index each modality into its appropriate vector store or database.
"""
for item in content:
embedding = embed_texts([item["content"]])[0]
vector_store.insert(
embedding=embedding,
text=item["content"],
modality=item["modality"],
source=item["source"],
metadata={"page": item["page"]}
)
```
### Multi-Modal Generation
```python
def multiModalAnswer(query: str, retrieved: list[dict]) -> dict:
"""
Generate answer using vision for images, text synthesis for text.
"""
text_chunks = [c for c in retrieved if c["modality"] == "text"]
image_items = [c for c in retrieved if c["modality"] == "image"]
table_items = [c for c in retrieved if c["modality"] == "table"]
synthesis_parts = [c["content"] for c in text_chunks]
# Include image descriptions inline
for img in image_items:
synthesis_parts.append(f"[Image: {img['content']}]")
# Include table representations inline
for tbl in table_items:
synthesis_parts.append(f"[Table: {tbl['content']}]")
synthesis_context = "\n\n".join(synthesis_parts)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "Answer using both the text and image/table descriptions provided."},
{"role": "user", "content": f"Question: {query}\n\nContext: {synthesis_context}"}
],
temperature=0.0
)
return {"answer": response.choices[0].message.content, "sources": retrieved}
```
### Failure Modes
Image descriptions are lossy—the vision model may miss small axis labels, faint grid lines, or color-coded legend items. Always include the image source link so users can verify. Table extraction boundary detection (rows vs. columns) varies by PDF structure; validate on a sample before production. Multi-modal pipelines triple the processing time of text-only pipelines; parallelize embedding calls across modalities.