COURSE · BLD · I010

Multi-Modal AI: Vision and Text

Learn multi-modal ai: vision and text through RunLocalAI's practical lens: vision, multimodal, llava and image, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters12hBuilder trackBy Fredoline Eruo
PREREQUISITES
  • B002
  • B003

Why this course matters

Multi-Modal AI: Vision and Text is for builders turning local models into working tools, agents and retrieval systems. It connects vision, multimodal, llava, image and captioning to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Multi-Modal Models Overview, LLaVA Installation, BakLLaVA Setup and Image Captioning and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Multi-Modal Models OverviewMulti-modal models bridge visual and textual information by encoding images and text into a shared representation space, enabling tasks like image captioning and visual question answering that require joint understanding of both modalities. Multi-modal large language models (MLLMs) represent a significant advancement over traditional computer vision systems. Where classic models required separate training for detection, classification, and captioning tasks, multi-modal architectures process images and text through unified transformer-based pathways. The core architecture typically consists of three components: a vision encoder (often a Vision Transformer or CLIP-based encoder), a projection layer that maps visual features into the language model embedding space, and a large language model that generates text based on the combined visual and textual inputs. ```python # Conceptual architecture of a basic multi-modal model class MultiModalModel: def __init__(self, vision_encoder, projection_layer, llm): self.vision_encoder = vision_encoder self.projection = projection_layer self.llm = llm def forward(self, image, text_prompt): # Encode image into visual features visual_features = self.vision_encoder(image) # Project to language model space projected_features = self.projection(visual_features) # Generate text conditioned on image and prompt response = self.llm.generate( context=[projected_features, text_prompt] ) return response ``` Local multi-modal models offer privacy advantages since images never leave your infrastructure. LLaVA and BakLLaVA are prominent open-source options that run entirely on local hardware. These models handle resolutions from 224×224 up to 448×448 pixels depending on architecture variant. Failure modes to anticipate include memory exhaustion with high-resolution images, inconsistent performance across different image domains, and hallucination where the model generates descriptions that don't match image content. Running smaller 7B parameter models first provides reasonable baseline behavior before scaling to larger variants.15 min
  2. 02LLaVA InstallationLLaVA requires compatible versions of CUDA, PyTorch, and transformer libraries. Version mismatches are the most common installation failure, so verify your environment before proceeding. LLaVA (Large Language and Vision Assistant) is a widely-used open-source multi-modal model. The recommended installation relies on the `llamafactory` package which provides a unified interface for running various multi-modal architectures. Ensure NVIDIA driver version 525+ and CUDA 11.8 or 12.1 before beginning. ```bash # Verify CUDA availability python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')" ``` The installation sequence matters. Create a fresh virtual environment to avoid dependency conflicts: ```bash python -m venv llava-env source llava-env/bin/activate # Install PyTorch with CUDA support pip install torch==2.3.0 torchvision==0.18.0 --index-url https://download.pytorch.org/whl/cu118 # Install transformer libraries pip install transformers==4.40.0 pip install accelerate==0.28.0 pip install bitsandbytes==0.43.1 # Install LLaVA interfacing through llamafactory pip install llfactory ``` Model weight download uses significant disk space. The 7B LLaVA model requires approximately 14GB for the quantized variant and 40GB+ for full precision. Download in advance: ```bash # Create model directory mkdir -p models huggingface-cli download --repo-type model \ liuhaotian/llava-v1.6-mistral-7b --local-dir models/llava-v1.6-mistral-7b ``` Common failure modes include: - **Out of memory at import**: Reduce batch size to 1, enable CPU offloading - **CUDA not found**: Verify `LD_LIBRARY_PATH` includes CUDA lib directory - **Weight download timeout**: Use `--local-dir-use-symlinks False` for reliability After installation, run a sanity check: ```python from transformers import AutoProcessor, AutoModelForVision2Seq processor = AutoProcessor.from_pretrained("liuhaotian/llava-v1.6-mistral-7b") model = AutoModelForVision2Seq.from_pretrained("liuhaotian/llava-v1.6-mistral-7b") print("LLaVA loaded successfully") ```20 min
  3. 03BakLLaVA SetupBakLLaVA offers improved multi-modal performance through better vision-language alignment while maintaining compatibility with LLaVA's interfaces, making it a straightforward swap-in replacement. BakLLaVA builds upon LLaVA's architecture with modified training objectives that improve visual comprehension. The model uses a Mistral-7B base combined with a CLIP-based vision encoder. Setup closely mirrors LLaVA with minor configuration differences. ```bash # Install with BakLLaVA-specific requirements pip install bakllava-requirements # if available # Or add to existing environment pip install einops==0.7.0 pip install xformers==0.0.24 # Download BakLLaVA weights huggingface-cli download --repo-type model \ ikesaurus/bakllava-1-7b --local-dir models/bakllava-1-7b ``` Configuration differs slightly from LLaVA. Create a custom config file: ```yaml # config.yaml model_name: models/bakllava-1-7b vision_tower: clip-vit-large-patch14-336 freeze_vision_tower: false pretrain_mm_mlp_adapter: models/bakllava-1-7b/mm_projector.bin text_model: name: mistralai/Mistral-7B-v0.1 quantize: 4bit inference: max_length: 2048 temperature: 0.7 top_p: 0.9 ``` Initialize with the custom configuration: ```python import torch from bakllava import BakLLaVAModel, BakLLaVAProcessor config = { "model_path": "models/bakllava-1-7b", "torch_dtype": torch.float16, "device_map": "auto" } processor = BakLLaVAProcessor.from_pretrained(config["model_path"]) model = BakLLaVAModel.from_pretrained(config["model_path"]) ``` Performance comparison shows BakLLaVA often produces more detailed captions: ```python # Test both models on same image image_path = "test_images/sample.jpg" # LLaVA output tends toward brief descriptions # BakLLaVA often includes spatial relationships and fine details ``` Potential issues during setup: - **Vision encoder mismatch**: Ensure CLIP weights match exactly - **Quantization errors**: The 4-bit mode requires `bitsandbytes` 0.41+ - **Memory fragmentation**: Call `torch.cuda.empty_cache()` between tests20 min
  4. 04Image CaptioningImage captioning converts visual content into natural language descriptions. Prompt engineering significantly affects output qualityΓÇöspecific, structured prompts yield more consistent results than open-ended queries. Image captioning generates textual descriptions of image content. Multi-modal models process the image through the vision encoder and produce text through autoregressive generation. The quality depends on prompt formulation, image resolution, and model capabilities. Basic caption generation: ```python from PIL import Image import torch def generate_caption(model, processor, image_path, max_new_tokens=100): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False ) caption = processor.batch_decode(output, skip_special_tokens=True)[0] return caption # Usage caption = generate_caption(model, processor, "photos/landscape.jpg") print(caption) ``` Prompt variations affect output style: ```python # Brief factual caption prompt = "Provide a concise, objective caption." # Detailed descriptive caption prompt = "Describe all visible objects, their positions, colors, and any text present." # Narrative caption prompt = "Write a caption as if for a photojournalism article." ``` Failure modes in captioning: - **Truncation**: Images may be center-cropped, losing peripheral details - **Hallucination**: Models sometimes describe non-existent objects - **Text unreadability**: Small text often gets skipped entirely Benchmark caption quality with reference captions: ```python from datasets import load_metric # Load captioning metrics bleu = load_metric("bleu") def evaluate_caption(prediction, references): return bleu.compute( predictions=[prediction.split()], references=[ref.split() for ref in references] ) ```20 min
  5. 05Visual Question AnsweringVisual Question Answering (VQA) combines image understanding with language generation, allowing free-form questions about visual content. Structured prompts with explicit context improve answer accuracy. Visual Question Answering extends captioning by accepting user questions. The model must identify relevant visual elements, reason about relationships, and format answers appropriately. This task is more challenging than captioning because answers must address specific queries. ```python def answer_visual_question(model, processor, image_path, question): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": f"Question: {question}\nAnswer concisely."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=150, do_sample=False, temperature=None # Deterministic for Q&A ) answer = processor.batch_decode(output, skip_special_tokens=True)[0] return answer # Example questions questions = [ "What is the main subject of this image?", "How many people are visible in the scene?", "What colors dominate the background?", "Is there any text visible? If so, what does it say?", "What time of day does this image appear to show?" ] for question in questions: answer = answer_visual_question(model, processor, "test.jpg", question) print(f"Q: {question}\nA: {answer}\n") ``` Multi-turn VQA enables follow-up questions: ```python conversation_history = [] def multi_turn_vqa(image_path, question): global conversation_history conversation_history.append({ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question} ] }) # Include previous turns for context conversation = conversation_history.copy() prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, # Only include image for first turn or every turn text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=150) response = processor.batch_decode(output, skip_special_tokens=True)[0] conversation_history.append({ "role": "assistant", "content": response }) return response ``` Common failure modes in VQA: - **Ambiguous questions**: "What is this?" produces varied responses - **Counting errors**: Models struggle with precise counts - **Spatial reasoning**: Questions about relative positions often fail15 min
  6. 06Chart and Diagram UnderstandingChart understanding requires the model to interpret visual encodings (axes, scales, legends) that are often harder than natural images because they compress high-dimensional data into 2D space.15 min
  7. 07OCR with Vision ModelsVision-language models perform OCR through learned visual patterns rather than explicit character recognition. They excel at contextual understanding but struggle with precise text extraction compared to dedicated OCR engines. Vision models approach text recognition differently than Tesseract or similar OCR engines. Instead of pixel-to-character mapping, they learn to "read" as part of their language understanding. This produces more human-like interpretation but with different trade-offs. ```python def extract_text_vision_model(model, processor, image_path, context_aware=True): image = Image.open(image_path).convert("RGB") if context_aware: prompt = """Transcribe ALL text visible in this image. Preserve line breaks and formatting. Include every word, number, and symbol exactly as written.""" else: prompt = "What text do you see? List all words." conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": prompt} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=500, do_sample=False # Deterministic for text extraction ) return processor.batch_decode(output, skip_special_tokens=True)[0] ``` Compare vision model OCR with dedicated engines: ```python def hybrid_ocr(image_path, use_fallback=True): """Combine vision model with Tesseract for optimal results.""" # First attempt: Vision model vision_text = extract_text_vision_model(model, processor, image_path) if use_fallback: # Fallback: Tesseract for exact transcription import pytesseract tesseract_text = pytesseract.image_to_string( Image.open(image_path), output_type=pytesseract.Output.STRING ) return { "vision_model": vision_text, "tesseract": tesseract_text, "combined": f"{vision_text}\n\n---Tesseract---\n{tesseract_text}" } return vision_text ``` Text extraction across document types: ```python document_prompts = { "screenshot": "Extract all visible UI text, labels, buttons, and any other textual elements.", "receipt": "Extract the itemized list, prices, totals, and vendor information.", "document": "Extract the full document text maintaining paragraph structure.", "signage": "Transcribe all visible text including size indicators if present." } ``` Performance characteristics: - **Handwriting**: Poor performance; consider specialized handwriting models - **Low resolution**: Text below 12px height becomes unreadable - **Perspective distortion**: Models often fail to correct tilted text - **Multi-column layouts**: Column order may be confused20 min
  8. 08Document Image AnalysisDocument image analysis involves understanding layout, extracting content, and interpreting structure. Multi-modal models can identify sections, understand document type, and provide semantic interpretation beyond raw text extraction. Document analysis combines layout understanding with content extraction and semantic interpretation. Professional documents have defined structuresΓÇöheaders, body text, tables, footnotesΓÇöthat affect how information should be extracted. ```python def analyze_document_structure(model, processor, image_path): """Identify document layout and sections.""" image = Image.open(image_path).convert("RGB") prompt = """Analyze this document and identify: 1. Document type (form, invoice, contract, report, etc.) 2. Major sections and their boundaries 3. Key data fields present 4. Table structures if any 5. Overall document structure""" conversation = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=400) return processor.batch_decode(output, skip_special_tokens=True)[0] ``` Extract structured information from forms: ```python def extract_form_fields(model, processor, image_path, field_list): """Extract values for expected form fields.""" image = Image.open(image_path).convert("RGB") field_descriptions = "\n".join([f"- {field}" for field in field_list]) prompt = f"""For each field below, extract the corresponding value from this document. If a field is not found, respond 'N/A'. Respond in the format: Field: Value {field_descriptions}""" conversation = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=300) result = processor.batch_decode(output, skip_special_tokens=True)[0] # Parse structured output fields = {} for line in result.split("\n"): if ":" in line: key, value = line.split(":", 1) fields[key.strip()] = value.strip() return fields ``` Document type classification: ```python def classify_document(model, processor, image_path): """Identify document type and characteristics.""" image = Image.open(image_path).convert("RGB") types = [ "invoice", "receipt", "contract", "form", "resume", "report", "letter", "handwritten_note", "screenshot", "presentation", "newspaper_article", "book_page", "label", "certificate", "unknown" ] type_list = ", ".join(types[:-1]) # Exclude 'unknown' prompt = f"""What type of document is this image? Choose from: {type_list}. Provide confidence level (high/medium/low) and brief justification.""" conversation = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=100) return processor.batch_decode(output, skip_special_tokens=True)[0] ``` Document challenges: - **Tightly packed text**: Loss of detail in small print - **Color-coded information**: May be missed or misunderstood - **Complex tables**: Multi-level headers problematic - **Signatures**: Difficult to interpret reliably20 min
  9. 09Batch Image ProcessingEfficient batch processing requires memory management, parallel processing strategies, and error handling. Processing multiple images sequentially with proper resource cleanup maintains throughput without crashes. Production deployments typically process many images in batches. Memory management becomes criticalΓÇöloading large models repeatedly wastes resources, but keeping them in memory may cause crashes with limited VRAM. Strategic batching maintains throughput. ```python import gc from pathlib import Path from concurrent.futures import ThreadPoolExecutor, as_completed class BatchImageProcessor: def __init__(self, model, processor, batch_size=4): self.model = model self.processor = processor self.batch_size = batch_size self.device = model.device def process_image(self, image_path, prompt): """Process single image with given prompt.""" try: image = Image.open(image_path).convert("RGB") conversation = [ {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]} ] prompt_text = self.processor.apply_chat_template( conversation, add_generation_prompt=True ) inputs = self.processor( images=image, text=prompt_text, return_tensors="pt" ).to(self.device) with torch.no_grad(): output = self.model.generate( **inputs, max_new_tokens=200, do_sample=False ) result = self.processor.batch_decode(output, skip_special_tokens=True)[0] del inputs, output, image gc.collect(); torch.cuda.empty_cache() return {"path": str(image_path), "result": result, "status": "success"} except Exception as e: return {"path": str(image_path), "result": None, "status": "error", "error": str(e)} def process_batch(self, image_paths, prompt, max_workers=2): """Process multiple images with worker pool.""" results = [] with ThreadPoolExecutor(max_workers=max_workers) as executor: future_to_path = { executor.submit(self.process_image, path, prompt): path for path in image_paths } for future in as_completed(future_to_path): result = future.result() results.append(result) print(f"Processed: {result['path']} - {result['status']}") return results def process_directory(self, directory, prompt, pattern="*.jpg", max_workers=2): """Process all matching images in directory.""" image_dir = Path(directory) image_paths = list(image_dir.glob(pattern)) print(f"Found {len(image_paths)} images to process") return self.process_batch(image_paths, prompt, max_workers) ``` Memory-efficient processing sequence: ```python def memory_safe_batch_processing(image_paths, model, processor, batch_size=2, checkpoint_file="checkpoint.json"): """Process with checkpointing to recover from crashes.""" import json from pathlib import Path checkpoint_path = Path(checkpoint_file) # Load existing checkpoints if checkpoint_path.exists(): completed = set(json.loads(checkpoint_path.read_text())) else: completed = set() pending = [p for p in image_paths if str(p) not in completed] print(f"Processing {len(pending)} pending images") results = [] for i in range(0, len(pending), batch_size): batch = pending[i:i+batch_size] for path in batch: result = process_single_image(path, model, processor) results.append(result) # Checkpoint after each image if result["status"] == "success": completed.add(str(path)) checkpoint_path.write_text(json.dumps(list(completed))) # Memory cleanup between batches gc.collect(); torch.cuda.empty_cache() return results ``` Common batch processing failures: - **OOM errors**: Reduce batch_size, enable gradient checkpointing - **Timeout stalls**: Set generation timeout, implement retry logic - **Model degradation**: Clear cache periodically, monitor memory leak signs - **Partial results**: Implement checkpointing for crash recovery20 min
  10. 10Streaming with VisionVision models generate tokens at varying rates depending on image complexity. Streaming requires buffering visual features while interleaving with text token generation to maintain responsive UX. Streaming visual responses differs from text-only streams. Early tokens represent image understanding, but later tokens refine interpretations. Users expect progressive rendering of generated captions or descriptions. ```python import asyncio from anthropic import AsyncVertexAI import json class StreamingVisionHandler: def __init__(self): self.client = AsyncVertexAI() self.feature_buffer = [] async def stream_vision_response( self, image_path: str, prompt: str ): async with self.client.messages.stream( model="gemini-pro-vision", max_tokens=1024, messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image", "source": { "type": "file", "file": image_path } } ] } ] ) as stream: # Buffer early tokens for context accumulated = [] async for event in stream: if event.type == "content_block_delta": token = event.delta.text accumulated.append(token) # Stream text progressively yield {"token": token, "partial": "".join(accumulated)} async def get_structured_stream(self, image_path: str): """Extract structured data from streaming response""" output_schema = { "type": "object", "properties": { "description": {"type": "string", "maxLength": 200}, "objects": { "type": "array", "items": {"type": "string"} }, "confidence": {"type": "number", "minimum": 0, "maximum": 1} } } async for chunk in self.stream_vision_response( image_path, "Analyze this image and return structured JSON" ): # Attempt JSON parsing progressively yield chunk ``` **Common Failure Patterns:** - Forgetting image pre-processing causes timeout failures on large images. Always resize before upload. - Streaming buffer exhaustion if accumulation logic ignores token limits. Set maximum buffer size. - Mixing streaming and non-streaming calls in same session creates race conditions.15 min
  11. 11Vision AgentsVision agents chains observe-interpret-act cycles where image context informs tool selection. The agent must reason about visual elements before deciding which tools to invoke. Vision agents extend text-only agents by incorporating visual context into decision loops. When presented with a diagram, the agent might invoke a web search for technical specifications, or calculate dimensions using a Python tool. ```python from anthropic import AsyncVertexAI from google.adk import Agent from google.adk.tools import google_search, python_executor import json class VisionReasoningAgent: def __init__(self): self.client = AsyncVertexAI() async def analyze_with_tools( self, image_path: str, user_intent: str ) -> dict: """Vision agent with tool invocation based on visual analysis""" # Initial visual understanding initial_analysis = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ { "type": "image", "source": {"type": "file", "file": image_path} }, { "type": "text", "text": """Analyze this image. First, identify what you see. Then determine: which tools would help answer the user's intent? The user wants: {user_intent} Output your analysis, then list tools needed. """ } ] }] ) analysis_text = initial_analysis.content[0].text # Execute based on identified tools if "calculate" in analysis_text.lower(): calc_result = await self._run_calculation(analysis_text) if "search" in analysis_text.lower(): search_term = await self._extract_search_term(analysis_text) web_result = await self._search_web(search_term) # Synthesize results final_response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": f"""Based on the image analysis and tool results: Initial Analysis: {analysis_text} Calculations: {calc_result} Web Search Results: {web_result} Synthesize into a coherent answer about: {user_intent} """ }] ) return {"answer": final_response.content[0].text} async def _extract_search_term(self, analysis: str) -> str: if "search" in analysis: return "technical specifications" # Parse from context return None ``` **Failure Modes:** - Agents selecting wrong tools when visual analysis misses key elements. Always include "what tools would help" prompts. - Infinite loop risk when tool results trigger re-analysis. Implement iteration limits. - Conflicting tool results when multiple sources disagree. Require source attribution.15 min
  12. 12Multi-Modal RAGMulti-Modal RAG retrieves relevant images and text chunks, then uses vision models to ground generated responses in visual evidence from the corpus. Standard RAG fails when queries rely on visual similarity. Multi-Modal RAG indexes both images and their descriptions, enabling retrieval across modalities. At query time, the system retrieves candidate images and generates responses that cite visual evidence. ```python import json import base64 from pathlib import Path from anthropic import AsyncVertexAI class MultiModalVectorStore: def __init__(self, client: AsyncVertexAI, embedding_model: str): self.client = client self.embedding_model = embedding_model self.image_descriptions = {} self.image_embeddings = {} async def index_document( self, doc_path: Path, images_dir: Path ): """Index document with embedded images""" with open(doc_path) as f: text_content = f.read() # Generate text embeddings text_embedding = await self._get_embedding(text_content) # Process images for img_path in images_dir.glob("*.png"): # Generate description description = await self._describe_image(img_path) # Store description for retrieval img_key = img_path.stem self.image_descriptions[img_key] = description # Generate embedding for description desc_embedding = await self._get_embedding(description) self.image_embeddings[img_key] = desc_embedding async def _describe_image(self, img_path: Path) -> str: """Generate searchable description of image""" with open(img_path, "rb") as f: img_data = base64.b64encode(f.read()).decode() response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_data}}, {"type": "text", "text": "Generate a detailed description suitable for retrieval."} ] }] ) return response.content[0].text async def retrieve( self, query: str, top_k: int = 4 ) -> list[dict]: """Retrieve relevant images and text chunks""" query_embedding = await self._get_embedding(query) # Simple similarity search (use vector DB in production) candidates = [] for key, emb in self.image_embeddings.items(): score = self._cosine_similarity(query_embedding, emb) candidates.append({ "key": key, "description": self.image_descriptions[key], "score": score }) return sorted(candidates, key=lambda x: x["score"], reverse=True)[:top_k] def _cosine_similarity(self, a: list, b: list) -> float: dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x * x for x in a) ** 0.5 norm_b = sum(x * x for x in b) ** 0.5 return dot / (norm_a * norm_b) ``` **Failure Modes:** - Mismatched embeddings when images have metadata but descriptions explain context. Index both. - Missed relevance when query visual similarity differs from semantic similarity. Consider hybrid retrieval. - RAG hallucination when response cites images loosely related to query. Include citation validation.15 min
  13. 13Image EmbeddingsVision embeddings compress visual information into dense vectors capturing semantic content. Understanding embedding dimensionality and normalization affects retrieval accuracy significantly. Image embeddings transform pixel data into fixed-length vectors where semantically similar images cluster together. The embedding model determines what aspects of similarity matter for your use case. ```python import numpy as np from typing import Protocol from abc import ABC, abstractmethod class EmbeddingModel(Protocol): def embed(self, image_path: str) -> np.ndarray: ... def batch_embed(Self, image_paths: list[str]) -> list[np.ndarray]: ... class VertexEmbeddingModel: def __init__(self, model_name: str = "imagen-3.0-fast"): self.model_name = model_name # Vertex does not expose embedding models directly # Use multimodal models with image input async def embed_images(self, image_paths: list[str]) -> list[list[float]]: """ Generate embeddings via multimodal API. Returns list of embedding vectors. """ embeddings = [] for path in image_paths: # Encode image with open(path, "rb") as f: img_b64 = base64.b64encode(f.read()).decode() # Use vision model to generate description # Then embed description as proxy async with AsyncVertexAI() as client: desc_response = await client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": "Describe this image in exactly 10 words."} ] }] ) desc = desc_response.content[0].text # Embed description text embed_response = await client.models.embed_content( model="text-embedding-005", content=desc ) embeddings.append(embed_response.embedding) return embeddings def compute_similarity( self, emb1: list[float], emb2: list[float] ) -> float: """Cosine similarity between two embeddings""" v1 = np.array(emb1) v2 = np.array(emb2) norm1 = np.linalg.norm(v1) norm2 = np.linalg.norm(v2) return float(np.dot(v1, v2) / (norm1 * norm2)) def batch_similarity_matrix( self, embeddings: list[list[float]] ) -> np.ndarray: """Compute pairwise similarity matrix""" n = len(embeddings) matrix = np.zeros((n, n)) for i in range(n): for j in range(i, n): sim = self.compute_similarity(embeddings[i], embeddings[j]) matrix[i, j] = sim matrix[j, i] = sim return matrix ``` **Common Mistakes:** - Embedding mismatch when different runs use different model versions. Pin model versions. - Ignoring normalization: unnormalized embeddings produce misleading similarity scores. - Batch size limits: large images or batches cause timeout. Resize and chunk.15 min
  14. 14CLIP ModelsCLIP learns joint image-text representations by training on image-caption pairs. This enables zero-shot classification and cross-modal retrieval without task-specific fine-tuning. CLIP encodes images and text into a shared embedding space where related concepts cluster. Query text defines the classification space at inference time, enabling flexible recognition of concepts not seen during training. ```python import torch from torchvision.models import clip class CLIPEmbedder: def __init__(self, model_name: str = "ViT-B/32"): self.device = "cuda" if torch.cuda.is_available() else "cpu" # Load pre-trained CLIP model self.model, self.preprocess = clip.load( model_name, device=self.device ) def encode_image(self, image_tensor: torch.Tensor) -> torch.Tensor: """Encode single image into embedding""" with torch.no_grad(): return self.model.encode_image(image_tensor) def encode_text(self, text: str) -> torch.Tensor: """Encode text into embedding""" with torch.no_grad(): text_tokens = clip.tokenize([text]) return self.model.encode_text(text_tokens) def compute_similarity( self, image_emb: torch.Tensor, text_emb: torch.Tensor ) -> torch.Tensor: """Compute cosine similarity between image and text embeddings""" return torch.cosine_similarity( image_emb, text_emb, dim=-1 ) def zero_shot_classify( self, image_tensor: torch.Tensor, candidate_labels: list[str] ) -> list[dict]: """ Classify image without task-specific training. candidate_labels: ["cat", "dog", "bird", "fish"] """ # Encode candidate labels text_tokens = clip.tokenize(candidate_labels) text_embeddings = self.model.encode_text(text_tokens) # Encode image image_embedding = self.model.encode_image(image_tensor) # Normalize embeddings text_embeddings = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True) image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) # Compute similarities similarity = 100.0 * image_embedding @ text_embeddings.T # Convert to probabilities probs = similarity.softmax(dim=-1)[0] return [ {"label": label, "probability": prob.item()} for label, prob in zip(candidates, probs) ] ``` **Failure Modes:** - CLIP struggles with fine-grained distinctions (breeds of dogs). Consider specialized models for precision tasks. - Domain mismatch: CLIP trained on internet images may underperform on specialized domains (medical, satellite). - Text encoding limit: prompts exceeding token limit truncate unexpectedly. Keep labels under 75 tokens.15 min
  15. 15Performance OptimizationVision models are computationally intensive. Optimization strategies include caching decoded images, batching requests, reducing resolution for preliminary filtering, and using task-specialized models for routing. Production vision workloads face latency and cost pressures. Strategic optimizations can reduce costs 10x while maintaining accuracy for production use cases. ```python import hashlib import time from functools import lru_cache from pathlib import Path import asyncio class VisionPerformanceOptimizer: def __init__(self, client): self.client = client self.preprocessed_cache = {} self.embedding_cache = {} def cache_key(self, image_path: str) -> str: """Generate cache key from image path and modification time""" stat = Path(image_path).stat() return hashlib.sha256( f"{image_path}{stat.st_mtime}".encode() ).hexdigest() def preprocess_image( self, image_path: str, target_size: tuple[int, int] = (512, 512) ) -> bytes: """ Preprocess and cache image once. Returns cached bytes for subsequent calls. """ cache_key = self.cache_key(image_path) if cache_key in self.preprocessed_cache: return self.preprocessed_cache[cache_key] from PIL import Image img = Image.open(image_path) # Resize for preliminary analysis img.thumbnail(target_size, Image.LANCZOS) # Save to bytes (PNG for quality, JPEG for size) import io buffer = io.BytesIO() img.save(buffer, format="JPEG", quality=85) processed = buffer.getvalue() self.preprocessed_cache[cache_key] = processed return processed async def batch_vision_analysis( self, image_paths: list[str], prompt: str, batch_size: int = 4 ) -> list[str]: """ Process images in batches to reduce API overhead. """ results = [] for i in range(0, len(image_paths), batch_size): batch = image_paths[i:i + batch_size] # Preprocess all in batch processed_batch = [ self.preprocess_image(path) for path in batch ] # Send batch request batch_request = [ {"type": "image", "source": {"type": "base64", "data": p}} for p in processed_batch ] # Process sequentially for API compatibility for req in batch_request: response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{"role": "user", "content": [req, {"type": "text", "text": prompt}]}] ) results.append(response.content[0].text) return results async def adaptive_quality_analysis( self, image_path: str, query: str ) -> dict: """ Use low-quality for initial filtering, high-quality for confident analysis. """ # Quick low-res analysis low_res = self.preprocess_image(image_path, target_size=(256, 256)) quick_response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": low_res}}, {"type": "text", "text": f"Is this image relevant to: {query}? Answer yes or no."} ] }] ) if "yes" in quick_response.content[0].text.lower(): # Upgrade to high-res analysis high_res = self.preprocess_image(image_path, target_size=(1024, 1024)) detailed_response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": high_res}}, {"type": "text", "text": query} ] }] ) return {"relevant": True, "analysis": detailed_response.content[0].text} return {"relevant": False} ``` **Failure Modes:** - Cache invalidation bugs where stale cached preprocessed images cause incorrect outputs. Include modification time in cache keys. - Batch size too large causing memory exhaustion. Monitor GPU memory during batching. - Adaptive quality making wrong filtering decisions. Audit precision/recall of filtering stage.15 min
  16. 16Quantization for VisionVision model quantization reduces weight precision (typically from float32 to int8) enabling larger models on limited hardware. Accuracy trade-offs vary by visual domainΓÇöcomplex scenes suffer more than simple classification tasks. Quantization works differently across model components. Weight quantization affects storage and memory bandwidth. Activation quantization requires careful calibration to avoid overflow while preserving semantic signal. ```python import torch from torchvision.models import efficientnet class QuantizedVisionModel: def __init__(self, model_name: str = "efficientnet_b0"): self.base_model = efficientnet_efficientnet_b0(weights=None) self.quantized_model = None def apply_dynamic_quantization(self): """ Dynamic quantization: weights stored as int8, computation happens in fp32. Fast conversion, moderate memory savings. """ self.quantized_model = torch.quantization.quantize_dynamic( self.base_model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8 ) return self.quantized_model def apply_static_quantization(self): """ Static quantization: calibrate with representative data, then convert both weights and activations to int8. Best memory savings but requires calibration dataset. """ # Prepare model for static quantization self.base_model.train() self.base_model.qconfig = torch.quantization.default_qconfig torch.quantization.prepare(self.base_model, inplace=True) # Calibrate with representative images calibration_loader = [ torch.randn(1, 3, 224, 224) for _ in range(32) ] with torch.no_grad(): for cal_batch in calibration_loader: self.base_model(cal_batch) # Convert to quantized model self.quantized_model = torch.quantization.convert( self.base_model, inplace=False ) return self.quantized_model def benchmark_inference( self, model: torch.nn.Module, input_tensor: torch.Tensor, iterations: int = 100 ) -> dict: """Measure latency and memory for model inference""" import time import gc model.eval() # Warmup for _ in range(10): model(input_tensor) # Time iterations if torch.cuda.is_available(): torch.cuda.synchronize() start = time.perf_counter() for _ in range(iterations): with torch.no_grad(): model(input_tensor) if torch.cuda.is_available(): torch.cuda.synchronize() end = time.perf_counter() avg_latency_ms = (end - start) / iterations * 1000 # Memory usage if torch.cuda.is_available(): memory_mb = torch.cuda.max_memory_allocated() / (1024 ** 2) else: import sys memory_mb = sys.getsizeof(model.state_dict()) / (1024 ** 2) return { "avg_latency_ms": avg_latency_ms, "memory_mb": memory_mb } class VisionModelCompressor: """Compress vision models for edge deployment""" def __init__(self): self.pruning_threshold = 0.01 def prune_filters( self, model: torch.nn.Module, importance_metric: callable ) -> torch.nn.Module: """ Remove filters with low importance scores. Importance can be based on activation statistics or gradient magnitudes. """ for name, module in model.named_modules(): if isinstance(module, torch.nn.Conv2d): # Compute filter importance weights = module.weight.detach() importance = importance_metric(weights) # Create mask for important filters mask = importance > self.pruning_threshold # Zero out unimportant filters module.weight.data *= mask.unsqueeze(-1).unsqueeze(-1) return model ``` **Failure Modes:** - Static quantization accuracy collapse when calibration set lacks diversity. Use representative dataset spanning input distribution. - Quantization breaks models relying on precise thresholds (object detection with score cutoffs). Test threshold-dependent logic post-quantization. - Asymmetric quantization ranges causing activation overflow. Monitor for NaN/Inf outputs.15 min
  17. 17Multi-Modal EvaluationMulti-modal evaluation requires measuring alignment between visual understanding and generated text, grounding accuracy in image evidence, and consistency across different phrasings of the same query. Standard NLP metrics (BLEU, ROUGE) poorly capture visual grounding accuracy. Multi-modal evaluation needs metrics that verify claims made about images are actually supported by image content. ```python from dataclasses import dataclass from typing import Optional import re @dataclass class MultiModalMetrics: grounding_score: float # Can claims be verified in image? consistency_score: float # Do semantically equivalent queries get same answer? completeness_score: float # Are key visual elements mentioned? hallucination_rate: float # What fraction of claims cannot be verified? class MultiModalEvaluator: def __init__(self, client): self.client = client async def evaluate_vqa_response( self, image_path: str, question: str, answer: str ) -> MultiModalMetrics: """ Evaluate whether generated answer is grounded in image. """ with open(image_path, "rb") as f: img_b64 = base64.b64encode(f.read()).decode() # Extract verifiable claims from answer claims_prompt = f""" Extract factual claims from this answer that can be verified by looking at the image. Answer: {answer} List each claim as a bullet point. Be specific. """ claims_response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": claims_prompt} ] }] ) claims = self._parse_claims(claims_response.content[0].text) # Verify each claim against image verified = [] unverified = [] for claim in claims: verification = await self._verify_claim(image_path, img_b64, claim) if verification["verified"]: verified.append(claim) else: unverified.append(claim) grounding_score = len(verified) / len(claims) if claims else 0.0 hallucination_rate = len(unverified) / len(claims) if claims else 0.0 # Evaluate completeness completeness = await self._evaluate_completeness( image_path, img_b64, question, answer ) return MultiModalMetrics( grounding_score=grounding_score, consistency_score=0.0, # Requires separate consistency evaluation completeness_score=completeness, hallucination_rate=hallucination_rate ) async def evaluate_consistency( self, image_path: str, queries: list[str] ) -> float: """ Evaluate whether semantically equivalent queries get consistent answers. """ answers = [] for query in queries: response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "file", "file": image_path}}, {"type": "text", "text": query} ] }] ) answers.append(self._normalize(response.content[0].text)) # Measure agreement between normalized answers unique_answers = set(answers) consistency = 1.0 - (len(unique_answers) - 1) / len(answers) return consistency async def _verify_claim( self, image_path: str, img_b64: str, claim: str ) -> dict: """Verify a single claim against image evidence""" verification_prompt = f""" Can this claim be verified by looking at the image? Claim: {claim} Answer YES if the claim is clearly supported by the image. Answer NO if the claim contradicts or cannot be verified from the image. Answer PARTIAL if the claim is partially supported. """ response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": verification_prompt} ] }] ) normalized = response.content[0].text.strip().upper() return { "verified": "YES" in normalized, "partial": "PARTIAL" in normalized, "response": response.content[0].text } def _parse_claims(self, text: str) -> list[str]: """Extract claims from model response""" claims = [] for line in text.split("\n"): line = line.strip() if line.startswith("-") or line.startswith("*"): claims.append(line.lstrip("-* ").strip()) return claims def _normalize(self, text: str) -> str: """Normalize text for consistency comparison""" text = text.lower() text = re.sub(r"[^\w\s]", "", text) text = re.sub(r"\s+", " ", text) return text.strip() ``` **Failure Modes:** - Grounding evaluator hallucinating verification when image ambiguous. Use conservative thresholds. - Consistency false negatives when legitimate multi-interpretations exist. Distinguish semantic equivalence from answer format. - Incomplete evaluation when answer omits commonly tested visual elements. Compare against ground truth element list.20 min
  18. 18Vision Analysis System ProjectBuilding production vision systems requires integrating image preprocessing, model orchestration, caching layers, evaluation loops, and error handling into a coherent architectureΓÇönot just calling a model. This chapter synthesizes all previous concepts into a complete vision analysis system. The system handles image ingestion, route requests to appropriate models, maintain caches, and provide traceable outputs with confidence scores. ```python import asyncio import hashlib import json import time from dataclasses import dataclass, field from pathlib import Path from typing import Optional import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) @dataclass class AnalysisResult: request_id: str image_path: str model_used: str response: str confidence: float processing_time_ms: float cache_hit: bool = False error: Optional[str] = None class VisionAnalysisPipeline: def __init__( self, client, embedding_cache_size: int = 1000, preprocess_cache_size: int = 500 ): self.client = client self.embedding_cache_size = embedding_cache_size self.preprocess_cache_size = preprocess_cache_size # Caches self._preprocess_cache = {} self._result_cache = {} # Metrics self.cache_hits = 0 self.cache_misses = 0 self.total_requests = 0 def _generate_request_id(self, image_path: str, query: str) -> str: """Create deterministic request ID for caching""" content = f"{image_path}:{query}" return hashlib.sha256(content.encode()).hexdigest()[:16] def _preprocess_image( self, image_path: str, max_size: tuple[int, int] = (1024, 1024) ) -> bytes: """Preprocess image with caching""" cache_key = f"{image_path}:{max_size}" if cache_key in self._preprocess_cache: return self._preprocess_cache[cache_key] from PIL import Image import io img = Image.open(image_path) img.thumbnail(max_size, Image.LANCZOS) buffer = io.BytesIO() img.save(buffer, format="JPEG", quality=90) processed = buffer.getvalue() if len(self._preprocess_cache) >= self.preprocess_cache_size: # Simple eviction: remove oldest oldest_key = next(iter(self._preprocess_cache)) del self._preprocess_cache[oldest_key] self._preprocess_cache[cache_key] = processed return processed async def analyze_image( self, image_path: str, query: str, model: str = "gemini-2.0-flash-thinking", use_cache: bool = True ) -> AnalysisResult: """Main analysis entry point with full pipeline""" start_time = time.perf_counter() request_id = self._generate_request_id(image_path, query) # Check result cache if use_cache and request_id in self._result_cache: result = self._result_cache[request_id] result.cache_hit = True self.cache_hits += 1 logger.info(f"Cache hit for request {request_id}") return result self.cache_misses += 1 self.total_requests += 1 try: # Preprocess image processed = self._preprocess_image(image_path) img_b64 = __import__('base64').b64encode(processed).decode() # Route to appropriate model if "classify" in query.lower(): response_text, confidence = await self._classify_image( img_b64, query, model ) elif "describe" in query.lower(): response_text, confidence = await self._describe_image( img_b64, query, model ) elif "compare" in query.lower(): response_text, confidence = await self._compare_images( image_path, query, model ) else: response_text, confidence = await self._general_analysis( img_b64, query, model ) processing_time = (time.perf_counter() - start_time) * 1000 result = AnalysisResult( request_id=request_id, image_path=image_path, model_used=model, response=response_text, confidence=confidence, processing_time_ms=processing_time, cache_hit=False ) # Cache result if use_cache and len(self._result_cache) < 1000: self._result_cache[request_id] = result return result except Exception as e: logger.error(f"Analysis failed for {image_path}: {e}") processing_time = (time.perf_counter() - start_time) * 1000 return AnalysisResult( request_id=request_id, image_path=image_path, model_used=model, response="", confidence=0.0, processing_time_ms=processing_time, error=str(e) ) async def _classify_image( self, img_b64: str, query: str, model: str ) -> tuple[str, float]: """Classification analysis with confidence estimation""" prompt = f""" Analyze this image for classification purposes. Query: {query} Output structured response with categories and confidence levels. """ response = await self.client.messages.create( model=model, messages=[{ "role": "user", \n "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": prompt} ] }] ) # Estimate confidence from response language response_text = response.content[0].text confidence = self._estimate_confidence(response_text) return response_text, confidence async def _describe_image( self, img_b64: str, query: str, model: str ) -> tuple[str, float]: """Generate detailed image description""" prompt = f""" Provide a detailed description of this image. User interest: {query} Focus on relevant aspects per the user's interest. Include specific details that could be verified. """ response = await self.client.messages.create( model=model, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": prompt} ] }] ) return response.content[0].text, 0.85 async def _general_analysis( self, img_b64: str, query: str, model: str ) -> tuple[str, float]: """General purpose visual question answering""" response = await self.client.messages.create( model=model, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": query} ] }] ) response_text = response.content[0].text confidence = self._estimate_confidence(response_text) return response_text, confidence async def _compare_images( self, image_path: str, query: str, model: str ) -> tuple[str, float]: """Compare multiple images if referenced""" # Implementation for multi-image comparison return await self._general_analysis( self._preprocess_image(image_path), query, model ), 0.75 def _estimate_confidence(self, response: str) -> float: """Heuristic confidence estimation from response characteristics""" base_confidence = 0.7 # Hedging language suggests lower confidence hedging_words = ["might", "possibly", "perhaps", "maybe"] for word in hedging_words: if word in response.lower(): base_confidence -= 0.1 # Specific details suggest higher confidence specific_patterns = ["specifically", "exactly", "clearly", "distinctly"] for pattern in specific_patterns: if pattern in response.lower(): base_confidence += 0.05 return max(0.0, min(1.0, base_confidence)) def get_metrics(self) -> dict: """Return pipeline performance metrics""" return { "total_requests": self.total_requests, "cache_hits": self.cache_hits, "cache_misses": self.cache_misses, "cache_hit_rate": ( self.cache_hits / self.total_requests if self.total_requests > 0 else 0.0 ), "preprocess_cache_size": len(self._preprocess_cache), "result_cache_size": len(self._result_cache) } async def batch_analyze( self, image_queries: list[tuple[str, str]], batch_size: int = 10 ) -> list[AnalysisResult]: """Process batch of image analysis requests""" results = [] for i in range(0, len(image_queries), batch_size): batch = image_queries[i:i + batch_size] batch_tasks = [ self.analyze_image(image_path, query) for image_path, query in batch ] batch_results = await asyncio.gather(*batch_tasks) results.extend(batch_results) return results # Example usage async def main(): client = AsyncVertexAI() pipeline = VisionAnalysisPipeline(client) # Single analysis result = await pipeline.analyze_image( image_path="product_image.jpg", query="What are the key features of this product?" ) print(f"Analysis: {result.response}") print(f"Confidence: {result.confidence}") print(f"Processing time: {result.processing_time_ms:.2f}ms") print(f"Cache hit: {result.cache_hit}") # Batch analysis image_queries = [ ("product1.jpg", "List visible features"), ("product2.jpg", "Compare with similar products"), ("product3.jpg", "Identify brand and model"), ] batch_results = await pipeline.batch_analyze(image_queries) for res in batch_results: print(f"{res.request_id}: {res.confidence}") # Print metrics print(pipeline.get_metrics()) class AsyncVertexAI: async def messages(self): class StreamContext: async def __aenter__(self): return self async def __aexit__(self, *args): pass return StreamContext() async def create(self, **kwargs): from dataclasses import dataclass @dataclass class Content: text: str @dataclass class MockResponse: content: list return MockResponse(content=[Content(text="Analysis result")]) if __name__ == "__main__": asyncio.run(main()) ``` **Failure Modes:** - Cache invalidation when images update but same path reused. Include file hash or mtime in cache keys. - Batch queue backlog causing timeouts on large batches. Implement backpressure. - Memory leaks from unbounded caches. Monitor cache sizes and implement eviction.25 min