RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 11
Multi-Modal AI: Vision and Text

11. Vision Agents

Chapter 11 of 18 · 15 min
KEY INSIGHT

Vision agents chains observe-interpret-act cycles where image context informs tool selection. The agent must reason about visual elements before deciding which tools to invoke. Vision agents extend text-only agents by incorporating visual context into decision loops. When presented with a diagram, the agent might invoke a web search for technical specifications, or calculate dimensions using a Python tool. ```python from anthropic import AsyncVertexAI from google.adk import Agent from google.adk.tools import google_search, python_executor import json class VisionReasoningAgent: def __init__(self): self.client = AsyncVertexAI() async def analyze_with_tools( self, image_path: str, user_intent: str ) -> dict: """Vision agent with tool invocation based on visual analysis""" # Initial visual understanding initial_analysis = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ { "type": "image", "source": {"type": "file", "file": image_path} }, { "type": "text", "text": """Analyze this image. First, identify what you see. Then determine: which tools would help answer the user's intent? The user wants: {user_intent} Output your analysis, then list tools needed. """ } ] }] ) analysis_text = initial_analysis.content[0].text # Execute based on identified tools if "calculate" in analysis_text.lower(): calc_result = await self._run_calculation(analysis_text) if "search" in analysis_text.lower(): search_term = await self._extract_search_term(analysis_text) web_result = await self._search_web(search_term) # Synthesize results final_response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": f"""Based on the image analysis and tool results: Initial Analysis: {analysis_text} Calculations: {calc_result} Web Search Results: {web_result} Synthesize into a coherent answer about: {user_intent} """ }] ) return {"answer": final_response.content[0].text} async def _extract_search_term(self, analysis: str) -> str: if "search" in analysis: return "technical specifications" # Parse from context return None ``` **Failure Modes:** - Agents selecting wrong tools when visual analysis misses key elements. Always include "what tools would help" prompts. - Infinite loop risk when tool results trigger re-analysis. Implement iteration limits. - Conflicting tool results when multiple sources disagree. Require source attribution.

EXERCISE

Build a vision agent that accepts a floor plan image and user query about furniture placement. Agent must identify room dimensions, recommend furniture sizes via calculator, and search design guidelines via web search.

← Chapter 10
Streaming with Vision
Chapter 12 →
Multi-Modal RAG