RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced Multi-Modal Systems
  6. /Ch. 10
Advanced Multi-Modal Systems

10. Vision Agents

Chapter 10 of 24 · 20 min
KEY INSIGHT

Vision agents succeed when perception, reasoning, and action form an integrated loop. Failures occur when these components operate independently, causing the agent to act on outdated or irrelevant perceptions.

Vision agents use visual inputs to plan and execute actions in environments. These systems perceive, reason, and act—closing the loop between perception and control.

Agent Architecture

A vision agent consists of: a visual perception module, a reasoning/planning module, and an action execution module.

class VisionAgent:
    def __init__(self, vision_model, llm, action_space):
        self.vision = vision_model
        self.llm = llm
        self.action_space = action_space
        self.max_steps = 20
    
    def perceive(self, observation):
        """Convert observation to structured representation."""
        if isinstance(observation, np.ndarray):
            # Image observation
            frame = Image.fromarray(observation)
            visual_description = self.vision.describe(frame)
        else:
            visual_description = str(observation)
        
        return visual_description
    
    def plan(self, goal, state_description, history):
        """Generate plan given goal and current state."""
        prompt = f"""
You are a robot agent. You have seen the following in the environment:
Current observation: {state_description}

Goal: {goal}

Previous actions taken: {history}

What should you do next? Respond with:
1. Reasoning: Why this action
2. Action: The next action to take

Available actions: {self.action_space.describe()}
"""
        
        response = self.llm.generate(prompt)
        return self._parse_action(response)
    
    def step(self, observation, goal):
        """Execute one step of perceive-plan-act loop."""
        # Perceive
        state = self.perceive(observation)
        
        # Add to history
        self.history.append(state)
        
        # Plan
        action = self.plan(goal, state, self.history)
        
        # Execute (returns next observation)
        next_obs = action.execute()
        
        return next_obs
    
    def run(self, env, goal, max_steps=None):
        """Run agent in environment until goal or max steps."""
        max_steps = max_steps or self.max_steps
        obs = env.reset()
        self.history = []
        
        for step in range(max_steps):
            obs = self.step(obs, goal)
            
            if self._check_goal(obs, goal):
                return {"success": True, "steps": step + 1}
        
        return {"success": False, "steps": max_steps}

Grounding Visual Feedback

Agents must ground language plans in visual reality. The model plans actions, but the environment confirms or denies those actions succeeded.

def ground_action_in_visual_context(agent, planned_action, current_frame):
    """Verify planned action is achievable given visual context."""
    
    # Encode current frame
    frame_emb = agent.vision.encode(current_frame)
    
    # Encode action description
    action_emb = agent.llm.encode(planned_action)
    
    # Compute feasibility score
    feasibility = torch.matmul(frame_emb, action_emb).item()
    
    # Threshold-based grounding check
    if feasibility < 0.5:
        # Action may not be grounded in current visual context
        # Request re-planning with updated context
        return {"grounded": False, "reason": "visual mismatch"}
    
    return {"grounded": True, "confidence": feasibility}

Failure Mode: Perception-Action Mismatch

Agents can perceive correctly but fail to act on that perception. The model "sees" an obstacle but plans a path through it anyway, because planning and perception operate independently.

def diagnose_perception_action_gap(agent, test_episodes):
    """Measure correlation between perception accuracy and action success."""
    
    perception_scores = []
    action_successes = []
    
    for episode in test_episodes:
        # Measure perception accuracy
        pred_objects = agent.vision.detect_objects(episode["frame"])
        gt_objects = episode["ground_truth_objects"]
        
        perception_acc = compute_map(pred_objects, gt_objects)
        perception_scores.append(perception_acc)
        
        # Measure if planned action matches perception
        action = agent.plan(episode["goal"], pred_objects, [])
        action_matches_perception = verify_action_consistency(
            action, pred_objects
        )
        action_successes.append(action_matches_perception)
    
    # Low correlation means perception and action are disconnected
    correlation = np.corrcoef(perception_scores, action_successes)[0, 1]
    print(f"Perception-Action Correlation: {correlation:.3f}")
    
    if correlation < 0.3:
        print("WARNING: Low correlation - agent may have perception-action gap")
EXERCISE

Implement a simple grid-world vision agent. Given an image of the grid, it must plan a path to reach a goal location. Test the agent with varying obstacle configurations. Measure success rate as obstacle density increases.

← Chapter 9
Cross-Modal Retrieval
Chapter 11 →
Video Agent