Vision Agents — Advanced Multi-Modal Systems (Chapter 10)

Vision agents use visual inputs to plan and execute actions in environments. These systems perceive, reason, and act—closing the loop between perception and control.

Agent Architecture

A vision agent consists of: a visual perception module, a reasoning/planning module, and an action execution module.

class VisionAgent:
    def __init__(self, vision_model, llm, action_space):
        self.vision = vision_model
        self.llm = llm
        self.action_space = action_space
        self.max_steps = 20
    
    def perceive(self, observation):
        """Convert observation to structured representation."""
        if isinstance(observation, np.ndarray):
            # Image observation
            frame = Image.fromarray(observation)
            visual_description = self.vision.describe(frame)
        else:
            visual_description = str(observation)
        
        return visual_description
    
    def plan(self, goal, state_description, history):
        """Generate plan given goal and current state."""
        prompt = f"""
You are a robot agent. You have seen the following in the environment:
Current observation: {state_description}

Goal: {goal}

Previous actions taken: {history}

What should you do next? Respond with:
1. Reasoning: Why this action
2. Action: The next action to take

Available actions: {self.action_space.describe()}
"""
        
        response = self.llm.generate(prompt)
        return self._parse_action(response)
    
    def step(self, observation, goal):
        """Execute one step of perceive-plan-act loop."""
        # Perceive
        state = self.perceive(observation)
        
        # Add to history
        self.history.append(state)
        
        # Plan
        action = self.plan(goal, state, self.history)
        
        # Execute (returns next observation)
        next_obs = action.execute()
        
        return next_obs
    
    def run(self, env, goal, max_steps=None):
        """Run agent in environment until goal or max steps."""
        max_steps = max_steps or self.max_steps
        obs = env.reset()
        self.history = []
        
        for step in range(max_steps):
            obs = self.step(obs, goal)
            
            if self._check_goal(obs, goal):
                return {"success": True, "steps": step + 1}
        
        return {"success": False, "steps": max_steps}

Grounding Visual Feedback

Agents must ground language plans in visual reality. The model plans actions, but the environment confirms or denies those actions succeeded.

def ground_action_in_visual_context(agent, planned_action, current_frame):
    """Verify planned action is achievable given visual context."""
    
    # Encode current frame
    frame_emb = agent.vision.encode(current_frame)
    
    # Encode action description
    action_emb = agent.llm.encode(planned_action)
    
    # Compute feasibility score
    feasibility = torch.matmul(frame_emb, action_emb).item()
    
    # Threshold-based grounding check
    if feasibility < 0.5:
        # Action may not be grounded in current visual context
        # Request re-planning with updated context
        return {"grounded": False, "reason": "visual mismatch"}
    
    return {"grounded": True, "confidence": feasibility}

Failure Mode: Perception-Action Mismatch

Agents can perceive correctly but fail to act on that perception. The model "sees" an obstacle but plans a path through it anyway, because planning and perception operate independently.

def diagnose_perception_action_gap(agent, test_episodes):
    """Measure correlation between perception accuracy and action success."""
    
    perception_scores = []
    action_successes = []
    
    for episode in test_episodes:
        # Measure perception accuracy
        pred_objects = agent.vision.detect_objects(episode["frame"])
        gt_objects = episode["ground_truth_objects"]
        
        perception_acc = compute_map(pred_objects, gt_objects)
        perception_scores.append(perception_acc)
        
        # Measure if planned action matches perception
        action = agent.plan(episode["goal"], pred_objects, [])
        action_matches_perception = verify_action_consistency(
            action, pred_objects
        )
        action_successes.append(action_matches_perception)
    
    # Low correlation means perception and action are disconnected
    correlation = np.corrcoef(perception_scores, action_successes)[0, 1]
    print(f"Perception-Action Correlation: {correlation:.3f}")
    
    if correlation < 0.3:
        print("WARNING: Low correlation - agent may have perception-action gap")