10. Vision Agents
Vision agents use visual inputs to plan and execute actions in environments. These systems perceive, reason, and act—closing the loop between perception and control.
Agent Architecture
A vision agent consists of: a visual perception module, a reasoning/planning module, and an action execution module.
class VisionAgent:
def __init__(self, vision_model, llm, action_space):
self.vision = vision_model
self.llm = llm
self.action_space = action_space
self.max_steps = 20
def perceive(self, observation):
"""Convert observation to structured representation."""
if isinstance(observation, np.ndarray):
# Image observation
frame = Image.fromarray(observation)
visual_description = self.vision.describe(frame)
else:
visual_description = str(observation)
return visual_description
def plan(self, goal, state_description, history):
"""Generate plan given goal and current state."""
prompt = f"""
You are a robot agent. You have seen the following in the environment:
Current observation: {state_description}
Goal: {goal}
Previous actions taken: {history}
What should you do next? Respond with:
1. Reasoning: Why this action
2. Action: The next action to take
Available actions: {self.action_space.describe()}
"""
response = self.llm.generate(prompt)
return self._parse_action(response)
def step(self, observation, goal):
"""Execute one step of perceive-plan-act loop."""
# Perceive
state = self.perceive(observation)
# Add to history
self.history.append(state)
# Plan
action = self.plan(goal, state, self.history)
# Execute (returns next observation)
next_obs = action.execute()
return next_obs
def run(self, env, goal, max_steps=None):
"""Run agent in environment until goal or max steps."""
max_steps = max_steps or self.max_steps
obs = env.reset()
self.history = []
for step in range(max_steps):
obs = self.step(obs, goal)
if self._check_goal(obs, goal):
return {"success": True, "steps": step + 1}
return {"success": False, "steps": max_steps}
Grounding Visual Feedback
Agents must ground language plans in visual reality. The model plans actions, but the environment confirms or denies those actions succeeded.
def ground_action_in_visual_context(agent, planned_action, current_frame):
"""Verify planned action is achievable given visual context."""
# Encode current frame
frame_emb = agent.vision.encode(current_frame)
# Encode action description
action_emb = agent.llm.encode(planned_action)
# Compute feasibility score
feasibility = torch.matmul(frame_emb, action_emb).item()
# Threshold-based grounding check
if feasibility < 0.5:
# Action may not be grounded in current visual context
# Request re-planning with updated context
return {"grounded": False, "reason": "visual mismatch"}
return {"grounded": True, "confidence": feasibility}
Failure Mode: Perception-Action Mismatch
Agents can perceive correctly but fail to act on that perception. The model "sees" an obstacle but plans a path through it anyway, because planning and perception operate independently.
def diagnose_perception_action_gap(agent, test_episodes):
"""Measure correlation between perception accuracy and action success."""
perception_scores = []
action_successes = []
for episode in test_episodes:
# Measure perception accuracy
pred_objects = agent.vision.detect_objects(episode["frame"])
gt_objects = episode["ground_truth_objects"]
perception_acc = compute_map(pred_objects, gt_objects)
perception_scores.append(perception_acc)
# Measure if planned action matches perception
action = agent.plan(episode["goal"], pred_objects, [])
action_matches_perception = verify_action_consistency(
action, pred_objects
)
action_successes.append(action_matches_perception)
# Low correlation means perception and action are disconnected
correlation = np.corrcoef(perception_scores, action_successes)[0, 1]
print(f"Perception-Action Correlation: {correlation:.3f}")
if correlation < 0.3:
print("WARNING: Low correlation - agent may have perception-action gap")
Implement a simple grid-world vision agent. Given an image of the grid, it must plan a path to reach a goal location. Test the agent with varying obstacle configurations. Measure success rate as obstacle density increases.