Agent Observability — Multi-Agent Systems (Chapter 13)

Observability in multi-agent systems extends far beyond simple logging. An observable agent exposes its internal state transitions, decision reasoning, and tool invocation patterns in a structured, queryable format. This chapter examines how to instrument agents for full visibility without degrading performance.

Structured State Telemetry

Agents emit state changes as discrete events with consistent schemas. Each state transition includes timestamp, previous state, new state, reason, and metadata. This event stream feeds into the observability pipeline where downstream systems can reconstruct agent behavior timelines.

# agent/telemetry.py
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any
import json

class AgentState(Enum):
    IDLE = "idle"
    REASONING = "reasoning"
    TOOL_CALL = "tool_call"
    AWAITING_RESPONSE = "awaiting_response"
    COMPLETED = "completed"
    ERROR = "error"

@dataclass
class StateEvent:
    timestamp: datetime
    agent_id: str
    state: AgentState
    previous_state: AgentState | None
    reason: str
    metadata: dict[str, Any]
    
    def to_json(self) -> str:
        return json.dumps({
            "timestamp": self.timestamp.isoformat(),
            "agent_id": self.agent_id,
            "state": self.state.value,
            "previous_state": self.previous_state.value if self.previous_state else None,
            "reason": self.reason,
            "metadata": self.metadata
        })

class AgentTelemetry:
    def __init__(self, agent_id: str, emitter):
        self.agent_id = agent_id
        self.emitter = emitter
        self._current_state = AgentState.IDLE
    
    def emit_state_change(self, new_state: AgentState, reason: str, metadata: dict = None):
        event = StateEvent(
            timestamp=datetime.utcnow(),
            agent_id=self.agent_id,
            state=new_state,
            previous_state=self._current_state,
            reason=reason,
            metadata=metadata or {}
        )
        self.emitter.emit(event)
        self._current_state = new_state

Context Injection Pattern

Telemetry works best when injected into the agent's context window. This allows the observing system to correlate agent reasoning with external events without cross-referencing separate logs.

Failure Detection

Observable agents enable real-time alerting on state anomalies. An agent stuck in AWAITING_RESPONSE for more than the configured threshold triggers automatic investigation workflows.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.