13. Agent Observability
Observability in multi-agent systems extends far beyond simple logging. An observable agent exposes its internal state transitions, decision reasoning, and tool invocation patterns in a structured, queryable format. This chapter examines how to instrument agents for full visibility without degrading performance.
Structured State Telemetry
Agents emit state changes as discrete events with consistent schemas. Each state transition includes timestamp, previous state, new state, reason, and metadata. This event stream feeds into the observability pipeline where downstream systems can reconstruct agent behavior timelines.
# agent/telemetry.py
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any
import json
class AgentState(Enum):
IDLE = "idle"
REASONING = "reasoning"
TOOL_CALL = "tool_call"
AWAITING_RESPONSE = "awaiting_response"
COMPLETED = "completed"
ERROR = "error"
@dataclass
class StateEvent:
timestamp: datetime
agent_id: str
state: AgentState
previous_state: AgentState | None
reason: str
metadata: dict[str, Any]
def to_json(self) -> str:
return json.dumps({
"timestamp": self.timestamp.isoformat(),
"agent_id": self.agent_id,
"state": self.state.value,
"previous_state": self.previous_state.value if self.previous_state else None,
"reason": self.reason,
"metadata": self.metadata
})
class AgentTelemetry:
def __init__(self, agent_id: str, emitter):
self.agent_id = agent_id
self.emitter = emitter
self._current_state = AgentState.IDLE
def emit_state_change(self, new_state: AgentState, reason: str, metadata: dict = None):
event = StateEvent(
timestamp=datetime.utcnow(),
agent_id=self.agent_id,
state=new_state,
previous_state=self._current_state,
reason=reason,
metadata=metadata or {}
)
self.emitter.emit(event)
self._current_state = new_state
Context Injection Pattern
Telemetry works best when injected into the agent's context window. This allows the observing system to correlate agent reasoning with external events without cross-referencing separate logs.
Failure Detection
Observable agents enable real-time alerting on state anomalies. An agent stuck in AWAITING_RESPONSE for more than the configured threshold triggers automatic investigation workflows.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement a telemetry decorator that wraps agent methods and automatically emits state events for any method execution exceeding 100ms. Record execution time, method name, and arguments in the metadata field.