14. Agent Tracing
Distributed tracing provides end-to-end visibility across agent interactions. Unlike simple logging, traces capture causality chains—showing exactly how one agent's output influenced another agent's behavior.
Trace Context Propagation
When an agent spawns a sub-agent or calls another agent, trace context must propagate. This creates a tree structure where each span represents an agent operation and edges represent causal relationships.
# tracing/context.py
from contextvars import ContextVar
from dataclasses import dataclass, field
from typing import Optional
import uuid
trace_id: ContextVar[str] = ContextVar('trace_id', default="")
span_id: ContextVar[str] = ContextVar('span_id', default="")
parent_span_id: ContextVar[str] = ContextVar('parent_span_id', default="")
@dataclass
class TraceContext:
trace_id: str
span_id: str
parent_span_id: Optional[str]
tags: dict = field(default_factory=dict)
@classmethod
def new(cls) -> 'TraceContext':
tid = trace_id.get() or str(uuid.uuid4())
current_span = span_id.get()
return cls(
trace_id=tid,
span_id=str(uuid.uuid4()),
parent_span_id=current_span or None
)
def inject(self) -> dict:
return {
"trace_id": self.trace_id,
"span_id": self.span_id,
"parent_span_id": self.parent_span_id
}
@classmethod
def extract(cls, headers: dict) -> 'TraceContext':
return cls(
trace_id=headers.get("trace_id", str(uuid.uuid4())),
span_id=str(uuid.uuid4()),
parent_span_id=headers.get("span_id")
)
class TracedAgent:
def __init__(self, agent, exporter):
self.agent = agent
self.exporter = exporter
def invoke(self, input_data: dict, context: TraceContext) -> dict:
with self.exporter.start_span(context) as span:
try:
result = self.agent.invoke(input_data)
span.set_attribute("status", "ok")
return result
except Exception as e:
span.set_attribute("status", "error")
span.set_attribute("error.message", str(e))
raise
finally:
span.end()
Span Annotation
Traces become valuable when annotated with semantic information. Agents should annotate spans with tool invocations, token counts, and intermediate reasoning steps.
Trace Sampling
High-volume systems require intelligent sampling. The tracing infrastructure captures full traces for error cases while sampling baseline operations to control storage costs.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement a trace aggregator that reconstructs the full call graph from individual spans and generates a flame graph visualization showing time distribution across agent nodes.