What this does

Setting up agent observability with OpenTelemetry provides distributed tracing, metrics collection, and structured logging for AI agent workflows. The instrumentation captures each tool call, reasoning step, and external API request as spans in a trace, enabling root-cause analysis of slow or failing agent runs. The collected data flows to backends like Jaeger, Grafana, or an OTLP-compatible service for visualization and alerting.

Steps

Initialize the OpenTelemetry SDK at application startup. Create a tracing.py module: import TracerProvider, BatchSpanProcessor, and the OTLP exporter. Set the service name: resource = Resource(attributes={"service.name": "ai-agent"}). Configure the provider with trace.set_tracer_provider(TracerProvider(resource=resource)) and add a span processor exporting to http://localhost:4317. Create a tracer instance: tracer = trace.get_tracer(__name__). Instrument the agent's main loop by wrapping each step in a span: with tracer.start_as_current_span("agent.step") as span: span.set_attribute("step", step_count); result = agent.execute(). Within each tool call, create a child span: with tracer.start_as_current_span(f"tool.{tool_name}") as tool_span: tool_span.set_attribute("args", str(args)); output = tool.run(). Add error recording: span.record_exception(e) in except blocks. Enable metrics by creating a MeterProvider and registering a counter for token usage and a histogram for step latency. For log correlation, inject trace context into log records using LoggingInstrumentor().instrument().

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Send a test task to the agent and check the observability backend UI—a trace should appear with a root span and nested child spans for each tool call. Verify span attributes are populated: inspect a tool call span and confirm args, duration, and status are present. Check that intentional errors appear as exception events on the relevant spans. Run the agent 10 times and confirm metrics show 10 counter increments and latency percentiles in the histogram. Verify log lines in the console include trace_id and span_id fields.

Common failures

OTLP exporter cannot reach backend: Verify the endpoint URL and port with curl http://localhost:4317/v1/traces and check firewall rules. Spans not appearing: Ensure BatchSpanProcessor flushes before the process exits by calling trace.get_tracer_provider().shutdown(). High memory usage from span buffering: Reduce max_export_batch_size from 512 to 128 in the processor configuration. Missing traces for async agent loops: Use async instrumentors and ensure spans are passed through context correctly in async/await patterns. Duplicate instrumentation: Check that only one TracerProvider is initialized—wrap in if not hasattr(sys.modules[__name__], '_otel_initialized') guard.

Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

monitor-agent-token-usage-cost
deploy-ai-kubernetes-gpu-nodes
build-multi-agent-supervisor-workflow