HOW-TO · OPS

How to implement distributed tracing for multi-agent workflows with trace context propagation

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

OpenTelemetry SDK, multi-agent system

What this does

This guide implements distributed tracing across multiple AI agent services that collaborate on a single workflow. When one agent delegates a subtask to another, the trace context (trace ID, span ID, and trace flags) propagates via HTTP headers or message queue metadata. The result is a single end-to-end trace showing every agent's contribution, including inter-agent latency and error attribution.

Steps

  1. Install the propagation library for the transport protocol. For HTTP:

    pip install opentelemetry-propagator-b3
    

    For message queues, use the W3C propagator included in opentelemetry-api.

  2. Configure the global propagator in every agent service's startup:

    from opentelemetry.propagate import set_global_textmap
    from opentelemetry.propagators.composite import CompositeHTTPPropagator
    from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
    
    set_global_textmap(CompositeHTTPPropagator([TraceContextTextMapPropagator()]))
    
  3. In the orchestrator agent, create a parent span for the workflow and inject context into outbound HTTP calls:

    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("multi-agent-workflow") as workflow_span:
        headers = {}
        propagate.inject(headers)
        response = requests.post("http://worker-agent:8000/execute",
                                 json={"task": subtask}, headers=headers)
        workflow_span.set_attribute("workflow.subtask_count", len(tasks))
    
  4. In the worker agent, extract the propagated context and create child spans:

    @app.post("/execute")
    async def execute(request: Request):
        ctx = propagate.extract(dict(request.headers))
        tracer = trace.get_tracer(__name__)
        with tracer.start_as_current_span("worker-execute", context=ctx) as span:
            result = await process_task(request.json()["task"])
            span.set_attribute("worker.result_size", len(str(result)))
            return {"result": result}
    
  5. For message-queue propagation, inject context into message metadata fields. With Redis Pub/Sub:

    carrier = {}
    propagate.inject(carrier)
    redis.publish("agent-tasks", json.dumps({
        "task": task_data,
        "trace_context": carrier
    }))
    
  6. On the consumer side, extract from the message carrier and restore the parent-child span relationship.

  7. Deploy all services with identical OTLP exporter configuration:

    exporter = OTLPSpanExporter(endpoint="http://jaeger-collector:4317", insecure=True)
    

    Expected: spans from both orchestrator and worker appear under one trace ID in the tracing UI.

Verification

curl -s "http://jaeger:16686/api/traces?service=agent-orchestrator&limit=1" | jq '.data[0].spans | length'

Expected output: an integer >= 2, confirming both orchestrator and worker spans exist in a single trace.

Common failures

  • Spans appear as separate traces — the propagated context is not being extracted. Verify the traceparent header is present on the worker's incoming request: print(dict(request.headers).get("traceparent")).
  • Incomplete traces (missing worker spans) — the worker's span exporter is not configured or the OTLP endpoint is unreachable from the worker container. Check worker logs for exporter errors.
  • Mismatched propagation formats — orchestrator uses W3C TraceContext but consumer expects B3 format. Standardize on W3C across all services by setting OTEL_PROPAGATORS=tracecontext.

Related guides