15. Performance Metrics

Chapter 15 of 24 · 15 min

Measuring multi-agent system performance requires metrics that capture both individual agent efficiency and system-level emergent behaviors. Standard LLM metrics miss the interaction dynamics critical to multi-agent systems.

Core Metric Definitions

Time to First Token (TTFT): Measures latency from request to first response. Critical for user-facing agents where perceived responsiveness dominates.

Token Efficiency Ratio: Output tokens divided by input tokens. Tracks whether agents consume excessive context through redundant reasoning or verbose output formats.

Tool Invocation Accuracy: Percentage of tool calls that produce relevant results. High accuracy indicates the agent correctly identifies when and how to invoke external capabilities.

Handoff Latency: Time between one agent completing its task and the receiving agent starting work. Measures coordination overhead.

# metrics/collector.py
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import defaultdict
import threading

@dataclass
class AgentMetrics:
    agent_id: str
    invocation_count: int = 0
    total_latency_ms: float = 0.0
    token_count: int = 0
    tool_calls: int = 0
    successful_tool_calls: int = 0
    error_count: int = 0
    timestamps: list[datetime] = field(default_factory=list)

class MetricsCollector:
    def __init__(self):
        self._metrics: dict[str, AgentMetrics] = {}
        self._lock = threading.Lock()
    
    def record_invocation(self, agent_id: str, latency_ms: float, tokens: int):
        with self._lock:
            if agent_id not in self._metrics:
                self._metrics[agent_id] = AgentMetrics(agent_id=agent_id)
            m = self._metrics[agent_id]
            m.invocation_count += 1
            m.total_latency_ms += latency_ms
            m.token_count += tokens
            m.timestamps.append(datetime.utcnow())
    
    def record_tool_call(self, agent_id: str, success: bool):
        with self._lock:
            if agent_id not in self._metrics:
                self._metrics[agent_id] = AgentMetrics(agent_id=agent_id)
            m = self._metrics[agent_id]
            m.tool_calls += 1
            if success:
                m.successful_tool_calls += 1
    
    def get_summary(self) -> dict:
        with self._lock:
            return {
                agent_id: {
                    "invocations": m.invocation_count,
                    "avg_latency_ms": m.total_latency_ms / m.invocation_count if m.invocation_count > 0 else 0,
                    "total_tokens": m.token_count,
                    "tool_accuracy": m.successful_tool_calls / m.tool_calls if m.tool_calls > 0 else 0,
                    "error_rate": m.error_count / m.invocation_count if m.invocation_count > 0 else 0
                }
                for agent_id, m in self._metrics.items()
            }

Composite System Metrics

Individual metrics combine into system-level indicators. Cyclic dependency detection uses graph analysis. Escalation rates reveal boundary issues between agent responsibilities. End-to-end latency percentiles expose coordination bottlenecks.

Dashboard Integration

Metrics export to standard monitoring systems via Prometheus metrics endpoints or direct API integration. Dashboards surface real-time performance with historical comparison bands.

EXERCISE

Implement a composite metric calculator that detects latency spikes by comparing rolling averages against historical baselines and generates alerts when standard deviation exceeds 2σ.