15. Performance Metrics
Measuring multi-agent system performance requires metrics that capture both individual agent efficiency and system-level emergent behaviors. Standard LLM metrics miss the interaction dynamics critical to multi-agent systems.
Core Metric Definitions
Time to First Token (TTFT): Measures latency from request to first response. Critical for user-facing agents where perceived responsiveness dominates.
Token Efficiency Ratio: Output tokens divided by input tokens. Tracks whether agents consume excessive context through redundant reasoning or verbose output formats.
Tool Invocation Accuracy: Percentage of tool calls that produce relevant results. High accuracy indicates the agent correctly identifies when and how to invoke external capabilities.
Handoff Latency: Time between one agent completing its task and the receiving agent starting work. Measures coordination overhead.
# metrics/collector.py
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import defaultdict
import threading
@dataclass
class AgentMetrics:
agent_id: str
invocation_count: int = 0
total_latency_ms: float = 0.0
token_count: int = 0
tool_calls: int = 0
successful_tool_calls: int = 0
error_count: int = 0
timestamps: list[datetime] = field(default_factory=list)
class MetricsCollector:
def __init__(self):
self._metrics: dict[str, AgentMetrics] = {}
self._lock = threading.Lock()
def record_invocation(self, agent_id: str, latency_ms: float, tokens: int):
with self._lock:
if agent_id not in self._metrics:
self._metrics[agent_id] = AgentMetrics(agent_id=agent_id)
m = self._metrics[agent_id]
m.invocation_count += 1
m.total_latency_ms += latency_ms
m.token_count += tokens
m.timestamps.append(datetime.utcnow())
def record_tool_call(self, agent_id: str, success: bool):
with self._lock:
if agent_id not in self._metrics:
self._metrics[agent_id] = AgentMetrics(agent_id=agent_id)
m = self._metrics[agent_id]
m.tool_calls += 1
if success:
m.successful_tool_calls += 1
def get_summary(self) -> dict:
with self._lock:
return {
agent_id: {
"invocations": m.invocation_count,
"avg_latency_ms": m.total_latency_ms / m.invocation_count if m.invocation_count > 0 else 0,
"total_tokens": m.token_count,
"tool_accuracy": m.successful_tool_calls / m.tool_calls if m.tool_calls > 0 else 0,
"error_rate": m.error_count / m.invocation_count if m.invocation_count > 0 else 0
}
for agent_id, m in self._metrics.items()
}
Composite System Metrics
Individual metrics combine into system-level indicators. Cyclic dependency detection uses graph analysis. Escalation rates reveal boundary issues between agent responsibilities. End-to-end latency percentiles expose coordination bottlenecks.
Dashboard Integration
Metrics export to standard monitoring systems via Prometheus metrics endpoints or direct API integration. Dashboards surface real-time performance with historical comparison bands.
Implement a composite metric calculator that detects latency spikes by comparing rolling averages against historical baselines and generates alerts when standard deviation exceeds 2σ.