HOW-TO · RAG
How to Monitor Agent Performance Metrics
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Agent system running, Prometheus or similar, Python 3.10+
What this does
Performance monitoring tracks latency, token usage, error rates, and tool call frequency per agent — providing dashboards for capacity planning and bottleneck identification.
Steps
- Install Prometheus client library.
pip install prometheus-client
- Define metric counters and histograms. Each signal gets a Prometheus metric.
from prometheus_client import Counter, Histogram, Gauge, start_http_server
TOOL_CALL_COUNTER = Counter(
"agent_tool_calls_total",
"Total tool calls by tool name",
["agent", "tool_name", "status"]
)
LLM_LATENCY = Histogram(
"agent_llm_latency_seconds",
"LLM call latency in seconds",
["agent"],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0)
)
TOKEN_USAGE = Counter(
"agent_token_usage_total",
"Token usage by agent",
["agent", "token_type"] # prompt, completion
)
ACTIVE_AGENTS = Gauge(
"agent_active_count",
"Number of currently active agents",
["agent_type"]
)
- Instrument the agent loop. Record metrics around each operation.
import time
def monitored_tool_call(agent_name: str, tool_name: str, args: dict):
start = time.time()
try:
result = execute_tool(tool_name, args)
TOOL_CALL_COUNTER.labels(agent=agent_name, tool_name=tool_name, status="success").inc()
return result
except Exception as e:
TOOL_CALL_COUNTER.labels(agent=agent_name, tool_name=tool_name, status="error").inc()
raise
finally:
LLM_LATENCY.labels(agent=agent_name).observe(time.time() - start)
- Start the Prometheus HTTP server. Expose metrics on a dedicated port.
from prometheus_client import start_http_server
start_http_server(8001) # Metrics available at http://localhost:8001/metrics
- Add business-level metrics. Track task completion rate and quality.
TASK_COMPLETION = Counter(
"agent_tasks_total",
"Tasks by completion status",
["agent", "status"] # success, failure, timeout
)
TASK_DURATION = Histogram(
"agent_task_duration_seconds",
"End-to-end task duration",
["agent"]
)
- Query with PromQL. Example queries for monitoring.
# Top 5 most-called tools
topk(5, sum(rate(agent_tool_calls_total[5m])) by (tool_name))
# Error rate per agent
sum(rate(agent_tool_calls_total{status="error"}[5m])) by (agent)
/ sum(rate(agent_tool_calls_total[5m])) by (agent)
# P95 LLM latency
histogram_quantile(0.95, sum(rate(agent_llm_latency_seconds_bucket[5m])) by (le))
Verification
python -c "
from prometheus_client import Counter, start_http_server
import threading, requests, time
c = Counter('test_total', 'Test', ['label'])
c.labels(label='test').inc()
threading.Thread(target=lambda: start_http_server(8002), daemon=True).start()
time.sleep(0.5)
r = requests.get('http://localhost:8002/metrics')
print('test_total' in r.text)
# Expected: True
"
Common failures
- Metric cardinality explosion. Using unique labels like
session_idcreates unlimited metric series. Pin to fixed labels likeagentandtool_name. - Port conflicts. Multiple agent instances crash if they all try to bind the same Prometheus port. Use a unique port per instance or a push gateway.
- Histogram bucket sizing wrong. Default buckets miss long tails. Configure buckets up to 60s for LLM calls with slow local models.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
- How to Set Up Agent Observability and Tracing
- How to Implement Logging for Agent Debugging