14. Production Monitoring
Chapter 14 of 18 · 25 min
Running function calling in production requires visibility into model behavior, tool execution times, error rates, and resource consumption. Without monitoring, issues remain hidden until they cause user-facing failures.
Key Metrics
Monitor these metrics for function-calling systems:
- Tool call frequency: Which tools are called and how often
- Tool latency: Time from call to result for each tool
- Model response time: Time to generate tool calls
- Error rate by tool: Which tools fail most frequently
- Token usage: Input and output tokens per request
- Cache hit rate: How often similar queries use cached results
Prometheus Metrics
Instrument the tool executor with Prometheus metrics:
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
tool_calls_total = Counter(
'tool_calls_total',
'Total tool calls',
['tool_name', 'status']
)
tool_duration_seconds = Histogram(
'tool_duration_seconds',
'Tool execution duration',
['tool_name'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
model_tokens = Histogram(
'model_tokens_total',
'Token usage per request',
['type'], # 'input' or 'output'
buckets=[10, 100, 500, 1000, 5000]
)
active_requests = Gauge(
'active_requests',
'Currently processing requests'
)
class MonitoredToolExecutor:
def __init__(self, base_executor):
self.base = base_executor
def execute(self, tool_name: str, func: callable, *args, **kwargs):
active_requests.inc()
start_time = time.time()
try:
result = func(*args, **kwargs)
tool_calls_total.labels(
tool_name=tool_name,
status='success'
).inc()
return result
except Exception as e:
tool_calls_total.labels(
tool_name=tool_name,
status='error'
).inc()
raise
finally:
duration = time.time() - start_time
tool_duration_seconds.labels(tool_name=tool_name).observe(duration)
active_requests.dec()
Structured Logging
Combine metrics with structured logs for correlation:
import structlog
import json
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
def log_tool_execution(
tool_name: str,
duration_ms: float,
success: bool,
error: str | None = None,
model_latency_ms: float | None = None,
tokens_in: int | None = None,
tokens_out: int | None = None
):
logger = structlog.get_logger()
log = logger.info if success else logger.error
log(
"tool_execution",
tool=tool_name,
duration_ms=duration_ms,
model_latency_ms=model_latency_ms,
tokens_in=tokens_in,
tokens_out=tokens_out,
error=error
)
Health Endpoint
Expose a health endpoint for load balancers and orchestrators:
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
async def health():
return {
"status": "healthy",
"ollama_connected": check_ollama_connection(),
"active_tools": len(available_tools),
"uptime_seconds": time.time() - start_time
}
@app.get("/metrics")
async def metrics():
# Expose Prometheus-format metrics
return generate_metrics()
def check_ollama_connection() -> bool:
try:
response = requests.get("http://localhost:11434/api/tags")
return response.status_code == 200
except:
return False
Alerting Thresholds
Set thresholds that trigger alerts when exceeded:
# prometheus alert rules
groups:
- name: function_calling_alerts
rules:
- alert: HighToolErrorRate
expr: |
rate(tool_calls_total{status="error"}[5m])
/ rate(tool_calls_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Tool error rate above 10%"
- alert: ToolLatencyHigh
expr: |
histogram_quantile(0.95, rate(tool_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile tool latency above 5 seconds"
- alert: OllamaDown
expr: up{job="ollama"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ollama server unreachable"
EXERCISE
Add Prometheus metrics to your tool executor tracking call count, duration histogram, and error count per tool. Verify metrics are exported correctly by querying the /metrics endpoint.