14. Production Monitoring

Chapter 14 of 18 · 25 min

Running function calling in production requires visibility into model behavior, tool execution times, error rates, and resource consumption. Without monitoring, issues remain hidden until they cause user-facing failures.

Key Metrics

Monitor these metrics for function-calling systems:

  • Tool call frequency: Which tools are called and how often
  • Tool latency: Time from call to result for each tool
  • Model response time: Time to generate tool calls
  • Error rate by tool: Which tools fail most frequently
  • Token usage: Input and output tokens per request
  • Cache hit rate: How often similar queries use cached results

Prometheus Metrics

Instrument the tool executor with Prometheus metrics:

from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
tool_calls_total = Counter(
    'tool_calls_total',
    'Total tool calls',
    ['tool_name', 'status']
)

tool_duration_seconds = Histogram(
    'tool_duration_seconds',
    'Tool execution duration',
    ['tool_name'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

model_tokens = Histogram(
    'model_tokens_total',
    'Token usage per request',
    ['type'],  # 'input' or 'output'
    buckets=[10, 100, 500, 1000, 5000]
)

active_requests = Gauge(
    'active_requests',
    'Currently processing requests'
)

class MonitoredToolExecutor:
    def __init__(self, base_executor):
        self.base = base_executor
    
    def execute(self, tool_name: str, func: callable, *args, **kwargs):
        active_requests.inc()
        start_time = time.time()
        
        try:
            result = func(*args, **kwargs)
            tool_calls_total.labels(
                tool_name=tool_name,
                status='success'
            ).inc()
            return result
        except Exception as e:
            tool_calls_total.labels(
                tool_name=tool_name,
                status='error'
            ).inc()
            raise
        finally:
            duration = time.time() - start_time
            tool_duration_seconds.labels(tool_name=tool_name).observe(duration)
            active_requests.dec()

Structured Logging

Combine metrics with structured logs for correlation:

import structlog
import json

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

def log_tool_execution(
    tool_name: str,
    duration_ms: float,
    success: bool,
    error: str | None = None,
    model_latency_ms: float | None = None,
    tokens_in: int | None = None,
    tokens_out: int | None = None
):
    logger = structlog.get_logger()
    
    log = logger.info if success else logger.error
    log(
        "tool_execution",
        tool=tool_name,
        duration_ms=duration_ms,
        model_latency_ms=model_latency_ms,
        tokens_in=tokens_in,
        tokens_out=tokens_out,
        error=error
    )

Health Endpoint

Expose a health endpoint for load balancers and orchestrators:

from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "ollama_connected": check_ollama_connection(),
        "active_tools": len(available_tools),
        "uptime_seconds": time.time() - start_time
    }

@app.get("/metrics")
async def metrics():
    # Expose Prometheus-format metrics
    return generate_metrics()

def check_ollama_connection() -> bool:
    try:
        response = requests.get("http://localhost:11434/api/tags")
        return response.status_code == 200
    except:
        return False

Alerting Thresholds

Set thresholds that trigger alerts when exceeded:

# prometheus alert rules
groups:
  - name: function_calling_alerts
    rules:
      - alert: HighToolErrorRate
        expr: |
          rate(tool_calls_total{status="error"}[5m]) 
          / rate(tool_calls_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Tool error rate above 10%"
      
      - alert: ToolLatencyHigh
        expr: |
          histogram_quantile(0.95, rate(tool_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile tool latency above 5 seconds"
      
      - alert: OllamaDown
        expr: up{job="ollama"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ollama server unreachable"
EXERCISE

Add Prometheus metrics to your tool executor tracking call count, duration histogram, and error count per tool. Verify metrics are exported correctly by querying the /metrics endpoint.