Production Monitoring — Function Calling for Local Models (Chapter 14)

Running function calling in production requires visibility into model behavior, tool execution times, error rates, and resource consumption. Without monitoring, issues remain hidden until they cause user-facing failures.

Key Metrics

Monitor these metrics for function-calling systems:

Tool call frequency: Which tools are called and how often
Tool latency: Time from call to result for each tool
Model response time: Time to generate tool calls
Error rate by tool: Which tools fail most frequently
Token usage: Input and output tokens per request
Cache hit rate: How often similar queries use cached results

Prometheus Metrics

Instrument the tool executor with Prometheus metrics:

from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
tool_calls_total = Counter(
    'tool_calls_total',
    'Total tool calls',
    ['tool_name', 'status']
)

tool_duration_seconds = Histogram(
    'tool_duration_seconds',
    'Tool execution duration',
    ['tool_name'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

model_tokens = Histogram(
    'model_tokens_total',
    'Token usage per request',
    ['type'],  # 'input' or 'output'
    buckets=[10, 100, 500, 1000, 5000]
)

active_requests = Gauge(
    'active_requests',
    'Currently processing requests'
)

class MonitoredToolExecutor:
    def __init__(self, base_executor):
        self.base = base_executor
    
    def execute(self, tool_name: str, func: callable, *args, **kwargs):
        active_requests.inc()
        start_time = time.time()
        
        try:
            result = func(*args, **kwargs)
            tool_calls_total.labels(
                tool_name=tool_name,
                status='success'
            ).inc()
            return result
        except Exception as e:
            tool_calls_total.labels(
                tool_name=tool_name,
                status='error'
            ).inc()
            raise
        finally:
            duration = time.time() - start_time
            tool_duration_seconds.labels(tool_name=tool_name).observe(duration)
            active_requests.dec()

Structured Logging

Combine metrics with structured logs for correlation:

import structlog
import json

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

def log_tool_execution(
    tool_name: str,
    duration_ms: float,
    success: bool,
    error: str | None = None,
    model_latency_ms: float | None = None,
    tokens_in: int | None = None,
    tokens_out: int | None = None
):
    logger = structlog.get_logger()
    
    log = logger.info if success else logger.error
    log(
        "tool_execution",
        tool=tool_name,
        duration_ms=duration_ms,
        model_latency_ms=model_latency_ms,
        tokens_in=tokens_in,
        tokens_out=tokens_out,
        error=error
    )

Health Endpoint

Expose a health endpoint for load balancers and orchestrators:

from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "ollama_connected": check_ollama_connection(),
        "active_tools": len(available_tools),
        "uptime_seconds": time.time() - start_time
    }

@app.get("/metrics")
async def metrics():
    # Expose Prometheus-format metrics
    return generate_metrics()

def check_ollama_connection() -> bool:
    try:
        response = requests.get("http://localhost:11434/api/tags")
        return response.status_code == 200
    except:
        return False

Alerting Thresholds

Set thresholds that trigger alerts when exceeded:

# prometheus alert rules
groups:
  - name: function_calling_alerts
    rules:
      - alert: HighToolErrorRate
        expr: |
          rate(tool_calls_total{status="error"}[5m]) 
          / rate(tool_calls_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Tool error rate above 10%"
      
      - alert: ToolLatencyHigh
        expr: |
          histogram_quantile(0.95, rate(tool_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile tool latency above 5 seconds"
      
      - alert: OllamaDown
        expr: up{job="ollama"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ollama server unreachable"