HOW-TO · SUP

How to Set Up Model Fallback Chains (Local to Cloud)

intermediate25 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Local LLM running, cloud API key, router logic

What this does

Model fallback chains route requests through a primary local LLM, then automatically escalate to a cloud API when the local model is unavailable, slow, or returning errors. This architecture optimizes cost and latency while maintaining reliability.

Steps

Step 1 — Define a health check function for the local model.

Create a lightweight function that sends a probe request to the local endpoint (Ollama by default). Check the response status, latency, and a simple parse of the output. Return True only if all checks pass.

Step 2 — Set a timeout threshold.

Pick a maximum wait time in milliseconds before the local call is considered stalled. For most use cases, 5–15 seconds balances responsiveness and user experience. Store this as a configurable constant.

Step 3 — Build the fallback wrapper.

Implement a wrapper function that attempts the local model first. Wrap the call in a try block. If an exception is raised or the timeout fires, catch it and route the same prompt to the cloud API. Return whichever response succeeds first.

Step 4 — Add a circuit breaker pattern.

Track consecutive failures on the local model. After three consecutive failures, mark the local path as degraded and skip it entirely for a cooldown period (e.g., 60 seconds). This prevents repeated failed attempts from blocking requests.

Step 5 — Pass routing metadata downstream.

Attach metadata to the response indicating which model handled the request, total latency, and whether a fallback occurred. This data is critical for monitoring and cost attribution.

Step 6 — Test the chain end-to-end.

Shut down the local service to simulate a failure. Send the same prompt through the wrapper. Verify the cloud API receives and processes the request without any code changes.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

  • Send 10 identical prompts. Stop the local service midway. Confirm all remaining requests succeed via the cloud path.
  • Confirm latency jumps correlate with cloud routing in the response metadata.
  • Confirm the circuit breaker activates after three local failures and recovers after the cooldown period.

Common failures

  • Timeout too short: A timeout of 1–2 seconds may fire on a legitimately slow but functional local model. Increase to 5+ seconds and adjust based on observed p95 latency.
  • Silent fallback: If exceptions are caught but not logged, operators never know the local path failed. Always emit a structured log entry on fallback.
  • Mismatched prompt formats: Local and cloud models may accept different input schemas. Normalize the prompt structure before passing to either endpoint.

Related guides

  • How to Implement AI Agent Logging and Audit Trails — records which model handled each request
  • How to Set Up Batch Processing for Large Document Sets — applies fallback chains to batch workloads