What this does

Model fallback chains route requests through a primary local LLM, then automatically escalate to a cloud API when the local model is unavailable, slow, or returning errors. This architecture optimizes cost and latency while maintaining reliability.

Steps

Step 1 — Define a health check function for the local model.

Create a lightweight function that sends a probe request to the local endpoint (Ollama by default). Check the response status, latency, and a simple parse of the output. Return True only if all checks pass.

Step 2 — Set a timeout threshold.

Pick a maximum wait time in milliseconds before the local call is considered stalled. For most use cases, 5–15 seconds balances responsiveness and user experience. Store this as a configurable constant.

Step 3 — Build the fallback wrapper.

Implement a wrapper function that attempts the local model first. Wrap the call in a try block. If an exception is raised or the timeout fires, catch it and route the same prompt to the cloud API. Return whichever response succeeds first.

Step 4 — Add a circuit breaker pattern.

Track consecutive failures on the local model. After three consecutive failures, mark the local path as degraded and skip it entirely for a cooldown period (e.g., 60 seconds). This prevents repeated failed attempts from blocking requests.

Step 5 — Pass routing metadata downstream.

Attach metadata to the response indicating which model handled the request, total latency, and whether a fallback occurred. This data is critical for monitoring and cost attribution.

Step 6 — Test the chain end-to-end.

Shut down the local service to simulate a failure. Send the same prompt through the wrapper. Verify the cloud API receives and processes the request without any code changes.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

Send 10 identical prompts. Stop the local service midway. Confirm all remaining requests succeed via the cloud path.
Confirm latency jumps correlate with cloud routing in the response metadata.
Confirm the circuit breaker activates after three local failures and recovers after the cooldown period.

Common failures

Timeout too short: A timeout of 1–2 seconds may fire on a legitimately slow but functional local model. Increase to 5+ seconds and adjust based on observed p95 latency.
Silent fallback: If exceptions are caught but not logged, operators never know the local path failed. Always emit a structured log entry on fallback.
Mismatched prompt formats: Local and cloud models may accept different input schemas. Normalize the prompt structure before passing to either endpoint.

Related guides

How to Implement AI Agent Logging and Audit Trails — records which model handled each request
How to Set Up Batch Processing for Large Document Sets — applies fallback chains to batch workloads