Local vs cloud
Editorial

Local inference vs cloud APIs

Local AI is not strictly better than cloud APIs. It wins on privacy, predictable cost, offline capability, and lock-in. It loses on raw quality at the frontier, on operator complexity, and on time-to-deploy. The honest decision is workload-by- workload.

Dimension
Local inference
Owned hardware
OpenRouter
Multi-vendor router
Together / Fireworks
Open-source cloud
Frontier APIs
OpenAI / Anthropic / Google
Raw decode speed
Tokens per second at the model the operator actually uses.
Acceptable
Bound by your hardware. 30-100 tok/s consumer; 200-400 with vLLM on pro.
Strong
Routes to fastest provider; usually 100-300 tok/s.
Strong
Optimized inference; competitive with frontier on open-source models.
Excellent
Fastest tokens-per-second on their proprietary models.
Privacy
Where your prompts and outputs live.
Excellent
On your machine. Logs are yours.
Limited
Routed across multiple providers; each has its own retention policy.
Acceptable
Stated retention controls; trust the policy or run on dedicated.
Limited
Vendor-controlled; enterprise tiers offer DPA but the data still leaves your boundary.
Offline capability
Works with the network unplugged.
Excellent
The whole point.
Poor
Internet-required.
Poor
Internet-required.
Poor
Internet-required.
Lock-in risk
What you lose if your vendor changes.
Excellent
Open weights + open runtime; portable.
Strong
Multi-provider abstraction reduces single-vendor lock-in.
Strong
Open-weights focus; switching providers is real.
Limited
Closed models; switching means re-prompting and quality regression.
Predictable cost
Can you forecast next month's bill within ±10%?
Excellent
Capex + electricity. Predictable to the cent.
Acceptable
Per-token; predictable if usage is stable.
Acceptable
Per-token + dedicated tiers.
Limited
Per-token; one viral product moment can 10x the bill.
Latency floor
Time to first token under good conditions.
Excellent
Sub-100 ms TTFT typical; no internet round trip.
Strong
Provider-dependent; 200-500 ms typical.
Strong
Optimized network; 200-400 ms.
Strong
Optimized infrastructure; 200-500 ms typical.
Model breadth
How many models you can choose from.
Strong
Anything with public weights; HuggingFace + GGUF library.
Excellent
Hundreds of routes; aggregator-grade breadth.
Strong
Most popular open-source + custom fine-tunes.
Limited
Only the vendor's own models.
Quality on hardest tasks
Frontier-tier reasoning + tool use + long context.
Acceptable
Best open-source models close the gap, but a quality gap to GPT/Claude/Gemini remains.
Strong
You can route to frontier when you need it.
Strong
Open-source models, served well; same quality ceiling as local.
Excellent
Top-tier on hardest reasoning + tool use.
Compliance / DPA
What enterprise procurement asks about.
Excellent
Data never leaves your infrastructure; airgap real.
Limited
Vendor DPAs vary by route; complex to audit.
Strong
SOC 2; explicit DPAs; dedicated tier for stricter requirements.
Strong
Robust enterprise programs; data-zero retention options.
Operator complexity
Hours per month keeping the system working.
Limited
5-15 hours/month: drivers, runtime updates, model management.
Excellent
Effectively zero.
Excellent
Effectively zero.
Excellent
Effectively zero.

Decision rule of thumb

Local wins when: you handle sensitive data, you have predictable inference volume, you want to control the entire stack, you need offline capability, or your workload is a fit for an open-source model.

Cloud wins when: you need frontier-grade reasoning, your usage is bursty or unpredictable, you don't have an operator on staff, or you're prototyping and time-to-first-result matters more than per-token cost.

Hybrid wins more often than people admit: local for the 80% of workloads where it's fine, cloud frontier API for the 20% where quality matters.