My cloud LLM provider just changed pricing. What are my local options?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Cloud LLM pricing changes are a normal part of the market. Local AI is the structural hedge — but only for the workloads where local actually works.
When a vendor changes pricing (split a free tier to paid, deprecate a "free" mode, raise per-token rates), the framing matters: this isn't a betrayal, it's how SaaS economics work. Cloud providers chose pricing levels that didn't reflect serving cost long enough to capture market share. Eventually the unit economics catch up. The same pattern happened with Anthropic's --print mode shift to credits in May 2026, and earlier with OpenAI's GPT-4 pricing tiers.
What's the structural defense? Run as much of your workload locally as the work itself allows. By workload class:
Coding (your highest-value workload):
- Tool: Aider / Cline / Continue / Tabby / Twinny (all in our /apps directory)
- Model: Qwen 2.5 Coder 32B Q4_K_M on 24GB VRAM, or Qwen 2.5 Coder 7B Q4_K_M on 12GB
- Cost crossover: at typical Claude Sonnet 4.5 coding usage (a few hundred thousand tokens/day across agent loops), a $700 used RTX 3090 typically pays back inside a year. Heavy users — multi-million token days — recoup in months. Plug your actual monthly token volume into /cost-vs-cloud for the exact crossover.
- Editorial honest take: local 32B coding is genuinely competitive with cloud Sonnet/Opus for most coding tasks, especially with agentic loops
Chat (the easy migration):
- Tool: Open WebUI, Jan, LibreChat, AnythingLLM (all in /apps)
- Model: Llama 3.1 8B Q4_K_M, Qwen 3 14B Q4_K_M, or Qwen 3 32B Q4_K_M (if you have the VRAM)
- Cost-equivalent: nearly any consumer GPU pays back in months for daily chat use
RAG / docs:
- Tool: PrivateGPT, Verba, Khoj, AnythingLLM
- Model: 7-8B base model + local embedder (bge-small, nomic-embed)
- Editorial honest take: this is where local wins decisively — you keep your documents on your hardware, no per-query cost, no rate limits
Voice / transcription:
- Tool: MacWhisper, Buzz, OpenedAI-Speech (TTS)
- Cost-equivalent: pays back almost instantly for any sustained transcription workload
What stays cloud (be honest with yourself):
- Frontier-quality one-shot tasks where local 32B genuinely isn't enough
- Cutting-edge multimodal (vision-language at frontier scale)
- 405B+ model needs (datacenter-only locally)
- Workloads where latency hides the cost difference (your time matters more than dollars)
The migration order: start with the workload where local genuinely beats cloud on quality + cost + latency. That's coding for most operators. Run Aider with Qwen 2.5 Coder 32B for a week. If the gap is small enough (it usually is), expand from there. Don't try to migrate everything at once.
Sanity check first: before any panic-migration, run /cost-vs-cloud with your actual monthly volume. The break-even on a $700 used 3090 vs Claude Sonnet 4.5 typically lands at 5-10M tokens/month of API usage. Below that, the migration overhead isn't worth it.
Explore the numbers for your specific stack
Where we got the numbers
Anthropic --print mode → credits change: r/ClaudeAI May 2026 threads (533+ upvotes, 463+ comments). General SaaS-pricing-unit-economics observation: industry analysis since 2020. Local-vs-cloud crossover math: /cost-vs-cloud calculator + /compounder.
Also see
Aider, Cline, Continue, Tabby, Twinny — all run fully against local Ollama / vLLM.
Decision rule by agent style + model pairing.
The TCO compounder — drag a daily-volume slider, watch the cross-over point.
Full rig recipe — GPU + runtime + model + install script for replacing cloud coding.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.