Gemini CLI with local models.
Google's Gemini CLI is the third frontier-cloud coding agent — alongside Codex and Claude Code — and like them it can be pointed at a local backend. The setup is slightly more involved than the other two (Gemini's wire protocol is its own format, so a LiteLLM proxy in the middle does the translation), but the model file works the same way: cloud tool, local brain. Here's the verified path.
TL;DR
Set GOOGLE_GEMINI_BASE_URL to a LiteLLM Proxy endpoint that translates Gemini-format requests to your local Ollama / vLLM / llama.cpp backend. LiteLLM's model_group_alias feature maps Gemini model names (gemini-3-flash-preview, etc.) to whichever local model you want to serve them.
Tool calling is the hard part. Gemini CLI sends tools-parameter requests; for those to work end-to-end the local model must support tool-call schemas in its Ollama template. Qwen 2.5 and Hermes 3/4 do; many older base models don't.
Cleaner alternative for Gemini-curious operators: Ollama-Code is a community fork of Gemini CLI that strips out the cloud dependency and uses local Ollama directly. No proxy needed.
Editorial stance
RunLocalAI is brand-agnostic. We don't earn referral fees from Google, Ollama, LiteLLM, or any other tool covered on this page. This guide documents the route from cloud-default to local backend — not an endorsement of Gemini CLI as the right surface for local coding. § 9 names the natively- local alternatives that often fit better.
See /how-we-make-money.
What Gemini CLI is
Gemini CLI is Google's open-source terminal AI agent (google-gemini/gemini-cli). Like Codex and Claude Code it reads your repo, plans changes, edits files, executes commands. Default backend is the Gemini family (Flash / Pro) in the cloud. Unique feature: the agent's code is fully open source, so the community has produced multiple local-first forks — see § 7.
Why LiteLLM is needed here
Unlike Codex (which uses OpenAI-compatible Responses API) and Claude Code (which now has direct Ollama support via the Anthropic Messages compat endpoint), Gemini CLI sends requests in Google's GenAI wire format. No local backend speaks that protocol natively — so the cleanest path is to put LiteLLM Proxy in between, accept Gemini-format requests on one side, and translate to OpenAI-format calls against Ollama on the other.
LiteLLM's model_group_alias feature is the key knob: it maps a requested model name (what Gemini CLI asks for) to a different actual model (what the proxy calls). That lets you keep Gemini CLI's UI/model-picker unchanged while quietly serving local models behind it.
Setup walkthrough
# 1. Install Gemini CLI (npm — see google-gemini/gemini-cli docs)
npm install -g @google/gemini-cli
# 2. Run Ollama with a tool-use-capable coder
ollama serve &
ollama pull qwen2.5:7b # 12GB rig, lightweight + tool-use
# OR
ollama pull qwen2.5-coder:32b # 24GB rig, stronger
# 3. Install LiteLLM Proxy
pip install 'litellm[proxy]'
# 4. Write litellm-config.yaml — Gemini ↔ Ollama bridge
cat > litellm-config.yaml <<'YAML'
model_list:
- model_name: gemini-3-flash-preview # what Gemini CLI asks for
litellm_params:
model: ollama/qwen2.5-coder:32b # what LiteLLM actually calls
api_base: http://localhost:11434
router_settings:
model_group_alias:
"gemini-3-pro-preview": "gemini-3-flash-preview"
"gemini-2.5-flash": "gemini-3-flash-preview"
"gemini-2.5-pro": "gemini-3-flash-preview"
YAML
# 5. Start the proxy
litellm --config litellm-config.yaml --port 4000
# 6. Point Gemini CLI at LiteLLM
export GOOGLE_GEMINI_BASE_URL="http://localhost:4000"
export GEMINI_API_KEY="sk-local" # any non-empty string
# 7. Run
cd ~/your-repo
geminiThe model_group_alias block matters: Gemini CLI ships with several model names hard-coded. Mapping them all to your one local model means the agent never encounters “model not found” errors when it tries to pick a different size for a specific task.
Tool-use matters — model picks
Gemini CLI relies heavily on tool calls (file read, file write, shell execute, etc.). The local model behind LiteLLM must support tools properly in its Ollama chat template — many base models don't. 2026 picks that hold up:
Note on the Hermes line: “specifically tuned for tool-use loops” reflects Nous Research's published function-calling fine-tunes plus observed uptake across community recipes on r/LocalLLaMA + Ollama threads. We don't hold an audited usage count; treat the framing as a popular default rather than a measured winner.
| Model | VRAM | Tool-use note |
|---|---|---|
| Qwen 2.5 Coder 32B | 24GB | Reliable tool-call schema; best 32B-class option |
| Qwen 2.5 7B | 12GB | Smallest defensible pick; official Ollama template supports tools |
| Hermes 3 8B | 12GB | Specifically tuned for tool-use loops — most robust small-class pick when tool-call reliability matters more than reasoning depth |
| Hermes 4 70B | 48GB+ | Strongest local pick for agentic loops when you have the headroom — tool-use is the differentiator |
| DeepSeek Coder V2 16B | 16GB | Decent coder but tool-call drift more common than Qwen line |
Avoid base Llama variants without tool-tuning — Gemini CLI will surface confused error states when the model hallucinates tool argument structure.
Or: the Ollama-Code community fork
Several community forks of Gemini CLI strip out the cloud dependency entirely and ship with local Ollama support baked in. One active fork is Ollama-Code (ausstein/gemini-cli-ollama) — a privacy-focused fork of Gemini CLI (with code borrowed from Qwen Code) that processes everything on your infrastructure with no external API calls. Check commit cadence against alternatives before adopting; we don't hold a measured comparison.
When to pick the fork over the official-CLI-plus-LiteLLM path:
- • You care about zero outbound traffic guarantees — Google's official CLI may still send telemetry; a fork is auditable
- • You want one fewer moving part — no LiteLLM proxy process to babysit
- • You're comfortable trading the official-vendor upgrade path for community maintenance velocity
Trade-offs: forks lag behind upstream Gemini CLI on new features and fixes. The pure-local fork is the better path for stable, privacy-strict daily use; the official- CLI-plus-LiteLLM path is the better path if you want Google's feature pace and just need local backend flexibility.
Limits + when to keep using cloud
- Multi-step planning. Gemini Pro's frontier reasoning at scale (deep refactors, complex architectural moves) outperforms local 32B coders. Local is fine for “fix this bug” loops; cloud wins on “migrate this module” jobs.
- Multimodal Gemini. Gemini CLI's vision + audio handoffs need a vision- capable model. Local LLaVA / Llama-3.2-Vision works for basic image input but doesn't match cloud Gemini on complex visual reasoning.
- Tool-call drift. Gemini CLI updates its tool schemas. A new release can break a working LiteLLM mapping until the config catches up.
- Google AI Studio free tier. For small workloads, Gemini API has a generous free quota. Local is cheaper at scale but the break-even kicks in higher than for Codex/Claude — see /cost-calculator.
Natively-local alternatives
Same recommendation pattern as the Codex and Claude Code companion guides — if local backend is the priority, the natively-local tools usually feel cleaner:
- Aider → — terminal-driven, git-aware, OpenAI-compatible from day one. The simplest path on local.
- Cline → — VS Code integrated; pairs naturally with Ollama through OpenAI-compatible endpoint.
- Roo Code → — faster-moving Cline fork with multi-mode personas.
- OpenHands → — autonomous browser-native agent; runs full app generation in a sandboxed environment.
OpenAI's side of the same pattern.
Anthropic's side. Two paths: Ollama-direct or LiteLLM.
Get a local Ollama backend running first.
Every local coding tool we track + when each fits.