Codex CLI with local models.
OpenAI's Codex CLI is a cloud-first coding agent for the terminal — but it speaks the OpenAI-compatible API, so it can be pointed at any local backend that does too. Here's how to set that up against Ollama / LM Studio / MLX, which local coder models actually work for agentic coding, and when a natively-local alternative (Aider, Cline) is the cleaner path.
TL;DR
Yes, Codex CLI works with local models. Set OPENAI_API_BASE=http://localhost:11434/v1, use OPENAI_API_KEY=ollama (any non-empty string), configure wire_api = "responses" in ~/.codex/config.toml, and Codex talks to your local Ollama instead of the OpenAI API. Or pass --oss on the command line for the same effect.
Honest caveat: Codex CLI's prompts and tool-calling patterns are tuned for GPT-class models. Smaller local coders (≤32B) handle simple completions and edits but struggle with the multi-step planning Codex expects on harder tasks. The native-local-first alternatives (Aider, Cline, Roo Code) work better for serious agentic coding on local hardware — see § 8.
Editorial stance
RunLocalAI is brand-agnostic. We don't earn referral fees from OpenAI, Ollama, LM Studio, MLX, or anything else on this page. The purpose is to document a configuration that brings a cloud-default tool into a local-first workflow — not to promote Codex CLI, OpenAI, or any specific provider. If a natively-local tool fits your situation better (it often does), we say so plainly in § 8.
Same stance applies across every tool-focused editorial. See /how-we-make-money.
What Codex CLI is
Codex is OpenAI's coding-agent CLI. It runs in your terminal, reads your repo, plans multi-step changes, and writes / applies / iterates on them. Default backend is OpenAI's GPT-class models hosted in the cloud. The wire protocol is the standard OpenAI Responses API — which is also what every Ollama-compatible / vLLM / llama.cpp-server endpoint speaks.
The implication for local-AI operators: anything that exposes an OpenAI-compatible endpoint can take Codex CLI's requests. Codex doesn't know — or care — whether the responses come from gpt-5-codex in the cloud or qwen2.5-coder:32b running on your 4090.
Why route it through local
- Privacy. Your code never leaves your machine. Critical for proprietary, regulated, or pre-publication work.
- Cost. Codex CLI against GPT can spend $10-50/day on real agentic workflows. Local backend = $0 marginal cost after the electricity.
- Offline. Travel, air-gap, network-flake. Local works.
- Familiar UX. You already know Codex CLI's commands. Keeping the surface and swapping the backend is lower friction than learning a whole new tool.
The trade you make: smaller local models are weaker on hard reasoning tasks than the frontier cloud models Codex was tuned for. § 6 is honest about which local coders hold up.
Setup walkthrough
The path most operators take in 2026:
# 1. Install Codex CLI (pip or npm — see OpenAI docs for current method) pip install openai-codex # 2. Make sure Ollama is running with a coder model ollama serve & ollama pull qwen2.5-coder:32b # 24GB rig # OR ollama pull deepseek-coder-v2:16b # 16GB rig # 3. Tell Codex CLI where to find the local endpoint export OPENAI_API_BASE="http://localhost:11434/v1" export OPENAI_API_KEY="ollama" # any non-empty string # 4. (Optional) write ~/.codex/config.toml for persistence mkdir -p ~/.codex cat > ~/.codex/config.toml <<'TOML' [model_providers.ollama] name = "Ollama" base_url = "http://localhost:11434/v1" wire_api = "responses" [providers.default] provider = "ollama" model = "qwen2.5-coder:32b" TOML # 5. Run it cd ~/your-repo codex --model qwen2.5-coder:32b # OR if config.toml is set up: codex
The wire_api = "responses" setting matters: OpenAI is sunsetting Chat Completions support from Codex CLI, so the Responses API is the only one guaranteed to keep working. Ollama's OpenAI- compatible endpoint speaks Responses since 0.5+. Verify the matching version of your local runtime before configuring.
Alternative: pass --oss on the command line — Codex CLI ships an “OSS mode” that routes automatically to a local OpenAI-compatible provider on the default port.
Which local models work well
Not all local coders work equally well as Codex backends. The honest 2026 ranking on consumer hardware:
Note on the Hermes line: “specifically tuned for tool-use loops” reflects Nous Research's published function-calling fine-tunes plus observed uptake across community recipes on r/LocalLLaMA + Ollama threads. We don't hold an audited usage count; treat the framing as a popular default rather than a measured winner.
| Model | VRAM | Codex-fit |
|---|---|---|
| Qwen 2.5 Coder 32B | 24GB | Best 2026 local choice — strong tool-calling + planning |
| Qwen 3 Coder 32B | 24GB | Newer, similar profile; pick whichever Ollama serves cleanly |
| Hermes 3 8B | 12GB | Specifically tuned for tool-use loops — most robust small-class pick when tool-call reliability matters more than reasoning depth |
| DeepSeek Coder V2 16B | 16GB | Solid 16GB option; weaker on multi-step planning |
| Hermes 4 70B | 48GB+ | Strongest local pick for agentic loops when you have the headroom — tool-use is the differentiator |
| Llama 3.3 70B Instruct | 48GB+ | Generalist; weaker than dedicated coders on code-specific tasks |
| GPT-OSS 20B | 14GB | Codex's own OSS default — designed for the OSS flow; fastest path on smaller rigs |
The 70B-class jump is real but expensive in VRAM. For most workflows on a 24GB card, Qwen 2.5 Coder 32B is the sweet spot — see its catalog page for measured tok/s.
Limits vs the cloud default
Honest about what local-Codex doesn't match yet:
- Multi-step planning quality. Frontier GPT can hold ~10-step refactors in working memory; 32B local coders typically fragment at 4-5 steps. Break work into smaller chunks.
- Long-context tool use. Codex CLI on GPT routinely operates on 128K+ context. Local 32K is the comfortable ceiling on 24GB; longer contexts mean degraded tool-call reliability.
- Edge-case language coverage. Smaller models have less code in their training mix. Mainstream JS / Python / Rust / Go is fine; obscure DSLs (Erlang, Crystal, etc.) degrade noticeably.
- Tool-call schema strictness. Codex expects clean JSON tool calls. Models without strong tool-use training (older base models) hallucinate tool arguments. Use Hermes 3/4 or Qwen 2.5 Coder which are tuned for it.
Natively-local alternatives
The honest editorial line: Codex CLI works against local models, but it's a cloud-first tool routed to a local backend. Tools designed from the start for local backends usually feel better. Three to consider:
- Aider → — terminal-driven, git-aware, diff-first. Built with local backends as a first-class target. The lowest-friction path if you already live in the terminal.
- Cline → — VS Code-integrated autonomous loop. Multi-mode personas, project-level rules. Pairs naturally with Ollama via OpenAI-compatible endpoint.
- Roo Code → — faster-moving Cline fork. Architect / Code / Ask / Debug / Orchestrator modes; opinionated about how to use local models.
For the full ecosystem map see State of Local AI 2026 § 5 and the coding-agents map.
Same pattern, Anthropic's side. Routing via ANTHROPIC_BASE_URL + LiteLLM.
Same pattern, Google's side. Via GOOGLE_GEMINI_BASE_URL + a LiteLLM bridge.
Get an Ollama backend running first via Docker, then point any CLI at it.
Every local coding tool we track + when each fits.