BLK · CODEX CLI · LOCAL BACKEND

Codex CLI with local models.

OpenAI's Codex CLI is a cloud-first coding agent for the terminal — but it speaks the OpenAI-compatible API, so it can be pointed at any local backend that does too. Here's how to set that up against Ollama / LM Studio / MLX, which local coder models actually work for agentic coding, and when a natively-local alternative (Aider, Cline) is the cleaner path.

Published 2026-05-13Reviewed May 2026

§ 01

TL;DR

Yes, Codex CLI works with local models. Set OPENAI_API_BASE=http://localhost:11434/v1, use OPENAI_API_KEY=ollama (any non-empty string), configure wire_api = "responses" in ~/.codex/config.toml, and Codex talks to your local Ollama instead of the OpenAI API. Or pass --oss on the command line for the same effect.

Honest caveat: Codex CLI's prompts and tool-calling patterns are tuned for GPT-class models. Smaller local coders (≤32B) handle simple completions and edits but struggle with the multi-step planning Codex expects on harder tasks. The native-local-first alternatives (Aider, Cline, Roo Code) work better for serious agentic coding on local hardware — see § 8.

§ 02

Editorial stance

RunLocalAI is brand-agnostic. We don't earn referral fees from OpenAI, Ollama, LM Studio, MLX, or anything else on this page. The purpose is to document a configuration that brings a cloud-default tool into a local-first workflow — not to promote Codex CLI, OpenAI, or any specific provider. If a natively-local tool fits your situation better (it often does), we say so plainly in § 8.

Same stance applies across every tool-focused editorial. See /how-we-make-money.

§ 03

What Codex CLI is

Codex is OpenAI's coding-agent CLI. It runs in your terminal, reads your repo, plans multi-step changes, and writes / applies / iterates on them. Default backend is OpenAI's GPT-class models hosted in the cloud. The wire protocol is the standard OpenAI Responses API — which is also what every Ollama-compatible / vLLM / llama.cpp-server endpoint speaks.

The implication for local-AI operators: anything that exposes an OpenAI-compatible endpoint can take Codex CLI's requests. Codex doesn't know — or care — whether the responses come from gpt-5-codex in the cloud or qwen2.5-coder:32b running on your 4090.

§ 04

Why route it through local

Privacy. Your code never leaves your machine. Critical for proprietary, regulated, or pre-publication work.
Cost. Codex CLI against GPT can spend $10-50/day on real agentic workflows. Local backend = $0 marginal cost after the electricity.
Offline. Travel, air-gap, network-flake. Local works.
Familiar UX. You already know Codex CLI's commands. Keeping the surface and swapping the backend is lower friction than learning a whole new tool.

The trade you make: smaller local models are weaker on hard reasoning tasks than the frontier cloud models Codex was tuned for. § 6 is honest about which local coders hold up.

§ 05

Setup walkthrough

The path most operators take in 2026:

# 1. Install Codex CLI (pip or npm — see OpenAI docs for current method)
pip install openai-codex

# 2. Make sure Ollama is running with a coder model
ollama serve &
ollama pull qwen2.5-coder:32b      # 24GB rig
# OR
ollama pull deepseek-coder-v2:16b  # 16GB rig

# 3. Tell Codex CLI where to find the local endpoint
export OPENAI_API_BASE="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"     # any non-empty string

# 4. (Optional) write ~/.codex/config.toml for persistence
mkdir -p ~/.codex
cat > ~/.codex/config.toml <<'TOML'
[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[providers.default]
provider = "ollama"
model = "qwen2.5-coder:32b"
TOML

# 5. Run it
cd ~/your-repo
codex --model qwen2.5-coder:32b
# OR if config.toml is set up:
codex

CONFIG NOTE

The wire_api = "responses" setting matters: OpenAI is sunsetting Chat Completions support from Codex CLI, so the Responses API is the only one guaranteed to keep working. Ollama's OpenAI- compatible endpoint speaks Responses since 0.5+. Verify the matching version of your local runtime before configuring.

Alternative: pass --oss on the command line — Codex CLI ships an “OSS mode” that routes automatically to a local OpenAI-compatible provider on the default port.

§ 06

Which local models work well

Not all local coders work equally well as Codex backends. The honest 2026 ranking on consumer hardware:

Note on the Hermes line: “specifically tuned for tool-use loops” reflects Nous Research's published function-calling fine-tunes plus observed uptake across community recipes on r/LocalLLaMA + Ollama threads. We don't hold an audited usage count; treat the framing as a popular default rather than a measured winner.

Model	VRAM	Codex-fit
Qwen 2.5 Coder 32B	24GB	Best 2026 local choice — strong tool-calling + planning
Qwen 3 Coder 32B	24GB	Newer, similar profile; pick whichever Ollama serves cleanly
Hermes 3 8B	12GB	Specifically tuned for tool-use loops — most robust small-class pick when tool-call reliability matters more than reasoning depth
DeepSeek Coder V2 16B	16GB	Solid 16GB option; weaker on multi-step planning
Hermes 4 70B	48GB+	Strongest local pick for agentic loops when you have the headroom — tool-use is the differentiator
Llama 3.3 70B Instruct	48GB+	Generalist; weaker than dedicated coders on code-specific tasks
GPT-OSS 20B	14GB	Codex's own OSS default — designed for the OSS flow; fastest path on smaller rigs

The 70B-class jump is real but expensive in VRAM. For most workflows on a 24GB card, Qwen 2.5 Coder 32B is the sweet spot — see its catalog page for measured tok/s.

§ 07

Limits vs the cloud default

Honest about what local-Codex doesn't match yet:

Multi-step planning quality. Frontier GPT can hold ~10-step refactors in working memory; 32B local coders typically fragment at 4-5 steps. Break work into smaller chunks.
Long-context tool use. Codex CLI on GPT routinely operates on 128K+ context. Local 32K is the comfortable ceiling on 24GB; longer contexts mean degraded tool-call reliability.
Edge-case language coverage. Smaller models have less code in their training mix. Mainstream JS / Python / Rust / Go is fine; obscure DSLs (Erlang, Crystal, etc.) degrade noticeably.
Tool-call schema strictness. Codex expects clean JSON tool calls. Models without strong tool-use training (older base models) hallucinate tool arguments. Use Hermes 3/4 or Qwen 2.5 Coder which are tuned for it.

§ 08

Natively-local alternatives

The honest editorial line: Codex CLI works against local models, but it's a cloud-first tool routed to a local backend. Tools designed from the start for local backends usually feel better. Three to consider:

Aider → — terminal-driven, git-aware, diff-first. Built with local backends as a first-class target. The lowest-friction path if you already live in the terminal.
Cline → — VS Code-integrated autonomous loop. Multi-mode personas, project-level rules. Pairs naturally with Ollama via OpenAI-compatible endpoint.
Roo Code → — faster-moving Cline fork. Architect / Code / Ask / Debug / Orchestrator modes; opinionated about how to use local models.

For the full ecosystem map see State of Local AI 2026 § 5 and the coding-agents map.

SOURCES

Claude Code with local models →

Same pattern, Anthropic's side. Routing via ANTHROPIC_BASE_URL + LiteLLM.

Gemini CLI with local models →

Same pattern, Google's side. Via GOOGLE_GEMINI_BASE_URL + a LiteLLM bridge.

/quickstart →

Get an Ollama backend running first via Docker, then point any CLI at it.

Coding agents map →

Every local coding tool we track + when each fits.