command-r
104B parameters
Restricted

Command R+ 104B

Cohere's flagship — RAG-tuned, multilingual. Open weights but non-commercial.

License: CC BY-NC 4.0·Released Aug 30, 2024·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
7.8/10
Positioning

Command R+ is Cohere's enterprise-aimed open-weight flagship — 104B parameters with explicit training for retrieval-augmented generation and tool-call workflows. The model to pick when RAG is the primary workload and you can afford workstation-class hardware.

Strengths
  • Best-in-class RAG behavior — explicitly trained on citation-aware retrieval.
  • Strong multilingual — better than Llama on European + Asian languages.
  • Mature tool-use format — Cohere's API conventions translate cleanly.
Limitations
  • CC-BY-NC license for non-commercial only — commercial deployment requires a separate Cohere license.
  • 104B VRAM cost — ~58 GB at Q4, partial-offload required on 24 GB cards.
  • General-chat quality below Llama 3.3 70B despite the size advantage.
Real-world performance on RTX 4090
  • Q4_K_M (58 GB) — heavy offload: 12–18 tok/s, 64+ GB system RAM required
  • Q5_K_M (68 GB) — workstation only
  • Q8_0 (104 GB) — workstation only
Should you run this locally?

Yes, for RAG-heavy non-commercial work, or commercial deployments with the Cohere license. No, for general-purpose chat — Llama 3.3 70B is more memory-efficient and equally good.

How it compares
  • vs Llama 3.3 70B → Llama wins on general use + license; Command R+ wins on RAG specifically.
  • vs Command R 35B → 104B is meaningfully smarter; 35B is the practical-VRAM Cohere pick.
  • vs DeepSeek R1 Distill Llama 70B → R1 Distill wins on reasoning; Command R+ wins on RAG/tool-use behavior.
Run this yourself
ollama pull command-r-plus:104b-q4_K_M
ollama run command-r-plus:104b-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers ~30, RTX 4090 + 96 GB DDR5
Why this rating

7.8/10 — Cohere's enterprise-tuned 104B with built-in RAG and tool-use scaffolding. Excellent at structured RAG workflows, but eclipsed by Llama 3.3 70B for general use. Loses points on license restrictiveness and VRAM cost.

Overview

Cohere's flagship — RAG-tuned, multilingual. Open weights but non-commercial.

Strengths

  • Best open RAG model at release
  • Multilingual

Weaknesses

  • Non-commercial
  • 70GB+ VRAM

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M60.0 GB70 GB

Get the model

Ollama

One-line install

ollama run command-r-plus:104bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/CohereForAI/c4ai-command-r-plus

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Command R+ 104B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run Command R+ 104B?

70GB of VRAM is enough to run Command R+ 104B at the Q4_K_M quantization (file size 60.0 GB). Higher-quality quantizations need more.

Can I use Command R+ 104B commercially?

Command R+ 104B is released under the CC BY-NC 4.0, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Command R+ 104B?

Command R+ 104B supports a context window of 131,072 tokens (about 131K).

How do I install Command R+ 104B with Ollama?

Run `ollama pull command-r-plus:104b` to download, then `ollama run command-r-plus:104b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/CohereForAI/c4ai-command-r-plus

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.