RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Quick answers
REF
  • All buyer guides
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Will it run? / NVIDIA GB200 NVL72 / agents

What can NVIDIA GB200 NVL72 run for agents?

Build: NVIDIA GB200 NVL72 + — + 32 GB RAM (windows)

Memory: 13824 GB VRAM + 32 GB system RAM
Runner: llama.cpp / Ollama (CUDA)
AnyChatCodingAgentsReasoningVisionLong contextCreative

Runs comfortably
147 models

Ranked by fit for agents use case + predicted speed. Click a row for VRAM breakdown.

#1Hermes 3 Llama 3.1 8B
8B
hermes
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 22.9 GBHeadroom: 13801.1 GB
ollama run hermes3:8b
612
tok/s
E
Weights
8.50 GB
KV cache
4.00 GB
Activations
8.62 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#2Dolphin 3.0 Mistral 24B
24B
dolphin
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 37.2 GBHeadroom: 13786.8 GB
ollama run dolphin-mistral:24b
359
tok/s
E
Weights
14.49 GB
KV cache
12.00 GB
Activations
8.92 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#3Qwen 3 30B-A3B
30B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 58.5 GBHeadroom: 13765.5 GB
ollama run qwen3:30b
163
tok/s
E
Weights
31.88 GB
KV cache
15.00 GB
Activations
9.79 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#4Qwen 2.5 Coder 32B Instruct
32B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 47.8 GBHeadroom: 13776.2 GB
ollama run qwen2.5-coder:32b
153
tok/s
E
Weights
34.00 GB
KV cache
2.15 GB
Activations
9.89 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#5Qwen 3 32B
32B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 61.7 GBHeadroom: 13762.3 GB
ollama run qwen3:32b
153
tok/s
E
Weights
34.00 GB
KV cache
16.00 GB
Activations
9.89 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#6Qwen 2.5 32B Instruct
32B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 61.7 GBHeadroom: 13762.3 GB
ollama run qwen2.5:32b
153
tok/s
E
Weights
34.00 GB
KV cache
16.00 GB
Activations
9.89 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#7Llama 3.1 Nemotron 70B Instruct
70B
llama
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 89.4 GBHeadroom: 13734.6 GB
ollama run nemotron:70b
123
tok/s
E
Weights
42.26 GB
KV cache
35.00 GB
Activations
10.31 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#8Hermes 3 Llama 3.1 70B
70B
hermes
Commercial OK
Quant: Q4_K_MContext: 8,192VRAM: 89.4 GBHeadroom: 13734.6 GB
ollama run hermes3:70b
123
tok/s
E
Weights
42.26 GB
KV cache
35.00 GB
Activations
10.31 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#9DeepSeek R1 Distill Llama 70B
70B
deepseek
Commercial OK
Quant: Q5_K_MContext: 8,192VRAM: 95.5 GBHeadroom: 13728.5 GB
ollama run deepseek-r1:70b
108
tok/s
E
Weights
48.13 GB
KV cache
35.00 GB
Activations
10.60 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#10Llama 3.1 70B Instruct
70B
llama
Commercial OK
Quant: Q5_K_MContext: 8,192VRAM: 95.5 GBHeadroom: 13728.5 GB
ollama run llama3.1:70b
108
tok/s
E
Weights
48.13 GB
KV cache
35.00 GB
Activations
10.60 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#11Qwen 2.5 72B Instruct
72B
qwen
Commercial OK
Quant: Q5_K_MContext: 8,192VRAM: 98.0 GBHeadroom: 13726.0 GB
ollama run qwen2.5:72b
105
tok/s
E
Weights
49.50 GB
KV cache
36.00 GB
Activations
10.67 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →
#12Qwen 2.5 14B Instruct
14B
qwen
Commercial OK
Quant: Q8_0Context: 8,192VRAM: 32.6 GBHeadroom: 13791.4 GB
ollama run qwen2.5:14b
350
tok/s
E
Weights
14.88 GB
KV cache
7.00 GB
Activations
8.94 GB
Runtime
1.80 GB
Model details →Run-on benchmark page →

What if you upgraded?

Hypothetical scenarios. We re-ran the compatibility engine for each.

+32 GB system RAM

~$80–150

Doubles your CPU-offload working set. Helps when models don't quite fit in VRAM.

Unlocks: 36 new comfortable

  • • Gemma 3 1B
  • • Llama 3.2 1B Instruct
  • • Gemma 4 E2B (Effective 2B)
  • • Llama 3.2 3B Instruct
Shop this upgrade↗

Add a second NVIDIA GB200 NVL72

see current pricing

Tensor parallelism splits the model across both cards, effectively doubling VRAM. Bandwidth doesn't double — runs ~1.5× the single-card speed in practice.

Unlocks: 36 new comfortable

  • • Gemma 3 1B
  • • Llama 3.2 1B Instruct
  • • Gemma 4 E2B (Effective 2B)
  • • Llama 3.2 3B Instruct
Shop this upgrade↗

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Won't run
top 3 popular models

Need more memory than you have. Shown for orientation.

Qwen 3.6 35B-A3B (MTP)
35B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (13824 GB) + 60% of system RAM (19 GB) combined.

—
Qwen 3.6 27B (MTP)
27B
qwen
Commercial OK

Even with CPU offload, needs more memory than your VRAM (13824 GB) + 60% of system RAM (19 GB) combined.

—
Ring-2.6-1T
1000B
other
Commercial OK

Even with CPU offload, needs more memory than your VRAM (13824 GB) + 60% of system RAM (19 GB) combined.

—

How to read these numbers

M
Measured — we ran this exact combo on owner hardware.

~
Extrapolated — predicted from a measured benchmark on similar-bandwidth hardware.

E
Estimated — pure formula based on VRAM bandwidth and model architecture.

Full methodology →

Want a specific benchmark we don't have? Email support@runlocalai.co and we'll prioritize it.