Hardware buyer guide · 4 picksEditorialReviewed May 2026

Best GPU for KoboldCpp

Honest 2026 guide to picking a GPU for KoboldCpp roleplay and story workflows. Long context is dominant — why 24 GB is the 32B Q4 + 32K context sweet spot, Mac mini as silent always-on server.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For KoboldCpp roleplay and story workflows, 24 GB VRAM is the sweet spot — it fits 32B Q4 at 32K context comfortably. The used RTX 3090 at $800 is the single best card for this workload.

KoboldCpp is context-hungry. A 32B Q4 model at 32K context can consume 22-24 GB VRAM — right at the 24 GB ceiling. Long context is the dominant constraint, and context scales linearly with VRAM. If you prioritize story continuity over raw throughput, VRAM trumps compute here.

For silent always-on KoboldCpp servers, the Mac mini M4 Pro 48 GB at $1,999 is a stealth pick — 48 GB unified runs 70B Q4 + context comfortably, zero fan noise, and 20W idle draw vs 200W+ on an x86 tower.

The picks, ranked by buyer-leverage

RTX 4060 Ti 16 GB — KoboldCpp entry pick

full verdict →

16 GB · $450-550 (2026 retail)

Fits 13B Q4 at 32K context comfortably. Good entry for basic roleplay — but 22B+ models don't fit with long context.

Buy if

KoboldCpp users running 13B-class models (Mistral Nemo, Llama 3.1 8B)
First-time roleplay setup under $600
Shorter context stories (4-8K tokens)

Skip if

Users targeting 32B Q4 + 32K context (blocks you at 16 GB)
70B Q4 KoboldCpp (non-starter)
Buyers who can stretch to used 3090 (extra 8 GB is game-changing)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 3090 (used) — KoboldCpp value pick

full verdict →

24 GB · $700-1,000 (2026 used)

Fits 32B Q4 at 32K context. The de facto KoboldCpp roleplay card — long context, real coherence.

Buy if

32B Q4 roleplay at 32K context
Long-form story generation with full session memory
Cost-conscious builders who want the KoboldCpp sweet spot

Skip if

Buyers who hate used silicon
Silent-server deployments (3090 fans are audible)
70B Q4 long-context (48 GB is the real tier for that)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 4090 — KoboldCpp best pick

full verdict →

24 GB · $1,400-1,900 used / $1,800-2,200 new

Same 24 GB as 3090 but 20-30% faster prompt eval on KoboldCpp. Better thermals, quieter fan curve.

Buy if

KoboldCpp users who value prompt evaluation speed
32 GB context on 22B models (fits comfortably)
Buyers wanting new silicon with warranty

Skip if

Multi-roleplay-session operators (exceeds 24 GB)
Buyers stretching to 5090 for 70B long-context
Cost-conscious builders (used 3090 is half the price)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

Apple M4 Pro Mac mini 48 GB — koboldCPP silent server

full verdict →

48 GB · $1,999 (Mac mini M4 Pro 48 GB, 2026)

48 GB unified runs 70B Q4 + 32K context silently. Best always-on KoboldCpp server — zero fan noise, 20W idle.

Buy if

Always-on KoboldCpp roleplay server
70B Q4 models with 32K context
Silent operation requirement (bedroom/home-office server)

Skip if

Prompt evaluation speed enthusiasts (CUDA faster)
CUDA-locked koboldCPP-specific features
Builders who prefer tinkering with multi-GPU PCIe rigs

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

HonestyWhy benchmark numbers on this page might not reflect your real experience

tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

KoboldCpp is context-dominant. A 32B Q4 model at 4K context uses ~18 GB. At 32K context, that same model uses ~23 GB. Every doubling of context costs ~2-3 GB of VRAM. For roleplay coherence, long context is non-negotiable.

8 GB — 7B Q4 at 4K context only. Basic roleplay, short memory window. Not recommended for serious use.
12 GB — 7B-8B Q4 at 8K context. 13B Q4 fits at 4K with offloading. Entry-level roleplay viable.
16 GB — 13-22B Q4 at 16K context. Solid mid-tier roleplay. 32B Q4 only fits at low context.
24 GB (KoboldCpp sweet spot) — 32B Q4 at 32K context comfortable. Coherent long-form storytelling. The roleplay goldilocks tier.
32-48 GB+ — 70B Q4 at 32K context. Full-session memory for complex multi-character roleplay.

Compare these picks head-to-head

RTX 3090 vs RTX 4090

Both 24 GB. KoboldCpp prompt eval speed vs price.

RTX 4090 vs RTX 5090

When 32 GB matters for 70B KoboldCpp long context.

Frequently asked questions

Does KoboldCpp need a fast GPU?

For prompt evaluation: yes — faster GPU = faster context processing. For token generation: less so — most modern GPUs deliver acceptable tok/s on quantized models. The VRAM constraint matters more than raw TFLOPS for KoboldCpp. A 3090 with 24 GB beats a 5080 with 16 GB on 32B Q4 long-context workloads.

Can I run KoboldCpp without a GPU?

Yes, KoboldCpp supports CPU-only inference via GGUF. CPU tok/s is model-dependent — 13B Q4 on a Ryzen 7950X runs at 8-12 tok/s, usable for casual roleplay. Larger models drop to 1-3 tok/s. GPU acceleration is strongly recommended for daily use.

How much VRAM do I need for long-context roleplay?

For coherent long-form roleplay: 24 GB minimum. 16 GB works for shorter sessions (4-8K context). The KV cache is the hidden VRAM eater — at 32K context on a 32B model, the cache alone consumes 4-6 GB. Plan for model weights + KV cache + headroom.

Is the Mac mini a good KoboldCpp server?

Yes, for serving. The M4 Pro 48 GB runs 70B Q4 at 32K silently, drawing 20W idle. Inference is slower than an RTX desktop (10-15 tok/s vs 30+ tok/s), but the silence + always-on reliability make it a compelling roleplay server for LAN access.

Why does context length matter so much in KoboldCpp?

KoboldCpp is built for story-driven interaction. Short context = the model forgets earlier plot points and character details within minutes. Long context (16K-32K) maintains narrative coherence across multi-hour sessions. If roleplay quality matters, don't skimp on VRAM for context.

Do I need a dedicated GPU just for KoboldCpp?

Not necessarily. KoboldCpp layers nicely with other LLM tools — Ollama, vLLM, llama.cpp all share the same GPU resources sequentially. But if you run other VRAM-heavy tools (ComfyUI, video gen) concurrently, plan for concurrent VRAM budgets.

Go deeper

Best GPU for Llama models — Llama-family hardware picks (common KoboldCpp target)
Best GPU for local AI (pillar) — All picks ranked across model families
Best AI PC for developers — Rig builds that handle KoboldCpp + dev tools
Running local AI on multiple GPUs — When one card isn't enough for 70B long-context

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider:

If your budget is tighter →best budget GPU for local AI
If you'd rather buy used →best used GPU for local AI
If you're on Apple Silicon →best Mac for local AI
If you're not sure what fits your build →the will-it-run checker
If you don't want to buy anything yet →our editorial philosophy