Best GPU for KoboldCpp
Honest 2026 guide to picking a GPU for KoboldCpp roleplay and story workflows. Long context is dominant — why 24 GB is the 32B Q4 + 32K context sweet spot, Mac mini as silent always-on server.
The short answer
For KoboldCpp roleplay and story workflows, 24 GB VRAM is the sweet spot — it fits 32B Q4 at 32K context comfortably. The used RTX 3090 at $800 is the single best card for this workload.
KoboldCpp is context-hungry. A 32B Q4 model at 32K context can consume 22-24 GB VRAM — right at the 24 GB ceiling. Long context is the dominant constraint, and context scales linearly with VRAM. If you prioritize story continuity over raw throughput, VRAM trumps compute here.
For silent always-on KoboldCpp servers, the Mac mini M4 Pro 48 GB at $1,999 is a stealth pick — 48 GB unified runs 70B Q4 + context comfortably, zero fan noise, and 20W idle draw vs 200W+ on an x86 tower.
The picks, ranked by buyer-leverage
16 GB · $450-550 (2026 retail)
Fits 13B Q4 at 32K context comfortably. Good entry for basic roleplay — but 22B+ models don't fit with long context.
- KoboldCpp users running 13B-class models (Mistral Nemo, Llama 3.1 8B)
- First-time roleplay setup under $600
- Shorter context stories (4-8K tokens)
- Users targeting 32B Q4 + 32K context (blocks you at 16 GB)
- 70B Q4 KoboldCpp (non-starter)
- Buyers who can stretch to used 3090 (extra 8 GB is game-changing)
24 GB · $700-1,000 (2026 used)
Fits 32B Q4 at 32K context. The de facto KoboldCpp roleplay card — long context, real coherence.
- 32B Q4 roleplay at 32K context
- Long-form story generation with full session memory
- Cost-conscious builders who want the KoboldCpp sweet spot
- Buyers who hate used silicon
- Silent-server deployments (3090 fans are audible)
- 70B Q4 long-context (48 GB is the real tier for that)
24 GB · $1,400-1,900 used / $1,800-2,200 new
Same 24 GB as 3090 but 20-30% faster prompt eval on KoboldCpp. Better thermals, quieter fan curve.
- KoboldCpp users who value prompt evaluation speed
- 32 GB context on 22B models (fits comfortably)
- Buyers wanting new silicon with warranty
- Multi-roleplay-session operators (exceeds 24 GB)
- Buyers stretching to 5090 for 70B long-context
- Cost-conscious builders (used 3090 is half the price)
48 GB · $1,999 (Mac mini M4 Pro 48 GB, 2026)
48 GB unified runs 70B Q4 + 32K context silently. Best always-on KoboldCpp server — zero fan noise, 20W idle.
- Always-on KoboldCpp roleplay server
- 70B Q4 models with 32K context
- Silent operation requirement (bedroom/home-office server)
- Prompt evaluation speed enthusiasts (CUDA faster)
- CUDA-locked koboldCPP-specific features
- Builders who prefer tinkering with multi-GPU PCIe rigs
HonestyWhy benchmark numbers on this page might not reflect your real experience
- tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
- Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
- Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
- Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
- Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
- Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
- Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.
We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.
How to think about VRAM tiers
KoboldCpp is context-dominant. A 32B Q4 model at 4K context uses ~18 GB. At 32K context, that same model uses ~23 GB. Every doubling of context costs ~2-3 GB of VRAM. For roleplay coherence, long context is non-negotiable.
- 8 GB — 7B Q4 at 4K context only. Basic roleplay, short memory window. Not recommended for serious use.
- 12 GB — 7B-8B Q4 at 8K context. 13B Q4 fits at 4K with offloading. Entry-level roleplay viable.
- 16 GB — 13-22B Q4 at 16K context. Solid mid-tier roleplay. 32B Q4 only fits at low context.
- 24 GB (KoboldCpp sweet spot) — 32B Q4 at 32K context comfortable. Coherent long-form storytelling. The roleplay goldilocks tier.
- 32-48 GB+ — 70B Q4 at 32K context. Full-session memory for complex multi-character roleplay.
Compare these picks head-to-head
Frequently asked questions
Does KoboldCpp need a fast GPU?
For prompt evaluation: yes — faster GPU = faster context processing. For token generation: less so — most modern GPUs deliver acceptable tok/s on quantized models. The VRAM constraint matters more than raw TFLOPS for KoboldCpp. A 3090 with 24 GB beats a 5080 with 16 GB on 32B Q4 long-context workloads.
Can I run KoboldCpp without a GPU?
Yes, KoboldCpp supports CPU-only inference via GGUF. CPU tok/s is model-dependent — 13B Q4 on a Ryzen 7950X runs at 8-12 tok/s, usable for casual roleplay. Larger models drop to 1-3 tok/s. GPU acceleration is strongly recommended for daily use.
How much VRAM do I need for long-context roleplay?
For coherent long-form roleplay: 24 GB minimum. 16 GB works for shorter sessions (4-8K context). The KV cache is the hidden VRAM eater — at 32K context on a 32B model, the cache alone consumes 4-6 GB. Plan for model weights + KV cache + headroom.
Is the Mac mini a good KoboldCpp server?
Yes, for serving. The M4 Pro 48 GB runs 70B Q4 at 32K silently, drawing 20W idle. Inference is slower than an RTX desktop (10-15 tok/s vs 30+ tok/s), but the silence + always-on reliability make it a compelling roleplay server for LAN access.
Why does context length matter so much in KoboldCpp?
KoboldCpp is built for story-driven interaction. Short context = the model forgets earlier plot points and character details within minutes. Long context (16K-32K) maintains narrative coherence across multi-hour sessions. If roleplay quality matters, don't skimp on VRAM for context.
Do I need a dedicated GPU just for KoboldCpp?
Not necessarily. KoboldCpp layers nicely with other LLM tools — Ollama, vLLM, llama.cpp all share the same GPU resources sequentially. But if you run other VRAM-heavy tools (ComfyUI, video gen) concurrently, plan for concurrent VRAM budgets.
Go deeper
- Best GPU for Llama models — Llama-family hardware picks (common KoboldCpp target)
- Best GPU for local AI (pillar) — All picks ranked across model families
- Best AI PC for developers — Rig builds that handle KoboldCpp + dev tools
- Running local AI on multiple GPUs — When one card isn't enough for 70B long-context
When it doesn't work
Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:
Common alternatives readers consider:
- If your budget is tighter →best budget GPU for local AI
- If you'd rather buy used →best used GPU for local AI
- If you're on Apple Silicon →best Mac for local AI
- If you're not sure what fits your build →the will-it-run checker
- If you don't want to buy anything yet →our editorial philosophy