RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Quick answers
REF
  • All buyer guides
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
128 GB
Effective VRAM
~126 GB
range 116-126 GB
Topology
single gpu
none
Setup difficulty
beginner
speed penalty ~0%
Why effective VRAM is lower than total

Single AMD Instinct MI300A (APU) — 128 GB VRAM minus ~1.8 GB runtime overhead = ~126 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.

Coding agents — agentic tool-call burst

Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.

Bottleneck: kv cache

Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.

Budget for
  • •32K context with KV-cache room to spare (~3-4 GB on 4090 AWQ-INT4)
  • •Prefix cache: prefer SGLang for >5 tool calls / task
  • •Decode latency: aim for >40 tok/s sustained

Recommended runtime

Best engine for this topology + skill level + use case.

llama.cpp (HIPBLAS)
primary
moderate

The most reliable AMD inference path in 2026. GGUF format works on every AMD card; HIPBLAS backend matches llama.cpp's CUDA backend within ~20% on RDNA3.

Models that fit your build

47 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
24 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextNote
Phind CodeLlama 34B v234BQ4_K_M73.9 GB16,384Comfortable fit with 41% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 32B Instruct32BQ8_079.1 GB32,768Fits cleanly at Q8_0 + 32,768 ctx with 37% headroom.
Devstral Small 2 24B24BQ4_K_M98 GB32,768Fits cleanly at Q4_K_M + 32,768 ctx with 22% headroom.
Codestral 22B22BQ8_0103.3 GB32,768Fits cleanly at Q8_0 + 32,768 ctx with 18% headroom.
DeepSeek V3 Lite (16B MoE)16BQ4_K_M76.9 GB32,768Fits cleanly at Q4_K_M + 32,768 ctx with 39% headroom.
DeepSeek Coder V2 Lite (16B)16BQ4_K_M76.9 GB32,768Fits cleanly at Q4_K_M + 32,768 ctx with 39% headroom.
StarCoder 2 15B15BQ4_K_M42.9 GB16,384Comfortable fit with 66% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 14B Instruct14BQ8_078.4 GB32,768Fits cleanly at Q8_0 + 32,768 ctx with 38% headroom.
Qwen 3 14B14BQ8_078.4 GB32,768Fits cleanly at Q8_0 + 32,768 ctx with 38% headroom.
Qwen 2.5 Coder 14B Instruct14BQ4_K_M71.6 GB32,768Comfortable fit with 43% headroom — room to extend context or run alongside other workloads.
Yi Coder 9B9BQ4_K_M58.5 GB32,768Comfortable fit with 54% headroom — room to extend context or run alongside other workloads.
OpenCoder 8B8BQ4_K_M55.8 GB32,768Comfortable fit with 56% headroom — room to extend context or run alongside other workloads.
Qwen 3 8B8BQ8_059.7 GB32,768Comfortable fit with 53% headroom — room to extend context or run alongside other workloads.
Llama 3.1 8B Instruct8BFP1655.9 GB32,768Comfortable fit with 56% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Llama 8B8BQ4_K_M55.8 GB32,768Comfortable fit with 56% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 7B Instruct7BQ6_K54.8 GB32,768Comfortable fit with 56% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 7B Instruct7BQ8_044.5 GB32,768Comfortable fit with 65% headroom — room to extend context or run alongside other workloads.
CodeGemma 7B7BQ4_K_M18.1 GB8,192Comfortable fit with 86% headroom — room to extend context or run alongside other workloads.
Codestral Mamba 7B7BQ4_K_M53.2 GB32,768Comfortable fit with 58% headroom — room to extend context or run alongside other workloads.
CodeQwen 1.5 7B7BQ4_K_M53.2 GB32,768Comfortable fit with 58% headroom — room to extend context or run alongside other workloads.
StarCoder 2 7B7BQ4_K_M29.8 GB16,384Comfortable fit with 76% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 3B3BQ4_K_M42.7 GB32,768Comfortable fit with 66% headroom — room to extend context or run alongside other workloads.
StarCoder 2 3B3BQ4_K_M23.3 GB16,384Comfortable fit with 82% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 1.5B2BQ4_K_M38.7 GB32,768Comfortable fit with 69% headroom — room to extend context or run alongside other workloads.
Borderline
8 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextNote
Llama 3.3 70B Instruct70BQ8_0123.6 GB32,768Tight fit at Q8_0 — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3.6 35B-A3B (MTP)35BQ3_K_M122.7 GB32,768Tight fit at Q3_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 2.5 32B Instruct32BQ4_K_M119.1 GB32,768Tight fit at Q4_K_M — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 32B32BQ5_K_M121.9 GB32,768Tight fit at Q5_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Gemma 4 31B Dense31BQ4_K_M116.4 GB32,768Tight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 30B-A3B30BQ4_K_M113.8 GB32,768Tight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3.6 27B (MTP)27BQ8_0118.9 GB32,768Tight fit at Q8_0 — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Mistral Small 3 24B24BQ8_0109.5 GB32,768Tight fit at Q8_0 — only 13% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
15 models · oversize for this build
ModelParamsQuantVRAM est.ContextNote
DeepSeek R1 Distill Qwen 3 32B32BAWQ-INT4132.4 GB32,768~132.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
Qwen 3 Coder 32B32BAWQ-INT4132.4 GB32,768~132.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
DeepSeek Coder V333BAWQ-INT4135.4 GB32,768~135.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build.
Llama 4 70B70BAWQ-INT4248.3 GB32,768~248.3 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 97%. Drop quant or move to a larger build.
Llama 3.1 70B Instruct70BQ4_K_M219.1 GB32,768~219.1 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 74%. Drop quant or move to a larger build.
Qwen 3 72B72BAWQ-INT4254.4 GB32,768~254.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 102%. Drop quant or move to a larger build.
Llama 4 Scout109BQ4_K_M321.9 GB32,768~321.9 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 155%. Drop quant or move to a larger build.
Qwen 3 235B-A22B235BQ4_K_M653.7 GB32,768~653.7 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 419%. Drop quant or move to a larger build.
DeepSeek Coder V2 236B236BQ4_K_M656.4 GB32,768~656.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 421%. Drop quant or move to a larger build.
DeepSeek V2.5 236B236BQ4_K_M656.4 GB32,768~656.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 421%. Drop quant or move to a larger build.
DeepSeek V4 Flash (284B MoE)284BQ4_K_M782.8 GB32,768~782.8 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 521%. Drop quant or move to a larger build.
DeepSeek V4745BAWQ-INT42307 GB32,768~2307.0 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 1731%. Drop quant or move to a larger build.
Kimi K2.61000BQ4_K_M2668.7 GB32,768~2668.7 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 2018%. Drop quant or move to a larger build.
Ring-2.6-1T1000BQ3_K_M2546.6 GB32,768~2546.6 GB needed at Q3_K_M + 32,768 ctx — overshoots effective VRAM by 1921%. Drop quant or move to a larger build.
DeepSeek V4 Pro (1.6T MoE)1600BQ4_K_M4249.1 GB32,768~4249.1 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 3272%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.