154 units indexed. GPUs, SoCs, laptops, and edge hardware ranked for local LLM inference.
Sortable tier list with estimated tok/s for 7B / 14B / 32B / 70B at Q4_K_M. Measured benchmarks where we have them, bandwidth-derived estimates where we don't — every cell labeled.
The accessible Mac Studio tier, launched alongside the M3 Ultra. M4 Max with 36/48/64/96GB unified memory at 546 GB/s — about 2x the M4 Pro's bandwidth. The…
Top-spec Mac Studio with M3 Ultra. Up to 512GB unified memory in custom configs.
The sweet-spot local-AI desktop for most people. M4 Pro with 24/48/64GB unified memory at 273 GB/s — more than double the base M4's bandwidth. A 64GB config…
The cheapest entry into Apple unified memory and the most-recommended starter box for local AI. Base M4 with 16/24/32GB unified memory at 120 GB/s, in a tiny…
The enthusiast-favorite Strix Halo box: a Ryzen AI Max+ 395 system with 128GB LPDDR5X-8000 unified memory (~256 GB/s), up to ~96GB allocatable as VRAM on…
The cheapest mainstream 128GB Strix Halo mini-PC (~$1,499 entry). Ryzen AI Max+ 395, 128GB LPDDR5X-8000 unified, Radeon 8060S iGPU, up to ~96GB allocatable as…
The popular OEM twin of the NVIDIA DGX Spark, ~$1,000 cheaper (~$2,999). Same GB10 Grace Blackwell superchip, 128GB LPDDR5X unified, ~1 PFLOP FP4, ConnectX-7…
NVIDIA's desktop AI box — Grace Blackwell GB10 with 128GB unified LPDDR5X. The closest consumer can get to running 200B-class models locally without renting…
Blackwell flagship. 32GB GDDR7 on a 512-bit bus delivers ~1.79 TB/s memory bandwidth — the new top of consumer hardware for local LLM inference. Comfortably…
Mobile Blackwell flagship. 24GB GDDR7 in a laptop is the new high-water mark.
Second-tier Blackwell. 16GB GDDR7, ~960 GB/s bandwidth. Fastest 16GB consumer card on the market.
Mobile Ada flagship. 16GB VRAM in a laptop. Premium gaming and AI laptop default.
AMD's 24GB challenger to the 4090. ROCm Linux now solid for llama.cpp and vLLM. Best price-per-VRAM-GB on the new market.
The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.
Highest-tier Ampere consumer card. Used market gold for AI: 24GB at sub-$1200 in 2026.
20GB RDNA 3. Cheaper alternative to XTX.
Refreshed 6900 XT with faster GDDR6 (576 GB/s). 16 GB VRAM, slightly more compute. ROCm officially supported. ~110-145 tok/s on 7B Q4. The bandwidth bump…
Ampere flagship-minus-one. 12 GB GDDR6X at 912 GB/s — closer to the 3090 in raw bandwidth than to the 3080. Fits 13B Q4 with full context, 32B Q4 with offload.…
The original 24GB CUDA value pick. Used market still strong in 2026 — many AI hobbyists run dual 3090 setups for 70B inference.
RDNA 2 flagship. 16 GB VRAM at 512 GB/s, more compute than the 6800 XT. ROCm officially supported. ~95-130 tok/s on 7B Q4. The peak of RDNA 2 consumer AMD;…
RDNA 2 enthusiast. 16 GB VRAM, 512 GB/s bandwidth, more compute units than the base 6800. ROCm officially supported. ~85-110 tok/s on 7B Q4, 35-50 tok/s on 13B…
Turing flagship. 11 GB GDDR6 at 616 GB/s — fits 13B Q4 comfortably, 7B Q4 at ~110-140 tok/s. Used $360-420 in 2026 makes it the 'enthusiast on a budget' floor;…
The 12GB member of AMD's RDNA4 desktop line, now a global SKU ($549) after a year as a China-only 'Golden Rabbit Edition'. Navi 48, 48 CUs, 192-bit, 432 GB/s,…
RDNA 4 flagship. 16GB at $599 — best AMD value for local AI in 2026.
16GB Blackwell at the upper-mid price tier. Strong 14B–32B model performance.
16GB RDNA 4 at sub-$600. ROCm + Vulkan supported.
The volume mainstream RTX 50-series gaming-laptop GPU. Originally 8GB, a 12GB variant launched April 2026 to relieve VRAM pressure. GB206, 4,608 CUDA cores,…
RDNA 3 'Golden Rabbit Edition' — 16 GB at 576 GB/s, between the 6800 XT and 7900 XT. ROCm officially supported. The current AMD value choice in the $500-600…
16GB upgrade of the 4070 Ti. Solid mid-high pick for local AI.
Refreshed 4080 with 16GB GDDR6X. Slightly behind 5080 but well-supported.
16GB RDNA 3 mid-range.
12GB Ada — fits 7B–14B Q4 with usable context.
Laptop variant of Ampere. 16GB VRAM in a portable form factor was rare and remains a sleeper pick on the used market.
Original 4080. 16GB GDDR6X. Still capable for 14B–32B Q4 work.
Refreshed 6700 XT with faster GDDR6 (432 GB/s). 12 GB VRAM fits 13B Q4 comfortably. ROCm officially supported. ~55-75 tok/s on 7B Q4. Strong AMD value pick at…
Mid-life 12GB refresh of the 3080. Decent 7B–14B card on the used market.
12 GB RDNA 2 at the $280 used tier. ROCm officially supported. The VRAM headroom matters — fits 13B Q4 comfortably. ~50-65 tok/s on 7B Q4. Strong AMD value…
Ampere Ti with GDDR6X at 608 GB/s. 8 GB VRAM is the ceiling — same as base 3070, no improvement on the variable that matters. Runs 7B Q4 at ~110-145 tok/s with…
16 GB RDNA 2 — the AMD answer to the 4070 12 GB question. Comfortable for any 7B/13B model + 32B Q4 with offload. ROCm officially supported. ~70-90 tok/s on 7B…
Original 10GB 3080. Tight on VRAM for AI but still capable for 7B work.
Ampere mid-high with 8 GB at 448 GB/s. Comfortable for 7B Q4 (~80-100 tok/s) and 13B Q4 with light offload. The 'middle' of Ampere — better bandwidth than the…
Turing 'almost-flagship'. 8 GB VRAM is the ceiling — same as base 2080 — but more bandwidth (496 GB/s) and Tensor compute. Runs 7B Q4 at ~80-105 tok/s with…
The Turing refresh that made 8 GB Turing genuinely fast. ~70-90 tok/s on 7B Q4 with ExLlamaV2. Same 8 GB ceiling as the base 2070 but more compute. Strong…
RDNA 1 flagship. ROCm support was always experimental and is effectively defunct in 2026. Vulkan via llama.cpp is the only operator-grade path; performance is…
Turing high-tier. 8 GB VRAM, similar bandwidth to the 2060 Super, slightly more compute. Runs 7B Q4 at ~65-80 tok/s, 13B Q4 with light offload. A solid…
Pascal halo card. 11 GB GDDR5X at 484 GB/s — outperforms many newer mid-range cards on raw bandwidth. Runs 7B Q4 at ~50-65 tok/s, 13B Q4 fits comfortably at…
AMD's RDNA 4 mainstream card. 16GB VRAM, ROCm + Vulkan support, $449 MSRP. Targets the same $400-500 price segment as NVIDIA's RTX 5060 Ti but ships 16GB by…
The 16GB sub-$500 sweet spot. Best value for entering local AI seriously.
Mid-range Blackwell with 12GB. 7B-14B Q4 territory.
10GB Battlemage at sub-$220. Entry budget compute.
8GB Blackwell. Capable of 7B Q4 only — go 16GB SKU instead for AI work.
Sub-$330 16GB AMD. Memory-bandwidth-limited but great VRAM-per-dollar.
Refreshed 4070. Strong mid-range value for 12GB-tier local AI.
Battlemage architecture. 12GB at $250 — the budget compute card. IPEX-LLM and Vulkan are usable paths for AI.
Chinese third-party modification of the stock RTX 2080 Ti, replacing the 11 GB GDDR6 with 22 GB. The TU102 chip, 352-bit memory bus, and 616 GB/s bandwidth are…
The poster child of 'cheap 16GB CUDA card'. Memory bandwidth is mediocre but 16GB at $400-something opens up 14B Q4.
Original 4070. 12GB Ada. Now eclipsed by 4070 Super at the same price.
12GB RDNA 3.
8GB version — go 16GB SKU for AI work.
Alchemist 16GB. Cheapest path to that VRAM tier. Vulkan llama.cpp is the most-tested route.
Refreshed 6600 XT with slightly faster GDDR6 (280 GB/s). 8 GB VRAM ceiling unchanged. ~40-55 tok/s on 7B Q4 with ROCm. Same buyer-decision shape as the 6600 XT…
The community pick for 'cheapest CUDA card with serious VRAM'. The value floor for local AI in 2026.
Entry RDNA 2. 8 GB VRAM, lower bandwidth (224 GB/s) — the bottleneck on AI. ROCm officially supported on Linux. ~30-45 tok/s on 7B Q4. Reasonable budget AMD…
RDNA 2 mid-tier. 8 GB GDDR6 at 256 GB/s. ROCm officially supported on Linux. ~35-50 tok/s on 7B Q4. Bandwidth-bottlenecked vs the 6700 XT/6750 XT siblings.…
8GB Ampere. Fits 7B Q4 only.
RDNA 1 mid-tier. 6 GB VRAM is the ceiling — fits 7B Q4 only with short context. No production ROCm support; Vulkan-only. ~20-30 tok/s on 7B Q4. The 'I have one…
Turing mid with the 8 GB upgrade — meaningful for AI. 7B Q4 fits comfortably with full context, 13B Q4 fits with offload. ~60-75 tok/s on 7B with ExLlamaV2.…
First consumer card with Tensor cores at the ~$200 used tier. 6 GB VRAM is the bottleneck — 7B Q4 fits with limited context. FP16/INT8 Tensor compute makes…
Turing mid-tier without RT/Tensor cores. 6 GB VRAM fits 7B Q4 with short context. Bandwidth (288 GB/s) is solid for the tier — ~30-40 tok/s on 7B Q4. Same VRAM…
Turing mid-range with GDDR6 — bandwidth jumps to 336 GB/s vs the base 1660. Same 6 GB VRAM ceiling but ~30-45 tok/s on 7B Q4 thanks to the bandwidth bump.…
Turing mid-range without RT cores. 6 GB VRAM fits 7B Q4 with short context. No Tensor cores or FP16 acceleration on consumer Turing-LITE, so inference is…
Pascal Ti slot between the 1070 and 1080. 8 GB GDDR5 at 256 GB/s. No FP16 acceleration on consumer Pascal — quantized inference only. Runs 7B Q4 at ~25-35…
AMD Polaris with 8 GB VRAM. Cheap on used market ($70-100) but Polaris was dropped from ROCm in 2022, so AMD's official AI runtimes won't work. Vulkan via…
Pascal high-tier with 8 GB VRAM. Comfortable for 7B Q4 models at ~25-35 tok/s. Bandwidth-limited like the rest of Pascal but the 8 GB headroom matters — fits…
Pascal flagship for two years. 8 GB GDDR5X at 320 GB/s — better bandwidth than the 1070. Runs 7B Q4 at ~30-45 tok/s; 13B Q4 with offload but slow. The…
NVIDIA RTX 5000 PRO is NVIDIA's Blackwell-generation workstation card, slotting between the consumer RTX 5090 (32GB) and the RTX 6000 PRO Blackwell (96GB).…
The Blackwell Ultra datacenter refresh of the B200. 288GB HBM3e per GPU, ~8 TB/s, up to 1,400W; GB300 NVL72 racks reach 1.1 ExaFLOPS FP4. The current top-end…
Latest CDNA 4. 288GB HBM3e — currently the highest VRAM per chip on the market.
The air-coolable sibling of the listed MI355X — same CDNA 4 silicon, 288GB HBM3E, ~8 TB/s, at a lower (~1,000W-class) power profile. The variant most…
Pro Blackwell — 96GB GDDR7 ECC. The single-card answer to 70B and 100B+ local inference.
Mid-tier Blackwell workstation card: 32GB GDDR7, 200W, explicitly pitched for desktop LLM inference and generative AI. Fills the single-card 32GB…
Intel's workstation card explicitly marketed for low-cost local LLM inference. 24GB GDDR6, 456 GB/s, ~197 TOPS, ~$599. Board partners ship 48GB dual-GPU…
Single-slot 140W Blackwell workstation card with 24GB GDDR7. The low-power, compact entry to the RTX PRO Blackwell line — fits small workstations and dense…
72-GPU Blackwell rack with 36 Grace CPUs. Hyperscale-only — relevant context here for understanding 'what frontier training runs on'.
256GB HBM3e — direct competitor to NVIDIA H200 with more memory.
Datacenter Blackwell. 192GB HBM3e per chip, ~8 TB/s bandwidth. Cloud-tier — you rent these by the hour.
The PCIe-form-factor variant of the H200. Same 141 GB HBM3e, same memory subsystem (~4.8 TB/s bandwidth), in a dual-slot workstation card rather than the SXM5…
Hopper refresh — 141GB HBM3e at ~4.8 TB/s. Datacenter-class; rentable on RunPod, Lambda, etc.
Intel's enterprise AI accelerator. 128GB HBM2e. Habana stack required — limited ecosystem support.
The China-market Hopper SKU tuned for inference: 96GB HBM3 (more than the standard H100's 80GB), 4.0 TB/s, 400W, with ~41% fewer cores than a full H100.…
Third-party physical modification of a stock GIGABYTE / ASUS / MSI RTX 4090. Chinese specialty shops (and GPVLab in the US) replace the 24 GB of GDDR6X with…
192GB HBM3 datacenter card. Used by Microsoft, Oracle, Meta cloud deployments.
Dual-card H100 with 188GB combined memory. Built for LLM serving.
Ada-gen datacenter card. 48GB GDDR6 — popular at cloud GPU rentals as a budget H100 alternative.
32GB workstation Ada. Mid-tier pro card.
Inference-focused Ada datacenter card. Low-power 24GB suitable for 7B-14B serving.
Previous-gen Habana accelerator. 96GB HBM2e.
Hopper SXM5 — 80GB HBM3 at 3.35 TB/s. The original GPU that trained GPT-4. Cloud-rentable.
PCIe Hopper. Lower power, lower bandwidth than SXM. Server-tier.
64GB CDNA 2. Lower-power AMD datacenter option.
Original Ada datacenter. Slower than L40S. 48GB GDDR6.
Pro Ada — 48GB ECC. Pre-Blackwell workstation default.
Previous-gen CDNA 2. 128GB HBM2e. Powered the Frontier supercomputer.
24GB Ampere workstation card. Tighter power envelope than RTX 3090.
Ampere datacenter flagship. 80GB HBM2e at 2 TB/s. Still common at cloud providers.
Ampere workstation/datacenter hybrid. 48GB GDDR6.
Ampere-gen workstation card with 48GB. Common in AI labs; used market is reasonable for 48GB at this point.
Original A100. 40GB HBM2 at 1.55 TB/s. Trained the early generation of frontier models.
Entry Blackwell. 8GB limits to 7B Q4 with limited context.
The cheapest Blackwell desktop card ($249) and the entry point to the RTX 50 series. 8GB GDDR6, 128-bit, 320 GB/s, 130W. A budget CUDA option for small LLMs…
Intel Lunar Lake's Arc 140V iGPU (Xe2 Battlemage architecture). Highest iGPU bandwidth on Windows in 2026 (137 GB/s LPDDR5x-8533). ~12-18 tok/s on 7B Q4. Pairs…
AMD's 880M iGPU (Ryzen AI 300 series Strix Point). RDNA 3.5 with LPDDR5x-7500 unified memory — bandwidth jump from 780M (89 → 102 GB/s). ~8-15 tok/s on 7B Q4.…
Entry-level Ada. 8GB limits to 7B Q4.
AMD's 780M iGPU (Ryzen 7040/8040 series Phoenix). Shares system RAM via unified memory architecture; 32 GB DDR5 system gives effective 16-20 GB usable for…
Ampere entry with full Tensor + RT cores at the $200 tier. 8 GB VRAM is the practical floor for serious 7B work. Bandwidth (224 GB/s) is the bottleneck —…
RDNA 1 entry. 8 GB GDDR6 at 224 GB/s. ROCm support was always experimental on RDNA 1 and is effectively defunct in 2026 — Vulkan via llama.cpp is the only…
Turing entry refresh with GDDR6. 4 GB VRAM is below the practical AI floor — 1-3B Q4 only. No Tensor cores. Common in pre-built office PCs from 2019-2021 that…
Turing entry without RT/Tensor cores. 4 GB VRAM keeps it at the practical floor — 1-3B Q4 only. The 'I built a budget gaming PC' audience runs into VRAM walls…
Cut-down RX 580 with 4 GB VRAM. Below the practical AI floor; 1-3B Q4 with offload at best. ROCm was never supported on Polaris in any production build; Vulkan…
Pascal mid-range, 6 GB VRAM. The most-installed Steam GPU for many years; high probability the 'I have a GTX 1060' audience is asking about this card. Runs 7B…
Pascal-era entry GPU. 4 GB VRAM is the practical floor for any local model — fits 1-3B at Q4 with room for short context. CUDA-compatible but no FP16…
Pascal mid-range cut down to 3 GB VRAM. Below the practical AI floor — even 3B Q4 models need ~2 GB plus KV cache. Operators with this card almost universally…
Mobile-only Ampere with 4 GB VRAM at 192 GB/s. The 4 GB ceiling is the bottleneck — 1-3B Q4 only with no headroom for context. CUDA + Tensor cores work, but…
Top-end Windows AI laptop with 24GB RTX 5090 Mobile.
Desktop-replacement gaming/AI laptop with cooler thermals than ultraslims.
16-inch M4 Max — 128GB unified at 546 GB/s. The most capable AI laptop in 2026.
Ryzen 7 6800H + RTX 3080 16GB Mobile. The reference 'serious local-AI laptop' build. Look for the 16GB SKU.
The cheapest portable Apple unified-memory machine and a common 'try local LLMs on a laptop' entry point. M4 with 16/24/32GB at 120 GB/s, fanless. Runs 8-14B…
Modular AMD laptop. Limited GPU but the platform is the appeal.
The flagship Strix Halo mobile workstation: a 14" laptop with 128GB LPDDR5X-8000 unified memory (up to ~96GB allocatable to the Radeon 8060S iGPU). The…
M3 Ultra — up to 512GB unified in Mac Studio top spec. 819 GB/s bandwidth.
Two-chip Ultra fusing two M4 Max dies. Up to 256GB unified memory at 1.1 TB/s. The single highest-VRAM consumer rig you can buy in a Mac Studio.
M4 Max — 546 GB/s memory bandwidth, up to 128GB unified. Most capable laptop SoC for 70B+ models.
M2 Ultra — up to 192GB at 800 GB/s. Mac Studio and Mac Pro hosting models.
M3 Max — 400 GB/s bandwidth, up to 128GB.
Original Ultra — 800 GB/s. 64–128GB unified. Still capable for 70B Q4.
Mid-tier M4 — 273 GB/s bandwidth, up to 48GB.
Copilot+ PC reference SoC. 12-core Oryon CPU + Adreno GPU + Hexagon NPU at 45 TOPS INT8. The first ARM Windows laptop with serious NPU compute; runs Phi Silica…
M2 Max — 400 GB/s bandwidth, up to 96GB.
Original M1 Max. 400 GB/s. 32–64GB unified.
Lower-tier Snapdragon X. 45 TOPS NPU.
AMD's first true Strix Halo chip, combining Zen 5 CPU cores, RDNA 3.5 integrated graphics, and a 50-TOPS XDNA 2 NPU on a single die. The headline feature for…
Combined CPU + GPU APU with 128GB unified HBM3. Powers the El Capitan supercomputer.
Pixel 9 SoC. Google's mobile chip optimized for Gemini Nano + on-device transcription / summarization. NPU TOPS not publicly disclosed by Google; treat as…
iPhone 16 Pro SoC. Improved Neural Engine for Apple Intelligence on-device workloads. 8GB RAM as the new mobile floor enables 3B-class on-device models.
iPad Pro M4 — 10-core CPU + 10-core GPU + 16-core Neural Engine (38 TOPS). 8/16GB unified memory. Best mobile-class chip for local-AI experimentation as of…
Late-2024 Android flagship SoC. Oryon CPU + Hexagon NPU at ~80 TOPS INT8. 8B-class models become viable on-device with adequate quantization.
Flagship Android SoC. Hexagon NPU at 45 TOPS INT8. First mainstream phone NPU to run 7B-class models on-device via Qualcomm AI Hub + ONNX Runtime Mobile.
iPhone 15 Pro / 15 Pro Max SoC. 6-core CPU + 6-core GPU + 16-core Neural Engine (35 TOPS INT8). First A-series with hardware ray tracing; runs MLX-Swift and…
Intel's current-gen AI-PC platform (launched Jan 2026, Intel 18A node), succeeding Lunar Lake. NPU5 delivers ~50 TOPS (clears the 40-TOPS Copilot+ bar);…
Second-gen Snapdragon PC platform (retail H1 2026), succeeding the X Elite/X Plus. Highest laptop NPU at 80 TOPS Hexagon, up to 18 Oryon cores at 5GHz; some…
Intel Lunar Lake 9-core. NPU 4 at 48 TOPS INT8 + Xe2 iGPU + Skymont E-cores. Copilot+ PC certified. Runs DirectML + ONNX Runtime + OpenVINO; primary…
Strix Point laptop SoC. XDNA 2 NPU at 50 TOPS INT8 + RDNA 3.5 iGPU. ROCm support on Linux unlocks llama.cpp ROCm path; on Windows, ONNX Runtime + DirectML.
Amazon search links — we may earn a small commission at no extra cost to you. How we make money.