RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Benchmarks
  4. /Wanted
◯Community submitted(Public roadmap)

Benchmark roadmap — what we want measured

The public version of our benchmark queue. Each entry is a model+hardware combo we'd like a measurement for, and why that measurement would unlock useful pages or sharpen a confidence tier. If you have the rig, click “I can measure this” — the submission form arrives prefilled with model, hardware, and runtime.

Pending
16
In progress
0
Measured
0
Critical
1
Benchmark coverage

0 measured · 0 wanted · 168 unstarted

0.0% of 168 cells covered
Hardware ↓ / Model →
DeepSeek V4 Pro (1
Qwen 3.5 235B-A17B
Qwen 3 235B-A22B
Llama 3.1 8B Instr
DeepSeek R1 (671B
DeepSeek V4 Flash
Llama 4 Scout
Qwen 3 30B-A3B
Qwen 2.5 Coder 32B
Llama 3.3 70B Inst
Qwen 3 32B
Gemma 4 31B Dense
Qwen 3 8B
Mistral Medium 3.5
AMD Instinct MI325X
NVIDIA H200
NVIDIA L40S
NVIDIA B200
NVIDIA H100 PCIe
NVIDIA L40
NVIDIA RTX PRO 6000 Blackwel
NVIDIA RTX 6000 Ada Generati
AMD Instinct MI300X
NVIDIA GB200 NVL72
AMD Instinct MI355X
AMD Instinct MI300A (APU)
High confidenceMedium confidenceCritical wantedWantedNot started

Have a rig? Run a benchmark in 5 minutes.

The community-benchmark scripts capture power source, GPU clock, CUDA version, thermal state — every variable that affects the tok/s number. Output is a paste-ready result block. Nothing uploads automatically.

Windows
.\scripts\community-benchmark\run-benchmark.ps1
macOS / Linux / WSL
./scripts/community-benchmark/run-benchmark.sh
Get the scripts on GitHub →Read the measurement methodologyEmail a result block

How this works

1. We mark a model+hardware combo as “wanted” when measuring it would unlock new pages, sharpen a confidence tier, or fill a gap that operators are repeatedly asking about.

2. When you click “I can measure this,” the submission form opens with the model, hardware, and runtime prefilled. You add your measurements (tok/s, VRAM, context, runtime version, OS).

3. Submissions still go through editorial review. We don't auto-publish. If your numbers are plausible and well-documented, we mark the opportunity as measured and the benchmark goes live with full attribution.

  • Critical
    target: 55-75 tok/s decode (single stream)

    Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)

    Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090 · vLLM · AWQ-INT4

    Why we want this

    The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.

    Unlocks
    /hardware/rtx-5090/guides/choosing-a-gpu-for-local-ai-2026/guides/running-local-ai-on-multiple-gpus-2026/will-it-run/rtx-5090
    Run this benchmark on your rig
    MODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • High
    target: 12-25 tok/s decode (Hexagon NPU, estimate)

    Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8

    Why we want this

    Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.

    Unlocks
    /hardware/snapdragon-8-elite/tools/qualcomm-ai-hub/stacks/android-on-device-ai
    Run this benchmark on your rig
    MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • High
    target: 8-15 tok/s decode (estimate, sustained)

    iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

    Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4

    Why we want this

    Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'

    Unlocks
    /hardware/apple-a18-pro/systems/mobile-local-ai/stacks/iphone-on-device-ai
    Run this benchmark on your rig
    MODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • High
    target: 100-160 tok/s decode (single stream)

    4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

    DeepSeek V4 Flash (284B MoE) on · vLLM · AWQ-INT4

    Why we want this

    DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.

    Unlocks
    /hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/models/deepseek-v4-flash
    Run this benchmark on your rig
    MODEL="deepseek-v4:deepseek-v4-flash" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • High
    target: 8-14 tok/s decode (single stream)

    Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

    Qwen 3.5 235B-A17B (MoE) on · MLX-LM · MLX-4bit

    Why we want this

    The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.

    Unlocks
    /hardware-combos/mac-studio-m3-ultra-192gb/stacks/apple-silicon-ai/will-it-run/combo/mac-studio-m3-ultra-192gb/guides/running-local-ai-on-multiple-gpus-2026
    Run this benchmark on your rig
    MODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • High
    target: 60-90 tok/s decode (single stream)

    4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

    Qwen 3.5 235B-A17B (MoE) on · vLLM · FP8

    Why we want this

    The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.

    Unlocks
    /hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/will-it-run/combo/vllm-tensor-parallel-h100-workstation/guides/running-local-ai-on-multiple-gpus-2026
    Run this benchmark on your rig
    MODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • High
    target: 28-36 tok/s decode (PCIe only)

    Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

    Llama 3.3 70B Instruct on · vLLM · AWQ-INT4

    Why we want this

    Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.

    Unlocks
    /hardware-combos/dual-rtx-4090/stacks/dual-4090-workstation/will-it-run/combo/dual-rtx-4090/guides/running-local-ai-on-multiple-gpus-2026
    Run this benchmark on your rig
    MODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • High
    target: 25-32 tok/s decode (NVLink)

    Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

    Llama 3.3 70B Instruct on · vLLM · AWQ-INT4

    Why we want this

    The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.

    Unlocks
    /hardware-combos/dual-rtx-3090/stacks/dual-3090-workstation/will-it-run/combo/dual-rtx-3090/guides/running-local-ai-on-multiple-gpus-2026
    Run this benchmark on your rig
    MODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • Medium
    target: 10-22 tok/s decode (Adreno GPU path)

    Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

    Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)

    Why we want this

    MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.

    Unlocks
    /hardware/snapdragon-8-elite/tools/mlc-llm/stacks/android-on-device-ai
    Run this benchmark on your rig
    MODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • Medium
    target: 20-35 tok/s decode (cold); throttle curve TBD

    iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

    Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit

    Why we want this

    Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.

    Unlocks
    /hardware/apple-m4-ipad/systems/mobile-local-ai
    Run this benchmark on your rig
    MODEL="qwen-2.5:3b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • Medium
    target: 18-35 tok/s decode (estimate)

    Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

    Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8

    Why we want this

    Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.

    Unlocks
    /hardware/intel-lunar-lake-258v/systems/mobile-local-ai
    Run this benchmark on your rig
    MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • Medium
    target: 20-40 tok/s decode (estimate)

    Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8

    Why we want this

    Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.

    Unlocks
    /hardware/snapdragon-x-elite/tools/onnx-runtime-mobile/systems/mobile-local-ai
    Run this benchmark on your rig
    MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model pageHardware page
  • Medium
    target: 30-45 tok/s per stream × 4-32 concurrent

    Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

    Qwen 3 32B on · Ray Serve · AWQ-INT4

    Why we want this

    Distributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.

    Unlocks
    /hardware-combos/ray-serve-distributed-multi-node/stacks/distributed-inference-homelab
    Run this benchmark on your rig
    MODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • Medium
    target: 4-9 tok/s decode (Thunderbolt 5 inter-node)

    4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

    Llama 3.1 70B Instruct on · Exo · MLX-4bit

    Why we want this

    Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.

    Unlocks
    /hardware-combos/quad-mac-mini-m4-pro-exo/stacks/multi-machine-apple-cluster/will-it-run/combo/quad-mac-mini-m4-pro-exo
    Run this benchmark on your rig
    MODEL="llama-3.1:70b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • Medium
    target: 20-28 tok/s decode (with thinking-mode bloat)

    4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)

    DeepSeek R1 Distill Llama 70B on · vLLM · AWQ-INT4

    Why we want this

    Quad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.

    Unlocks
    /hardware-combos/quad-rtx-3090/stacks/quad-3090-workstation/will-it-run/combo/quad-rtx-3090
    Run this benchmark on your rig
    MODEL="deepseek-r1:70b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page
  • Low
    target: 10-16 tok/s (asymmetric layer-split)

    llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)

    Mixtral 8x22B Instruct on · llama.cpp · Q4_K_M

    Why we want this

    Research-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.

    Unlocks
    /hardware-combos/mixed-4090-3090/stacks/mixed-4090-3090-workstation
    Run this benchmark on your rig
    MODEL="mixtral-8x22b:22b" ./scripts/community-benchmark/run-benchmark.sh
    I can measure this →Model page

Need a measurement we don't cover?

Got a measurement? Submit it directly — editorial adds the corresponding opportunity row and credits the gap to you.

Need one measured? Request it. Editorial reviews and surfaces accepted requests in the section above.

Submit a benchmark →Request a benchmark →