Community submitted(Public roadmap)

Benchmark roadmap — what we want measured

The public version of our benchmark queue. Each entry is a model+hardware combo we'd like a measurement for, and why that measurement would unlock useful pages or sharpen a confidence tier. If you have the rig, click “I can measure this” — the submission form arrives prefilled with model, hardware, and runtime.

Pending

In progress

Measured

Critical

Benchmark coverage

0 measured · 0 wanted · 168 unstarted

0.0% of 168 cells covered

Hardware ↓ / Model →	DeepSeek V4 Pro (1	Qwen 3.5 235B-A17B	Qwen 3 235B-A22B	Llama 3.1 8B Instr	DeepSeek R1 (671B	DeepSeek V4 Flash	Llama 4 Scout	Qwen 3 30B-A3B	Qwen 2.5 Coder 32B	Llama 3.3 70B Inst	Qwen 3 32B	Gemma 4 31B Dense	Qwen 3 8B	Mistral Medium 3.5
AMD Instinct MI325X
NVIDIA H200
NVIDIA L40S
NVIDIA B200
NVIDIA H100 PCIe
NVIDIA L40
NVIDIA RTX PRO 6000 Blackwel
NVIDIA RTX 6000 Ada Generati
AMD Instinct MI300X
NVIDIA GB200 NVL72
AMD Instinct MI355X
AMD Instinct MI300A (APU)

High confidenceMedium confidenceCritical wantedWantedNot started

Have a rig? Run a benchmark in 5 minutes.

The community-benchmark scripts capture power source, GPU clock, CUDA version, thermal state — every variable that affects the tok/s number. Output is a paste-ready result block. Nothing uploads automatically.

Windows

.\scripts\community-benchmark\run-benchmark.ps1

macOS / Linux / WSL

./scripts/community-benchmark/run-benchmark.sh

Get the scripts on GitHub →Read the measurement methodology Email a result block

How this works

1. We mark a model+hardware combo as “wanted” when measuring it would unlock new pages, sharpen a confidence tier, or fill a gap that operators are repeatedly asking about.

2. When you click “I can measure this,” the submission form opens with the model, hardware, and runtime prefilled. You add your measurements (tok/s, VRAM, context, runtime version, OS).

3. Submissions still go through editorial review. We don't auto-publish. If your numbers are plausible and well-documented, we mark the opportunity as measured and the benchmark goes live with full attribution.

Critical
target: 55-75 tok/s decode (single stream)
Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)
Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090 · vLLM · AWQ-INT4
Why we want this
The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Unlocks
/hardware/rtx-5090/guides/choosing-a-gpu-for-local-ai-2026/guides/running-local-ai-on-multiple-gpus-2026/will-it-run/rtx-5090
Run this benchmark on your rig
```
MODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
High
target: 12-25 tok/s decode (Hexagon NPU, estimate)
Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8
Why we want this
Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Unlocks
/hardware/snapdragon-8-elite/tools/qualcomm-ai-hub/stacks/android-on-device-ai
Run this benchmark on your rig
```
MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
High
target: 8-15 tok/s decode (estimate, sustained)
iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)
Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4
Why we want this
Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
Unlocks
/hardware/apple-a18-pro/systems/mobile-local-ai/stacks/iphone-on-device-ai
Run this benchmark on your rig
```
MODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
High
target: 100-160 tok/s decode (single stream)
4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)
DeepSeek V4 Flash (284B MoE) on · vLLM · AWQ-INT4
Why we want this
DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Unlocks
/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/models/deepseek-v4-flash
Run this benchmark on your rig
```
MODEL="deepseek-v4:deepseek-v4-flash" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
High
target: 8-14 tok/s decode (single stream)
Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
Qwen 3.5 235B-A17B (MoE) on · MLX-LM · MLX-4bit
Why we want this
The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Unlocks
/hardware-combos/mac-studio-m3-ultra-192gb/stacks/apple-silicon-ai/will-it-run/combo/mac-studio-m3-ultra-192gb/guides/running-local-ai-on-multiple-gpus-2026
Run this benchmark on your rig
```
MODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
High
target: 60-90 tok/s decode (single stream)
4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)
Qwen 3.5 235B-A17B (MoE) on · vLLM · FP8
Why we want this
The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Unlocks
/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/will-it-run/combo/vllm-tensor-parallel-h100-workstation/guides/running-local-ai-on-multiple-gpus-2026
Run this benchmark on your rig
```
MODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
High
target: 28-36 tok/s decode (PCIe only)
Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want this
Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Unlocks
/hardware-combos/dual-rtx-4090/stacks/dual-4090-workstation/will-it-run/combo/dual-rtx-4090/guides/running-local-ai-on-multiple-gpus-2026
Run this benchmark on your rig
```
MODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
High
target: 25-32 tok/s decode (NVLink)
Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want this
The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Unlocks
/hardware-combos/dual-rtx-3090/stacks/dual-3090-workstation/will-it-run/combo/dual-rtx-3090/guides/running-local-ai-on-multiple-gpus-2026
Run this benchmark on your rig
```
MODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
Medium
target: 10-22 tok/s decode (Adreno GPU path)
Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)
Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)
Why we want this
MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
Unlocks
/hardware/snapdragon-8-elite/tools/mlc-llm/stacks/android-on-device-ai
Run this benchmark on your rig
```
MODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
Medium
target: 20-35 tok/s decode (cold); throttle curve TBD
iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)
Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit
Why we want this
Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
Unlocks
/hardware/apple-m4-ipad/systems/mobile-local-ai
Run this benchmark on your rig
```
MODEL="qwen-2.5:3b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
Medium
target: 18-35 tok/s decode (estimate)
Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)
Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8
Why we want this
Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
Unlocks
/hardware/intel-lunar-lake-258v/systems/mobile-local-ai
Run this benchmark on your rig
```
MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
Medium
target: 20-40 tok/s decode (estimate)
Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8
Why we want this
Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
Unlocks
/hardware/snapdragon-x-elite/tools/onnx-runtime-mobile/systems/mobile-local-ai
Run this benchmark on your rig
```
MODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page Hardware page
Medium
target: 30-45 tok/s per stream × 4-32 concurrent
Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)
Qwen 3 32B on · Ray Serve · AWQ-INT4
Why we want this
Distributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.
Unlocks
/hardware-combos/ray-serve-distributed-multi-node/stacks/distributed-inference-homelab
Run this benchmark on your rig
```
MODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
Medium
target: 4-9 tok/s decode (Thunderbolt 5 inter-node)
4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
Llama 3.1 70B Instruct on · Exo · MLX-4bit
Why we want this
Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
Unlocks
/hardware-combos/quad-mac-mini-m4-pro-exo/stacks/multi-machine-apple-cluster/will-it-run/combo/quad-mac-mini-m4-pro-exo
Run this benchmark on your rig
```
MODEL="llama-3.1:70b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
Medium
target: 20-28 tok/s decode (with thinking-mode bloat)
4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)
DeepSeek R1 Distill Llama 70B on · vLLM · AWQ-INT4
Why we want this
Quad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.
Unlocks
/hardware-combos/quad-rtx-3090/stacks/quad-3090-workstation/will-it-run/combo/quad-rtx-3090
Run this benchmark on your rig
```
MODEL="deepseek-r1:70b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page
Low
target: 10-16 tok/s (asymmetric layer-split)
llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)
Mixtral 8x22B Instruct on · llama.cpp · Q4_K_M
Why we want this
Research-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.
Unlocks
/hardware-combos/mixed-4090-3090/stacks/mixed-4090-3090-workstation
Run this benchmark on your rig
```
MODEL="mixtral-8x22b:22b" ./scripts/community-benchmark/run-benchmark.sh
```
I can measure this →Model page

Need a measurement we don't cover?

Got a measurement? Submit it directly — editorial adds the corresponding opportunity row and credits the gap to you.

Need one measured? Request it. Editorial reviews and surfaces accepted requests in the section above.

Submit a benchmark →Request a benchmark →

0 measured · 0 wanted · 168 unstarted

Have a rig? Run a benchmark in 5 minutes.

How this works

Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)

Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)

llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)

Need a measurement we don't cover?