Benchmark roadmap — what we want measured
The public version of our benchmark queue. Each entry is a model+hardware combo we'd like a measurement for, and why that measurement would unlock useful pages or sharpen a confidence tier. If you have the rig, click “I can measure this” — the submission form arrives prefilled with model, hardware, and runtime.
22 measured or reviewed / 0 estimated / 0 wanted / 6 unstarted
22 measured or reviewed cells, 0 estimated cells, 0 wanted cells, and 6 unstarted cells.
Have a rig? Capture a useful benchmark in 10-20 minutes.
The public benchmark runner captures the variables we currently use for review: runtime, GPU, OS, driver, run spread, and a paste-ready result block. Nothing uploads unless you pass --submit.
iwr -useb https://www.runlocalai.co/bench.mjs -OutFile bench.mjs; node bench.mjs --model llama3.1:8b
curl -fsSL https://www.runlocalai.co/bench.mjs -o bench.mjs && node bench.mjs --model llama3.1:8b
How this works
1. We mark a model+hardware combo as “wanted” when measuring it would unlock new pages, sharpen a confidence tier, or fill a gap that operators are repeatedly asking about.
2. When you click “I can measure this,” the submission form opens with the model, hardware, and runtime prefilled. You add your measurements (tok/s, VRAM, context, runtime version, OS).
3. Submissions still go through editorial review. We don't auto-publish. If your numbers are plausible and well-documented, we mark the opportunity as measured and the benchmark goes live with full attribution.
- Criticaltarget: 55-75 tok/s decode (single stream)
Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)
Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090 · vLLM · AWQ-INT4
Why we want thisThe single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Unlocks/hardware/rtx-5090/guides/choosing-a-gpu-for-local-ai-2026/guides/running-local-ai-on-multiple-gpus-2026/will-it-run/rtx-5090Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 12-25 tok/s decode (Hexagon NPU, estimate)
Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8
Why we want thisSnapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Unlocks/hardware/snapdragon-8-elite/tools/qualcomm-ai-hub/stacks/android-on-device-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 8-15 tok/s decode (estimate, sustained)
iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)
Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4
Why we want thisMobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
Unlocks/hardware/apple-a18-pro/systems/mobile-local-ai/stacks/iphone-on-device-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 100-160 tok/s decode (single stream)
4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)
DeepSeek V4 Flash (284B MoE) on · vLLM · AWQ-INT4
Why we want thisDeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Unlocks/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/models/deepseek-v4-flashUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 8-14 tok/s decode (single stream)
Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
Qwen 3.5 235B-A17B (MoE) on · MLX-LM · MLX-4bit
Why we want thisThe Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Unlocks/hardware-combos/mac-studio-m3-ultra-192gb/stacks/apple-silicon-ai/will-it-run/combo/mac-studio-m3-ultra-192gb/guides/running-local-ai-on-multiple-gpus-2026Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 60-90 tok/s decode (single stream)
4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)
Qwen 3.5 235B-A17B (MoE) on · vLLM · FP8
Why we want thisThe frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Unlocks/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/will-it-run/combo/vllm-tensor-parallel-h100-workstation/guides/running-local-ai-on-multiple-gpus-2026Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 28-36 tok/s decode (PCIe only)
Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want thisPairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Unlocks/hardware-combos/dual-rtx-4090/stacks/dual-4090-workstation/will-it-run/combo/dual-rtx-4090/guides/running-local-ai-on-multiple-gpus-2026Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Hightarget: 25-32 tok/s decode (NVLink)
Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want thisThe dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Unlocks/hardware-combos/dual-rtx-3090/stacks/dual-3090-workstation/will-it-run/combo/dual-rtx-3090/guides/running-local-ai-on-multiple-gpus-2026Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 10-22 tok/s decode (Adreno GPU path)
Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)
Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)
Why we want thisMLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
Unlocks/hardware/snapdragon-8-elite/tools/mlc-llm/stacks/android-on-device-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 20-35 tok/s decode (cold); throttle curve TBD
iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)
Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit
Why we want thisTablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
Unlocks/hardware/apple-m4-ipad/systems/mobile-local-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 18-35 tok/s decode (estimate)
Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)
Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8
Why we want thisLunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
Unlocks/hardware/intel-lunar-lake-258v/systems/mobile-local-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 20-40 tok/s decode (estimate)
Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8
Why we want thisCopilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
Unlocks/hardware/snapdragon-x-elite/tools/onnx-runtime-mobile/systems/mobile-local-aiUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 30-45 tok/s per stream × 4-32 concurrent
Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)
Qwen 3 32B on · Ray Serve · AWQ-INT4
Why we want thisDistributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.
Unlocks/hardware-combos/ray-serve-distributed-multi-node/stacks/distributed-inference-homelabUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 4-9 tok/s decode (Thunderbolt 5 inter-node)
4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
Llama 3.1 70B Instruct on · Exo · MLX-4bit
Why we want thisMulti-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
Unlocks/hardware-combos/quad-mac-mini-m4-pro-exo/stacks/multi-machine-apple-cluster/will-it-run/combo/quad-mac-mini-m4-pro-exoUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Mediumtarget: 20-28 tok/s decode (with thinking-mode bloat)
4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)
DeepSeek R1 Distill Llama 70B on · vLLM · AWQ-INT4
Why we want thisQuad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.
Unlocks/hardware-combos/quad-rtx-3090/stacks/quad-3090-workstation/will-it-run/combo/quad-rtx-3090Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command. - Lowtarget: 10-16 tok/s (asymmetric layer-split)
llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)
Mixtral 8x22B Instruct on · llama.cpp · Q4_K_M
Why we want thisResearch-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.
Unlocks/hardware-combos/mixed-4090-3090/stacks/mixed-4090-3090-workstationUse the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
Need a measurement we don't cover?
Got a measurement? Submit it directly — editorial adds the corresponding opportunity row and credits the gap to you.
Need one measured? Request it. Editorial reviews and surfaces accepted requests in the section above.