RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Benchmarks
  4. /Wanted
◯Community submitted(Public roadmap)

Benchmark roadmap — what we want measured

The public version of our benchmark queue. Each entry is a model+hardware combo we'd like a measurement for, and why that measurement would unlock useful pages or sharpen a confidence tier. If you have the rig, click “I can measure this” — the submission form arrives prefilled with model, hardware, and runtime.

Pending
16
In progress
0
Measured
0
Critical
1
Selected coverage slice

22 measured or reviewed / 0 estimated / 0 wanted / 6 unstarted

78.6% of 28 selected cells measured or reviewed

22 measured or reviewed cells, 0 estimated cells, 0 wanted cells, and 6 unstarted cells.

Selected benchmark coverage by hardware and model. Measured, reviewed, estimated, and wanted cells are visually distinct.
Hardware / Model
Trendyol LLM Asure
Kumru 2B
Mistral Turkish v2
Turkcell LLM 7B v1
YTU Turkish Gemma
RefinedNeuro RN TR
RefinedNeuro RN TR
Malhajar Mistral 7
Llama 3.1 8B Instr
Qwen 2.5 Coder 14B
Llama 3.2 1B Instr
CodeGemma 7B
DeepSeek Coder V2
DeepSeek R1 Distil
NVIDIA GeForce RTX 3080 16GB
43Trendyol LLM Asure 12B on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 43.4 tok/s, high confidence
174Kumru 2B on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 174.2 tok/s, high confidence
107Mistral Turkish v2 (brooqs) on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 106.8 tok/s, high confidence
86Turkcell LLM 7B v1 on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 85.8 tok/s, high confidence
66YTU Turkish Gemma 9B v0.1 on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 66.0 tok/s, high confidence
80RefinedNeuro RN TR R1 on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 79.9 tok/s, high confidence
79RefinedNeuro RN TR R2 on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 79.3 tok/s, high confidence
87Malhajar Mistral 7B Turkish on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 87.3 tok/s, high confidence
Llama 3.1 8B Instruct on NVIDIA GeForce RTX 3080 16GB (Mobile): not started
Qwen 2.5 Coder 14B Instruct on NVIDIA GeForce RTX 3080 16GB (Mobile): not started
190Llama 3.2 1B Instruct on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 189.5 tok/s, high confidence
81CodeGemma 7B on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 80.6 tok/s, high confidence
152DeepSeek Coder V2 Lite (16B) on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 152.0 tok/s, high confidence
80DeepSeek R1 Distill Qwen 7B on NVIDIA GeForce RTX 3080 16GB (Mobile): RunLocalAI measured 80.3 tok/s, high confidence
NVIDIA GeForce RTX 5080
82Trendyol LLM Asure 12B on NVIDIA GeForce RTX 5080: RunLocalAI measured 82.0 tok/s, high confidence
444Kumru 2B on NVIDIA GeForce RTX 5080: RunLocalAI measured 443.7 tok/s, high confidence
161Mistral Turkish v2 (brooqs) on NVIDIA GeForce RTX 5080: RunLocalAI measured 161.1 tok/s, high confidence
145Turkcell LLM 7B v1 on NVIDIA GeForce RTX 5080: RunLocalAI measured 145.1 tok/s, high confidence
101YTU Turkish Gemma 9B v0.1 on NVIDIA GeForce RTX 5080: RunLocalAI measured 101.1 tok/s, high confidence
134RefinedNeuro RN TR R1 on NVIDIA GeForce RTX 5080: RunLocalAI measured 133.6 tok/s, high confidence
133RefinedNeuro RN TR R2 on NVIDIA GeForce RTX 5080: RunLocalAI measured 133.4 tok/s, high confidence
130Malhajar Mistral 7B Turkish on NVIDIA GeForce RTX 5080: RunLocalAI measured 130.4 tok/s, high confidence
136Llama 3.1 8B Instruct on NVIDIA GeForce RTX 5080: RunLocalAI measured 135.6 tok/s, high confidence
79Qwen 2.5 Coder 14B Instruct on NVIDIA GeForce RTX 5080: RunLocalAI measured 79.0 tok/s, high confidence
Llama 3.2 1B Instruct on NVIDIA GeForce RTX 5080: not started
CodeGemma 7B on NVIDIA GeForce RTX 5080: not started
DeepSeek Coder V2 Lite (16B) on NVIDIA GeForce RTX 5080: not started
DeepSeek R1 Distill Qwen 7B on NVIDIA GeForce RTX 5080: not started
RunLocalAI measuredReproducedVendor-publishedCommunity-reviewedLow/unverified measuredEstimatedCritical wantedWantedNot started
Fast measured/reviewed cells
  • Kumru 2B on NVIDIA GeForce RTX 5080: 443.7 tok/s
  • Llama 3.2 1B Instruct on NVIDIA GeForce RTX 3080 16GB (Mobile): 189.5 tok/s
  • Kumru 2B on NVIDIA GeForce RTX 3080 16GB (Mobile): 174.2 tok/s
  • Mistral Turkish v2 (brooqs) on NVIDIA GeForce RTX 5080: 161.1 tok/s

Have a rig? Capture a useful benchmark in 10-20 minutes.

The public benchmark runner captures the variables we currently use for review: runtime, GPU, OS, driver, run spread, and a paste-ready result block. Nothing uploads unless you pass --submit.

Windows
iwr -useb https://www.runlocalai.co/bench.mjs -OutFile bench.mjs; node bench.mjs --model llama3.1:8b
macOS / Linux / WSL
curl -fsSL https://www.runlocalai.co/bench.mjs -o bench.mjs && node bench.mjs --model llama3.1:8b
Download the runner ->Paste a result ->Read the measurement methodologyContact the benchmark desk

How this works

1. We mark a model+hardware combo as “wanted” when measuring it would unlock new pages, sharpen a confidence tier, or fill a gap that operators are repeatedly asking about.

2. When you click “I can measure this,” the submission form opens with the model, hardware, and runtime prefilled. You add your measurements (tok/s, VRAM, context, runtime version, OS).

3. Submissions still go through editorial review. We don't auto-publish. If your numbers are plausible and well-documented, we mark the opportunity as measured and the benchmark goes live with full attribution.

  • Critical
    target: 55-75 tok/s decode (single stream)

    Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)

    Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090 · vLLM · AWQ-INT4

    Why we want this

    The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.

    Unlocks
    /hardware/rtx-5090/guides/choosing-a-gpu-for-local-ai-2026/guides/running-local-ai-on-multiple-gpus-2026/will-it-run/rtx-5090
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • High
    target: 12-25 tok/s decode (Hexagon NPU, estimate)

    Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8

    Why we want this

    Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.

    Unlocks
    /hardware/snapdragon-8-elite/tools/qualcomm-ai-hub/stacks/android-on-device-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • High
    target: 8-15 tok/s decode (estimate, sustained)

    iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

    Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4

    Why we want this

    Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'

    Unlocks
    /hardware/apple-a18-pro/systems/mobile-local-ai/stacks/iphone-on-device-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • High
    target: 100-160 tok/s decode (single stream)

    4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)

    DeepSeek V4 Flash (284B MoE) on · vLLM · AWQ-INT4

    Why we want this

    DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.

    Unlocks
    /hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/models/deepseek-v4-flash
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • High
    target: 8-14 tok/s decode (single stream)

    Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

    Qwen 3.5 235B-A17B (MoE) on · MLX-LM · MLX-4bit

    Why we want this

    The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.

    Unlocks
    /hardware-combos/mac-studio-m3-ultra-192gb/stacks/apple-silicon-ai/will-it-run/combo/mac-studio-m3-ultra-192gb/guides/running-local-ai-on-multiple-gpus-2026
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • High
    target: 60-90 tok/s decode (single stream)

    4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)

    Qwen 3.5 235B-A17B (MoE) on · vLLM · FP8

    Why we want this

    The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.

    Unlocks
    /hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/will-it-run/combo/vllm-tensor-parallel-h100-workstation/guides/running-local-ai-on-multiple-gpus-2026
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • High
    target: 28-36 tok/s decode (PCIe only)

    Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

    Llama 3.3 70B Instruct on · vLLM · AWQ-INT4

    Why we want this

    Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.

    Unlocks
    /hardware-combos/dual-rtx-4090/stacks/dual-4090-workstation/will-it-run/combo/dual-rtx-4090/guides/running-local-ai-on-multiple-gpus-2026
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • High
    target: 25-32 tok/s decode (NVLink)

    Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

    Llama 3.3 70B Instruct on · vLLM · AWQ-INT4

    Why we want this

    The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.

    Unlocks
    /hardware-combos/dual-rtx-3090/stacks/dual-3090-workstation/will-it-run/combo/dual-rtx-3090/guides/running-local-ai-on-multiple-gpus-2026
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • Medium
    target: 10-22 tok/s decode (Adreno GPU path)

    Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

    Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)

    Why we want this

    MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.

    Unlocks
    /hardware/snapdragon-8-elite/tools/mlc-llm/stacks/android-on-device-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • Medium
    target: 20-35 tok/s decode (cold); throttle curve TBD

    iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

    Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit

    Why we want this

    Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.

    Unlocks
    /hardware/apple-m4-ipad/systems/mobile-local-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • Medium
    target: 18-35 tok/s decode (estimate)

    Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

    Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8

    Why we want this

    Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.

    Unlocks
    /hardware/intel-lunar-lake-258v/systems/mobile-local-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • Medium
    target: 20-40 tok/s decode (estimate)

    Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8

    Why we want this

    Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.

    Unlocks
    /hardware/snapdragon-x-elite/tools/onnx-runtime-mobile/systems/mobile-local-ai
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model pageHardware page
  • Medium
    target: 30-45 tok/s per stream × 4-32 concurrent

    Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)

    Qwen 3 32B on · Ray Serve · AWQ-INT4

    Why we want this

    Distributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.

    Unlocks
    /hardware-combos/ray-serve-distributed-multi-node/stacks/distributed-inference-homelab
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • Medium
    target: 4-9 tok/s decode (Thunderbolt 5 inter-node)

    4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

    Llama 3.1 70B Instruct on · Exo · MLX-4bit

    Why we want this

    Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.

    Unlocks
    /hardware-combos/quad-mac-mini-m4-pro-exo/stacks/multi-machine-apple-cluster/will-it-run/combo/quad-mac-mini-m4-pro-exo
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • Medium
    target: 20-28 tok/s decode (with thinking-mode bloat)

    4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)

    DeepSeek R1 Distill Llama 70B on · vLLM · AWQ-INT4

    Why we want this

    Quad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.

    Unlocks
    /hardware-combos/quad-rtx-3090/stacks/quad-3090-workstation/will-it-run/combo/quad-rtx-3090
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page
  • Low
    target: 10-16 tok/s (asymmetric layer-split)

    llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)

    Mixtral 8x22B Instruct on · llama.cpp · Q4_K_M

    Why we want this

    Research-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.

    Unlocks
    /hardware-combos/mixed-4090-3090/stacks/mixed-4090-3090-workstation
    Use the benchmark script, then select this request in the submit form. We avoid guessed model commands unless a request stores a verified reproduction command.
    I can measure this →Model page

Need a measurement we don't cover?

Got a measurement? Submit it directly — editorial adds the corresponding opportunity row and credits the gap to you.

Need one measured? Request it. Editorial reviews and surfaces accepted requests in the section above.

Submit a benchmark →Request a benchmark →