RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
BLK · QUALITY LEADERBOARDSreviewed · per-quant · raw logs

Quality benchmarks

Reviewed quality benchmark scores for open-weight local models. First-party rows are produced on RunLocalAI hardware; community rows render only after review. Each public score carries a stated quantization, runtime, hardware target, and raw-log Gist.

Benchmarks tracked
3
Total test runs
13
Unique models
7
First-party
13
By trust tier:first-party: 13·verified: 0·community: 0·pending: 0·rejected: 0
+Submit a scoreMethodology →JSON API →·Inference speed leaderboard (tok/s) →
BENCHMARK · HUMANEVAL-PLUS

HumanEval+ (EvalPlus)

164 Python coding problems with augmented test suites (~80 hidden tests per problem) from the EvalPlus extension of the original HumanEval benchmark. Measures functional code gener

5 test runs
Coding
#ModelQuantRigScoreTrustLog
1Qwen 2.5 Coder 7B Instruct
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
81.1/100
First-party
command published
Gist →
2Phi-4 14B
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
78.7/100
First-party
command published
Gist →
3Trendyol LLM Asure 12B
Q4_K_M
ollama-0.24.0
rtx-5080
69.5/100
First-party
command published
Gist →
4Llama 3.1 8B Instruct
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
56.1/100
First-party
command published
Gist →
5Qwen 3 8B
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
2.4/100
First-party
command published
Gist →
Full HumanEval+ (EvalPlus) leaderboard →
BENCHMARK · MBPP-PLUS

MBPP+ (EvalPlus)

EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation q

4 test runs
Coding
#ModelQuantRigScoreTrustLog
1Trendyol LLM Asure 12B
Q4_K_M
ollama-0.24.0
rtx-5080
71.7/100
First-party
command published
Gist →
2Qwen 2.5 Coder 7B Instruct
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
66.9/100
First-party
command published
Gist →
3Phi-4 14B
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
60.3/100
First-party
command published
Gist →
4Llama 3.1 8B Instruct
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
39.2/100
First-party
command published
Gist →
Full MBPP+ (EvalPlus) leaderboard →
BENCHMARK · TURKISH-MMLU-GENERATIVE

TurkishMMLU (Generative)

Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Re

4 test runs
Turkish
General knowledge
Multilingual
#ModelQuantRigScoreTrustLog
1Trendyol LLM Asure 12B
Q4_K_M
ollama-0.24.0
rtx-5080
58.9/100
First-party
command published
Gist →
2Llama 3.2 3B Instruct
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
11.4/100
First-party
command published
Gist →
3Turkish Llama 8B Instruct v0.1
Q4_K_M
ollama-0.24
rtx-3080
11.0/100
First-party
command published
Gist →
4Turkish Llama 8B Instruct v0.1
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
11.0/100
First-party
command published
Gist →
Full TurkishMMLU (Generative) leaderboard →
HOW SCORES EARN PUBLIC RENDER

The trust gate

  • Every row must link to a public Gist with the raw stdout + stderr of the run. No Gist, no public render.
  • Trust tier shown per row: first-party (we ran it), verified (community submission reviewed with named verifier, timestamp, raw log, and reproduction command), pending (submitted but awaiting review and hidden from public leaderboards).
  • Reproduction commands published so a third party can replicate independently. Full methodology lives at /benchmarks/methodology.
  • Same model + quant + runtime + hardware tuple is unique. Anonymous submissions cannot overwrite an existing public row; reviewers decide whether a new run replaces or supplements the record.