BENCHMARK · MBPP-PLUS
Coding

MBPP+ (EvalPlus)

EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation quality.

Test runs
4
Metric
pass@1
Range
/100
Best score
71.7
METHODOLOGY

How we run MBPP+ at runlocalai

  1. Spin up the target model on its native runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
  2. Generate one deterministic sample per task through the local OpenAI-compatible endpoint.
  3. Score the JSONL samples with the official EvalPlus evaluator: python -m evalplus.evaluate mbpp --samples $SAMPLES.
  4. Score = pass@1 on MBPP+ x 100.
  5. Publish the raw generation log, official scorer log, sanitized samples, and raw completions to a public GitHub Gist before the row enters the leaderboard.

Runner source: scripts/run-humaneval-plus.ts and scripts/evalplus_openai_generate.py in the public repo.

HOW TO READ THE SCORE

Score is pass@1 percentage on the EvalPlus-extended MBPP test suite. 100 = every task solved correctly. MBPP+ is usually easier than HumanEval+ for strong coding models, but the augmented tests catch many brittle or partial solutions.

LEADERBOARD

Public reviewed runs, ranked

4 rows
#ModelQuantRigScoreTrustLog
1Trendyol LLM Asure 12B
11.8B · gemma
Q4_K_M
ollama-0.24.0
rtx-5080
71.7
First-party
runlocalai
command published
Gist →
2Qwen 2.5 Coder 7B Instruct
7B · qwen
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
66.9
First-party
runlocalai
command published
Gist →
3Phi-4 14B
14B · phi
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
60.3
First-party
runlocalai
command published
Gist →
4Llama 3.1 8B Instruct
8B · llama
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
39.2
First-party
runlocalai
command published
Gist →
OPERATOR NOTES
Trendyol LLM Asure 12B · Q4_K_M · ollama-0.24.0 · rtx-5080

First-party measured MBPP+ run. Generation used Ollama's OpenAI-compatible chat endpoint at temperature 0 and num_ctx 8192. Scoring used official EvalPlus 0.3.1 under WSL; public Gist includes metadata, generation log, official scorer log, sanitized samples, and raw model completions.

Qwen 2.5 Coder 7B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py. Paired with the HumanEval+ row at 81.1/100 for the same model+quant+hardware.

Phi-4 14B · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.

Llama 3.1 8B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.