MBPP+ (EvalPlus)

EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation quality.

Test runs

Metric

pass@1

Range

/100

Best score

71.7

+Submit a score Runner script →Original source →Vendor leaderboard →JSON →

METHODOLOGY

How we run MBPP+ at runlocalai

Spin up the target model on its native runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
Generate one deterministic sample per task through the local OpenAI-compatible endpoint.
Score the JSONL samples with the official EvalPlus evaluator: python -m evalplus.evaluate mbpp --samples $SAMPLES.
Score = pass@1 on MBPP+ x 100.
Publish the raw generation log, official scorer log, sanitized samples, and raw completions to a public GitHub Gist before the row enters the leaderboard.

Runner source: scripts/run-humaneval-plus.ts and scripts/evalplus_openai_generate.py in the public repo.

HOW TO READ THE SCORE

Score is pass@1 percentage on the EvalPlus-extended MBPP test suite. 100 = every task solved correctly. MBPP+ is usually easier than HumanEval+ for strong coding models, but the augmented tests catch many brittle or partial solutions.

LEADERBOARD

Public reviewed runs, ranked

4 rows

#	Model	Quant	Rig	Score	Trust	Log
1	Trendyol LLM Asure 12B 11.8B · gemma	Q4_K_M	ollama-0.24.0 rtx-5080	71.7	First-party runlocalai command published	Gist →
2	Qwen 2.5 Coder 7B Instruct 7B · qwen	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	66.9	First-party runlocalai command published	Gist →
3	Phi-4 14B 14B · phi	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	60.3	First-party runlocalai command published	Gist →
4	Llama 3.1 8B Instruct 8B · llama	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	39.2	First-party runlocalai command published	Gist →

OPERATOR NOTES

Trendyol LLM Asure 12B · Q4_K_M · ollama-0.24.0 · rtx-5080

First-party measured MBPP+ run. Generation used Ollama's OpenAI-compatible chat endpoint at temperature 0 and num_ctx 8192. Scoring used official EvalPlus 0.3.1 under WSL; public Gist includes metadata, generation log, official scorer log, sanitized samples, and raw model completions.

Qwen 2.5 Coder 7B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py. Paired with the HumanEval+ row at 81.1/100 for the same model+quant+hardware.

Phi-4 14B · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.

Llama 3.1 8B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.