MBPP+ (EvalPlus)
EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation quality.
How we run MBPP+ at runlocalai
- Spin up the target model on its native runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
- Generate one deterministic sample per task through the local OpenAI-compatible endpoint.
- Score the JSONL samples with the official EvalPlus evaluator:
python -m evalplus.evaluate mbpp --samples $SAMPLES. - Score = pass@1 on MBPP+ x 100.
- Publish the raw generation log, official scorer log, sanitized samples, and raw completions to a public GitHub Gist before the row enters the leaderboard.
Runner source: scripts/run-humaneval-plus.ts and scripts/evalplus_openai_generate.py in the public repo.
Score is pass@1 percentage on the EvalPlus-extended MBPP test suite. 100 = every task solved correctly. MBPP+ is usually easier than HumanEval+ for strong coding models, but the augmented tests catch many brittle or partial solutions.
Public reviewed runs, ranked
| # | Model | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|
| 1 | Trendyol LLM Asure 12B 11.8B · gemma | Q4_K_M | ollama-0.24.0 rtx-5080 | 71.7 | First-party runlocalai command published | Gist → |
| 2 | Qwen 2.5 Coder 7B Instruct 7B · qwen | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 66.9 | First-party runlocalai command published | Gist → |
| 3 | Phi-4 14B 14B · phi | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 60.3 | First-party runlocalai command published | Gist → |
| 4 | Llama 3.1 8B Instruct 8B · llama | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 39.2 | First-party runlocalai command published | Gist → |
First-party measured MBPP+ run. Generation used Ollama's OpenAI-compatible chat endpoint at temperature 0 and num_ctx 8192. Scoring used official EvalPlus 0.3.1 under WSL; public Gist includes metadata, generation log, official scorer log, sanitized samples, and raw model completions.
First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py. Paired with the HumanEval+ row at 81.1/100 for the same model+quant+hardware.
First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.
First-party MBPP+ on RTX 3080 Laptop 16GB via Ollama 0.24. Windows-safe scoring via scripts/evalplus_score_windows.py.