How to run a standardized benchmark suite against multiple models
Python 3.10+, pip, multiple models downloaded in Ollama, Git for cloning evaluation tooling
What this does
Sets up lm-evaluation-harness to run a consistent benchmark across multiple locally hosted models, producing comparable scores for each model in a structured output format. After this guide results like MMLU or HellaSwag scores will be available for side-by-side comparison.
Steps
Install lm-evaluation-harness. Uses pip to install the library and its dependencies.
pip install lm-evalExpected output:
Successfully installed lm-eval version X.X.X.Run the harness against one model. Evaluates a single model as a baseline using the MMLU task.
lm_eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.3 --tasks mmlu --device cuda:0 --batch_size 4Expected output: Per-task scores, e.g.
mmlu (5-shot): 0.6241.Create a shell loop to evaluate multiple models. Automates repeated runs across all models.
for model in "mistralai/Mistral-7B-Instruct-v0.3" "NousResearch/Llama-3.2-3B-Instruct"; do lm_eval --model hf --model_args pretrained=$model --tasks hellaswag --device cuda:0 --batch_size 4; doneExpected output: Benchmark results for each model printed in sequence.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
grep -E "(mmlu|hellaswag)" benchmark_results.txt 2>/dev/null || echo "Run benchmark and save output to file"
# Expected: scores for each model on the evaluated task
Common failures
- CUDA out of memory: Reduce
--batch_sizeto 2 or 1, or switch to a smaller model. - model not in HF cache: Download HF weights explicitly with
huggingface-cli download <model-id>. - lm_eval not found: Use
python -m lm_evalinstead, or ensure pip bin directory is on PATH. - task fails to load: Confirm task name is valid with
lm_eval --tasks list. - inconsistent scores: Set
--seedto a fixed value to reduce variance across runs.