What this does

Sets up lm-evaluation-harness to run a consistent benchmark across multiple locally hosted models, producing comparable scores for each model in a structured output format. After this guide results like MMLU or HellaSwag scores will be available for side-by-side comparison.

Steps

Install lm-evaluation-harness. Uses pip to install the library and its dependencies.
```
pip install lm-eval
```
Expected output: Successfully installed lm-eval version X.X.X.
Run the harness against one model. Evaluates a single model as a baseline using the MMLU task.
```
lm_eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.3 --tasks mmlu --device cuda:0 --batch_size 4
```
Expected output: Per-task scores, e.g. mmlu (5-shot): 0.6241.

Create a shell loop to evaluate multiple models. Automates repeated runs across all models.

for model in "mistralai/Mistral-7B-Instruct-v0.3" "NousResearch/Llama-3.2-3B-Instruct"; do lm_eval --model hf --model_args pretrained=$model --tasks hellaswag --device cuda:0 --batch_size 4; done

Expected output: Benchmark results for each model printed in sequence.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

grep -E "(mmlu|hellaswag)" benchmark_results.txt 2>/dev/null || echo "Run benchmark and save output to file"
# Expected: scores for each model on the evaluated task

Common failures

CUDA out of memory: Reduce --batch_size to 2 or 1, or switch to a smaller model.
model not in HF cache: Download HF weights explicitly with huggingface-cli download <model-id>.
lm_eval not found: Use python -m lm_eval instead, or ensure pip bin directory is on PATH.
task fails to load: Confirm task name is valid with lm_eval --tasks list.
inconsistent scores: Set --seed to a fixed value to reduce variance across runs.

How to run a standardized benchmark suite against multiple models

What this does

Steps

Verification

Common failures

Related guides