RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to run a standardized benchmark suite against multiple models
HOW-TO · INF

How to run a standardized benchmark suite against multiple models

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Python 3.10+, pip, multiple models downloaded in Ollama, Git for cloning evaluation tooling

What this does

Sets up lm-evaluation-harness to run a consistent benchmark across multiple locally hosted models, producing comparable scores for each model in a structured output format. After this guide results like MMLU or HellaSwag scores will be available for side-by-side comparison.

Steps

  1. Install lm-evaluation-harness. Uses pip to install the library and its dependencies.

    pip install lm-eval
    

    Expected output: Successfully installed lm-eval version X.X.X.

  2. Run the harness against one model. Evaluates a single model as a baseline using the MMLU task.

    lm_eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.3 --tasks mmlu --device cuda:0 --batch_size 4
    

    Expected output: Per-task scores, e.g. mmlu (5-shot): 0.6241.

  3. Create a shell loop to evaluate multiple models. Automates repeated runs across all models.

    for model in "mistralai/Mistral-7B-Instruct-v0.3" "NousResearch/Llama-3.2-3B-Instruct"; do lm_eval --model hf --model_args pretrained=$model --tasks hellaswag --device cuda:0 --batch_size 4; done
    

    Expected output: Benchmark results for each model printed in sequence.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

grep -E "(mmlu|hellaswag)" benchmark_results.txt 2>/dev/null || echo "Run benchmark and save output to file"
# Expected: scores for each model on the evaluated task

Common failures

  • CUDA out of memory: Reduce --batch_size to 2 or 1, or switch to a smaller model.
  • model not in HF cache: Download HF weights explicitly with huggingface-cli download <model-id>.
  • lm_eval not found: Use python -m lm_eval instead, or ensure pip bin directory is on PATH.
  • task fails to load: Confirm task name is valid with lm_eval --tasks list.
  • inconsistent scores: Set --seed to a fixed value to reduce variance across runs.

Related guides

  • How to benchmark models with varying context lengths
  • How to create a systematic model comparison matrix
RELATED GUIDES
INF
How to create a systematic model comparison matrix
INF
How to benchmark models with varying context lengths
← All how-to guidesCourses →