Local evaluation lab
Run reproducible benchmarks on local models. lm-evaluation-harness + bigcode-eval-harness + custom task runners + Postgres results store + Grafana for tracking. The setup that turns 'this model feels smarter' into 'this model is +3.2 on HumanEval+'.
Build summary
Goal: Evaluate model + quant + runtime combinations against standard and custom benchmarks reproducibly.
Operator card
- ✓Researchers comparing model + quant + runtime combinations
- ✓Teams choosing between open-weights candidates
- ✓Anyone fine-tuning who needs before/after measurements
- ✓Authors of model lineage / benchmark articles
- ⚠You only need one-off vibes-check evals
- ⚠You don't have a dedicated GPU for the lab
- ⚠You're not willing to pin every version (eval is a discipline practice)
Service ledger
6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
Single 4090 covers 7B-32B model evaluation. Dual 3090 with NVLink lets you eval 70B-class models without renting cloud time.
Reserve one GPU for evals only. Sharing with chat / coding workloads invalidates throughput measurements; sharing with another agent may even invalidate accuracy if KV-cache eviction differs.
Power matters: throttling silently degrades scores. Run evals on a dedicated PSU rail; monitor via DCGM during runs.
Storage
Each eval run produces ~50 MB raw outputs (per-task model generations). Keep them — regression debugging needs raw output, not just aggregate scores.
Per-model: ~10-20 GB weights + ~500 MB rolling output history. Postgres results tracking adds <100 MB / year.
Back up the Postgres results DB separately — that's the irreplaceable artifact. Per-run outputs can be regenerated.
Networking
Eval lab is offline-friendly. Most harnesses pre-download datasets; once downloaded, runs are local-only.
If you publish eval results: a thin static-site renderer (Next.js / Hugo) reads from Postgres and emits a leaderboard page. Internal-only is fine for solo research.
Observability
Critical metrics during a run:
- Eval throughput (samples/min). Drops indicate VRAM pressure or thermal throttle.
- GPU temp during the run. >82 °C sustained → throttling possible → invalid run.
- Sampling determinism check. Re-run a few canonical prompts at run start; compare outputs to previous-run baseline. Drift = something changed (driver, runtime, model).
Post-run:
- Score regression detection. Grafana alert when a new run scores >2σ below its model's historical mean.
- Reproducibility window. Same model + same harness commit + same vLLM commit should match within ~0.5%. Wider variance = something is non-deterministic.
Security
Dataset contamination. Many public benchmarks have leaked into training data. lm-eval ships with leakage-aware variants when available; use them.
Custom eval data. If you write proprietary eval suites, treat the dataset like source code — don't paste into ChatGPT, don't train on it accidentally.
Reproducibility audit trail. Every result row should reference a Git commit (harness + your custom tasks) + model SHA + runtime version. This matters when a paper reviewer asks.
Upgrade path
More benchmarks: add MT-Bench, AlpacaEval-2 (LLM-as-judge), SWE-Bench. Each has different runtime profiles; budget per-benchmark hardware time.
Multi-model parallel eval: add a second GPU; run lm-eval-harness across both with different models. Measures throughput at the cost of cross-contamination if you mistakenly share state.
Production-grade: move from Postgres-on-Docker to a managed Postgres (or local HA), add a webhook system to fire CI on new model release, automate the leaderboard page generation.
Custom harness tasks: write task definitions for your domain — coding-style, factuality-on-internal-docs, instruction-following on your prompt library.
What breaks first
- Driver / runtime drift mid-suite. Evals taking days span auto-update windows. Pin everything; never run NVIDIA driver upgrades during a multi-day eval campaign.
- Sampling non-determinism. Different vLLM versions sample differently with the same seed. Tag every run with the runtime SHA.
- Dataset version drift. HF datasets occasionally update; cached versions may differ from latest. Pin dataset revisions.
- Disk fill from raw outputs. A multi-task eval can drop 5 GB of model generations. Set up rotation.
- Postgres integrity. Don't run evals against the same Postgres that stores production data. One bug in a custom harness can corrupt the results table.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Validation
This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.