Research

Week build-out

Local evaluation lab

Run reproducible benchmarks on local models. lm-evaluation-harness + bigcode-eval-harness + custom task runners + Postgres results store + Grafana for tracking. The setup that turns 'this model feels smarter' into 'this model is +3.2 on HumanEval+'.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint

RTX 4090 OR 2× RTX 3090 · 64 GB RAM · 1 TB NVMe

Concurrency

1 active eval run; multiple queued.

Power

~400-450 W during eval; idle between runs.

Goal: Evaluate model + quant + runtime combinations against standard and custom benchmarks reproducibly.

Operator card

Workflow

Best for

✓Researchers comparing model + quant + runtime combinations
✓Teams choosing between open-weights candidates
✓Anyone fine-tuning who needs before/after measurements
✓Authors of model lineage / benchmark articles

Avoid if

⚠You only need one-off vibes-check evals
⚠You don't have a dedicated GPU for the lab
⚠You're not willing to pin every version (eval is a discipline practice)

Stability

evolving

Maintenance

Weekly attention

Skill

Advanced

Long-session reliability

reliable

Service ledger

6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

vLLM

Inference

8000/tcp

Inference engine. Continuous batching makes harness runs ~3-5× faster than single-stream inference. Reproducibility is solid (deterministic seeds work).

Runs: Docker, GPU 0

Surface

lm-evaluation-harness (EleutherAI)

Router / orchestrator

Eval orchestrator. Industry-standard harness. Covers MMLU, ARC, HellaSwag, GSM8K, TruthfulQA, and most LLM benchmarks. Plug-in tasks for custom metrics.

Runs: Python venv on host

bigcode-evaluation-harness

Router / orchestrator

Coding-eval orchestrator. HumanEval+, MBPP+, MultiPL-E. Coding-specific harness; runs alongside lm-eval.

Runs: Python venv on host

Data

Postgres

Storage

5432/tcp (loopback)

Results store. Each eval run gets a row with model + quant + runtime + commit + scores. Postgres is the right shape for the time-series + query pattern.

Runs: Docker container

GitHub Actions self-hosted runner OR Buildkite agent

Queue

Eval queue. Eval runs are CI-shaped: deterministic, idempotent, want artifacts persisted. Self-hosted runner gates against your GPU schedule.

Runs: host service

Operations

Grafana (Postgres datasource)

Observability

3000/tcp

Results visualization. Per-model scoreboards, regression detection, model lineage charts. Better than spreadsheets for tracking weeks of runs.

Runs: Docker container

Hardware

Single 4090 covers 7B-32B model evaluation. Dual 3090 with NVLink lets you eval 70B-class models without renting cloud time.

Reserve one GPU for evals only. Sharing with chat / coding workloads invalidates throughput measurements; sharing with another agent may even invalidate accuracy if KV-cache eviction differs.

Power matters: throttling silently degrades scores. Run evals on a dedicated PSU rail; monitor via DCGM during runs.

Storage

Each eval run produces ~50 MB raw outputs (per-task model generations). Keep them — regression debugging needs raw output, not just aggregate scores.

Per-model: ~10-20 GB weights + ~500 MB rolling output history. Postgres results tracking adds <100 MB / year.

Back up the Postgres results DB separately — that's the irreplaceable artifact. Per-run outputs can be regenerated.

Networking

Eval lab is offline-friendly. Most harnesses pre-download datasets; once downloaded, runs are local-only.

If you publish eval results: a thin static-site renderer (Next.js / Hugo) reads from Postgres and emits a leaderboard page. Internal-only is fine for solo research.

Observability

Critical metrics during a run:

Eval throughput (samples/min). Drops indicate VRAM pressure or thermal throttle.
GPU temp during the run. >82 °C sustained → throttling possible → invalid run.
Sampling determinism check. Re-run a few canonical prompts at run start; compare outputs to previous-run baseline. Drift = something changed (driver, runtime, model).

Post-run:

Score regression detection. Grafana alert when a new run scores >2σ below its model's historical mean.
Reproducibility window. Same model + same harness commit + same vLLM commit should match within ~0.5%. Wider variance = something is non-deterministic.

Security

Dataset contamination. Many public benchmarks have leaked into training data. lm-eval ships with leakage-aware variants when available; use them.

Custom eval data. If you write proprietary eval suites, treat the dataset like source code — don't paste into ChatGPT, don't train on it accidentally.

Reproducibility audit trail. Every result row should reference a Git commit (harness + your custom tasks) + model SHA + runtime version. This matters when a paper reviewer asks.

Upgrade path

More benchmarks: add MT-Bench, AlpacaEval-2 (LLM-as-judge), SWE-Bench. Each has different runtime profiles; budget per-benchmark hardware time.

Multi-model parallel eval: add a second GPU; run lm-eval-harness across both with different models. Measures throughput at the cost of cross-contamination if you mistakenly share state.

Production-grade: move from Postgres-on-Docker to a managed Postgres (or local HA), add a webhook system to fire CI on new model release, automate the leaderboard page generation.

Custom harness tasks: write task definitions for your domain — coding-style, factuality-on-internal-docs, instruction-following on your prompt library.

What breaks first

Driver / runtime drift mid-suite. Evals taking days span auto-update windows. Pin everything; never run NVIDIA driver upgrades during a multi-day eval campaign.
Sampling non-determinism. Different vLLM versions sample differently with the same seed. Tag every run with the runtime SHA.
Dataset version drift. HF datasets occasionally update; cached versions may differ from latest. Pin dataset revisions.
Disk fill from raw outputs. A multi-task eval can drop 5 GB of model generations. Set up rotation.
Postgres integrity. Don't run evals against the same Postgres that stores production data. One bug in a custom harness can corrupt the results table.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/rtx-4090-workstation →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Validation

This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.