RunLocalAI Model Intelligence Pipeline

The Model Intelligence Pipeline imports public benchmark signals and labels them as external priors. They help answer which model is worth trying, but they never replace RunLocalAI's local fit math, measured tok/s rows, quantization, context length, runtime, or confidence evidence.

By Eruo Fredoline - Last reviewed 2026-05-28

Why this exists

RunLocalAI should not become a generic model leaderboard. LMSYS, LMArena, LiveBench, Artificial Analysis, and similar projects already specialize in model quality, human preference, provider speed, and contamination-resistant evaluation. RunLocalAI's job is to join that intelligence to the local reality: hardware fit, runtime choice, quantization, context pressure, speed, cost, and evidence.

The pipeline therefore produces model intelligence priors. A prior can say that one model appears stronger than another in external evaluations. It cannot say that the model will fit in your VRAM, run fast at Q4, or beat the cloud for your workload. Those claims come from the Will-It-Run Framework and RunLocalAI benchmark evidence.

Sources

OpenEvals leaderboard data

Provider and public benchmark scores such as GPQA, SWE-style coding scores, MMLU-Pro-like fields, and aggregate coverage.

LMArena leaderboard dataset

Human-preference ratings from the text leaderboard, preserved as sourced external preference data.

LiveBench model judgments

Public judgment rows aggregated by model and category without claiming RunLocalAI reproduced the evaluation.

Pipeline contract

weekly GitHub Action
  -> fetch upstream dataset-server rows
  -> normalize model aliases
  -> attach source/confidence/license notes
  -> aggregate LiveBench judgment rows from a stated sample
  -> compute transparent model-intelligence priors
  -> publish JSON snapshot + API + dated archive

Every score in the snapshot keeps its source dataset, config, split, source URL, row count, capture time, scale, and evidence note. The public JSON includes not_local_measurement: true at the snapshot level because this layer is intentionally separate from local execution evidence.

OpenEvals is small enough to pull fully. LMArena and LiveBench are large enough to hit public rows-API rate limits, so the default job publishes fetched_rows, total_rows_reported, and sampling_note in the source metadata. Deeper refreshes can raise those caps without changing the public schema.

Composite discipline

The composite score is a convenience sort key, not the truth. It normalizes available OpenEvals aggregate score, LMArena text rating, and LiveBench aggregate judgment into a 0-100 prior. Missing components are not backfilled. The confidence tier is based on source coverage: high when all three source families are present, moderate when two are present, and low when only one is present.

The composite is never shown as a local verdict by itself. A useful RunLocalAI answer must join it with the local tuple:

model intelligence prior
+ hardware effective VRAM
+ quantization and context
+ runtime support
+ measured or estimated tok/s evidence
+ cost-vs-cloud economics
= Will-It-Run verdict

Public files

JSON snapshot: /model-intelligence/latest.json
API endpoint: /api/v1/model-intelligence
JSON schema: /schemas/runlocalai-model-intelligence-snapshot-v1.json

Limits

External benchmarks are useful but incomplete. They may use different prompts, providers, endpoints, tools, release dates, sampling settings, or model variants. They may not reflect open-weight quantized quality. They do not answer whether a model is legal for your use case, safe for your domain, cheap enough for your workload, or runnable on your exact machine.

That is why this pipeline exists as a supporting layer. RunLocalAI owns the final local decision, not the generic leaderboard.

Use the priors correctly

The fit methodology that turns priors into local decisions.

Read the Will-It-Run Framework

OrOpen the model intelligence API Browse local benchmarks