Runtime health methodology
The runtime-health surface labels each inference engine with a release cadence (active, maintained, slowing, stalled), a benchmark-freshness pill, common failure modes, and an ecosystem-stability note. These are useful signals when an operator is choosing between Ollama, vLLM, llama.cpp, MLX-LM, ExLlamaV2, SGLang, TensorRT-LLM, or anything new in the field. They are also proxies. This page is the honest accounting of what each label means and what it does not.
Release cadence — the four labels
The cadence label is computed from the runtime tool’s editorial timestamps in the catalog. The derivation lives at src/lib/runtime-health.ts. It picks the most recent of three signals — the operational-review date, the last-updated date, and the LLM-enrichment date — and bins the age of that timestamp into one of four buckets. This is a deliberately small surface; we’d rather have an honest proxy that’s easy to defend than a complicated metric that’s hard to interrogate.
- Active. The runtime has been editorially touched within the last 60 days. In practice this corresponds to either a release the editorial team incorporated, an operational review, or an LLM-enrichment refresh against current docs. Active is the strongest cadence signal we publish.
- Maintained. Editorial touch within 60 to 180 days. The runtime is still on the catalog’s map, the ecosystem isn’t treating it as abandoned, but the pace of change is moderate. Most production-grade runtimes spend long stretches of their life here.
- Slowing. Editorial touch 180 to 365 days ago. The runtime hasn’t been refreshed in our system in six months to a year. Could mean genuine project slowdown; could mean the editorial pipeline simply hasn’t got back to it. We render this label with the understanding that it’s a yellow flag, not a verdict.
- Stalled. Editorial touch more than 365 days ago, or no editorial touch at all. The runtime should not be treated as actively recommended. Operators are free to use it, but the catalog is honestly signalling that we haven’t re-validated it within a reasonable window.
A runtime with no editorial timestamp at all returns the unknown label and the UI omits the pill rather than guessing. This is the same discipline as the rest of the trust moat — missing data is rendered as missing, not assumed.
Benchmark freshness
Independently of cadence, each runtime carries a benchmark-freshness signal: when was the most recent benchmark published against this runtime, and how many of its current benchmarks are within the 18-month staleness threshold from the confidence methodology? A runtime can be at active cadence (the editorial team is paying attention) but with stale benchmarks (no recent operator submissions). It can also be the inverse — a slowly-maintained runtime that just got a fresh batch of operator benchmarks because someone ran a comparison.
Both readings are useful and they don’t reduce to a single score. The runtime-health surface presents them side by side so the operator can read both signals.
Common failure modes
Each runtime detail page surfaces the failure modes editorial has observed or operators have reported. These are not exhaustive lists; they’re the patterns that come up often enough to warrant calling out before an operator commits to the runtime. Common categories:
- Out-of-memory at long context. Different runtimes allocate KV cache differently; a runtime that’s fine at 4K context may OOM at 32K on the same hardware. We surface the threshold where editorial or operators have hit it.
- Quantisation-format coverage gaps. Most runtimes don’t support every quant format. The page surfaces what loads cleanly and what produces errors.
- Model-architecture coverage gaps. A new model architecture might land in llama.cpp before vLLM, or vice versa. We track the lag where it’s noticeable.
- Driver / CUDA / OS combinations that break. Specific known-bad pairings get flagged. ROCm has more of these than CUDA; Apple Silicon has fewer because the surface is narrower.
The discipline rule for this section is the same as everywhere else: we report observed failures, not speculative ones. A failure mode listed here means somebody hit it and wrote it down. A clean failure-mode list does not mean the runtime is bug-free; it means we haven’t collected reports.
Ecosystem stability
Beyond the runtime itself, each detail page carries notes on ecosystem stability: how often the API surface changes, whether wrappers like LangChain or LlamaIndex track the runtime cleanly, whether the operational footprint (config files, env vars, defaults) shifts between releases. A runtime that’s technically active but rewrites its config syntax every two releases costs operators time even when it’s shipping improvements.
Ecosystem stability is editorially noted rather than derived from a metric. The notes carry the editorial pill; operators can disagree and submit corrections through /submit/feedback.
What this methodology cannot measure
Honest gaps in what the runtime-health surface can tell you.
- Bug rate. We don’t track open-issue count, time-to-close, or severity distribution. A runtime could ship a release every week with regressions every other one and still earn the active cadence pill. The confidence engine’s regression-candidate detector partially counterbalances this on the benchmark side, but the runtime-health surface itself is silent on bug volume.
- Maintainer health. Project sustainability is a real concern in inference-engine ecosystems — a one-maintainer project that ships frequently can collapse overnight. We don’t track contributor counts, bus factor, or governance structure.
- Documentation quality. A runtime with great performance and broken docs costs operators days. Editorial notes flag documentation problems where they’re egregious; we don’t systematically audit doc completeness.
- Long-tail compatibility. A runtime that works beautifully on the configurations we benchmark might break in subtle ways on configurations we don’t. Coverage is finite by definition.
- Production-deployment longevity. Whether a runtime is stable enough to put into 24/7 production over six months is something only operators running it that long can answer. We surface their reports when we get them.
Why not the GitHub API?
A frequent question: why not pull commit cadence, issue counts, and contributor activity directly from GitHub and produce a more sophisticated health score? The catalog doesn’t because the tradeoff isn’t favourable for this surface.
The good case for a GitHub integration is real — commit cadence is more granular than editorial timestamps, and it could catch the case where the editorial pipeline lags. The problem is that commit cadence isn’t the same as runtime-health; a project rewriting its README weekly has the same commit pulse as one shipping a new attention kernel weekly. Filtering for “meaningful” commits requires editorial interpretation, which is what we’re already doing through the operational-review timestamp. Adding the API layer would swap one editorial signal for two competing signals without obviously improving accuracy. The operator question — is this runtime safe to commit to for the next 12 months? — is genuinely a judgement call, not a metric. We’d rather render the judgement honestly than dress it up with API-derived precision.
If we add release-tracking columns later, the health derivation will prefer them. The current proxy is documented as a proxy.
Adjacent reading
The versioned-benchmarking methodology documents the metadata fields that connect specific benchmark rows to specific runtime versions. The confidence methodology documents how the runtime-version-drift signal feeds confidence demotion. The scoring methodology covers the parallel engine for catalog dimensions like runtime maturity and setup complexity.
Next recommended step
See per-runtime detail pages with cadence pills, freshness signals, and failure-mode notes.
Back to /resources. See also /editorial-policy and /submit/feedback if you spot a runtime label that’s out of date.