Community submitted

Editorial benchmark

Llama 3.1 8B Instruct on Apple M4 Max

Measured this month.

Measurement

tok/s: 78.5
TTFT: 92 ms
VRAM used: —
RAM used: 4.5 GB
Power: 28 W
Quant: MLX-4bit
Context: 32K
Run date: 2026-05-04
Source: community

Editorial notes

Apple M4 Max with 64GB unified memory running MLX-LM 4-bit. Throughput is ~75% of the RTX 4090 Ollama Q4_K_M baseline despite ~3-5x lower power draw — the unified-memory architecture wins on tokens-per-watt for this size class. Long-context (32K) holds throughput within 5% of short-context, which is meaningfully stronger than llama.cpp Metal at the same context length. vramUsageGb left null because Apple Silicon unified memory doesn't have a separate VRAM pool. Pulled from community runs verified against the published MLX-LM benchmark scripts.

Why this confidence tier?

Moderate confidence

Confidence is rule-based. Every factor below contributed to the tier. We never expose a single numeric score; the tier label is auditable through this explanation alone.

Factors

+Source: community submission

How to improve this benchmark's confidence

Reproduce this benchmark →An independent reproduction with matching numbers lifts the tier and reduces single-source risk.
Read the confidence methodology →Full editorial standards for tiering.
Why we don't use percentages →Tier labels — auditable, no opaque score.

Cohort intelligence

How this measurement compares to the rest of the corpus. Only comparable rows (same model + hardware first, with relaxations labelled) are used. We never average across runtimes or quant formats unless explicitly told to.

Insufficient comparison data. Insufficient cohort (1 comparable measurements). Outlier detection requires ≥5.

Same model + hardware, different runtime

1 matching row

Variance here is pure runtime / version drift. Wide spread suggests a runtime regression candidate worth investigating.

Median tok/s

78.5

Spread

78.5 – 78.5

78.5 tok/sapple-m4-maxMLX-4bitEditorial

Same model, different hardware

7 matching rows

What this model looks like on adjacent hardware. Drives the 'should I upgrade?' question.

Median tok/s

118.2

Spread

55.0 – 195.0

CoV

35%

86.4 tok/srx-7900-xtxQ4_K_MEditorial
55.0 tok/sapple-m3-maxQ4_K_MEditorial
105.0 tok/srtx-3090Q4_K_MEditorial
118.2 tok/srtx-5080Q4_K_MEditorial
132.2 tok/srtx-5080ollama version is 0.23.2Q4_K_MEditorial
+2 more

Reproduce this benchmark

Got the same model + hardware combo? Run the same measurement and submit your numbers. We'll pre-fill model, hardware, quant, and context — you just add your tok/s, VRAM, runtime version. If your numbers match within ±15%, this benchmark gets a confidence lift and a reproduction badge.

Reproduce this benchmark →

Drill into the entity pages for this measurement.

Llama 3.1 8B Instruct model page

Apple M4 Max hardware page

All measurements for this exact pair

Try Apple M4 Max in the build engine

Cite or export

Reference this benchmark in your work. Multiple formats; CC-BY attribution required.

Cite this benchmark or paste it into a README. Copy-to-clipboard; license is CC-BY-4.0 (attribution to RunLocalAI required).

OG card (PNG)

1200x630, social-preview ready

Download SVG

vector card, scales cleanly

Embed this benchmark

Paste into a Reddit thread, blog post, or README — attribution baked in.

<a href="https://runlocalai.co/benchmarks/331" rel="noopener">RunLocalAI: Llama 3.1 8B Instruct on Apple M4 Max — 78.5 tok/s</a>

Direct download: .json · .md · .bib · .svg

Next recommended step

Got the same model + hardware? Run it and submit your numbers — successful reproductions lift this benchmark's confidence tier.

Reproduce this benchmark

OrCompare other measurements for Llama 3.1 8B Instruct on Apple M4 Max See the benchmark roadmap

Measurement

Why this confidence tier?

Cohort intelligence

Same model + hardware, different runtime

Same model, different hardware

Reproduce this benchmark

Related

Cite or export

Next recommended step