Llama 3.1 8B Instruct on Apple M4 Max
Measured this month.
Measurement
- tok/s
- 78.5
- TTFT
- 92 ms
- VRAM used
- —
- RAM used
- 4.5 GB
- Power
- 28 W
- Quant
- MLX-4bit
- Context
- 32K
- Run date
- 2026-05-04
- Source
- community
Apple M4 Max with 64GB unified memory running MLX-LM 4-bit. Throughput is ~75% of the RTX 4090 Ollama Q4_K_M baseline despite ~3-5x lower power draw — the unified-memory architecture wins on tokens-per-watt for this size class. Long-context (32K) holds throughput within 5% of short-context, which is meaningfully stronger than llama.cpp Metal at the same context length. vramUsageGb left null because Apple Silicon unified memory doesn't have a separate VRAM pool. Pulled from community runs verified against the published MLX-LM benchmark scripts.
Why this confidence tier?
Confidence is rule-based. Every factor below contributed to the tier. We never expose a single numeric score; the tier label is auditable through this explanation alone.
- +Source: community submission
- Reproduce this benchmark →An independent reproduction with matching numbers lifts the tier and reduces single-source risk.
- Read the confidence methodology →Full editorial standards for tiering.
- Why we don't use percentages →Tier labels — auditable, no opaque score.
Cohort intelligence
How this measurement compares to the rest of the corpus. Only comparable rows (same model + hardware first, with relaxations labelled) are used. We never average across runtimes or quant formats unless explicitly told to.
Same model + hardware, different runtime
1 matching rowVariance here is pure runtime / version drift. Wide spread suggests a runtime regression candidate worth investigating.
- 78.5 tok/sapple-m4-maxMLX-4bitEditorial
Same model, different hardware
7 matching rowsWhat this model looks like on adjacent hardware. Drives the 'should I upgrade?' question.
- 86.4 tok/srx-7900-xtxQ4_K_MEditorial
- 55.0 tok/sapple-m3-maxQ4_K_MEditorial
- 105.0 tok/srtx-3090Q4_K_MEditorial
- 118.2 tok/srtx-5080Q4_K_MEditorial
- 132.2 tok/srtx-5080ollama version is 0.23.2Q4_K_MEditorial
- +2 more
Reproduce this benchmark
Got the same model + hardware combo? Run the same measurement and submit your numbers. We'll pre-fill model, hardware, quant, and context — you just add your tok/s, VRAM, runtime version. If your numbers match within ±15%, this benchmark gets a confidence lift and a reproduction badge.
Related
Drill into the entity pages for this measurement.
Cite or export
Reference this benchmark in your work. Multiple formats; CC-BY attribution required.
Cite this benchmark or paste it into a README. Copy-to-clipboard; license is CC-BY-4.0 (attribution to RunLocalAI required).
<a href="https://runlocalai.co/benchmarks/331" rel="noopener">RunLocalAI: Llama 3.1 8B Instruct on Apple M4 Max — 78.5 tok/s</a>
Next recommended step
Got the same model + hardware? Run it and submit your numbers — successful reproductions lift this benchmark's confidence tier.