RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Metal / Apple Silicon / MLX: Memory pressure detected — consider reducing batch size
Metal / Apple Silicon

MLX: Memory pressure detected — consider reducing batch size

Warning: Memory pressure detected. Consider reducing the batch size.
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Environment: Apple Silicon running mlx-lm batch generation, fine-tuning, or RAG embedding.

Severity: low to medium — not fatal, but throughput collapses when macOS starts swapping.

  • macOS detects unified-memory pressure (yellow / red in Activity Monitor)
  • MLX's allocator hasn't hit its hard limit yet, but the OS is preparing to swap
  • Background indexing (Spotlight, Time Machine) competing for pages
  • MLX caching tensors that haven't been freed
  • Batch size + sequence length × hidden dim exceeds practical free memory

Solution

1. Reduce batch size first (most direct fix):

# Was: batch_size=32
mlx_lm.generate(model, tokenizer, prompts, batch_size=8)

2. Set MLX's GPU memory limit explicitly so the warning happens before swap kicks in:

import mlx.core as mx
# Cap at 75% of physical RAM (e.g. 24 GB on 32 GB Mac)
mx.metal.set_memory_limit(int(0.75 * 32 * 1024**3))
mx.metal.set_cache_limit(0)  # disable cache; free more for tensors

3. Free the cache after each batch:

import gc, mlx.core as mx
for batch in batches:
    out = mlx_lm.generate(model, tokenizer, batch, ...)
    mx.metal.clear_cache()
    gc.collect()

4. Watch macOS pressure live:

vm_stat 1   # Pages free / inactive / wired columns
# Or: open Activity Monitor → Memory → Memory Pressure graph

5. Disable swap pressure for long jobs:

sudo sysctl -w kern.maxvnodes=750000
caffeinate -dimsu mlx_lm.generate ...

6. Bigger picture: Apple Silicon swap is fast SSD but still 10-50× slower than RAM. Once you swap during inference, throughput collapses. Resize the workload to stay green.

Alternative solutions

On a 16 GB Mac, treat the warning as fatal — swap will dominate and effective tok/s drops below CPU-only inference. Move the workload to a Mac with ≥ 32 GB unified memory, or to a Linux box with a discrete GPU.

Related errors

  • MLX / Metal: command buffer execution failed
  • Apple Silicon: RuntimeError: MPS backend out of memory
  • Metal Allocator: out of memory on Apple Silicon
  • Metal allocation failed — Apple Silicon OOM under unified memory pressure

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.