MLX: Memory pressure detected — consider reducing batch size — fix and explanation

Q: What causes "MLX: Memory pressure detected — consider reducing batch size"?

**Environment:** Apple Silicon running [mlx-lm](/tools/mlx-lm) batch generation, fine-tuning, or RAG embedding. **Severity: low to medium** — not fatal, but throughput collapses when macOS starts swapping. - macOS detects unified-memory pressure (yellow / red in Activity Monitor) - MLX's allocator hasn't hit its hard limit yet, but the OS is preparing to swap - Background indexing (Spotlight, Time Machine) competing for pages - MLX caching tensors that haven't been freed - Batch size + sequence length × hidden dim exceeds practical free memory

Q: How do you fix "MLX: Memory pressure detected — consider reducing batch size"?

**1. Reduce batch size first** (most direct fix): ```python # Was: batch_size=32 mlx_lm.generate(model, tokenizer, prompts, batch_size=8) ``` **2. Set MLX's GPU memory limit explicitly** so the warning happens before swap kicks in: ```python import mlx.core as mx # Cap at 75% of physical RAM (e.g. 24 GB on 32 GB Mac) mx.metal.set_memory_limit(int(0.75 * 32 * 1024**3)) mx.metal.set_cache_limit(0) # disable cache; free more for tensors ``` **3. Free the cache after each batch:** ```python import gc, mlx.core as mx for batch in batches: out = mlx_lm.generate(model, tokenizer, batch, ...) mx.metal.clear_cache() gc.collect() ``` **4. Watch macOS pressure live:** ```bash vm_stat 1 # Pages free / inactive / wired columns # Or: open Activity Monitor → Memory → Memory Pressure graph ``` **5. Disable swap pressure for long jobs:** ```bash sudo sysctl -w kern.maxvnodes=750000 caffeinate -dimsu mlx_lm.generate ... ``` **6. Bigger picture:** Apple Silicon swap is fast SSD but still 10-50× slower than RAM. Once you swap during inference, throughput collapses. Resize the workload to stay green.

Cause

Environment: Apple Silicon running mlx-lm batch generation, fine-tuning, or RAG embedding.

Severity: low to medium — not fatal, but throughput collapses when macOS starts swapping.

macOS detects unified-memory pressure (yellow / red in Activity Monitor)
MLX's allocator hasn't hit its hard limit yet, but the OS is preparing to swap
Background indexing (Spotlight, Time Machine) competing for pages
MLX caching tensors that haven't been freed
Batch size + sequence length × hidden dim exceeds practical free memory

Solution

1. Reduce batch size first (most direct fix):

# Was: batch_size=32
mlx_lm.generate(model, tokenizer, prompts, batch_size=8)

2. Set MLX's GPU memory limit explicitly so the warning happens before swap kicks in:

import mlx.core as mx
# Cap at 75% of physical RAM (e.g. 24 GB on 32 GB Mac)
mx.metal.set_memory_limit(int(0.75 * 32 * 1024**3))
mx.metal.set_cache_limit(0)  # disable cache; free more for tensors

3. Free the cache after each batch:

import gc, mlx.core as mx
for batch in batches:
    out = mlx_lm.generate(model, tokenizer, batch, ...)
    mx.metal.clear_cache()
    gc.collect()

4. Watch macOS pressure live:

vm_stat 1   # Pages free / inactive / wired columns
# Or: open Activity Monitor → Memory → Memory Pressure graph

5. Disable swap pressure for long jobs:

sudo sysctl -w kern.maxvnodes=750000
caffeinate -dimsu mlx_lm.generate ...

6. Bigger picture: Apple Silicon swap is fast SSD but still 10-50× slower than RAM. Once you swap during inference, throughput collapses. Resize the workload to stay green.

Alternative solutions

On a 16 GB Mac, treat the warning as fatal — swap will dominate and effective tok/s drops below CPU-only inference. Move the workload to a Mac with ≥ 32 GB unified memory, or to a Linux box with a discrete GPU.

MLX: Memory pressure detected — consider reducing batch size

Cause

Solution

Alternative solutions

Related errors

Did this fix it?