RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /PyTorch MPS falling back to CPU on Apple Silicon
degrades✓Editorial·Reviewed May 2026

MPS fallback to CPU — get Apple Silicon GPU usage back

PyTorch on Apple Silicon silently falls back to CPU when an op isn't supported by MPS. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to make it audible, then fix the actual op (cast dtype, disable flash-attention, lower batch).

PyTorchApple Silicon (M1-M4)MetalComfyUI on MacHugging Face Transformers
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Operation not supported by MPS backend (most common)

Diagnose

Performance is 5-20x slower than expected. Set `PYTORCH_ENABLE_MPS_FALLBACK=1` and re-run — you'll see warnings like 'The operator aten::xyz is not currently implemented for MPS, falling back to CPU.'

Fix

Either accept the fallback (the env var allows it) or restructure your code to avoid the unsupported op. Common offenders: complex64 ops, certain reduction ops, custom CUDA kernels.

#2

PyTorch version too old for current macOS

Diagnose

`python -c 'import torch; print(torch.__version__)'` shows < 2.5. Newer macOS (14+, 15) needs PyTorch 2.5+ for full MPS coverage.

Fix

Upgrade: `pip install --upgrade torch torchvision`. Newer PyTorch ships better MPS op coverage every release.

#3

Tensor dtype unsupported on MPS

Diagnose

Error mentions complex64, bfloat16 in some ops, or specific quantization types. MPS supports float32, float16, int*, but coverage is uneven.

Fix

Cast tensors to a supported dtype before the failing op: `tensor.to(torch.float32)`. For models that demand bf16, fall back to CPU for that specific op via PYTORCH_ENABLE_MPS_FALLBACK=1.

#4

Model architecture using flash-attention (NVIDIA-only)

Diagnose

Error mentions FlashAttention or 'no kernel image' on MPS. The model's config tries to load flash-attn-2 which doesn't exist on Apple Silicon.

Fix

In Transformers: pass `attn_implementation='eager'` or `attn_implementation='sdpa'` when loading. SDPA on MPS is reasonably fast; eager is the safe fallback.

#5

Memory pressure forcing macOS to evict

Diagnose

`top` or Activity Monitor shows yellow/red memory pressure. Sudden 10x slowdown mid-inference — macOS is paging tensors to disk.

Fix

Close memory-heavy apps (browsers with hardware accel, Photoshop, Final Cut). Lower batch size. For 70B+ workloads, more unified memory is the only real answer.

Best Mac for local AI →
If this keeps happening — the next decision is hardware

MPS fallback warnings are usually a code-path issue, but if they happen on every workload your Mac is undersized. The guide below frames the upgrade path.

  • best budget Mac for local AI

Frequently asked questions

Is MPS as fast as CUDA?

No, but closer than people assume on inference. M4 Max ~546 GB/s vs RTX 4090 ~1008 GB/s — bandwidth gap is real. For LLM decode, M4 Max hits ~70-90% of 4090 throughput. For prefill (compute-bound), 30-50% slower. For training with PyTorch, MPS lags more.

Should I use MLX instead of PyTorch on Mac?

For Apple-native fine-tuning + research, yes — MLX is faster and built for Apple Silicon. For PyTorch-ecosystem code (Hugging Face Transformers, diffusers, anything that imports torch), stick with PyTorch + MPS. The translation cost isn't worth it for inference-only workflows.

Why does my Mac feel slow during inference?

Three usual causes: (1) Memory pressure forcing OS to swap (close apps), (2) MPS fallback to CPU on unsupported ops (set the env var to surface them), (3) thermal throttling on MacBooks under sustained load (use a cooling pad or run on power, not battery).

Related troubleshooting

llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS

Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

torch.cuda.is_available() returns False

PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?