05. MLX Framework
MLX is Apple's machine learning framework, designed from the ground up for Apple Silicon. Unlike llama.cpp (originally written for CUDA GPUs and ported to Metal) or Ollama (which wraps llama.cpp), MLX is native. It understands the unified memory architecture, schedules operations across CPU and GPU optimally, and avoids unnecessary memory copies.
MLX models are distributed in MLX format, not GGUF. The quantization scheme is different (MLX supports a broader set of quantizations including 2-bit, 4-bit, 6-bit, and 8-bit), and the model files use the .mlx extension or are served via the MLX API.
Install MLX and run a model:
# Using mlx-lm Python package
pip install mlx-lm
# List available models
mlx_lm.ls
# Generate with a model
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Llama-3.2-3B-Instruct-4bit')
response = model.generate('Explain GPU memory bandwidth on Apple Silicon.', tokenizer)
print(response)
"
The key advantage of MLX on Apple Silicon: 2–4× higher throughput than equivalent GGUF models running through llama.cpp for the same parameter count. This is not marketing—this is the result of memory access patterns optimized for unified memory and compute kernels written for ARM SIMD instructions.
MLX also supports LoRA fine-tuning with a much lower memory footprint than other frameworks because of its memory-efficient gradient computation.
Real failure mode: mlx_lm.ls returns an empty list or a Python import error. This means the package did not install correctly. Run pip show mlx-lm to check. If it is not installed, pip install --upgrade mlx-lm and confirm the install completed without errors.
Another failure: Running an MLX model on a machine with insufficient RAM produces a SIGKILL with no error message. This is the OS terminating the process because it exceeded memory limits. You need a smaller model or more RAM.
Install mlx-lm, run mlx_lm.ls to list available models, then generate 50 tokens using mlx-community/Qwen2.5-0.5B-Instruct-4bit. Time the generation and note tokens per second.