LM Studio generation much slower than expected
Cause
Environment: LM Studio on Windows / macOS / Linux desktop with discrete GPU.
Severity: low — works, just slow.
- GPU offload layers slider too low (default sometimes auto-detects conservatively, leaving most layers on CPU)
- KV cache type set to F16 when Q4_0/Q8_0 would fit better in VRAM and run faster
- Context size set higher than VRAM allows, forcing CPU spillover
- Background apps (Chrome, Steam, Discord) holding VRAM
- LM Studio running CPU-only build because GPU runtime didn't initialize on first launch
Solution
1. Push GPU offload layers to max. In the chat side panel → "Hardware Settings":
- GPU Offload: drag to Max (e.g. 33/33 for Llama 3.1 8B)
- Watch the VRAM meter; if it spikes red, drop one or two layers
2. Set KV cache to a quantized type (F16 → Q8_0 ≈ 50% memory cut, ~5% quality):
- "KV Cache Quant Type" →
Q8_0for both K and V - For aggressive memory pressure:
Q4_0
3. Lower context size to what you actually use:
- Context Length slider: 4096 or 8192 instead of the model's 128K max
4. Verify the GPU is actually being used:
# While generating, in another terminal
nvidia-smi -l 1 # or rocm-smi -l 1 on AMD
GPU utilization should be 70-100%; if 0%, LM Studio is running CPU-only — toggle "Hardware Settings → GPU type" to your card and reload the model.
5. Close VRAM hogs. Chrome (1-3 GB), other AI apps, Discord overlay. Verify with nvidia-smi before reloading the model.
Alternative solutions
Caveat: on Apple Silicon LM Studio uses Metal automatically — there's no GPU offload slider; instead reduce KV cache type or pick a smaller quant. Apple's "tok/s" ceiling is set by memory bandwidth (e.g. M2 Pro ≈ 200 GB/s vs M3 Max ≈ 400 GB/s).
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.