RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Configuration / LM Studio generation much slower than expected
Configuration

LM Studio generation much slower than expected

(no error — tok/s reads e.g. 4 tok/s on hardware that should do 40 tok/s)
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Environment: LM Studio on Windows / macOS / Linux desktop with discrete GPU.

Severity: low — works, just slow.

  • GPU offload layers slider too low (default sometimes auto-detects conservatively, leaving most layers on CPU)
  • KV cache type set to F16 when Q4_0/Q8_0 would fit better in VRAM and run faster
  • Context size set higher than VRAM allows, forcing CPU spillover
  • Background apps (Chrome, Steam, Discord) holding VRAM
  • LM Studio running CPU-only build because GPU runtime didn't initialize on first launch

Solution

1. Push GPU offload layers to max. In the chat side panel → "Hardware Settings":

  • GPU Offload: drag to Max (e.g. 33/33 for Llama 3.1 8B)
  • Watch the VRAM meter; if it spikes red, drop one or two layers

2. Set KV cache to a quantized type (F16 → Q8_0 ≈ 50% memory cut, ~5% quality):

  • "KV Cache Quant Type" → Q8_0 for both K and V
  • For aggressive memory pressure: Q4_0

3. Lower context size to what you actually use:

  • Context Length slider: 4096 or 8192 instead of the model's 128K max

4. Verify the GPU is actually being used:

# While generating, in another terminal
nvidia-smi -l 1   # or rocm-smi -l 1 on AMD

GPU utilization should be 70-100%; if 0%, LM Studio is running CPU-only — toggle "Hardware Settings → GPU type" to your card and reload the model.

5. Close VRAM hogs. Chrome (1-3 GB), other AI apps, Discord overlay. Verify with nvidia-smi before reloading the model.

Alternative solutions

Caveat: on Apple Silicon LM Studio uses Metal automatically — there's no GPU offload slider; instead reduce KV cache type or pick a smaller quant. Apple's "tok/s" ceiling is set by memory bandwidth (e.g. M2 Pro ≈ 200 GB/s vs M3 Max ≈ 400 GB/s).

Related errors

  • Ollama: Error: model 'X' not found
  • Ollama: bind: address already in use (port 11434)
  • Ollama: connection refused on localhost:11434
  • Ollama truncates input — default context length is only 2048
  • Token generation slows as conversation gets longer

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.