RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Frameworks & tools / LM Studio
Frameworks & tools

LM Studio

Also known as: lmstudio

LM Studio is a desktop application that provides a graphical interface for downloading, managing, and running local large language models (LLMs) on consumer hardware. It wraps llama.cpp as its inference backend, enabling operators to load models in GGUF format, configure context length, GPU offloading, and quantization levels without writing command-line arguments. The app handles model downloads from Hugging Face repositories and manages VRAM allocation automatically, showing real-time token generation speed and memory usage. Operators use LM Studio to chat with models, run local inference servers compatible with OpenAI's API, and experiment with different model sizes and settings without scripting.

Deeper dive

LM Studio simplifies local LLM deployment by abstracting the complexities of llama.cpp and model management. When an operator selects a model (e.g., Llama 3.1 8B Q4_K_M), LM Studio downloads the GGUF file from Hugging Face, stores it in a local cache, and loads it into VRAM using GPU offloading. The interface exposes sliders for context length (e.g., 2048 to 8192 tokens), GPU layers (how many transformer layers run on GPU vs. CPU), and thread count. It also provides a built-in server mode that exposes an HTTP endpoint mimicking OpenAI's chat completions API, allowing other tools (e.g., SillyTavern, Open Interpreter) to connect. LM Studio is particularly useful for operators who prefer a visual workflow over terminal commands, though it offers less granular control than direct llama.cpp usage.

Practical example

An operator with an RTX 3060 12GB can run Llama 3.1 8B at Q4_K_M (5 GB) with 4096 context in LM Studio. The app shows ~30 tok/s and 80% VRAM usage. Trying Mistral 7B at Q8 (7 GB) might cause out-of-memory errors, prompting the operator to reduce context or switch to Q4. The server mode lets them point a script at http://localhost:1234/v1 to generate text via the OpenAI Python client.

Workflow example

In LM Studio, an operator clicks 'Search' to find 'Mistral-7B-Instruct-v0.3-GGUF' from Hugging Face, downloads the Q4_K_M file, and loads it. They set GPU Offload to 'Max' (all layers on GPU) and context to 4096. After clicking 'Start Server', they see 'Server running on http://localhost:1234'. They then use a Python script with openai.ChatCompletion.create(model='local-model', messages=[...]) to interact with the model. The app's sidebar shows real-time tokens/sec and VRAM usage.

Related terms

GGUFllama.cppOllama

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →