RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / MLOps & deployment / Model Deployment
MLOps & deployment

Model Deployment

Model deployment is the process of making a trained AI model available for inference in a production environment. For local AI operators, this means loading a model into a runtime (e.g., llama.cpp, Ollama, vLLM) on a specific hardware configuration (e.g., RTX 4090, Apple M-series) and exposing it via an API or CLI. The key decisions are quantization level (Q4 vs Q8), context length, batch size, and offloading strategy—all constrained by VRAM. Deployment is distinct from training: it focuses on serving, not learning.

Practical example

An operator deploys Llama 3.1 8B on an RTX 3060 (12 GB VRAM). They choose Q4_K_M quantization (5 GB) to fit the model plus a 4K context (2 GB). Using Ollama, they run ollama run llama3.1:8b which loads the quantized weights into VRAM and starts an HTTP server on port 11434. If they instead try Q8_0 (~8 GB), VRAM runs out and Ollama offloads to system RAM, dropping tokens/sec from ~40 to ~5.

Workflow example

In LM Studio, deployment is a two-click flow: select a model from the hub, choose a quantization preset (e.g., Q4_K_M), and click 'Start Server'. The UI shows VRAM usage and tokens/sec. For vLLM, deployment uses vllm serve meta-llama/Llama-3.1-8B --quantization awq --max-model-len 4096. The runtime loads the model, allocates KV cache, and exposes an OpenAI-compatible API. Operators monitor GPU memory with nvidia-smi to verify fit.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →