Model Deployment

Model deployment is the process of making a trained AI model available for inference in a production environment. For local AI operators, this means loading a model into a runtime (e.g., llama.cpp, Ollama, vLLM) on a specific hardware configuration (e.g., RTX 4090, Apple M-series) and exposing it via an API or CLI. The key decisions are quantization level (Q4 vs Q8), context length, batch size, and offloading strategy—all constrained by VRAM. Deployment is distinct from training: it focuses on serving, not learning.

An operator deploys Llama 3.1 8B on an RTX 3060 (12 GB VRAM). They choose Q4_K_M quantization (5 GB) to fit the model plus a 4K context (2 GB). Using Ollama, they run ollama run llama3.1:8b which loads the quantized weights into VRAM and starts an HTTP server on port 11434. If they instead try Q8_0 (~8 GB), VRAM runs out and Ollama offloads to system RAM, dropping tokens/sec from ~40 to ~5.

In LM Studio, deployment is a two-click flow: select a model from the hub, choose a quantization preset (e.g., Q4_K_M), and click 'Start Server'. The UI shows VRAM usage and tokens/sec. For vLLM, deployment uses vllm serve meta-llama/Llama-3.1-8B --quantization awq --max-model-len 4096. The runtime loads the model, allocates KV cache, and exposes an OpenAI-compatible API. Operators monitor GPU memory with nvidia-smi to verify fit.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example