Model Serving
Model serving is the process of making a trained AI model available for inference via an API or local runtime. For operators running local AI, this means loading a model into memory (typically VRAM) and exposing it through a server endpoint (e.g., HTTP, gRPC) or a command-line interface. The serving layer handles request batching, concurrency, and resource management. Key operator concerns: VRAM capacity determines which models can be served without offloading; latency and throughput depend on quantization level, batch size, and hardware. Tools like vLLM and llama.cpp optimize serving with continuous batching and KV-cache management.
Deeper dive
Model serving bridges training and inference. In production, it involves deploying a model behind a REST or gRPC API, handling authentication, rate limiting, and scaling. For local AI, serving is often simpler: a single process (e.g., Ollama, LM Studio) loads the model and listens on localhost. The runtime must manage the model's weights, KV cache, and token generation loop. Key serving metrics: time-to-first-token (TTFT), tokens-per-second (TPS), and memory usage. Advanced serving frameworks like vLLM use PagedAttention to reduce memory fragmentation and support continuous batching—processing multiple requests concurrently to improve throughput. Operators should consider whether their workload benefits from batching (e.g., chat apps) or low latency (e.g., real-time assistants).
Practical example
An operator with an RTX 4090 (24 GB VRAM) serves Llama 3.1 8B at Q4_K_M (~5.5 GB). Using llama.cpp's built-in HTTP server (./server -m model.gguf --host 0.0.0.0 --port 8080), they get ~40 tok/s for single requests. If they switch to vLLM with continuous batching, throughput can exceed 100 tok/s under concurrent load, but VRAM usage increases due to additional overhead.
Workflow example
In Ollama, model serving is automatic after ollama pull llama3.1:8b and ollama serve. The runtime listens on localhost:11434. An operator can send a request via curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b", "prompt": "Hello"}'. For custom setups, llama.cpp's server binary loads a GGUF file and provides a REST API. vLLM requires a Python environment: vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization awq starts an OpenAI-compatible server.
Reviewed by Fredoline Eruo. See our editorial policy.