03. Model Serving Setup
Model serving requires choosing between llama.cpp and vLLM based on hardware and throughput requirements. Llama.cpp runs on CPU with excellent memory efficiency through quantized weights. vLLM requires CUDA and delivers higher throughput for concurrent users through PagedAttention.
For llama.cpp, the server binary runs as a standalone process. Download the quantized model file and start the server:
# Download quantized model (Mistral 7B Q4_K_M)
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Start llama-server
./llama-server \
-m mistral-7b-instruct-v0.2.Q4_K_M.gguf \
-c 4096 \
--host 0.0.0.0 \
--port 8080
The context size (-c) determines how much text the model processes. Higher values enable longer conversations but increase memory usage. The Q4_K_M quantization reduces model size by 4x with acceptable quality loss.
For vLLM, install via pip and start with the OpenAI-compatible server:
pip install vllm
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--tensor-parallel-size 1 \
--port 8080
vLLM provides an OpenAI-compatible API by default. This simplifies integration—the same client code works for both providers.
Common failure modes with model serving include OOM kills when context windows exceed available RAM. Monitor model memory usage with nvidia-smi for GPU or ps aux | grep llama for CPU. Set container memory limits in Docker to trigger restarts before the host runs out of memory.
Health check endpoints should verify the model loads and responds to a simple completion request within a timeout. A failed health check should trigger container restart via Docker's restart policy.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Set up llama.cpp server locally and verify the /completion endpoint works. Measure latency for a 10-token completion.