RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Capstone: Full-Stack AI App
  6. /Ch. 3
Capstone: Full-Stack AI App

03. Model Serving Setup

Chapter 3 of 18 · 15 min
KEY INSIGHT

Model serving is the bottleneck in most AI applications—design autoscaling around inference latency, not request throughput.

Model serving requires choosing between llama.cpp and vLLM based on hardware and throughput requirements. Llama.cpp runs on CPU with excellent memory efficiency through quantized weights. vLLM requires CUDA and delivers higher throughput for concurrent users through PagedAttention.

For llama.cpp, the server binary runs as a standalone process. Download the quantized model file and start the server:

# Download quantized model (Mistral 7B Q4_K_M)
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Start llama-server
./llama-server \
  -m mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

The context size (-c) determines how much text the model processes. Higher values enable longer conversations but increase memory usage. The Q4_K_M quantization reduces model size by 4x with acceptable quality loss.

For vLLM, install via pip and start with the OpenAI-compatible server:

pip install vllm

vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
  --tensor-parallel-size 1 \
  --port 8080

vLLM provides an OpenAI-compatible API by default. This simplifies integration—the same client code works for both providers.

Common failure modes with model serving include OOM kills when context windows exceed available RAM. Monitor model memory usage with nvidia-smi for GPU or ps aux | grep llama for CPU. Set container memory limits in Docker to trigger restarts before the host runs out of memory.

Health check endpoints should verify the model loads and responds to a simple completion request within a timeout. A failed health check should trigger container restart via Docker's restart policy.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Set up llama.cpp server locally and verify the /completion endpoint works. Measure latency for a 10-token completion.

← Chapter 2
Architecture Design
Chapter 4 →
API Gateway