How to run vLLM with a HuggingFace model
vLLM installed, HuggingFace account (optional for gated models)
What this does
Launches vLLM as a server process and loads a HuggingFace-compatible model into GPU memory, enabling inference via chat completions or plain completions endpoints. The model is served on the local host with token streaming and batching handled automatically.
Steps
Authenticate if accessing gated models. Gated repos such as
meta-llama/Llama-*require license acceptance before downloading.huggingface-cli loginExpected output:
Login successful. Skip this step for fully public models.Export optional HuggingFace cache variables. By default, models cache to
~/.cache/huggingface/. SettingHF_HOMEredirects downloads to a faster volume.export HF_HOME=/path/to/fast/diskExpected output: no output; the variable is set in the shell.
Start vLLM with a HuggingFace model identifier.
vllm serve meta-llama/Llama-3.2-1B-Instruct \ --task generate \ --tensor-parallel-size 1Expected output:
INFO: Application startup complete. Uvicorn running on http://0.0.0.0:8000.Send a test inference request.
curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.2-1B-Instruct", "messages": [{"role": "user", "content": "Say hello in one sentence."}], "max_tokens": 32, "temperature": 0}'Expected output: a JSON object containing a
choicesarray with model-generated text.
Verification
curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: lists the served model name
Common failures
- 401 or 403 error — Auth token missing or expired for a gated model. Re-run
huggingface-cli loginand accept the model license. - CUDA out of memory — Model exceeds single-GPU VRAM. Lower
--gpu-memory-utilizationto 0.7 or switch to a smaller model. - Model not found (404) — Typo in the model identifier or the model has been renamed. Check exact path on HuggingFace.
- Port 8000 already in use — Another process occupies the port. Find it with
lsof -i :8000or pass--port 8001. - Slow first inference (cold start) — vLLM compiles CUDA kernels on the first request. Subsequent requests run faster.