What this does

Launches vLLM as a server process and loads a HuggingFace-compatible model into GPU memory, enabling inference via chat completions or plain completions endpoints. The model is served on the local host with token streaming and batching handled automatically.

Steps

Authenticate if accessing gated models. Gated repos such as meta-llama/Llama-* require license acceptance before downloading.
```
huggingface-cli login
```
Expected output: Login successful. Skip this step for fully public models.
Export optional HuggingFace cache variables. By default, models cache to ~/.cache/huggingface/. Setting HF_HOME redirects downloads to a faster volume.
```
export HF_HOME=/path/to/fast/disk
```
Expected output: no output; the variable is set in the shell.
Start vLLM with a HuggingFace model identifier.
```
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --task generate \
  --tensor-parallel-size 1
```
Expected output: INFO: Application startup complete. Uvicorn running on http://0.0.0.0:8000.

Send a test inference request.

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-1B-Instruct", "messages": [{"role": "user", "content": "Say hello in one sentence."}], "max_tokens": 32, "temperature": 0}'

Expected output: a JSON object containing a choices array with model-generated text.

Verification

curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: lists the served model name

Common failures

401 or 403 error — Auth token missing or expired for a gated model. Re-run huggingface-cli login and accept the model license.
CUDA out of memory — Model exceeds single-GPU VRAM. Lower --gpu-memory-utilization to 0.7 or switch to a smaller model.
Model not found (404) — Typo in the model identifier or the model has been renamed. Check exact path on HuggingFace.
Port 8000 already in use — Another process occupies the port. Find it with lsof -i :8000 or pass --port 8001.
Slow first inference (cold start) — vLLM compiles CUDA kernels on the first request. Subsequent requests run faster.

How to run vLLM with a HuggingFace model

What this does

Steps

Verification

Common failures

Related guides