RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to run vLLM with a HuggingFace model
HOW-TO · SET

How to run vLLM with a HuggingFace model

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

vLLM installed, HuggingFace account (optional for gated models)

What this does

Launches vLLM as a server process and loads a HuggingFace-compatible model into GPU memory, enabling inference via chat completions or plain completions endpoints. The model is served on the local host with token streaming and batching handled automatically.

Steps

  1. Authenticate if accessing gated models. Gated repos such as meta-llama/Llama-* require license acceptance before downloading.

    huggingface-cli login
    

    Expected output: Login successful. Skip this step for fully public models.

  2. Export optional HuggingFace cache variables. By default, models cache to ~/.cache/huggingface/. Setting HF_HOME redirects downloads to a faster volume.

    export HF_HOME=/path/to/fast/disk
    

    Expected output: no output; the variable is set in the shell.

  3. Start vLLM with a HuggingFace model identifier.

    vllm serve meta-llama/Llama-3.2-1B-Instruct \
      --task generate \
      --tensor-parallel-size 1
    

    Expected output: INFO: Application startup complete. Uvicorn running on http://0.0.0.0:8000.

  4. Send a test inference request.

    curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "meta-llama/Llama-3.2-1B-Instruct", "messages": [{"role": "user", "content": "Say hello in one sentence."}], "max_tokens": 32, "temperature": 0}'
    

    Expected output: a JSON object containing a choices array with model-generated text.

Verification

curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: lists the served model name

Common failures

  • 401 or 403 error — Auth token missing or expired for a gated model. Re-run huggingface-cli login and accept the model license.
  • CUDA out of memory — Model exceeds single-GPU VRAM. Lower --gpu-memory-utilization to 0.7 or switch to a smaller model.
  • Model not found (404) — Typo in the model identifier or the model has been renamed. Check exact path on HuggingFace.
  • Port 8000 already in use — Another process occupies the port. Find it with lsof -i :8000 or pass --port 8001.
  • Slow first inference (cold start) — vLLM compiles CUDA kernels on the first request. Subsequent requests run faster.

Related guides

  • How to install vLLM with pip
  • How to enable tensor parallelism in vLLM
  • Course Local AI Fundamentals
RELATED GUIDES
SET
How to enable tensor parallelism in vLLM
SET
How to install vLLM with pip
← All how-to guidesCourses →