HOW-TO · SET

How to run vLLM with an OpenAI-compatible API

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

vLLM installed, a model downloaded

What this does

Exposes a running vLLM instance through REST endpoints that match the OpenAI Chat Completions and Completions API interface. Any client library built for the OpenAI API can target the local vLLM server with minimal configuration changes.

Steps

  1. Start the vLLM server. The --host flag binds the listening interface; --port sets the TCP port.

    vllm serve <model> --host 0.0.0.0 --port 8000
    

    Expected output: Uvicorn running on http://0.0.0.0:8000.

  2. Check server readiness. The /v1/models endpoint confirms the API is responding.

    curl -s http://localhost:8000/v1/models | python -m json.tool
    

    Expected output: a JSON object with a data array containing model metadata.

  3. Send a chat completions request. The header and JSON shape mirror the OpenAI API exactly.

    curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "<model-id>",
        "messages": [{"role": "user", "content": "What is 2+2?"}],
        "max_tokens": 20,
        "temperature": 0
      }'
    

    Expected output: JSON with choices[0].message.content containing the model's response.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model-id>","messages":[{"role":"user","content":"ping"}],"max_tokens":5}'
# Expected: a short text response (not an error)

Common failures

  • 404 Not Found on /v1/chat/completions — Model identifier mismatch. Verify with curl http://localhost:8000/v1/models and use the exact string.
  • Connection refused — The server is not running or the port is blocked. Confirm with lsof -i :8000.
  • Invalid request error — Missing required fields such as max_tokens. Add "max_tokens": 20 to the payload.
  • model not found in response — The model was registered under a different name. Use the exact string from /v1/models.

Related guides

RELATED GUIDES