How to run vLLM with an OpenAI-compatible API
vLLM installed, a model downloaded
What this does
Exposes a running vLLM instance through REST endpoints that match the OpenAI Chat Completions and Completions API interface. Any client library built for the OpenAI API can target the local vLLM server with minimal configuration changes.
Steps
Start the vLLM server. The
--hostflag binds the listening interface;--portsets the TCP port.vllm serve <model> --host 0.0.0.0 --port 8000Expected output:
Uvicorn running on http://0.0.0.0:8000.Check server readiness. The
/v1/modelsendpoint confirms the API is responding.curl -s http://localhost:8000/v1/models | python -m json.toolExpected output: a JSON object with a
dataarray containing model metadata.Send a chat completions request. The header and JSON shape mirror the OpenAI API exactly.
curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<model-id>", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 20, "temperature": 0 }'Expected output: JSON with
choices[0].message.contentcontaining the model's response.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model-id>","messages":[{"role":"user","content":"ping"}],"max_tokens":5}'
# Expected: a short text response (not an error)
Common failures
- 404 Not Found on
/v1/chat/completions— Model identifier mismatch. Verify withcurl http://localhost:8000/v1/modelsand use the exact string. - Connection refused — The server is not running or the port is blocked. Confirm with
lsof -i :8000. Invalid requesterror — Missing required fields such asmax_tokens. Add"max_tokens": 20to the payload.model not foundin response — The model was registered under a different name. Use the exact string from/v1/models.