How to run inference with llama.cpp server
llama.cpp compiled, GGUF model file downloaded
What this does
Starts a REST API server that exposes a running LLM for client requests. Once running, any HTTP client can send prompts and receive streamed responses without managing the model directly.
Steps
Launch the server binary with model and port flags. The server process binds to the specified port and loads the model into memory before accepting connections.
./llama-server -m /path/to/model.gguf --host 0.0.0.0 --port 8080Expected output:
HTTP server listening on 0.0.0.0:8080followed by model loading logs.Send a completion request via curl. POST to the completions endpoint with a JSON body containing the prompt and generation parameters.
curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain the capital of France", "n_predict": 128}'Expected output: JSON response containing the generated text completion.
Query the server property endpoint. Returns server metadata and loaded model parameters.
curl http://localhost:8080/propertyExpected output: JSON object with
modelname,n_ctx, and server configuration.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
curl -s http://localhost:8080/property | grep '"model"'
# Expected: "model":"filename.gguf" or similar non-empty value
Common failures
- Segmentation fault during model load — Insufficient RAM for the model's context size. Reduce
-c(context size) to a lower value matching available memory. - Port 8080 already in use — Another process occupies the port. Identify with
lsof -i :8080and stop the conflicting service or launch on a different port. - Model file path not found — The
-margument points to a non-existent file. Verify the path withls -la. - Slow token generation on CPU — Expected for large models without GPU acceleration. Consider enabling CUDA support or using a quantized GGUF variant.