HOW-TO · SET

How to run inference with llama.cpp server

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

llama.cpp compiled, GGUF model file downloaded

What this does

Starts a REST API server that exposes a running LLM for client requests. Once running, any HTTP client can send prompts and receive streamed responses without managing the model directly.

Steps

  1. Launch the server binary with model and port flags. The server process binds to the specified port and loads the model into memory before accepting connections.

    ./llama-server -m /path/to/model.gguf --host 0.0.0.0 --port 8080
    

    Expected output: HTTP server listening on 0.0.0.0:8080 followed by model loading logs.

  2. Send a completion request via curl. POST to the completions endpoint with a JSON body containing the prompt and generation parameters.

    curl -X POST http://localhost:8080/completion \
      -H "Content-Type: application/json" \
      -d '{"prompt": "Explain the capital of France", "n_predict": 128}'
    

    Expected output: JSON response containing the generated text completion.

  3. Query the server property endpoint. Returns server metadata and loaded model parameters.

    curl http://localhost:8080/property
    

    Expected output: JSON object with model name, n_ctx, and server configuration.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

curl -s http://localhost:8080/property | grep '"model"'
# Expected: "model":"filename.gguf" or similar non-empty value

Common failures

  • Segmentation fault during model load — Insufficient RAM for the model's context size. Reduce -c (context size) to a lower value matching available memory.
  • Port 8080 already in use — Another process occupies the port. Identify with lsof -i :8080 and stop the conflicting service or launch on a different port.
  • Model file path not found — The -m argument points to a non-existent file. Verify the path with ls -la.
  • Slow token generation on CPU — Expected for large models without GPU acceleration. Consider enabling CUDA support or using a quantized GGUF variant.

Related guides

RELATED GUIDES