Frameworks & tools

KoboldCpp

Also known as: koboldai, kobold-cpp

KoboldCpp is a single-file, self-contained executable that bundles llama.cpp with a web-based UI and a built-in API, designed for running large language models locally on consumer hardware. It is a fork of llama.cpp that adds a graphical interface, persistent story/chat management, and integration with KoboldAI's character and lorebook systems. Operators use it to run GGUF-quantized models without needing to install Python or manage dependencies—just download the binary, load a model file, and access the UI via a browser. It is particularly popular among roleplayers and writers for its ease of use and built-in text-adventure features.

Deeper dive

KoboldCpp originated from the KoboldAI project, which previously relied on cloud-based APIs. The cpp version was created to provide a fully local alternative using llama.cpp's efficient inference engine. Unlike llama.cpp's command-line interface, KoboldCpp offers a full web UI with features like chat history, character cards, lorebooks (world info), and a text adventure mode. It also exposes an API compatible with KoboldAI's client, allowing existing frontends to connect to a local backend. The executable includes all dependencies (OpenBLAS, cuBLAS, CLBlast, Metal) compiled in, so operators can choose the right build for their hardware (CPU, NVIDIA CUDA, AMD ROCm, Apple Metal). It supports GGUF model loading, context extension (e.g., 8K, 32K), and various samplers. For operators, the key trade-off is convenience versus flexibility: KoboldCpp is easier to set up than raw llama.cpp but offers fewer knobs for advanced tuning.

Practical example

An operator with an RTX 3060 12GB downloads the KoboldCpp CUDA binary (koboldcpp.exe or koboldcpp_linux) and a GGUF model like Llama-3.1-8B-Instruct-Q4_K_M.gguf (5 GB). They launch the executable, select the model file, set context size to 4096, and click 'Start'. The web UI opens at http://localhost:5001, where they can chat, load a character card, and see tokens/sec (30-40 on GPU). If VRAM runs out, KoboldCpp automatically offloads layers to system RAM, slowing to ~5 tok/s.

Workflow example

In a typical workflow, an operator first downloads a GGUF model from Hugging Face or a repository like TheBloke. Then they run KoboldCpp with the model path: ./koboldcpp --model /path/to/model.gguf --contextsize 8192 --blasbatchsize 512. The UI loads, and they can import a character card (JSON or PNG) or start a new story. For API usage, they configure a client like SillyTavern to point to http://localhost:5001/api. KoboldCpp also supports a '--port' flag for custom ports and '--usecublas' for NVIDIA GPU acceleration.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms