Large language models

Prompt

A prompt is the input text you provide to a language model to generate a response. It can be a simple question, a set of instructions, or a structured format like a system message followed by user input. The prompt determines the model's output by conditioning the generation on its content. In practice, prompt length affects VRAM usage and latency because the model must process every token in the prompt before generating new tokens. Longer prompts consume more context window and increase time-to-first-token.

Deeper dive

Prompts are the primary interface for controlling model behavior. They can include system instructions (e.g., 'You are a helpful assistant'), few-shot examples, or chain-of-thought reasoning. The way a prompt is structured significantly impacts output quality. For example, adding 'Let's think step by step' often improves reasoning on complex tasks. Operators must consider prompt length because the context window (e.g., 4K, 8K, 128K tokens) limits how much text can be processed at once. Exceeding the context window truncates the prompt, losing information. Prompt engineering is the practice of crafting prompts to achieve desired outputs, and it's a key skill for running local models effectively.

Practical example

A prompt like 'Translate the following English text to French: "Hello, how are you?"' will cause the model to output a French translation. If the prompt is too long, say a 10,000-token document on a model with a 4K context window, the model will only see the last 4,000 tokens, potentially missing the beginning. On an RTX 3090 with 24 GB VRAM, a 4K prompt might fit entirely in VRAM, but a 32K prompt would require offloading to system RAM, slowing generation.

Workflow example

In Ollama, you set a prompt when running ollama run llama3.2 'What is the capital of France?'. In LM Studio, you type the prompt into the chat interface. In llama.cpp, you pass the prompt via -p flag: ./main -m model.gguf -p 'Once upon a time'. The prompt is tokenized and fed into the model. If the prompt exceeds the context window, the runtime may truncate it silently, so operators should check context length settings.

Reviewed by Fredoline Eruo. See our editorial policy.