In-Context Learning — AI glossary

In-context learning (ICL) is a capability of large language models where the model adapts its behavior based solely on examples or instructions provided in the input prompt, without updating its weights. Operators encounter ICL when they include a few examples of desired input-output pairs (few-shot) or a task description (zero-shot) in the prompt. The model uses its attention mechanism to infer the pattern from the context and apply it to new queries. ICL is distinct from fine-tuning, which modifies model weights. It is a key reason operators can adapt a single model to many tasks without retraining, but it consumes context window tokens and may be less reliable than fine-tuning for complex tasks.

Deeper dive

In-context learning works because transformer models process all tokens in the context window simultaneously through self-attention. When examples are placed in the prompt, the model's attention heads learn to associate patterns (e.g., 'sentiment: positive' after a movie review) and apply that mapping to the final query. The number of examples (shots) and their ordering significantly affect performance. ICL is sensitive to prompt formatting, example quality, and model size—larger models tend to perform ICL more reliably. Operators can use ICL for quick prototyping, data labeling, or task switching without retraining. However, ICL consumes context tokens, so long examples reduce the space available for the actual task. It also does not guarantee consistent performance across diverse inputs, and the model may overfit to spurious patterns in the examples. For mission-critical tasks, fine-tuning or RLHF is often preferred over ICL.

Practical example

An operator wants to classify customer emails as 'urgent' or 'normal'. Instead of fine-tuning a model, they craft a prompt with three examples: 'Email: "Server down!" -> urgent', 'Email: "Password reset request" -> normal', 'Email: "Meeting rescheduled" -> normal'. Then they append the new email: 'Email: "Billing issue, payment failed" ->'. The model outputs 'urgent' based on the pattern. This few-shot ICL works on a 7B model running on an RTX 3060 12 GB at ~20 tok/s, but consumes ~200 tokens of context for the examples.

Workflow example

In LM Studio or Ollama, an operator loads a model (e.g., Llama 3.1 8B) and types a prompt with few-shot examples directly in the chat interface. In llama.cpp, they run ./main -m model.gguf -p "Translate English to French: hello -> bonjour, goodbye -> au revoir, cat ->" to see ICL in action. In Hugging Face Transformers, they set tokenizer.apply_chat_template with a list of example messages. The operator must ensure the total prompt length (examples + query) fits within the context window; if it exceeds, the model may truncate or ignore later examples.