Annotation

Annotation is the process of adding labels, tags, or metadata to raw data (text, images, audio) to create a training dataset for supervised learning. In local AI, operators encounter annotation when fine-tuning a model on custom data: each example must be paired with a correct output. For text, this means writing prompt-response pairs; for images, drawing bounding boxes or classifying objects. The quality and consistency of annotations directly determine model performance—noisy or sparse labels produce unreliable fine-tuned models.

An operator fine-tuning Llama 3.1 8B to answer customer support queries needs a dataset of ~500 annotated examples. Each example is a JSON object with a "prompt" field (e.g., "How do I reset my password?") and a "completion" field (e.g., "Go to Settings > Account > Reset Password."). If the annotations are inconsistent—sometimes using "completion", other times "response"—the training script will fail or learn incorrectly.

In Hugging Face Transformers, annotation is done before training. Operators prepare a CSV or JSONL file with columns like "instruction" and "output". They then load it with datasets.load_dataset('json', data_files='annotations.jsonl') and pass it to the Trainer. In LM Studio, the fine-tuning UI expects a dataset in OpenAI chat format, where each message has a "role" and "content"—annotations define the assistant's expected reply.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example