Data & datasets

Test Data

Test data is a set of examples used to evaluate a model's performance after training, distinct from the training data the model learned from. In local AI workflows, test data measures how well a model generalizes to unseen inputs, helping operators detect overfitting or assess quality before deploying a model in production. Test data is never used during training or validation—it is held back until final evaluation. Common formats include text prompts for LLMs, image-label pairs for vision models, or structured CSV files for tabular models. Operators encounter test data when running benchmarks like MMLU or HumanEval, or when splitting their own datasets with tools like train_test_split from scikit-learn.

Practical example

An operator fine-tunes Llama 3.1 8B on a custom dataset of 10,000 customer support conversations. They reserve 2,000 conversations as test data, never showing them to the model during training. After fine-tuning, they run inference on the test set and measure accuracy of response relevance. If accuracy on test data is significantly lower than on training data (e.g., 95% train vs. 70% test), the model is overfitting—the operator may need more data, stronger regularization, or a smaller model.

Workflow example

In Hugging Face Transformers, an operator loads a dataset and calls dataset.train_test_split(test_size=0.2) to reserve 20% as test data. During training with Trainer, the eval_dataset parameter is set to the validation split, not the test split. After training, the operator runs trainer.predict(test_dataset) to get final metrics. In Ollama, operators can evaluate a model by running ollama run llama3.1:8b with a set of test prompts and comparing outputs against expected answers, often using a script to compute pass@k or exact match scores.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work