Training Data
Training data is the dataset used to teach a model its patterns and behaviors. For LLMs, this typically means trillions of tokens of text scraped from the web, books, and other sources. The model learns to predict the next token by iterating over this data during training. The quality, diversity, and size of training data directly determine the model's capabilities and biases. Operators encounter training data when choosing a model: a model trained on more or better data (e.g., Llama 3.1 vs. a smaller base) generally performs better, but also requires more VRAM and compute to run.
Deeper dive
Training data for LLMs is usually a static corpus collected before training begins. Common sources include Common Crawl (web pages), Wikipedia, books, academic papers, and code repositories. The data is cleaned, deduplicated, and sometimes filtered for quality or to remove toxic content. Tokenization converts the raw text into tokens, and the model is trained to minimize the cross-entropy loss on next-token prediction. The size of training data has grown from a few gigabytes for early models (GPT-1) to tens of terabytes for modern ones (Llama 3.1 405B was trained on 15 trillion tokens). Operators rarely interact with training data directly, but they benefit from understanding that a model's performance ceiling is set by its training data: a model trained on diverse, high-quality data will generalize better to niche tasks.
Practical example
When you download a model like llama3.1:8b from Ollama, you're getting weights learned from Meta's training data — 15 trillion tokens. If you instead use a fine-tuned variant like llama3.1:8b-instruct, the base weights were further trained on instruction-following data (e.g., user-assistant dialogues). The original training data is not included in the download; only the learned parameters (weights) are distributed.
Workflow example
In practice, operators select models based on training data reputation. For example, when running ollama pull llama3.1:8b, you trust that Meta's training data pipeline produced a capable model. If you want a model specialized for code, you might pull codellama:7b, which was trained on additional code-heavy data. The training data itself is never loaded into VRAM — only the model weights derived from it.
Reviewed by Fredoline Eruo. See our editorial policy.