Data & datasets

Data Pipeline

A data pipeline is a sequence of automated steps that ingest, transform, and load data from source to destination. In local AI, operators build pipelines to prepare training datasets or streaming inference inputs. Common steps include downloading, cleaning, tokenizing, and batching data before feeding it to a model. Pipelines matter because raw data (e.g., web scrapes, logs) is rarely model-ready; a broken pipeline means garbage-in-garbage-out, wasting VRAM and compute on malformed inputs.

Deeper dive

Data pipelines in local AI typically involve three phases: extraction (pulling data from APIs, files, or databases), transformation (deduplication, filtering, format conversion, tokenization), and loading (writing to a dataset format like JSONL or Arrow, or directly into a model's inference loop). Operators often use tools like Hugging Face Datasets, Apache Arrow, or custom Python scripts with multiprocessing to parallelize work. For training, the pipeline must shuffle, batch, and pad sequences efficiently to keep GPU utilization high. A poorly designed pipeline can become the bottleneck, leaving the GPU idle while the CPU struggles to prepare the next batch. Monitoring pipeline throughput (samples/sec) vs. model throughput (tokens/sec) helps identify mismatches.

Practical example

An operator fine-tuning Llama 3.1 8B on a custom dataset uses a pipeline: (1) download 10 GB of text files, (2) deduplicate with MinHash, (3) tokenize with the model's tokenizer, (4) pack into sequences of 4096 tokens, (5) write to sharded Arrow files. Without dedup, the model may overfit on repeated text; without packing, short sequences waste VRAM padding.

Workflow example

When using Hugging Face Transformers for training, operators define a pipeline via datasets.load_dataset() and dataset.map(tokenize_function, batched=True). The map call applies tokenization in parallel across CPU cores. Then DataLoader with num_workers=4 loads batches into GPU memory. If the pipeline is slow, operators increase num_workers or switch to streaming mode (streaming=True) to avoid loading the entire dataset into RAM.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work