ETL

ETL (Extract, Transform, Load) is a data pipeline process that pulls raw data from sources (Extract), cleans or reformats it (Transform), and stores it in a target system (Load). In local AI, ETL is used to prepare datasets for fine-tuning or RAG (Retrieval-Augmented Generation). Operators run ETL scripts to convert scraped web pages, PDFs, or logs into structured formats (e.g., JSONL) that models can ingest. The transform step often includes deduplication, tokenization, or chunking to fit context windows. ETL matters because raw data rarely matches the format expected by training or inference pipelines—skipping it leads to garbage-in-garbage-out.

An operator building a RAG pipeline for internal docs runs a Python ETL script: Extract reads 500 PDFs from a folder; Transform uses PyMuPDF to extract text, splits each document into 512-token chunks with 128-token overlap, and embeds them with a local model like all-MiniLM-L6-v2; Load inserts the chunks and embeddings into a ChromaDB vector store. Without ETL, the raw PDFs would be unreadable by the embedding model.

In LM Studio, an operator preparing a fine-tuning dataset exports chat logs from a Discord bot (Extract), runs a Python script to deduplicate messages and format them as {"prompt":"...","completion":"..."} JSONL (Transform), then copies the file into the training folder (Load). The same pattern appears in Ollama workflows when using ollama create with a Modelfile that references a pre-processed dataset.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example