Dataset Curation — Custom Training Pipelines (Chapter 3)

Garbage in, garbage out. No amount of architecture tuning compensates for a corrupted dataset. Dataset curation is unglamorous work that determines whether the project succeeds.

Integrity Checking

Before training begins, validate the dataset programmatically:

from pathlib import Path
import torch
from PIL import Image
import json

def validate_dataset(data_dir: Path):
    """Fail fast on corrupted data."""
    issues = []
    
    for split in ["train", "val", "test"]:
        split_dir = data_dir / split
        if not split_dir.exists():
            issues.append(f"Missing split directory: {split_dir}")
            continue
        
        manifest_path = split_dir / "manifest.json"
        if manifest_path.exists():
            with open(manifest_path) as f:
                manifest = json.load(f)
            for entry in manifest:
                if not Path(entry["path"]).exists():
                    issues.append(f"Missing file: {entry['path']}")
                elif entry["type"] == "image":
                    try:
                        img = Image.open(entry["path"])
                        img.verify()
                    except Exception as e:
                        issues.append(f"Corrupt image: {entry['path']} - {e}")
    
    if issues:
        raise ValueError(f"Dataset validation failed:\n" + "\n".join(issues))
    
    print("Dataset validation passed")

Handling Class Imbalance

Many real datasets have severe class imbalance. Training naively leads to a model that predicts the majority class and achieves high accuracy while doing nothing useful.

from collections import Counter
import numpy as np

def compute_class_weights(labels):
    """Compute inverse-frequency weights for weighted sampling."""
    counts = Counter(labels)
    total = len(labels)
    weights = {}
    for cls, count in counts.items():
        weights[cls] = total / (len(counts) * count)
    return weights

def create_balanced_sampler(labels):
    """WeightedRandomSampler for balanced mini-batches."""
    class_weights = compute_class_weights(labels)
    sample_weights = [class_weights[label] for label in labels]
    sampler = torch.utils.data.WeightedRandomSampler(
        weights=sample_weights,
        num_samples=len(labels),
        replacement=True
    )
    return sampler

Data Versioning

Never overwrite a dataset without versioning. Use DVC, Delta Lake, or even dated directories. A training run without a data version hash is not reproducible.

# DVC example
dvc init
dvc add data/raw
git add data/raw.dvc
dvc remote add -d myremote s3://my-bucket/data
dvc push