03. Dataset Curation
Chapter 3 of 18 · 20 min
Garbage in, garbage out. No amount of architecture tuning compensates for a corrupted dataset. Dataset curation is unglamorous work that determines whether the project succeeds.
Integrity Checking
Before training begins, validate the dataset programmatically:
from pathlib import Path
import torch
from PIL import Image
import json
def validate_dataset(data_dir: Path):
"""Fail fast on corrupted data."""
issues = []
for split in ["train", "val", "test"]:
split_dir = data_dir / split
if not split_dir.exists():
issues.append(f"Missing split directory: {split_dir}")
continue
manifest_path = split_dir / "manifest.json"
if manifest_path.exists():
with open(manifest_path) as f:
manifest = json.load(f)
for entry in manifest:
if not Path(entry["path"]).exists():
issues.append(f"Missing file: {entry['path']}")
elif entry["type"] == "image":
try:
img = Image.open(entry["path"])
img.verify()
except Exception as e:
issues.append(f"Corrupt image: {entry['path']} - {e}")
if issues:
raise ValueError(f"Dataset validation failed:\n" + "\n".join(issues))
print("Dataset validation passed")
Handling Class Imbalance
Many real datasets have severe class imbalance. Training naively leads to a model that predicts the majority class and achieves high accuracy while doing nothing useful.
from collections import Counter
import numpy as np
def compute_class_weights(labels):
"""Compute inverse-frequency weights for weighted sampling."""
counts = Counter(labels)
total = len(labels)
weights = {}
for cls, count in counts.items():
weights[cls] = total / (len(counts) * count)
return weights
def create_balanced_sampler(labels):
"""WeightedRandomSampler for balanced mini-batches."""
class_weights = compute_class_weights(labels)
sample_weights = [class_weights[label] for label in labels]
sampler = torch.utils.data.WeightedRandomSampler(
weights=sample_weights,
num_samples=len(labels),
replacement=True
)
return sampler
Data Versioning
Never overwrite a dataset without versioning. Use DVC, Delta Lake, or even dated directories. A training run without a data version hash is not reproducible.
# DVC example
dvc init
dvc add data/raw
git add data/raw.dvc
dvc remote add -d myremote s3://my-bucket/data
dvc push
EXERCISE
Write a script that loads 100 random samples from your dataset and checks: file exists, readable format, expected dimensions, and valid label range. Run it and fix any issues found.