RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Training Pipelines
  6. /Ch. 3
Custom Training Pipelines

03. Dataset Curation

Chapter 3 of 18 · 20 min
KEY INSIGHT

Validate dataset integrity before training begins. Catch corrupted images, missing labels, and class imbalance in the curation phase, not during training.

Garbage in, garbage out. No amount of architecture tuning compensates for a corrupted dataset. Dataset curation is unglamorous work that determines whether the project succeeds.

Integrity Checking

Before training begins, validate the dataset programmatically:

from pathlib import Path
import torch
from PIL import Image
import json

def validate_dataset(data_dir: Path):
    """Fail fast on corrupted data."""
    issues = []
    
    for split in ["train", "val", "test"]:
        split_dir = data_dir / split
        if not split_dir.exists():
            issues.append(f"Missing split directory: {split_dir}")
            continue
        
        manifest_path = split_dir / "manifest.json"
        if manifest_path.exists():
            with open(manifest_path) as f:
                manifest = json.load(f)
            for entry in manifest:
                if not Path(entry["path"]).exists():
                    issues.append(f"Missing file: {entry['path']}")
                elif entry["type"] == "image":
                    try:
                        img = Image.open(entry["path"])
                        img.verify()
                    except Exception as e:
                        issues.append(f"Corrupt image: {entry['path']} - {e}")
    
    if issues:
        raise ValueError(f"Dataset validation failed:\n" + "\n".join(issues))
    
    print("Dataset validation passed")

Handling Class Imbalance

Many real datasets have severe class imbalance. Training naively leads to a model that predicts the majority class and achieves high accuracy while doing nothing useful.

from collections import Counter
import numpy as np

def compute_class_weights(labels):
    """Compute inverse-frequency weights for weighted sampling."""
    counts = Counter(labels)
    total = len(labels)
    weights = {}
    for cls, count in counts.items():
        weights[cls] = total / (len(counts) * count)
    return weights

def create_balanced_sampler(labels):
    """WeightedRandomSampler for balanced mini-batches."""
    class_weights = compute_class_weights(labels)
    sample_weights = [class_weights[label] for label in labels]
    sampler = torch.utils.data.WeightedRandomSampler(
        weights=sample_weights,
        num_samples=len(labels),
        replacement=True
    )
    return sampler

Data Versioning

Never overwrite a dataset without versioning. Use DVC, Delta Lake, or even dated directories. A training run without a data version hash is not reproducible.

# DVC example
dvc init
dvc add data/raw
git add data/raw.dvc
dvc remote add -d myremote s3://my-bucket/data
dvc push
EXERCISE

Write a script that loads 100 random samples from your dataset and checks: file exists, readable format, expected dimensions, and valid label range. Run it and fix any issues found.

← Chapter 2
Data Pipeline Design
Chapter 4 →
Data Augmentation