RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /MLOps for Local AI
  6. /Ch. 11
MLOps for Local AI

11. Data Validation

Chapter 11 of 24 · 15 min
KEY INSIGHT

Schema validation catches structural problems. Distribution validation catches semantic problems. A dataset can have perfect schema but shifted distributions that degrade model performance. For distribution validation, compare training data against baseline: ```python # Distribution drift detection def detect_drift(baseline_path: str, current_path: str, column: str, threshold: float = 0.05): import scipy.stats as stats baseline = pd.read_csv(baseline_path)[column] current = pd.read_csv(current_path)[column] # Chi-square test for categorical, KS test for continuous if baseline.dtype == "object": stat, p_value = stats.chisquare( observed=current.value_counts(), expected=baseline.value_counts().reindex(current.value_counts().index, fill_value=0) ) else: stat, p_value = stats.ks_2samp(baseline, current) drift_detected = p_value < threshold return { "drift_detected": drift_detected, "p_value": p_value, "statistic": stat } ```

Data validation ensures incoming data meets quality standards before training or inference. Bad data produces bad models. Unlike model validation (which evaluates predictions), data validation evaluates inputs.

Core validation dimensions:

  • Completeness: No missing values beyond acceptable thresholds
  • Correctness: Values within valid ranges
  • Consistency: Same format, same distributions
  • Freshness: Data is recent enough for the use case

The Great Expectations library provides declarative data validation:

# validate_data.py
import great_expectations as ge
import pandas as pd

def validate_training_data(path: str) -> dict:
    """Validate dataset and return validation report."""
    df = ge.from_pandas(pd.read_csv(path))
    
    # Define expectations
    expectations = [
        {"expectation": "expect_column_to_exist", "kwargs": {"column": "text"}},
        {"expectation": "expect_column_to_exist", "kwargs": {"column": "label"}},
        {"expectation": "expect_column_values_to_not_be_null", "kwargs": {"column": "text"}},
        {"expectation": "expect_column_value_lengths_to_be_between", 
         "kwargs": {"column": "text", "min_value": 1, "max_value": 10000}},
        {"expectation": "expect_column_values_to_be_in_set", 
         "kwargs": {"column": "label", "value_set": [0, 1]}},
    ]
    
    results = {"passed": [], "failed": []}
    
    for exp in expectations:
        expectation = getattr(df, exp["expectation"])
        try:
            result = expectation(**exp["kwargs"])
            if result.success:
                results["passed"].append(exp["expectation"])
            else:
                results["failed"].append({
                    "expectation": exp["expectation"],
                    "details": result
                })
        except Exception as e:
            results["failed"].append({
                "expectation": exp["expectation"],
                "error": str(e)
            })
    
    return results

report = validate_training_data("data/training.csv")
print(f"Passed: {len(report['passed'])}, Failed: {len(report['failed'])}")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Install Great Expectations. Create expectations for a dataset you're using. Run validation and fix any failures. Add distribution expectations comparing your data to a baseline distribution.

← Chapter 10
Model Validation Gates
Chapter 12 →
Model Validation