Data Validation — MLOps for Local AI (Chapter 11)

Data validation ensures incoming data meets quality standards before training or inference. Bad data produces bad models. Unlike model validation (which evaluates predictions), data validation evaluates inputs.

Core validation dimensions:

Completeness: No missing values beyond acceptable thresholds
Correctness: Values within valid ranges
Consistency: Same format, same distributions
Freshness: Data is recent enough for the use case

The Great Expectations library provides declarative data validation:

# validate_data.py
import great_expectations as ge
import pandas as pd

def validate_training_data(path: str) -> dict:
    """Validate dataset and return validation report."""
    df = ge.from_pandas(pd.read_csv(path))
    
    # Define expectations
    expectations = [
        {"expectation": "expect_column_to_exist", "kwargs": {"column": "text"}},
        {"expectation": "expect_column_to_exist", "kwargs": {"column": "label"}},
        {"expectation": "expect_column_values_to_not_be_null", "kwargs": {"column": "text"}},
        {"expectation": "expect_column_value_lengths_to_be_between", 
         "kwargs": {"column": "text", "min_value": 1, "max_value": 10000}},
        {"expectation": "expect_column_values_to_be_in_set", 
         "kwargs": {"column": "label", "value_set": [0, 1]}},
    ]
    
    results = {"passed": [], "failed": []}
    
    for exp in expectations:
        expectation = getattr(df, exp["expectation"])
        try:
            result = expectation(**exp["kwargs"])
            if result.success:
                results["passed"].append(exp["expectation"])
            else:
                results["failed"].append({
                    "expectation": exp["expectation"],
                    "details": result
                })
        except Exception as e:
            results["failed"].append({
                "expectation": exp["expectation"],
                "error": str(e)
            })
    
    return results

report = validate_training_data("data/training.csv")
print(f"Passed: {len(report['passed'])}, Failed: {len(report['failed'])}")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.