11. Data Validation
Data validation ensures incoming data meets quality standards before training or inference. Bad data produces bad models. Unlike model validation (which evaluates predictions), data validation evaluates inputs.
Core validation dimensions:
- Completeness: No missing values beyond acceptable thresholds
- Correctness: Values within valid ranges
- Consistency: Same format, same distributions
- Freshness: Data is recent enough for the use case
The Great Expectations library provides declarative data validation:
# validate_data.py
import great_expectations as ge
import pandas as pd
def validate_training_data(path: str) -> dict:
"""Validate dataset and return validation report."""
df = ge.from_pandas(pd.read_csv(path))
# Define expectations
expectations = [
{"expectation": "expect_column_to_exist", "kwargs": {"column": "text"}},
{"expectation": "expect_column_to_exist", "kwargs": {"column": "label"}},
{"expectation": "expect_column_values_to_not_be_null", "kwargs": {"column": "text"}},
{"expectation": "expect_column_value_lengths_to_be_between",
"kwargs": {"column": "text", "min_value": 1, "max_value": 10000}},
{"expectation": "expect_column_values_to_be_in_set",
"kwargs": {"column": "label", "value_set": [0, 1]}},
]
results = {"passed": [], "failed": []}
for exp in expectations:
expectation = getattr(df, exp["expectation"])
try:
result = expectation(**exp["kwargs"])
if result.success:
results["passed"].append(exp["expectation"])
else:
results["failed"].append({
"expectation": exp["expectation"],
"details": result
})
except Exception as e:
results["failed"].append({
"expectation": exp["expectation"],
"error": str(e)
})
return results
report = validate_training_data("data/training.csv")
print(f"Passed: {len(report['passed'])}, Failed: {len(report['failed'])}")
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Install Great Expectations. Create expectations for a dataset you're using. Run validation and fix any failures. Add distribution expectations comparing your data to a baseline distribution.