15. Quality Checks
Chapter 15 of 18 · 20 min
Processing documents without validation produces silent failuresΓÇödocuments that pass through the pipeline with missing or corrupted output. Quality checks catch these issues before downstream systems encounter them.
Defining Quality Metrics
Document processing quality depends on:
- Completeness: All expected content extracted
- Accuracy: Content matches source (no garbled text, missing sections)
- Format: Output matches schema and encoding requirements
- Timeliness: Processing completes within SLA
Structural Validation
Validate output structure against expected schema:
from jsonschema import validate, ValidationError
OUTPUT_SCHEMA = {
"type": "object",
"required": ["path", "content", "metadata"],
"properties": {
"path": {"type": "string"},
"content": {"type": "string", "minLength": 1},
"metadata": {
"type": "object",
"required": ["page_count", "processed_at"]
}
}
}
def validate_output(result):
try:
validate(instance=result, schema=OUTPUT_SCHEMA)
return True, "Valid"
except ValidationError as e:
return False, f"Schema validation failed: {e.message}"
Content Quality Checks
Beyond structure, verify content quality:
def check_content_quality(content, source_path):
issues = []
if len(content) < 100:
issues.append("Content suspiciously short")
if content.count("\x00") > 0:
issues.append("Contains null bytes")
non_printable_ratio = sum(1 for c in content if ord(c) < 32 and c not in "\n\t") / len(content)
if non_printable_ratio > 0.1:
issues.append(f"High non-printable character ratio: {non_printable_ratio:.2%}")
return issues
Comparison Against Source
For extracted text, estimate quality by comparing extracted length to PDF size:
def estimate_extraction_quality(pdf_path, extracted_text):
source_size = Path(pdf_path).stat().st_size
extraction_ratio = len(extracted_text) / source_size
if extraction_ratio < 0.01:
return "Very low extraction ratio - possible OCR failure or image-only PDF"
elif extraction_ratio > 50:
return "Very high extraction ratio - possible binary data included"
elif 1 < extraction_ratio < 10:
return "Healthy extraction ratio"
return "Acceptable"
Automated Quality Reporting
Generate quality reports:
def generate_quality_report(results):
total = len(results)
passed = sum(1 for r in results if r["quality_pass"])
report = {
"total_documents": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total if total > 0 else 0,
"issues_by_type": {},
"sample_failures": []
}
for result in results:
if not result["quality_pass"]:
issue_type = result.get("issue_type", "unknown")
report["issues_by_type"][issue_type] = report["issues_by_type"].get(issue_type, 0) + 1
if len(report["sample_failures"]) < 5:
report["sample_failures"].append(result)
return report
Quality Gates in Pipelines
Integrate checks into processing pipelines:
class QualityGateStage:
def __init__(self, min_content_length=100, max_missing_pages=0):
self.min_content_length = min_content_length
self.max_missing_pages = max_missing_pages
def execute(self, document):
issues = []
if not document.content or len(document.content) < self.min_content_length:
issues.append("Content below minimum length threshold")
expected_pages = document.metadata.get("expected_pages")
actual_pages = document.metadata.get("pages", 0)
if expected_pages and actual_pages < expected_pages - self.max_missing_pages:
issues.append(f"Missing pages: expected {expected_pages}, got {actual_pages}")
if issues:
document.metadata["quality_issues"] = issues
document.metadata["quality_pass"] = False
else:
document.metadata["quality_pass"] = True
return document
EXERCISE
Implement a quality checker that validates extracted content against source file properties, flags documents with suspiciously low extraction rates, and generates a summary report with failure categorization.