15. Quality Checks

Chapter 15 of 18 · 20 min

Processing documents without validation produces silent failuresΓÇödocuments that pass through the pipeline with missing or corrupted output. Quality checks catch these issues before downstream systems encounter them.

Defining Quality Metrics

Document processing quality depends on:

  • Completeness: All expected content extracted
  • Accuracy: Content matches source (no garbled text, missing sections)
  • Format: Output matches schema and encoding requirements
  • Timeliness: Processing completes within SLA

Structural Validation

Validate output structure against expected schema:

from jsonschema import validate, ValidationError

OUTPUT_SCHEMA = {
    "type": "object",
    "required": ["path", "content", "metadata"],
    "properties": {
        "path": {"type": "string"},
        "content": {"type": "string", "minLength": 1},
        "metadata": {
            "type": "object",
            "required": ["page_count", "processed_at"]
        }
    }
}

def validate_output(result):
    try:
        validate(instance=result, schema=OUTPUT_SCHEMA)
        return True, "Valid"
    except ValidationError as e:
        return False, f"Schema validation failed: {e.message}"

Content Quality Checks

Beyond structure, verify content quality:

def check_content_quality(content, source_path):
    issues = []
    
    if len(content) < 100:
        issues.append("Content suspiciously short")
    
    if content.count("\x00") > 0:
        issues.append("Contains null bytes")
    
    non_printable_ratio = sum(1 for c in content if ord(c) < 32 and c not in "\n\t") / len(content)
    if non_printable_ratio > 0.1:
        issues.append(f"High non-printable character ratio: {non_printable_ratio:.2%}")
    
    return issues

Comparison Against Source

For extracted text, estimate quality by comparing extracted length to PDF size:

def estimate_extraction_quality(pdf_path, extracted_text):
    source_size = Path(pdf_path).stat().st_size
    extraction_ratio = len(extracted_text) / source_size
    
    if extraction_ratio < 0.01:
        return "Very low extraction ratio - possible OCR failure or image-only PDF"
    elif extraction_ratio > 50:
        return "Very high extraction ratio - possible binary data included"
    elif 1 < extraction_ratio < 10:
        return "Healthy extraction ratio"
    
    return "Acceptable"

Automated Quality Reporting

Generate quality reports:

def generate_quality_report(results):
    total = len(results)
    passed = sum(1 for r in results if r["quality_pass"])
    
    report = {
        "total_documents": total,
        "passed": passed,
        "failed": total - passed,
        "pass_rate": passed / total if total > 0 else 0,
        "issues_by_type": {},
        "sample_failures": []
    }
    
    for result in results:
        if not result["quality_pass"]:
            issue_type = result.get("issue_type", "unknown")
            report["issues_by_type"][issue_type] = report["issues_by_type"].get(issue_type, 0) + 1
            if len(report["sample_failures"]) < 5:
                report["sample_failures"].append(result)
    
    return report

Quality Gates in Pipelines

Integrate checks into processing pipelines:

class QualityGateStage:
    def __init__(self, min_content_length=100, max_missing_pages=0):
        self.min_content_length = min_content_length
        self.max_missing_pages = max_missing_pages
    
    def execute(self, document):
        issues = []
        
        if not document.content or len(document.content) < self.min_content_length:
            issues.append("Content below minimum length threshold")
        
        expected_pages = document.metadata.get("expected_pages")
        actual_pages = document.metadata.get("pages", 0)
        if expected_pages and actual_pages < expected_pages - self.max_missing_pages:
            issues.append(f"Missing pages: expected {expected_pages}, got {actual_pages}")
        
        if issues:
            document.metadata["quality_issues"] = issues
            document.metadata["quality_pass"] = False
        else:
            document.metadata["quality_pass"] = True
        
        return document
EXERCISE

Implement a quality checker that validates extracted content against source file properties, flags documents with suspiciously low extraction rates, and generates a summary report with failure categorization.