Error Handling — Document Processing with Local AI (Chapter 14)

Document processing fails constantly. PDFs have password protection, images lack extractable text, file paths contain special characters, and AI models return unexpected formats. reliable error handling prevents cascading failures.

Categorizing Errors

Not all errors are equal. Distinguish between:

Recoverable: Can retry or skip gracefully (corrupt PDF page, timeout)
Fatal: Cannot proceed (encrypted file, unsupported format)
Partial: Some content extracted, some lost (multi-page PDF with one bad page)

Retry Logic with Exponential Backoff

Transient failures often succeed on retry:

import time
import functools

def retry(max_attempts=3, base_delay=1.0, backoff_factor=2.0):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            delay = base_delay
            last_exception = None
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt < max_attempts - 1:
                        time.sleep(delay)
                        delay *= backoff_factor
            raise last_exception
        return wrapper
    return decorator

@retry(max_attempts=3, base_delay=2.0, backoff_factor=2.0)
def extract_text_from_pdf(path):
    doc = pymupdf.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Graceful Degradation

When full extraction fails, attempt partial extraction:

def extract_with_fallback(path):
    try:
        return {"method": "full", "content": extract_text_from_pdf(path)}
    except Exception as e:
        print(f"Full extraction failed: {e}")
        try:
            return {"method": "ocr", "content": ocr_extraction(path)}
        except Exception as e2:
            return {"method": "none", "error": f"Both methods failed: {e}, {e2}"}

Error Context Preservation

Include context when logging failures:

import logging

logger = logging.getLogger(__name__)

def safe_process(path):
    try:
        result = process_document(path)
        return {"status": "success", "result": result}
    except PasswordEncryptedError as e:
        logger.warning(f"Password-protected document skipped: {path}")
        return {"status": "skipped", "reason": "encrypted", "path": path}
    except CorruptPDFError as e:
        logger.error(f"Corrupt PDF could not be processed: {path}", exc_info=True)
        return {"status": "failed", "reason": "corrupt", "path": path}
    except Exception as e:
        logger.error(f"Unexpected error processing {path}: {e}", exc_info=True)
        return {"status": "failed", "reason": "unknown", "error": str(e), "path": path}

Dead Letter Queue Pattern

Failed documents go to a dead letter queue for later analysis:

import shutil
from pathlib import Path

def handle_failure(path, error_info, dead_letter_dir="/errors"):
    dest = Path(dead_letter_dir) / Path(path).name
    shutil.move(path, dest)
    
    error_log = Path(dead_letter_dir) / "error_log.jsonl"
    with open(error_log, "a") as f:
        f.write(json.dumps({
            "original_path": path,
            "failure_path": str(dest),
            "error": error_info,
            "timestamp": datetime.now().isoformat()
        }) + "\n")

Validation Before Processing

Validate documents before processing to fail fast:

def validate_pdf(path):
    try:
        doc = pymupdf.open(path)
        if doc.page_count == 0:
            return False, "PDF has no pages"
        doc.close()
        return True, "Valid PDF"
    except Exception as e:
        return False, str(e)