14. Error Handling
Chapter 14 of 18 · 20 min
Document processing fails constantly. PDFs have password protection, images lack extractable text, file paths contain special characters, and AI models return unexpected formats. reliable error handling prevents cascading failures.
Categorizing Errors
Not all errors are equal. Distinguish between:
- Recoverable: Can retry or skip gracefully (corrupt PDF page, timeout)
- Fatal: Cannot proceed (encrypted file, unsupported format)
- Partial: Some content extracted, some lost (multi-page PDF with one bad page)
Retry Logic with Exponential Backoff
Transient failures often succeed on retry:
import time
import functools
def retry(max_attempts=3, base_delay=1.0, backoff_factor=2.0):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
delay = base_delay
last_exception = None
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
if attempt < max_attempts - 1:
time.sleep(delay)
delay *= backoff_factor
raise last_exception
return wrapper
return decorator
@retry(max_attempts=3, base_delay=2.0, backoff_factor=2.0)
def extract_text_from_pdf(path):
doc = pymupdf.open(path)
text = ""
for page in doc:
text += page.get_text()
return text
Graceful Degradation
When full extraction fails, attempt partial extraction:
def extract_with_fallback(path):
try:
return {"method": "full", "content": extract_text_from_pdf(path)}
except Exception as e:
print(f"Full extraction failed: {e}")
try:
return {"method": "ocr", "content": ocr_extraction(path)}
except Exception as e2:
return {"method": "none", "error": f"Both methods failed: {e}, {e2}"}
Error Context Preservation
Include context when logging failures:
import logging
logger = logging.getLogger(__name__)
def safe_process(path):
try:
result = process_document(path)
return {"status": "success", "result": result}
except PasswordEncryptedError as e:
logger.warning(f"Password-protected document skipped: {path}")
return {"status": "skipped", "reason": "encrypted", "path": path}
except CorruptPDFError as e:
logger.error(f"Corrupt PDF could not be processed: {path}", exc_info=True)
return {"status": "failed", "reason": "corrupt", "path": path}
except Exception as e:
logger.error(f"Unexpected error processing {path}: {e}", exc_info=True)
return {"status": "failed", "reason": "unknown", "error": str(e), "path": path}
Dead Letter Queue Pattern
Failed documents go to a dead letter queue for later analysis:
import shutil
from pathlib import Path
def handle_failure(path, error_info, dead_letter_dir="/errors"):
dest = Path(dead_letter_dir) / Path(path).name
shutil.move(path, dest)
error_log = Path(dead_letter_dir) / "error_log.jsonl"
with open(error_log, "a") as f:
f.write(json.dumps({
"original_path": path,
"failure_path": str(dest),
"error": error_info,
"timestamp": datetime.now().isoformat()
}) + "\n")
Validation Before Processing
Validate documents before processing to fail fast:
def validate_pdf(path):
try:
doc = pymupdf.open(path)
if doc.page_count == 0:
return False, "PDF has no pages"
doc.close()
return True, "Valid PDF"
except Exception as e:
return False, str(e)
EXERCISE
Create an error handler class that categorizes exceptions, applies appropriate recovery strategies, logs failures with full context, and updates a metrics counter for each error type.