06. Data Extraction from Documents

Chapter 6 of 18 · 20 min

Business documents come in unstructured formats: PDFs, scanned documents, emails, Word files. Extracting structured data from them requires OCR or parsing followed by AI extraction.

For text-based documents, direct extraction works:

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF document."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def extract_text_from_docx(docx_path):
    """Extract text from Word document."""
    from docx import Document
    doc = Document(docx_path)
    return "\n".join(para.text for para in doc.paragraphs)

For scanned PDFs or images, OCR is necessary:

# Install Tesseract OCR
# Ubuntu/Debian: apt install tesseract-ocr
# macOS: brew install tesseract

# Convert image to text using Tesseract
tesseract document_image.png output_text

Once you have text, the extraction prompt guides the model to output structured data:

EXTRACTION_PROMPT = """Extract structured data from this document. Output valid JSON only.

Fields to extract:
- vendor_name: Company name of the vendor/supplier
- invoice_number: Invoice reference number
- invoice_date: Date in YYYY-MM-DD format
- line_items: Array of items with description, quantity, unit_price, total
- subtotal: Numerical subtotal
- tax: Numerical tax amount
- total: Numerical total amount
- due_date: Payment due date in YYYY-MM-DD format

Document:
{document_text}

Output JSON only, no explanation:
"""

def extract_invoice_data(document_text):
    """Extract structured invoice data from document text."""
    # Handle case where document is too long
    if len(document_text) > 4000:
        document_text = document_text[:4000]
    
    prompt = EXTRACTION_PROMPT.format(document_text=document_text)
    
    response = chat(model='llama3.1:8b', messages=[
        {'role': 'user', 'content': prompt}
    ])
    
    import json
    try:
        return json.loads(response['message']['content'])
    except json.JSONDecodeError:
        return {'error': 'Failed to parse extraction', 'raw': response['message']['content']}

The extraction pattern works for invoices, receipts, contracts, forms, and any structured document. Output schema depends on your downstream system—design the extraction prompt to match what your database or processing pipeline expects.

EXERCISE

Find a PDF invoice or receipt from your business. Extract the text and run the extraction prompt. Validate the output matches the actual document values.