06. Data Extraction from Documents
Business documents come in unstructured formats: PDFs, scanned documents, emails, Word files. Extracting structured data from them requires OCR or parsing followed by AI extraction.
For text-based documents, direct extraction works:
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF document."""
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return text
def extract_text_from_docx(docx_path):
"""Extract text from Word document."""
from docx import Document
doc = Document(docx_path)
return "\n".join(para.text for para in doc.paragraphs)
For scanned PDFs or images, OCR is necessary:
# Install Tesseract OCR
# Ubuntu/Debian: apt install tesseract-ocr
# macOS: brew install tesseract
# Convert image to text using Tesseract
tesseract document_image.png output_text
Once you have text, the extraction prompt guides the model to output structured data:
EXTRACTION_PROMPT = """Extract structured data from this document. Output valid JSON only.
Fields to extract:
- vendor_name: Company name of the vendor/supplier
- invoice_number: Invoice reference number
- invoice_date: Date in YYYY-MM-DD format
- line_items: Array of items with description, quantity, unit_price, total
- subtotal: Numerical subtotal
- tax: Numerical tax amount
- total: Numerical total amount
- due_date: Payment due date in YYYY-MM-DD format
Document:
{document_text}
Output JSON only, no explanation:
"""
def extract_invoice_data(document_text):
"""Extract structured invoice data from document text."""
# Handle case where document is too long
if len(document_text) > 4000:
document_text = document_text[:4000]
prompt = EXTRACTION_PROMPT.format(document_text=document_text)
response = chat(model='llama3.1:8b', messages=[
{'role': 'user', 'content': prompt}
])
import json
try:
return json.loads(response['message']['content'])
except json.JSONDecodeError:
return {'error': 'Failed to parse extraction', 'raw': response['message']['content']}
The extraction pattern works for invoices, receipts, contracts, forms, and any structured document. Output schema depends on your downstream system—design the extraction prompt to match what your database or processing pipeline expects.
Find a PDF invoice or receipt from your business. Extract the text and run the extraction prompt. Validate the output matches the actual document values.