RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Document Processing with Local AI
COURSE · BLD · I008

Document Processing with Local AI

Learn document processing with local ai through RunLocalAI's practical lens: documents, pdf, ocr and classification, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters·10h·Builder track·By Fredoline Eruo
PREREQUISITES
  • B002
  • B012

Why this course matters

Document Processing with Local AI is for builders turning local models into working tools, agents and retrieval systems. It connects documents, pdf, ocr, classification and extraction to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Document Processing Overview, PDF Text Extraction, PyMuPDF Deep Dive and OCR with Tesseract and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Document Processing OverviewDocument processing pipelines follow a predictable patternΓÇöconvert, clean, extract, structureΓÇöbut the choice of extraction strategy depends entirely on how the document was created. ### What This Course Covers Local AI enables processing of sensitive documents without sending data to external APIs. This course builds complete pipelines from raw PDFs through to structured data suitable for RAG systems, analytics, or database ingestion. Documents fall into two fundamental categories based on their origin. **Born-digital documents** exist as electronic filesΓÇötext PDFs exported from Word, digitally created forms, exported reports. These contain embedded text that extraction tools can read directly. **Scanned documents** exist only as imagesΓÇöphotographs of paper documents, fax transmissions, poorly scanned PDFs. These require OCR (optical character recognition) to extract any text. Understanding this distinction determines your entire processing approach. ### The Document Processing Pipeline A complete pipeline consists of five stages: 1. **Format Detection** ΓÇö Identify file type, verify it's actually a document 2. **Text Extraction** ΓÇö Pull embedded text or perform OCR 3. **Preprocessing** ΓÇö Clean artifacts, deskew, binarize images 4. **Content Analysis** ΓÇö Classify, summarize, extract entities 5. **Output Formatting** ΓÇö Structure data for downstream use Each stage has multiple implementation options with different accuracy/speed/cost tradeoffs. ### Common Failure Modes Processing pipelines fail for predictable reasons. Wrong extraction method for document type wastes time and produces garbage. Memory exhaustion occurs when processing multi-gigabyte PDFs without streaming. Encoding issues plague extracted text when source documents use unusual character sets. Garbage output from OCR typically indicates inadequate preprocessing. ### Tooling Overview This course uses these primary tools: - **PyMuPDF** ΓÇö PDF text and image extraction, metadata access - **pytesseract** ΓÇö Tesseract OCR wrapper for Python - **pdf2image + PIL** ΓÇö Image preprocessing and format conversion - **Transformers (Hugging Face)** ΓÇö Classification, summarization, NER models - **LLama.cpp / llama.cpp Python bindings** ΓÇö Local LLM inference15 min
  2. 02PDF Text ExtractionText extraction from born-digital PDFs is trivial with PyMuPDFΓÇöjust call `get_text()`ΓÇöbut handling multi-column layouts and tables requires understanding document structure analysis. ### Extracting Text with PyMuPDF PyMuPDF (also known as fitz) is the standard tool for PDF text extraction on local systems. Installation: ```bash pip install pymupdf ``` Basic extraction: ```python import fitz doc = fitz.open("document.pdf") for page_num, page in enumerate(doc): text = page.get_text() print(f"--- Page {page_num + 1} ---") print(text[:500]) doc.close() ``` This works for straightforward documents but fails on complex layouts. ### Handling Multi-Column Documents Academic papers and legal documents often use multi-column layouts. Default text extraction produces scrambled output. Use `get_text("blocks")` to understand layout: ```python import fitz doc = fitz.open("paper.pdf") page = doc[0] blocks = page.get_text("blocks") for block in blocks: x0, y0, x1, y1, text, block_no, block_type = block # block_type 0 = text, 1 = image, 2 = drawing if block_type == 0: print(f"Y:{y0:.0f} - {text[:100]}") doc.close() ``` Sort blocks by `y0` (vertical position) then `x0` (horizontal) to reconstruct reading order: ```python def extract_by_layout(page): blocks = page.get_text("blocks") text_blocks = [(b[1], b[0], b[4]) for b in blocks if b[6] == 0] text_blocks.sort() # Sort by y, then x return "\n".join(text for _, _, text in text_blocks) ``` ### Extracting Tables Tables are notoriously difficult. PyMuPDF's `extract_table()` attempts detection but often fails on complex formats: ```python page = doc[0] tables = page.extract_tables() if tables: for table in tables: for row in table: print("\t".join(str(cell) for cell in row)) ``` For reliable table extraction, consider `tabula-py` (requires Java) or `camelot`. The table extraction task remains unsolved perfectlyΓÇöexpect iteration. ### Handling Encodings Extracted text sometimes contains encoding issues. Handle explicitly: ```python text = page.get_text() # Remove null bytes that break processing text = text.replace("\x00", "") # Normalize line endings text = text.replace("\r\n", "\n").replace("\r", "\n") ``` ### Extracting Metadata PDF metadata includes author, creation date, and custom fields: ```python doc = fitz.open("document.pdf") meta = doc.metadata print(f"Author: {meta.get('author')}") print(f"Created: {meta.get('creationDate')}") print(f"Pages: {len(doc)}") ```20 min
  3. 03PyMuPDF Deep DivePyMuPDF's granular controlΓÇöpage rotation, image extraction, redaction, annotationsΓÇöenables sophisticated document preprocessing directly within your pipeline. ### Beyond Basic Text Extraction PyMuPDF provides low-level PDF manipulation that enables preprocessing workflows. Understanding these capabilities allows building sophisticated pipelines without external tools. ### Image Extraction from PDFs Scraping images embedded in PDFs: ```python import fitz doc = fitz.open("document.pdf") page = doc[0] # List images on page image_list = page.get_images(full=True) for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] with open(f"image_{img_index}.{image_ext}", "wb") as f: f.write(image_bytes) doc.close() ``` This extracts raster images. Vector graphics require different handling. ### Page Rotation and Orientation Many scanned documents arrive rotated. Detect and fix: ```python import fitz def detect_rotation(page): # Check rotation attribute rotation = page.rotation # Analyze text block positions for implicit rotation blocks = page.get_text("blocks") if not blocks: return 0 # Heuristic: if most text starts on right side of page, likely rotated right_heavy = sum(1 for b in blocks if b[0] > page.rect.width / 2) if right_heavy > len(blocks) * 0.7: return 90 return rotation doc = fitz.open("scanned.pdf") for page in doc: current_rotation = detect_rotation(page) if current_rotation: page.set_rotation(current_rotation) doc.save("corrected.pdf") doc.close() ``` ### Text Extraction Methods PyMuPDF offers multiple text extraction modes with different output structures: | Method | Use Case | |--------|----------| | `get_text()` | Raw text with optional block sorting | | `get_text("dict")` | Structured dict with font info | | `get_text("blocks")` | List of positioned text rectangles | | `get_text("words")` | Individual words with positions | | `get_text("rawdict")` | Low-level internal structure | ```python # Word-level extraction for precise positioning page = doc[0] words = page.get_text("words") for word in words: x0, y0, x1, y1, content, block_no, line_no, word_no = word if "search_term" in content.lower(): print(f"Found at ({x0:.0f}, {y0:.0f})") ``` ### Redacting Content Redaction permanently removes content (unlike annotation which overlays): ```python doc = fitz.open("document.pdf") page = doc[0] # Redact specific text redact = page.add_redact_annot(page.rect, fill=(1, 1, 1)) page.apply_redactions() doc.save("redacted.pdf") doc.close() ``` ### Creating New PDFs Generate processed output: ```python doc = fitz.open("source.pdf") new_doc = fitz.open() for page in doc: # Process page new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height) new_page.show_pdf_page(new_page.rect, doc, page.number) new_doc.save("processed.pdf") new_doc.close() doc.close() ``` ### Performance Considerations PyMuPDF loads entire PDF into memory. For large files: ```python # Stream processing for memory efficiency with fitz.open("large.pdf", stream=open("large.pdf", "rb").read()) as doc: for page in doc: # Process and yield immediately text = page.get_text() yield page.number, text ```20 min
  4. 04OCR with TesseractTesseract OCR requires preprocessing to achieve reliable resultsΓÇöraw scanned images produce 60-70% accuracy while properly preprocessed images often exceed 95%. ### Installing and Configuring Tesseract Tesseract is the most widely-used open-source OCR engine. Install both the engine and Python bindings: ```bash # Ubuntu/Debian sudo apt install tesseract-ocr pip install pytesseract # macOS brew install tesseract pip install pytesseract # Windows # Download installer from github.com/UB-Mannheim/tesseract/wiki # Add to PATH, then: pip install pytesseract ``` Verify installation: ```bash tesseract --version ``` Tesseract supports multiple languages. Install additional language packs for non-English documents: ```bash # Install French and German language data sudo apt install tesseract-ocr-fra tesseract-ocr-deu ``` ### Basic OCR with pytesseract ```python import pytesseract from PIL import Image image = Image.open("scanned_page.png") text = pytesseract.image_to_string(image) print(text) ``` This single line handles basic cases. The output quality depends entirely on input image quality. ### Image Preprocessing for OCR Preprocessing dramatically affects accuracy. Key operations: ```python from PIL import Image, ImageFilter, ImageOps import numpy as np def preprocess_for_ocr(image_path, output_path=None): img = Image.open(image_path) # Convert to grayscale img = img.convert('L') # Increase contrast img = ImageOps.autocontrast(img) # Resize if too small (Tesseract works better on larger images) if min(img.size) < 1500: scale = 1500 / min(img.size) new_size = tuple(int(s * scale) for s in img.size) img = img.resize(new_size, Image.LANCZOS) # Apply sharpening img = img.filter(ImageFilter.SHARPEN) if output_path: img.save(output_path) return img ``` ### Page Segmentation Modes Tesseract uses different page segmentation modes (PSM) for different document types: ```python # Single block of text text = pytesseract.image_to_string(img, config='--psm 6') # Assume uniform block of text text = pytesseract.image_to_string(img, config='--psm 7') # Treat as single line text = pytesseract.image_to_string(img, config='--psm 8') # Sparse text search text = pytesseract.image_to_string(img, config='--psm 11') # Treat as single character text = pytesseract.image_to_string(img, config='--psm 10') ``` Auto-detect appropriate mode: ```python def auto_psm(img): width, height = img.size aspect = width / height # Likely a form with single lines if aspect > 3: return '--psm 8' # Likely a book page elif aspect < 0.8: return '--psm 6' # Standard document else: return '--psm 3' ``` ### Extracting Structure Get more than just textΓÇöextract bounding boxes and confidence: ```python data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) for i, text in enumerate(data['text']): if text.strip(): conf = data['conf'][i] x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i] print(f"'{text}' (conf: {conf:.0f}%) at ({x}, {y})") ``` ### Handling Multi-column Documents For multi-column layouts, split the image before OCR: ```python def split_columns(img, num_cols=2): width, height = img.size col_width = width // num_cols columns = [] for i in range(num_cols): left = i * col_width right = (i + 1) * col_width if i < num_cols - 1 else width column = img.crop((left, 0, right, height)) columns.append(column) return columns # Process each column separately for i, col in enumerate(split_columns(img)): text = pytesseract.image_to_string(col, config='--psm 6') print(f"Column {i + 1}: {text[:200]}") ``` ### Common Failure Modes Tesseract fails in predictable ways: - **Inverted images** produce gibberish. Detect by checking average pixel value and invert if needed - **Excessive noise** confuses recognition. Apply thresholding after deskewing - **Skewed text** reduces accuracy. Correct skew before OCR - **Low resolution** loses detail. Upscale before processing25 min
  5. 05OCR with AI ModelsModern OCR models (TrOCR, EasyOCR) handle imperfect input better than Tesseract but require GPU and produce different error patternsΓÇöknow when to use which approach. ### The AI OCR Landscape Tesseract remains the fastest option for clean documents but struggles with challenging inputs. AI-based OCR models using transformer architectures handle imperfect images better through learned features. Three primary options for local AI OCR: - **TrOCR** (Microsoft) ΓÇö Encoder-decoder transformer, excels at handwriting - **EasyOCR** ΓÇö Multi-language support, balanced speed/accuracy - **PaddleOCR** ΓÇö Fast, good Chinese support, quantized models available ### TrOCR for Document Recognition TrOCR uses vision transformer architecture. Best for structured documents and handwriting: ```bash pip install transformers torch ``` ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten") def ocr_trocr(image_path): image = Image.open(image_path).convert("RGB") pixel_values = processor(images=image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] return generated_text text = ocr_trocr("handwritten_notes.jpg") print(text) ``` Requires ~2GB VRAM for base model. Larger models improve accuracy but increase memory requirements. ### EasyOCR for Multi-language Support EasyOCR supports 80+ languages and handles mixed-language documents: ```bash pip install easyocr ``` ```python import easyocr reader = easyocr.Reader(['en', 'de', 'fr'], gpu=True) results = reader.readtext("multilingual_doc.png") for (bbox, text, confidence) in results: print(f"{text} (conf: {confidence:.2f})") ``` Results include bounding boxes for each detected text region. Useful for document understanding tasks beyond simple extraction. ### PaddleOCR for Speed PaddleOCR emphasizes inference speed while maintaining accuracy: ```bash pip install paddlepaddle paddleocr ``` ```python from paddleocr import PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang='en') results = ocr.ocr("document.png") for line in results[0]: bbox, (text, confidence) = line print(f"{text}") ``` ### Comparing Approaches | Engine | Speed (CPU) | Accuracy | Memory | Best For | |--------|-------------|----------|--------|----------| | Tesseract | Fast | Medium | Low | Clean documents, batch processing | | TrOCR | Slow | High | High | Handwriting, structured forms | | EasyOCR | Medium | High | Medium | Multi-language, varied quality | | PaddleOCR | Fast | High | Medium | Production pipelines, Chinese | ### Hybrid Pipelines Combine approaches for reliable: ```python def hybrid_ocr(image_path): from PIL import Image import pytesseract # Try EasyOCR first (better error messages) try: reader = easyocr.Reader(['en'], gpu=False) results = reader.readtext(image_path) # If low confidence, fall back to Tesseract avg_conf = sum(r[1][1] for r in results) / len(results) if avg_conf < 0.7: raise ValueError("Low confidence") return "\n".join(r[1][0] for r in results) except: # Fallback to Tesseract img = Image.open(image_path) return pytesseract.image_to_string(img) ``` ### Quantized Models for CPU When GPU unavailable, use quantized models: ```python # Use smaller model variant reader = easyocr.Reader(['en'], gpu=False, model_storage_directory='./models') # Or use ONNX runtime for CPU efficiency from paddleocr import PaddleOCR ocr = PaddleOCR(use_tensorrt=False, use_angle_cls=True) ``` Expect 2-3x slower processing but identical output quality.25 min
  6. 06Image PreprocessingOCR accuracy depends more on preprocessing quality than the OCR engine itselfΓÇöinvest time in proper image preparation and you'll need less capable (faster) OCR models. ### Why Preprocessing Matters Raw scanned documents contain noise, skew, uneven lighting, and artifacts. Preprocessing transforms these into clean inputs that OCR engines can handle reliably. The improvement from good preprocessing typically exceeds the improvement from switching OCR engines. ### Binarization: Converting to Black and White Binarization converts grayscale or color images to pure black and white. This removes noise and standardizes input. ```python import cv2 import numpy as np def binarize(image_path, method='otsu'): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) if method == 'otsu': # Otsu's method finds optimal threshold automatically _, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) elif method == 'adaptive': # Adaptive threshold handles uneven lighting binary = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2) cv2.imwrite('binary.png', binary) return binary ``` Adaptive threshold excels on scanned documents with uneven lightingΓÇöcommon in photographs of documents. ### Deskewing: Correcting Rotation Skewed documents reduce OCR accuracy. Detect and correct skew: ```python import cv2 import numpy as np def deskew(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # Apply Gaussian blur to reduce noise blurred = cv2.GaussianBlur(img, (5, 5), 0) # Edge detection edges = cv2.Canny(blurred, 50, 150, apertureSize=3) # Hough line transform to find lines lines = cv2.HoughLines(edges, 1, np.pi / 180, 200) if lines is None: return cv2.imread(image_path) # Calculate average angle angles = [] for line in lines[:20]: # Sample first 20 lines rho, theta = line[0] angle = np.degrees(theta) - 90 angles.append(angle) median_angle = np.median(angles) # Rotate image h, w = img.shape center = (w // 2, h // 2) rotation_matrix = cv2.getRotationMatrix2D(center, median_angle, 1.0) rotated = cv2.warpAffine(img, rotation_matrix, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) cv2.imwrite('deskewed.png', rotated) return rotated ``` Typical deskew angles range from -5┬░ to +5┬░. Angles outside this range often indicate scanning errors rather than document skew. ### Noise Reduction Scanner noise and compression artifacts create false text. Apply morphological operations: ```python def denoise(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # Morphological opening (remove small white noise) kernel = np.ones((2, 2), np.uint8) opened = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel) # Morphological closing (remove small black noise) closed = cv2.morphologyEx(opened, cv2.MORPH_CLOSE, kernel) cv2.imwrite('denoised.png', closed) return closed ``` For heavy noise, apply bilateral filtering which preserves edges while smoothing: ```python denoised = cv2.bilateralFilter(img, 9, 75, 75) ``` ### Contrast Enhancement Enhance text visibility through histogram equalization: ```python def enhance_contrast(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # CLAHE (Contrast Limited Adaptive Histogram Equalization) clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8)) enhanced = clahe.apply(img) cv2.imwrite('enhanced.png', enhanced) return enhanced ``` CLAHE handles documents with both bright and dark regions better than global histogram equalization. ### Complete Preprocessing Pipeline ```python def preprocess_document(input_path, output_path): import cv2 import numpy as np # Load image img = cv2.imread(input_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Step 1: Denoise denoised = cv2.bilateralFilter(gray, 9, 75, 75) # Step 2: Binarize with adaptive threshold binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2) # Step 3: Morphological cleanup kernel = np.ones((1, 1), np.uint8) cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel) # Step 4: Deskew # ... (deskew function from above) ... deskewed = deskew_from_image(cleaned) cv2.imwrite(output_path, deskewed) return output_path ```25 min
  7. 07Document ClassificationClassifying documents before routing them to specialized processing pipelines reduces overall processing time and enables different extraction strategies for different document types. ### Why Classify First A single document processing pipeline optimized for invoices fails on contracts. Classification enables routing: each document type follows its optimal path. Additionally, classification metadata helps downstream systems understand document provenance. ### Rule-Based Classification Simple classification based on file metadata and basic content analysis: ```python def classify_simple(doc_path): import fitz doc = fitz.open(doc_path) text = doc[0].get_text()[:2000] # Sample beginning # Check for specific patterns if any(marker in text for marker in ['INVOICE', 'Invoice #', 'Bill To:', 'Total Due']): return 'invoice' elif any(marker in text for marker in ['Contract', 'Agreement', 'Whereas']): return 'contract' elif any(marker in text for marker in ['Dear', 'Sincerely', 'Regards']): return 'letter' else: return 'unknown' doc.close() ``` Rule-based classification is fast and requires no ML model, but fragile against variations in document formatting. ### ML-Based Classification Train a classifier on document features: ```bash pip install scikit-learn transformers ``` ```python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline import os # Training data: (path, label) tuples training_data = [ ('./data/invoices/inv1.pdf', 'invoice'), ('./data/invoices/inv2.pdf', 'invoice'), ('./data/contracts/contract1.pdf', 'contract'), # ... more training examples ] def extract_features(path): import fitz doc = fitz.open(path) text = "" for page in doc[:3]: # First 3 pages text += page.get_text() doc.close() return text # Build training set X_train = [extract_features(path) for path, _ in training_data] y_train = [label for _, label in training_data] # Train classifier pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))), ('clf', LogisticRegression(max_iter=1000)) ]) pipeline.fit(X_train, y_train) # Predict prediction = pipeline.predict(['new_document.pdf'])[0] print(f"Classification: {prediction}") ``` ### Transformer-Based Classification For higher accuracy on diverse document types: ```bash pip install transformers torch ``` ```python from transformers import pipeline from functools import lru_cache classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") @lru_cache(maxsize=1000) def classify_transformer(text, doc_type): candidate_labels = ["invoice", "contract", "letter", "form", "report"] result = classifier(text[:2000], candidate_labels) return result['labels'][0], result['scores'][0] # Usage text = extract_features('document.pdf') label, confidence = classify_transformer(text, 'document') print(f"{label} ({confidence:.2f})") ``` Zero-shot classification requires no training dataΓÇöyou specify candidate labels and the model classifies. Works well when you have 5-10 known document types. ### Multi-Modal Classification Some documents are primarily images. Classify by visual features: ```python import torch from torchvision import transforms, models model = models.resnet50(pretrained=True) model.fc = torch.nn.Linear(model.fc.in_features, 5) # 5 document types transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) def classify_visual(doc_path): from PIL import Image import fitz # Render first page as image doc = fitz.open(doc_path) mat = fitz.Matrix(2, 2) # 2x zoom pix = doc[0].get_pixmap(matrix=mat) img_data = pix.tobytes("png") doc.close() image = Image.open(io.BytesIO(img_data)) tensor = transform(image).unsqueeze(0) with torch.no_grad(): logits = model(tensor) classes = ['form', 'invoice', 'letter', 'report', 'contract'] return classes[logits.argmax().item()] ``` ### Classification Confidence and Fallback Always check confidence and route low-confidence predictions: ```python def classify_with_fallback(doc_path): text = extract_features(doc_path) # Try ML classifier label, conf = classify_transformer(text, 'document') if conf < 0.6: # Low confidence - use rule-based as fallback rule_label = classify_simple(doc_path) print(f"Low confidence ({conf:.2f}), rule-based suggests: {rule_label}") return rule_label return label ```25 min
  8. 08Document SummarizationExtractive summarization (selecting important sentences) works without LLMs and runs fast; abstractive summarization (generating new text) requires LLMs but produces more coherent output. ### Two Summarization Approaches Extractive summarization selects existing sentences from the document. No language generation requiredΓÇöfaster, more reliable, but may produce choppy output. Abstractive summarization generates new text that paraphrases contentΓÇömore coherent but requires LLMs and may introduce hallucinations. ### Extractive Summarization with TF-IDF Extract the most important sentences using TF-IDF scoring: ```python import fitz import re from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np def extractive_summarize(text, num_sentences=5): # Split into sentences sentences = re.split(r'(?<=[.!?])\s+', text) sentences = [s for s in sentences if len(s) > 20] # Filter short sentences if len(sentences) <= num_sentences: return text # TF-IDF scoring vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(sentences) # Score each sentence by sum of TF-IDF values sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten() # Get top sentences (by original position, not score order) top_indices = sentence_scores.argsort()[-num_sentences:] top_indices.sort() # Sort by position in document summary = ' '.join(sentences[i] for i in top_indices) return summary # Usage doc = fitz.open("document.pdf") text = doc[0].get_text() doc.close() summary = extractive_summarize(text, num_sentences=5) print(summary) ``` ### LexRank for Better Extraction LexRank uses graph-based ranking similar to Google's PageRank. Often produces more coherent summaries: ```bash pip install sumy ``` ```python from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words def lexrank_summarize(text, num_sentences=5): parser = PlaintextParser.from_string(text, Tokenizer("english")) stemmer = Stemmer("english") summarizer = LexRankSummarizer(stemmer) summarizer.stop_words = get_stop_words("english") summary = summarizer(parser.document, sentences_count=num_sentences) return ' '.join(str(sentence) for sentence in summary) summary = lexrank_summarize(text) print(summary) ``` ### Abstractive Summarization with Local LLMs For coherent, human-readable summaries, use local LLMs: ```bash pip install llama-cpp-python transformers ``` ```python from llama_cpp import Llama import fitz llm = Llama( model_path="./models/llama-2-7b-chat.gguf", n_ctx=4096, n_threads=4 ) def summarize_llm(text, max_tokens=200): prompt = f"""Summarize the following document in 3-5 sentences: {text[:4000]} Summary:""" response = llm(prompt, max_tokens=max_tokens, temperature=0.3) return response['choices'][0]['text'] doc = fitz.open("document.pdf") text = " ".join(page.get_text() for page in doc) doc.close() summary = summarize_llm(text) print(summary) ``` Temperature 0.3 keeps output factual with minimal hallucination. Higher temperature produces more creative but less reliable summaries. ### Hybrid Approach: Extract + Abstract Combine extractive and abstractive for best results: ```python def hybrid_summarize(text): # First extract key sentences extracted = extractive_summarize(text, num_sentences=10) # Then abstract with LLM summary = summarize_llm(extracted) return summary ``` This approach reduces input length for the LLM (faster, cheaper) while preserving key information. ### Handling Long Documents Documents longer than LLM context require chunking: ```python def chunk_summarize(text, chunk_size=2000, overlap=200): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start = end - overlap # Overlap for continuity # Summarize each chunk chunk_summaries = [summarize_llm(chunk) for chunk in chunks] # Final summary of summaries combined = " ".join(chunk_summaries) return summarize_llm(combined) long_text = "..." # Your full document summary = chunk_summarize(long_text) ``` Overlap ensures context continuity across chunk boundaries.25 min
  9. 09Entity ExtractionNamed Entity Recognition extracts structured data (names, dates, amounts) from unstructured textΓÇötransforming documents into queryable databases. ### What is NER Named Entity Recognition identifies and classifies text spans into predefined categories: people, organizations, locations, dates, monetary values, product identifiers. Extracted entities enable database population, search indexing, and relationship analysis. ### Rule-Based Entity Extraction Simple patterns work for structured documents: ```python import re import fitz def extract_invoice_entities(text): entities = {} # Invoice number pattern invoice_match = re.search(r'(?:invoice|inv|#)\s*[:.]?\s*([A-Z0-9-]+)', text, re.I) if invoice_match: entities['invoice_number'] = invoice_match.group(1) # Date patterns date_patterns = [ r'\d{1,2}/\d{1,2}/\d{2,4}', r'\d{1,2}-\d{1,2}-\d{2,4}', r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}' ] for pattern in date_patterns: date_match = re.search(pattern, text) if date_match: entities['date'] = date_match.group() break # Currency amounts amounts = re.findall(r'\$[\d,]+\.?\d*', text) if amounts: entities['amounts'] = amounts entities['total'] = amounts[-1] if len(amounts) > 1 else amounts[0] # Email addresses emails = re.findall(r'[\w.-]+@[\w.-]+\.\w+', text) if emails: entities['email'] = emails[0] return entities doc = fitz.open("invoice.pdf") text = doc[0].get_text() doc.close() entities = extract_invoice_entities(text) print(entities) ``` Rule-based extraction works for predictable formats but fails on varied documents. ### Transformer-Based NER For varied document types, use pre-trained NER models: ```bash pip install transformers torch ``` ```python from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification import fitz # Load NER pipeline ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple") def extract_entities_ner(text): entities = ner_pipeline(text) # Group by entity type by_type = {} for entity in entities: label = entity['entity_group'] if label not in by_type: by_type[label] = [] by_type[label].append(entity['word']) return by_type doc = fitz.open("document.pdf") text = doc[0].get_text() doc.close() entities = extract_entities_ner(text) for entity_type, values in entities.items(): print(f"{entity_type}: {values}") ``` Common entity types: PER (person), ORG (organization), LOC (location), DATE, MISC (miscellaneous). ### Custom NER for Domain-Specific Entities Train custom models for domain-specific entities (product codes, case numbers, medical terms): ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer from datasets import Dataset import torch # Prepare training data training_data = [ {"text": "Invoice #INV-2024-001", "entities": [(10, 22, "INVOICE_ID")]}, {"text": "Case No. 23-CV-00451", "entities": [(9, 22, "CASE_NUMBER")]}, # ... more examples ] # Tokenize and align labels tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def tokenize_and_align(examples): tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128) labels = [] for text, entities in zip(examples["text"], examples["entities"]): word_ids = tokenized.word_ids() label = [0] * len(word_ids) for start, end, entity_type in entities: # Map character positions to token positions for i, word_id in enumerate(word_ids): if word_id is not None: # Simple alignment pass # Full implementation requires word-to-char mapping labels.append(label) tokenized["labels"] = labels return tokenized # Fine-tune model model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=3) trainer = Trainer(model=model, train_dataset=train_dataset, args=training_args) trainer.train() ``` Training requires 1000+ labeled examples for reasonable accuracy. For smaller datasets, use few-shot learning with LLMs. ### LLM-Based Entity Extraction Local LLMs handle entity extraction without training: ```python from llama_cpp import Llama llm = Llama(model_path="./models/llama-2-7b-chat.gguf") def extract_entities_llm(text): prompt = f"""Extract entities from the following text. Return as JSON with entity types as keys and lists of values. Text: {text[:3000]} Entities to extract: PERSON, ORGANIZATION, LOCATION, DATE, CURRENCY, PRODUCT Output format: {{ "PERSON": [], "ORGANIZATION": [], "LOCATION": [], "DATE": [], "CURRENCY": [], "PRODUCT": [] }}""" response = llm(prompt, max_tokens=500, temperature=0.1) return response['choices'][0]['text'] import json result = extract_entities_llm(text) entities = json.loads(result) print(entities) ``` Temperature 0.1 produces consistent output. Higher temperature may introduce formatting errors. ### Relationship Extraction Beyond isolated entities, extract relationships: ```python def extract_relationships(text): prompt = f"""Extract relationships between entities from this text. Format as subject|relation|object tuples. Text: {text[:2000]} Relations: works_for, located_in, purchased_by, dated_on, amount_is Example output: John Smith|works_for|Acme Corp Acme Corp|located_in|New York """ response = llm(prompt, max_tokens=300, temperature=0.1) relationships = [] for line in response['choices'][0]['text'].strip().split('\n'): if '|' in line: parts = line.split('|') if len(parts) == 3: relationships.append(tuple(parts)) return relationships rels = extract_relationships(text) for subject, relation, obj in rels: print(f"{subject} -> {relation} -> {obj}") ```25 min
  10. 10Table ExtractionTable extraction requires both layout analysis and semantic interpretation. No single strategy works across all document types. Test extraction accuracy on samples before processing large batches.20 min
  11. 11Batch Processing ArchitectureBatch processing architecture balances concurrency against memory constraints. Start with worker pools, add progress tracking for visibility, and implement chunked processing when memory limits emerge.20 min
  12. 12Watch-Folder AutomationWatch-folder systems must handle incomplete writes, rapid changes, and clean shutdown. Without stability delays and debouncing, processing triggers on partial files or floods with duplicate events.20 min
  13. 13Processing PipelinesPipelines convert sequential operations into reusable, configurable workflows. Stage-based design separates concerns, simplifies testing, and enables visual pipeline builders.20 min
  14. 14Error HandlingError handling determines system reliability. Categorize failures, implement appropriate recovery strategies, preserve context for debugging, and route failures to dead letter queues for analysis.20 min
  15. 15Quality ChecksQuality checks transform silent failures into visible issues. Without automated validation, bad outputs reach users and downstream systems. Integrate checks at pipeline boundaries.20 min
  16. 16Document SearchDocument search requires balancing keyword matching against semantic understanding. Start with FTS5 for fast, reliable keyword search, then add embeddings for natural language queries when needed.25 min
  17. 17Multi-Format SupportUnified extraction requires format detection, specialized processors, and fallback handling. Design the interface around a common return structure regardless of source format.20 min
  18. 18Document Pipeline ProjectA complete document pipeline combines extraction, quality checks, indexing, and automation. Each component remains independent, testable, and configurable. Start simple, add monitoring, and expand features as requirements emerge.25 min
← All coursesStart chapter 1 →