KEY INSIGHT
Named Entity Recognition extracts structured data (names, dates, amounts) from unstructured textΓÇötransforming documents into queryable databases.
### What is NER
Named Entity Recognition identifies and classifies text spans into predefined categories: people, organizations, locations, dates, monetary values, product identifiers. Extracted entities enable database population, search indexing, and relationship analysis.
### Rule-Based Entity Extraction
Simple patterns work for structured documents:
```python
import re
import fitz
def extract_invoice_entities(text):
entities = {}
# Invoice number pattern
invoice_match = re.search(r'(?:invoice|inv|#)\s*[:.]?\s*([A-Z0-9-]+)', text, re.I)
if invoice_match:
entities['invoice_number'] = invoice_match.group(1)
# Date patterns
date_patterns = [
r'\d{1,2}/\d{1,2}/\d{2,4}',
r'\d{1,2}-\d{1,2}-\d{2,4}',
r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}'
]
for pattern in date_patterns:
date_match = re.search(pattern, text)
if date_match:
entities['date'] = date_match.group()
break
# Currency amounts
amounts = re.findall(r'\$[\d,]+\.?\d*', text)
if amounts:
entities['amounts'] = amounts
entities['total'] = amounts[-1] if len(amounts) > 1 else amounts[0]
# Email addresses
emails = re.findall(r'[\w.-]+@[\w.-]+\.\w+', text)
if emails:
entities['email'] = emails[0]
return entities
doc = fitz.open("invoice.pdf")
text = doc[0].get_text()
doc.close()
entities = extract_invoice_entities(text)
print(entities)
```
Rule-based extraction works for predictable formats but fails on varied documents.
### Transformer-Based NER
For varied document types, use pre-trained NER models:
```bash
pip install transformers torch
```
```python
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import fitz
# Load NER pipeline
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
def extract_entities_ner(text):
entities = ner_pipeline(text)
# Group by entity type
by_type = {}
for entity in entities:
label = entity['entity_group']
if label not in by_type:
by_type[label] = []
by_type[label].append(entity['word'])
return by_type
doc = fitz.open("document.pdf")
text = doc[0].get_text()
doc.close()
entities = extract_entities_ner(text)
for entity_type, values in entities.items():
print(f"{entity_type}: {values}")
```
Common entity types: PER (person), ORG (organization), LOC (location), DATE, MISC (miscellaneous).
### Custom NER for Domain-Specific Entities
Train custom models for domain-specific entities (product codes, case numbers, medical terms):
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer
from datasets import Dataset
import torch
# Prepare training data
training_data = [
{"text": "Invoice #INV-2024-001", "entities": [(10, 22, "INVOICE_ID")]},
{"text": "Case No. 23-CV-00451", "entities": [(9, 22, "CASE_NUMBER")]},
# ... more examples
]
# Tokenize and align labels
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_and_align(examples):
tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
labels = []
for text, entities in zip(examples["text"], examples["entities"]):
word_ids = tokenized.word_ids()
label = [0] * len(word_ids)
for start, end, entity_type in entities:
# Map character positions to token positions
for i, word_id in enumerate(word_ids):
if word_id is not None:
# Simple alignment
pass # Full implementation requires word-to-char mapping
labels.append(label)
tokenized["labels"] = labels
return tokenized
# Fine-tune model
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=3)
trainer = Trainer(model=model, train_dataset=train_dataset, args=training_args)
trainer.train()
```
Training requires 1000+ labeled examples for reasonable accuracy. For smaller datasets, use few-shot learning with LLMs.
### LLM-Based Entity Extraction
Local LLMs handle entity extraction without training:
```python
from llama_cpp import Llama
llm = Llama(model_path="./models/llama-2-7b-chat.gguf")
def extract_entities_llm(text):
prompt = f"""Extract entities from the following text. Return as JSON with entity types as keys and lists of values.
Text: {text[:3000]}
Entities to extract: PERSON, ORGANIZATION, LOCATION, DATE, CURRENCY, PRODUCT
Output format:
{{
"PERSON": [],
"ORGANIZATION": [],
"LOCATION": [],
"DATE": [],
"CURRENCY": [],
"PRODUCT": []
}}"""
response = llm(prompt, max_tokens=500, temperature=0.1)
return response['choices'][0]['text']
import json
result = extract_entities_llm(text)
entities = json.loads(result)
print(entities)
```
Temperature 0.1 produces consistent output. Higher temperature may introduce formatting errors.
### Relationship Extraction
Beyond isolated entities, extract relationships:
```python
def extract_relationships(text):
prompt = f"""Extract relationships between entities from this text. Format as subject|relation|object tuples.
Text: {text[:2000]}
Relations: works_for, located_in, purchased_by, dated_on, amount_is
Example output:
John Smith|works_for|Acme Corp
Acme Corp|located_in|New York
"""
response = llm(prompt, max_tokens=300, temperature=0.1)
relationships = []
for line in response['choices'][0]['text'].strip().split('\n'):
if '|' in line:
parts = line.split('|')
if len(parts) == 3:
relationships.append(tuple(parts))
return relationships
rels = extract_relationships(text)
for subject, relation, obj in rels:
print(f"{subject} -> {relation} -> {obj}")
```