HOW-TO · RAG
How to Extract Text from PDFs Using PyMuPDF
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Python 3.10+, PyMuPDF installed
What this does
Structured text extraction from PDF documents is the ingestion starting point for most RAG pipelines. PyMuPDF provides fast, layout-aware text extraction with page-level granularity and metadata access. This guide demonstrates extracting readable text from a PDF file ready for embedding and vector database indexing.
Steps
Install PyMuPDF.
pip install pymupdf --quietOpen the PDF and iterate over pages.
import fitz doc = fitz.open("document.pdf") print(f"Pages: {doc.page_count}") print(f"Metadata: {doc.metadata}") for page_num in range(doc.page_count): page = doc[page_num] text = page.get_text("text") print(f"\n--- Page {page_num + 1} ---") print(text[:200]) doc.close()Extract text with layout preservation.
doc = fitz.open("document.pdf") page = doc[0] blocks = page.get_text("blocks") for block in blocks: x0, y0, x1, y1, content, block_type, *_ = block if block_type == 0: print(f"Y={y0:.0f}: {content[:80].replace(chr(10), ' ')}") doc.close()Stream results for pipeline ingestion.
import fitz, json doc = fitz.open("document.pdf") records = [] for page_num in range(doc.page_count): page = doc[page_num] text = page.get_text("text").strip() if text: records.append({"source": "document.pdf", "page": page_num + 1, "content": text}) doc.close() with open("extracted.jsonl", "w") as f: for rec in records: f.write(json.dumps(rec) + "\n") print(f"Wrote {len(records)} page records")
Verification
python3 -c "import fitz; print(f'PyMuPDF version: {fitz.__version__}')"
# Expected: PyMuPDF version: <version-number>
Common failures
- Scanned PDFs contain no extractable text. Use OCR libraries like Tesseract for scanned documents.
- Unicode rendering errors. Pass extracted text through
unicodedata.normalize("NFKC"). - Memory exhaustion on large PDFs. Use page-level iteration instead of loading the entire document at once.
- Table extraction misaligned. Adjust coordinates manually for edge cases the detector misses.
- Hidden metadata in extracted text. Filter blocks by type and skip annotations with zero height.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
RELATED GUIDES