What this does

Structured text extraction from PDF documents is the ingestion starting point for most RAG pipelines. PyMuPDF provides fast, layout-aware text extraction with page-level granularity and metadata access. This guide demonstrates extracting readable text from a PDF file ready for embedding and vector database indexing.

Steps

Install PyMuPDF.
```
pip install pymupdf --quiet
```

Open the PDF and iterate over pages.

import fitz

doc = fitz.open("document.pdf")
print(f"Pages: {doc.page_count}")
print(f"Metadata: {doc.metadata}")

for page_num in range(doc.page_count):
    page = doc[page_num]
    text = page.get_text("text")
    print(f"\n--- Page {page_num + 1} ---")
    print(text[:200])
doc.close()

Extract text with layout preservation.

doc = fitz.open("document.pdf")
page = doc[0]
blocks = page.get_text("blocks")
for block in blocks:
    x0, y0, x1, y1, content, block_type, *_ = block
    if block_type == 0:
        print(f"Y={y0:.0f}: {content[:80].replace(chr(10), ' ')}")
doc.close()

Stream results for pipeline ingestion.

import fitz, json

doc = fitz.open("document.pdf")
records = []
for page_num in range(doc.page_count):
    page = doc[page_num]
    text = page.get_text("text").strip()
    if text:
        records.append({"source": "document.pdf", "page": page_num + 1, "content": text})
doc.close()

with open("extracted.jsonl", "w") as f:
    for rec in records:
        f.write(json.dumps(rec) + "\n")
print(f"Wrote {len(records)} page records")

Verification

python3 -c "import fitz; print(f'PyMuPDF version: {fitz.__version__}')"
# Expected: PyMuPDF version: <version-number>

Common failures

Scanned PDFs contain no extractable text. Use OCR libraries like Tesseract for scanned documents.
Unicode rendering errors. Pass extracted text through unicodedata.normalize("NFKC").
Memory exhaustion on large PDFs. Use page-level iteration instead of loading the entire document at once.
Table extraction misaligned. Adjust coordinates manually for edge cases the detector misses.
Hidden metadata in extracted text. Filter blocks by type and skip annotations with zero height.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Extract Text from PDFs Using PyMuPDF

What this does

Steps

Verification

Common failures

Related guides