HOW-TO · RAG

How to Extract Text from PDFs Using PyMuPDF

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Python 3.10+, PyMuPDF installed

What this does

Structured text extraction from PDF documents is the ingestion starting point for most RAG pipelines. PyMuPDF provides fast, layout-aware text extraction with page-level granularity and metadata access. This guide demonstrates extracting readable text from a PDF file ready for embedding and vector database indexing.

Steps

  1. Install PyMuPDF.

    pip install pymupdf --quiet
    
  2. Open the PDF and iterate over pages.

    import fitz
    
    doc = fitz.open("document.pdf")
    print(f"Pages: {doc.page_count}")
    print(f"Metadata: {doc.metadata}")
    
    for page_num in range(doc.page_count):
        page = doc[page_num]
        text = page.get_text("text")
        print(f"\n--- Page {page_num + 1} ---")
        print(text[:200])
    doc.close()
    
  3. Extract text with layout preservation.

    doc = fitz.open("document.pdf")
    page = doc[0]
    blocks = page.get_text("blocks")
    for block in blocks:
        x0, y0, x1, y1, content, block_type, *_ = block
        if block_type == 0:
            print(f"Y={y0:.0f}: {content[:80].replace(chr(10), ' ')}")
    doc.close()
    
  4. Stream results for pipeline ingestion.

    import fitz, json
    
    doc = fitz.open("document.pdf")
    records = []
    for page_num in range(doc.page_count):
        page = doc[page_num]
        text = page.get_text("text").strip()
        if text:
            records.append({"source": "document.pdf", "page": page_num + 1, "content": text})
    doc.close()
    
    with open("extracted.jsonl", "w") as f:
        for rec in records:
            f.write(json.dumps(rec) + "\n")
    print(f"Wrote {len(records)} page records")
    

Verification

python3 -c "import fitz; print(f'PyMuPDF version: {fitz.__version__}')"
# Expected: PyMuPDF version: <version-number>

Common failures

  • Scanned PDFs contain no extractable text. Use OCR libraries like Tesseract for scanned documents.
  • Unicode rendering errors. Pass extracted text through unicodedata.normalize("NFKC").
  • Memory exhaustion on large PDFs. Use page-level iteration instead of loading the entire document at once.
  • Table extraction misaligned. Adjust coordinates manually for edge cases the detector misses.
  • Hidden metadata in extracted text. Filter blocks by type and skip annotations with zero height.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

RELATED GUIDES