PDF Text Extraction Methods Compared — pymupdf, pdfplumber, pdfminer, and OCR
If you've ever Googled "extract text from PDF Python", you've seen the same three libraries recommended in different orders: pymupdf, pdfplumber, pdfminer.six. Plus a fourth recommendation that says "just use OCR." Each has real strengths and real failure modes. This article runs them against the same test documents and tells you which to use for which job.
What "extracting text from a PDF" actually means
PDFs aren't really text documents. They're descriptions of how to draw shapes on a page, some of which happen to be letters. A "text extraction" library has to reconstruct reading order from positioned glyphs — which letter comes after which, where one paragraph ends and the next begins, where a table cell is.
This is harder than it sounds because the rules for "which glyph comes next" are heuristics, not exact. Even good libraries disagree on the same PDF. Failure modes shared by every library:
- Multi-column layouts where the columns get interleaved (left-to-right reading produces zigzag prose)
- Tables collapsed into space-separated runs
- Footnotes inserted mid-sentence into body text
- Ligatures (
fi,fl) coming through as the ligature glyph instead of the constituent letters - Headers and footers appearing in the middle of body text instead of being filtered out
Each library makes different choices about how aggressively to apply heuristics, and those choices show up as different outputs on the same input.
The contenders
Brief introduction to each before the comparison:
- pymupdf (PyMuPDF, the Python binding for MuPDF) — C++ engine, fast, comprehensive feature set
- pdfplumber — pure Python, built on pdfminer.six, strong on tables and visual layout
- pdfminer.six — pure Python, oldest of the three, lower-level
- pymupdf4llm — a wrapper on pymupdf that targets Markdown output for LLM consumption
- marker — newer, uses ML to handle layouts, slower but much better on complex documents
- OCR (Tesseract or vision models) — fallback for image-only or scanned PDFs
Side-by-side comparison
| Library | Speed | Tables | Reading order | Markdown output | Image-only PDFs | Install |
|---|---|---|---|---|---|---|
| pymupdf | Very fast | Mediocre | Good | No (raw text) | No | pip only |
| pdfplumber | Slow | Best (non-ML) | OK | No | No | pip only |
| pdfminer.six | Slow | Poor | OK | No | No | pip only |
| pymupdf4llm | Fast | Decent | Good | Yes | No | pip only |
| marker | Slow | Excellent | Excellent | Yes | No | Heavy ML deps |
| Tesseract OCR | Slow | Poor | Variable | No | Yes | System binary |
A summary in prose:
- pymupdf is the speed king. For large batches of digital PDFs where you want plain text fast and don't need fancy structure, nothing else comes close.
- pdfplumber has the strongest table extraction outside the ML-based tools. The cost is speed — it's 5–10× slower than pymupdf for the same document.
- pdfminer.six is the lowest-level option and the most flexible if you want to walk the PDF's actual structure objects yourself. For straight text extraction, it's slower than pymupdf without being more accurate.
- pymupdf4llm is the right default when the destination is an LLM or a Markdown editor. It produces structured Markdown output (headings, lists, code blocks) with most of pymupdf's speed.
- marker is the highest-quality option and the slowest. Uses ML models to detect layouts; handles multi-column papers, embedded equations, and complex tables far better than the heuristic-based tools.
- OCR is the only option for image-only PDFs. See the Tesseract OCR guide.
When to pick each
Concrete situations and the right tool for each:
Large batches of digital PDFs, plain text output: pymupdf. Process 1000 PDFs in a few minutes; spend the saved time on downstream cleanup.
Documents where tables are the point: pdfplumber for non-financial documents; AWS Textract or Azure Document Intelligence for financial statements and forms. See preserving tables when converting PDF to Markdown for table-specific advice.
Lower-level inspection, debugging a weird PDF: pdfminer.six. When you need to walk the actual layout objects to figure out why another library got the output wrong.
Feeding documents to LLMs: pymupdf4llm. Markdown headings, lists, and basic tables — exactly what LLMs parse well. See Markdown vs plain text for LLMs.
Highest quality regardless of speed: marker. Mixed-content PDFs with figures, equations, and multi-column layouts. Plan for 10–30 seconds per page on a GPU; longer on CPU.
Scanned or image-only PDFs: OCR. Tesseract for clean print; a vision model for handwriting or messy layouts.
A benchmark on three real documents
Running all five libraries against three representative documents:
Document 1: a clean digital research paper (single-column, embedded text, no tables)
All five extract usable text. Differences show up in heading detection and reference formatting. pymupdf4llm and marker both produce Markdown headings; the others produce flat text. pdfplumber and pdfminer.six format references on separate lines; pymupdf concatenates them. No clear winner — pick by output format.
Document 2: a multi-column journal article with tables and figures
This is where the choices diverge:
- pymupdf: interleaves the two columns into zigzag prose. Tables collapse into space-separated rows. Unusable without significant cleanup.
- pdfplumber: detects column boundaries correctly. Tables come through as structured rows. Slow but high-quality.
- pdfminer.six: similar column handling to pdfplumber but weaker on tables.
- pymupdf4llm: column-aware, decent table support, Markdown output. Best speed/quality balance.
- marker: best output of the five. Handles equations, captions, and embedded figures cleanly. Slowest.
Document 3: a scanned book chapter (image-only PDF)
Everything except OCR returns nothing. Tesseract recovers about 92% of the text with the right --psm mode. A vision model (GPT-4o) recovers ~98% but costs per page.
Combining tools
A pragmatic stack that handles most real-world inputs:
- Try pymupdf4llm first. Fast, Markdown-aware, handles 80% of PDFs well.
- Check the output. If it's empty or very short relative to the page count, fall back to OCR.
- For pages where pymupdf4llm produces text but misses embedded images, fall back to OCR on a per-page basis.
This is what the converter on this site does internally: pymupdf4llm for the bulk extraction, with a fallback OCR pass that catches scanned pages embedded mid-document and re-processes them.
For batch pipelines processing millions of pages, the same pattern with a per-page text-density check (route empty pages to OCR, send the rest through pymupdf) keeps costs low while catching the edge cases.
Code snippets
The same task — extract page text and count characters — in each library:
# pymupdf
import pymupdf
doc = pymupdf.open("paper.pdf")
text = "\n".join(page.get_text() for page in doc)
print(len(text))
# pdfplumber
import pdfplumber
with pdfplumber.open("paper.pdf") as pdf:
text = "\n".join(page.extract_text() or "" for page in pdf.pages)
print(len(text))
# pdfminer.six (high-level API)
from pdfminer.high_level import extract_text
text = extract_text("paper.pdf")
print(len(text))
# pymupdf4llm
import pymupdf4llm
md = pymupdf4llm.to_markdown("paper.pdf")
print(len(md))
# marker (rough sketch — actual API has more setup)
from marker.convert import convert_single_pdf
from marker.models import load_all_models
models = load_all_models()
text, _, _ = convert_single_pdf("paper.pdf", models)
print(len(text))
On the same 30-page paper, expect roughly: pymupdf 1s, pymupdf4llm 2s, pdfplumber 8s, pdfminer.six 12s, marker 60s (CPU) or 15s (GPU).
Conclusion
Default to pymupdf4llm for most jobs — fast, Markdown-aware, decent table support, the right balance for general PDF extraction. Reach for marker when you need top-shelf quality on a small batch. Reach for OCR when the PDF has no extractable text at all.
If you want a no-install starting point, the converter on this site wraps pymupdf4llm and OCR fallback into a single upload. For larger batches, install the library directly and run it locally.
← Back to all guides