Preserving Tables When Converting PDF to Markdown — Why It's Hard and How to Fix It
Tables are the single hardest thing to get right when converting a PDF to Markdown. A financial report, scientific paper, or product datasheet you converted yesterday probably has at least one table that came out as a space-separated mess.
This article explains why this happens (it's not the converter's fault — mostly), which tools handle tables well, and how to fix the rest by hand without spending an afternoon doing it.
Why PDF tables are uniquely hard
PDFs don't store tables. They store positioned text and the lines that draw the table's borders. A "table" in a PDF is a visual coincidence: numbers happen to align in a grid because the document creator placed them at specific x-coordinates.
To extract a table from a PDF, a converter has to:
- Detect that a group of text positions forms a grid
- Infer the column boundaries from those positions
- Decide which text belongs in which cell
- Handle merged cells, multi-line cells, and cells with embedded formatting
This is heuristic work, and the heuristics fail on edge cases. Different libraries make different guesses; the same PDF produces different table output depending on the tool.
Tables without visible ruling lines — borderless tables — are harder than ruled tables. The converter has to infer the grid from whitespace alone. Borderless table extraction is roughly 20–30 percentage points less accurate than ruled-table extraction across all the major tools.
Common failure modes
What goes wrong, with the patterns to recognize:
- Column collapse. A two-column table becomes a single column where values run together:
Q1 Q2 Q3 Q4 100 150 200 250. Caused by the converter reading left-to-right per pixel row instead of detecting column boundaries first. - Row mixing. Rows interleave because the converter reads across columns instead of down them. You'll see Q1 values mixed with Q2 values.
- Header detached. The header row gets emitted as a regular paragraph, leaving an unlabeled grid of numbers below.
- Multi-line cells flatten. A cell containing "Strongly\nagree" comes out as the next column's value, shifting everything to the right of it.
- Merged cells expand or repeat. A merged "Total" header copies into both columns it spans, or one of the columns ends up unlabeled.
- Footnote markers inside cells. Superscript footnote references break cell boundary detection in many tools.
Knowing the failure mode tells you whether to fix the output by hand or re-run with a different tool.
Tools ranked by table quality
From best to worst at table extraction specifically:
- AWS Textract — purpose-built for forms and tables. ~95% cell accuracy on standard tables. ~$1.50 per 1000 pages. The right pick if tables are the point.
- Azure Document Intelligence — comparable to Textract, particularly strong on financial documents.
- Google Cloud Document AI — strong on structured forms and templated documents.
- marker — open-source, ML-based layout detection, best free option for tables, but slow.
- AI vision models (GPT-4o, Claude, Gemini) — increasingly competitive. Screenshot the table, ask the model to convert to Markdown pipe syntax. Works surprisingly well on complex layouts.
- pdfplumber — good table detection for visually-ruled tables; weaker on borderless tables. Best of the pure-Python options.
- pymupdf4llm — basic table support. Works on simple ruled tables; collapses complex ones.
- camelot-py — specialized for tables, often competitive with pdfplumber but with a different set of strengths (better on lattice tables, weaker on stream tables).
- pdfminer.six / pymupdf raw — minimal table support. Expect to do most of the work yourself.
The Markdown table format and its limits
Markdown's pipe-syntax tables support:
- A header row (separated from data by a
---row) - Basic column alignment (
:---,:---:,---:) - Inline formatting inside cells (bold, italic, code)
They do not support:
- Merged cells (no rowspan or colspan)
- Multi-line cells (workaround: use
<br>inside the cell) - Nested tables
- Block-level content inside cells (no lists or code blocks within a cell)
For tables that need any of the above, switch to HTML tables inside the Markdown file. Most Markdown renderers (including GitHub, Obsidian, and the major static-site generators) render HTML tables fine.
For LLM consumption, pipe-syntax tables are fine — LLMs handle them well. See Markdown vs plain text for LLMs.
Manual cleanup workflow
For a single table that came out mangled, the fastest fix is usually manual:
- Open the original PDF alongside your Markdown editor.
- Re-type just the header row in pipe syntax.
- Copy-paste the data rows from the PDF into a scratch buffer.
- Use find-and-replace to convert runs of multiple spaces into
|separators. - Verify the column count matches the header on every row.
- Test render in a Markdown preview to catch missed pipes.
Budget two to five minutes per table. For documents with dozens of tables, this stops being economic — re-run the conversion with a better tool first.
When to ask a vision model to redo a table
For a one-off complex table (irregular shape, merged cells, embedded math), screenshot the table and paste into Claude or GPT with the prompt:
Convert this table to Markdown pipe syntax, preserving exactly what you see. Use
<br>for multi-line cells. If pipe syntax can't represent the structure, output an HTML table instead.
This usually beats heuristic extractors for complex layouts and runs in seconds. The output isn't always correct — verify the numbers against the original, especially for financial data — but the structural recovery is excellent.
A good vision model handles cases that no heuristic-based tool gets right:
- Multi-row headers with column-spanning categories
- Side-by-side mini-tables on the same page
- Tables with embedded images or icons in cells
- Tables with footnote markers that need to stay associated with their referent
A note on financial documents
Financial statements deserve their own paragraph because they're uniquely hostile to PDF-to-Markdown:
- Nested totals (subtotals within subtotals within a grand total)
- Hierarchical row headers (indented to show nesting depth)
- Currency symbols floating in their own columns
- Footnote markers everywhere
- Multi-column layouts on the same page (a 10-K filing often has running text and a small table interleaved)
For these, AWS Textract or Azure Document Intelligence is worth the API cost. The open-source path: camelot-py for the tables, pymupdf4llm for everything else, then stitch the outputs together.
If you process financial PDFs at scale, build a per-page router: detect tables with a layout model, route table regions to Textract, route the rest to a general extractor.
Quick reference: choosing a table extraction tool
A decision flow:
- Tables are simple, ruled, and you want a one-tool solution? pymupdf4llm covers it.
- Tables are the main content (financial reports, datasheets)? AWS Textract or Azure Document Intelligence.
- Open-source only, table quality matters? marker, with pdfplumber as a fallback.
- One complex table, not a batch? Screenshot and ask a vision model.
- Borderless tables (no visible grid lines)? Vision model or marker. Heuristic tools fail here.
Conclusion
Table extraction quality is the single best benchmark for picking a PDF converter. Most tools handle simple ruled tables; few handle complex ones well. If tables are critical to your output, test your tool of choice on your real documents before scaling up — a converter that works well on textbook examples often fails on actual financial statements.
For a no-install starting point, the converter on this site wraps pymupdf4llm, which handles simple tables but punts on complex ones. For table-critical work, plan to combine tools.
← Back to all guides