Preserving Tables When Converting PDF to Markdown — Why It's Hard and How to Fix It

Tables are the single hardest thing to get right when converting a PDF to Markdown. A financial report, scientific paper, or product datasheet you converted yesterday probably has at least one table that came out as a space-separated mess.

This article explains why this happens (it's not the converter's fault — mostly), which tools handle tables well, and how to fix the rest by hand without spending an afternoon doing it.

Why PDF tables are uniquely hard

PDFs don't store tables. They store positioned text and the lines that draw the table's borders. A "table" in a PDF is a visual coincidence: numbers happen to align in a grid because the document creator placed them at specific x-coordinates.

To extract a table from a PDF, a converter has to:

  1. Detect that a group of text positions forms a grid
  2. Infer the column boundaries from those positions
  3. Decide which text belongs in which cell
  4. Handle merged cells, multi-line cells, and cells with embedded formatting

This is heuristic work, and the heuristics fail on edge cases. Different libraries make different guesses; the same PDF produces different table output depending on the tool.

Tables without visible ruling lines — borderless tables — are harder than ruled tables. The converter has to infer the grid from whitespace alone. Borderless table extraction is roughly 20–30 percentage points less accurate than ruled-table extraction across all the major tools.

Common failure modes

What goes wrong, with the patterns to recognize:

Knowing the failure mode tells you whether to fix the output by hand or re-run with a different tool.

Tools ranked by table quality

From best to worst at table extraction specifically:

The Markdown table format and its limits

Markdown's pipe-syntax tables support:

They do not support:

For tables that need any of the above, switch to HTML tables inside the Markdown file. Most Markdown renderers (including GitHub, Obsidian, and the major static-site generators) render HTML tables fine.

For LLM consumption, pipe-syntax tables are fine — LLMs handle them well. See Markdown vs plain text for LLMs.

Manual cleanup workflow

For a single table that came out mangled, the fastest fix is usually manual:

  1. Open the original PDF alongside your Markdown editor.
  2. Re-type just the header row in pipe syntax.
  3. Copy-paste the data rows from the PDF into a scratch buffer.
  4. Use find-and-replace to convert runs of multiple spaces into | separators.
  5. Verify the column count matches the header on every row.
  6. Test render in a Markdown preview to catch missed pipes.

Budget two to five minutes per table. For documents with dozens of tables, this stops being economic — re-run the conversion with a better tool first.

When to ask a vision model to redo a table

For a one-off complex table (irregular shape, merged cells, embedded math), screenshot the table and paste into Claude or GPT with the prompt:

Convert this table to Markdown pipe syntax, preserving exactly what you see. Use <br> for multi-line cells. If pipe syntax can't represent the structure, output an HTML table instead.

This usually beats heuristic extractors for complex layouts and runs in seconds. The output isn't always correct — verify the numbers against the original, especially for financial data — but the structural recovery is excellent.

A good vision model handles cases that no heuristic-based tool gets right:

A note on financial documents

Financial statements deserve their own paragraph because they're uniquely hostile to PDF-to-Markdown:

For these, AWS Textract or Azure Document Intelligence is worth the API cost. The open-source path: camelot-py for the tables, pymupdf4llm for everything else, then stitch the outputs together.

If you process financial PDFs at scale, build a per-page router: detect tables with a layout model, route table regions to Textract, route the rest to a general extractor.

Quick reference: choosing a table extraction tool

A decision flow:

Conclusion

Table extraction quality is the single best benchmark for picking a PDF converter. Most tools handle simple ruled tables; few handle complex ones well. If tables are critical to your output, test your tool of choice on your real documents before scaling up — a converter that works well on textbook examples often fails on actual financial statements.

For a no-install starting point, the converter on this site wraps pymupdf4llm, which handles simple tables but punts on complex ones. For table-critical work, plan to combine tools.

← Back to all guides