Extracting Tables from PDFs into CSV and Excel — A Practical Workflow
The most common reason people convert a PDF isn't to read it — it's to get the numbers out. Quarterly reports, scientific tables, government statistics, scraped invoices: the data is sitting inside the document but locked behind PDF's layout-first design. Getting it into a CSV or Excel sheet without spending half a day cleaning up merged cells is its own small skill.
This guide walks through what makes PDF-to-spreadsheet conversion hard, which tools actually work for each kind of table, and how to script the workflow for repeating jobs.
Why this is harder than it looks
A PDF table is rarely a "table" in the structural sense. The PDF format has no first-class concept of a row or a column — what you see as a table is a set of text runs positioned to look gridded, sometimes with line segments drawn on top for borders. A converter has to reconstruct the table by clustering nearby text into cells and inferring the grid.
The places this goes wrong:
- Merged cells — a header cell that spans three columns confuses naive grid detection. The converter either splits the header across cells or collapses neighbors into a single super-cell.
- Multi-line cells — a description field that wraps onto two lines often gets read as two separate rows, with the second row missing data in the other columns.
- Inferred vs drawn borders — tables without visible borders rely entirely on whitespace clustering. The slightest column drift turns a clean two-column table into a single jumbled column.
- Rotated or vertical text — headers rotated 90° to fit narrow columns get extracted as random characters or skipped entirely.
- Tables that span pages — the header repeats on each page; the converter has to recognize the repeat and stitch the data rows together.
Each tool below makes different trade-offs on these failure modes.
Picking the right tool
A practical decision chart for the most common situations:
Born-digital PDF, simple bordered tables (financial statements, reports): pdfplumber — Python library with the best non-ML table extractor. Free, scriptable, reasonably fast. See the extraction methods comparison for context.
Born-digital PDF, complex tables (academic papers, mixed text-and-table layouts): camelot — older Python library, two engines (lattice for ruled tables, stream for whitespace tables). Less actively maintained than pdfplumber but its stream mode still beats most alternatives for borderless tables.
Born-digital PDF with ML help: Tabula — GUI tool where you draw a box around the table and it extracts. Good for one-off jobs without scripting.
Scanned PDFs or image-only documents: AWS Textract, Azure Document Intelligence, or Google Document AI. None of the free Python libraries handle scanned tables well — see the cloud OCR comparison.
Truly messy tables (rotated headers, merged cells, multi-line rows): A vision model with a careful prompt. GPT-4o, Claude, and Gemini can read messy tables and output CSV directly, with the trade-off of cost and the occasional hallucinated cell.
One-off, low-volume: Adobe Acrobat Pro's "Export PDF to Excel" feature, or smallpdf's PDF-to-Excel tool. Inconsistent quality, but zero setup.
A pdfplumber workflow for clean tables
For born-digital PDFs with reasonably structured tables, this is the pragmatic starting point:
import pdfplumber
import csv
with pdfplumber.open("report.pdf") as pdf:
rows = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
rows.extend(table)
with open("report.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
This extracts every table on every page and concatenates them. For most simple cases it produces usable CSV out of the box. When it doesn't, the typical fixes:
- Wrong column boundaries. Pass
table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"}if the table has visible borders, or"text"if it doesn't. - Merged cells appearing as empty. Forward-fill empty cells using pandas:
df.fillna(method="ffill"). - Headers misidentified. Slice off the first N rows manually:
rows = rows[skip_n_header_rows:].
For a single document, the trial-and-error on settings is fast. For batch jobs, write a small wrapper that tries lines first, falls back to text if the result has too few columns, and logs which strategy was used per page.
Vision-model fallback for messy tables
When pdfplumber and camelot both fail — and they will, on documents like utility bills with rotated text, hand-marked invoices, or anything scanned — a vision model is the next step. The prompt matters more than the model choice:
Extract the table from this image as CSV.
- Preserve original column order.
- For merged cells spanning multiple rows, repeat the value in each row.
- For empty cells, output an empty string.
- For unclear values, output [???] with a brief note in a separate column.
- Do not paraphrase or "clean up" the data — output what's on the page.
- Output ONLY CSV, no commentary.
The "do not paraphrase" instruction is critical. Vision models love to "fix" inconsistencies like "Q1" vs "Q1 2024" by silently normalizing one to the other, which silently breaks downstream data work.
Cost: roughly $0.005–$0.02 per page on a typical vision model, depending on image resolution. For large batches, that adds up — see bulk PDF conversion for batching patterns.
Handling multi-page tables
A table that spans 5 pages with the header repeating on each is one of the most common real-world cases and one of the worst-handled by naive extractors.
The pattern that works:
- Extract every table on every page.
- Look at row 0 of each table. If it matches the same string across pages, it's a repeating header.
- Keep row 0 only once (from the first page); concatenate all other rows.
- Watch for "Continued from previous page" footer rows that look like data — strip them before concatenating.
In pdfplumber:
all_tables = []
header = None
for page in pdf.pages:
for table in page.extract_tables():
if header is None:
header = table[0]
all_tables.append(table)
elif table[0] == header:
all_tables.append(table[1:]) # skip repeated header
else:
all_tables.append(table)
For tables where the header changes slightly (page number embedded, dates updated), use a fuzzy match instead of ==.
Excel-specific gotchas
If your destination is Excel rather than CSV:
- Leading zeros disappear. A column of zip codes or product IDs with leading zeros (
07930,01024) gets silently converted to numbers by Excel. Either pre-pad with a leading apostrophe ('07930) or write.xlsxdirectly withopenpyxland explicitly set the column type to text. - Dates auto-format unpredictably. A column with
01/02/2025is interpreted as January 2 in the US or February 1 in much of Europe. UseYYYY-MM-DDeverywhere, or write the column as text and parse explicitly. - Long numbers truncate. A 16-digit credit card number or scientific identifier gets converted to scientific notation. Same fix as above — store as text.
- Cells over 32,767 characters get split. Excel's per-cell character limit. If you're extracting long footnotes or descriptions, check for clipping.
When writing programmatically, prefer openpyxl over xlsxwriter for editing existing workbooks; prefer xlsxwriter for writing new files with complex formatting.
import openpyxl
wb = openpyxl.Workbook()
ws = wb.active
for row in rows:
ws.append(row)
# Force a column to text type to preserve leading zeros
for cell in ws["A"]:
cell.number_format = "@"
wb.save("report.xlsx")
A practical end-to-end example: extracting an invoice
A real workflow that recurs frequently: extract line items from a stack of supplier invoices into a single spreadsheet.
- Sort the invoices by template. Invoices from the same supplier follow the same layout. Group them; tune extraction settings per template once instead of per invoice.
- For each template, manually inspect one document. Identify the line-item table, the columns you need, where the totals row is.
- Write a small extractor per template. Use pdfplumber with settings tuned for that template. Output a normalized row schema (supplier, date, line_no, sku, description, qty, unit_price, total).
- Stitch the per-template outputs into one master CSV. Add a
source_invoicecolumn so you can trace back to the original document. - Spot-check the totals. Sum the line-item totals per invoice and compare against the invoice's stated total. Discrepancies flag extraction errors.
For invoices without a consistent template (one-off vendors, hand-edited PDFs), fall back to vision-model extraction with the prompt template above. The cost per invoice is low enough that it beats writing template-specific code for documents that appear once.
Validating the output
A conversion is only useful if you can trust it. A short validation pass:
- Row count check. Compare the extracted row count against a manual count from one or two source pages. Big discrepancies usually mean multi-line rows got split or merged.
- Column count check. Every row should have the same number of columns. Rows with extra or missing columns are extraction errors.
- Type check. Numeric columns should parse as numbers. A "currency" column with embedded
$and,characters needs cleanup; one with stray text means extraction picked up an adjacent cell. - Sum-and-total check. Where the source PDF has subtotals or grand totals, sum the extracted data and compare. This catches the silent errors that all the above miss.
Skipping this validation is the most common reason people get burned by PDF-extracted data — they trust the converter, the numbers look right at a glance, and the silent off-by-one row shift only surfaces in a downstream analysis weeks later.
When to give up on automation
Some PDFs aren't worth automating. A 3-page document with one table that gets extracted once is faster to retype manually than to engineer a robust extractor for. The rough break-even point: if the table has fewer than ~30 cells of data and you only need to extract it once, retype it. Save automation for tables that repeat, scale, or where retyping would introduce its own errors.
If you want a no-setup option for occasional table extractions, the converter on this site outputs Markdown tables that paste cleanly into Excel via the clipboard, which handles small one-off cases without writing code.
← Back to all guides