Extracting Tables from PDFs into CSV and Excel — A Practical Workflow

The most common reason people convert a PDF isn't to read it — it's to get the numbers out. Quarterly reports, scientific tables, government statistics, scraped invoices: the data is sitting inside the document but locked behind PDF's layout-first design. Getting it into a CSV or Excel sheet without spending half a day cleaning up merged cells is its own small skill.

This guide walks through what makes PDF-to-spreadsheet conversion hard, which tools actually work for each kind of table, and how to script the workflow for repeating jobs.

Why this is harder than it looks

A PDF table is rarely a "table" in the structural sense. The PDF format has no first-class concept of a row or a column — what you see as a table is a set of text runs positioned to look gridded, sometimes with line segments drawn on top for borders. A converter has to reconstruct the table by clustering nearby text into cells and inferring the grid.

The places this goes wrong:

Each tool below makes different trade-offs on these failure modes.

Picking the right tool

A practical decision chart for the most common situations:

Born-digital PDF, simple bordered tables (financial statements, reports): pdfplumber — Python library with the best non-ML table extractor. Free, scriptable, reasonably fast. See the extraction methods comparison for context.

Born-digital PDF, complex tables (academic papers, mixed text-and-table layouts): camelot — older Python library, two engines (lattice for ruled tables, stream for whitespace tables). Less actively maintained than pdfplumber but its stream mode still beats most alternatives for borderless tables.

Born-digital PDF with ML help: Tabula — GUI tool where you draw a box around the table and it extracts. Good for one-off jobs without scripting.

Scanned PDFs or image-only documents: AWS Textract, Azure Document Intelligence, or Google Document AI. None of the free Python libraries handle scanned tables well — see the cloud OCR comparison.

Truly messy tables (rotated headers, merged cells, multi-line rows): A vision model with a careful prompt. GPT-4o, Claude, and Gemini can read messy tables and output CSV directly, with the trade-off of cost and the occasional hallucinated cell.

One-off, low-volume: Adobe Acrobat Pro's "Export PDF to Excel" feature, or smallpdf's PDF-to-Excel tool. Inconsistent quality, but zero setup.

A pdfplumber workflow for clean tables

For born-digital PDFs with reasonably structured tables, this is the pragmatic starting point:

import pdfplumber
import csv

with pdfplumber.open("report.pdf") as pdf:
    rows = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            rows.extend(table)

with open("report.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

This extracts every table on every page and concatenates them. For most simple cases it produces usable CSV out of the box. When it doesn't, the typical fixes:

For a single document, the trial-and-error on settings is fast. For batch jobs, write a small wrapper that tries lines first, falls back to text if the result has too few columns, and logs which strategy was used per page.

Vision-model fallback for messy tables

When pdfplumber and camelot both fail — and they will, on documents like utility bills with rotated text, hand-marked invoices, or anything scanned — a vision model is the next step. The prompt matters more than the model choice:

Extract the table from this image as CSV.
- Preserve original column order.
- For merged cells spanning multiple rows, repeat the value in each row.
- For empty cells, output an empty string.
- For unclear values, output [???] with a brief note in a separate column.
- Do not paraphrase or "clean up" the data — output what's on the page.
- Output ONLY CSV, no commentary.

The "do not paraphrase" instruction is critical. Vision models love to "fix" inconsistencies like "Q1" vs "Q1 2024" by silently normalizing one to the other, which silently breaks downstream data work.

Cost: roughly $0.005–$0.02 per page on a typical vision model, depending on image resolution. For large batches, that adds up — see bulk PDF conversion for batching patterns.

Handling multi-page tables

A table that spans 5 pages with the header repeating on each is one of the most common real-world cases and one of the worst-handled by naive extractors.

The pattern that works:

  1. Extract every table on every page.
  2. Look at row 0 of each table. If it matches the same string across pages, it's a repeating header.
  3. Keep row 0 only once (from the first page); concatenate all other rows.
  4. Watch for "Continued from previous page" footer rows that look like data — strip them before concatenating.

In pdfplumber:

all_tables = []
header = None
for page in pdf.pages:
    for table in page.extract_tables():
        if header is None:
            header = table[0]
            all_tables.append(table)
        elif table[0] == header:
            all_tables.append(table[1:])  # skip repeated header
        else:
            all_tables.append(table)

For tables where the header changes slightly (page number embedded, dates updated), use a fuzzy match instead of ==.

Excel-specific gotchas

If your destination is Excel rather than CSV:

When writing programmatically, prefer openpyxl over xlsxwriter for editing existing workbooks; prefer xlsxwriter for writing new files with complex formatting.

import openpyxl
wb = openpyxl.Workbook()
ws = wb.active
for row in rows:
    ws.append(row)
# Force a column to text type to preserve leading zeros
for cell in ws["A"]:
    cell.number_format = "@"
wb.save("report.xlsx")

A practical end-to-end example: extracting an invoice

A real workflow that recurs frequently: extract line items from a stack of supplier invoices into a single spreadsheet.

  1. Sort the invoices by template. Invoices from the same supplier follow the same layout. Group them; tune extraction settings per template once instead of per invoice.
  2. For each template, manually inspect one document. Identify the line-item table, the columns you need, where the totals row is.
  3. Write a small extractor per template. Use pdfplumber with settings tuned for that template. Output a normalized row schema (supplier, date, line_no, sku, description, qty, unit_price, total).
  4. Stitch the per-template outputs into one master CSV. Add a source_invoice column so you can trace back to the original document.
  5. Spot-check the totals. Sum the line-item totals per invoice and compare against the invoice's stated total. Discrepancies flag extraction errors.

For invoices without a consistent template (one-off vendors, hand-edited PDFs), fall back to vision-model extraction with the prompt template above. The cost per invoice is low enough that it beats writing template-specific code for documents that appear once.

Validating the output

A conversion is only useful if you can trust it. A short validation pass:

Skipping this validation is the most common reason people get burned by PDF-extracted data — they trust the converter, the numbers look right at a glance, and the silent off-by-one row shift only surfaces in a downstream analysis weeks later.

When to give up on automation

Some PDFs aren't worth automating. A 3-page document with one table that gets extracted once is faster to retype manually than to engineer a robust extractor for. The rough break-even point: if the table has fewer than ~30 cells of data and you only need to extract it once, retype it. Save automation for tables that repeat, scale, or where retyping would introduce its own errors.

If you want a no-setup option for occasional table extractions, the converter on this site outputs Markdown tables that paste cleanly into Excel via the clipboard, which handles small one-off cases without writing code.

← Back to all guides