Extracting Structured JSON Data from PDFs — Schemas, Tools, and Validation
Plenty of work doesn't need the prose of a PDF — it needs the fields. The invoice total, the patient's date of birth, the line items, the contract's effective date. For those jobs the target format isn't Markdown or plain text, it's JSON: a typed, validated record you can drop into a database or hand to another program.
This is a different problem from converting a PDF to readable text, and it fails in different ways. This guide covers how to define what you want, the three families of extraction technique, and — the part people skip — how to know whether the JSON you got back is actually correct.
Start with the schema, not the PDF
The most common mistake is to extract first and figure out the shape later. Do the opposite. Write down the JSON schema you want before you touch a single document:
{
"invoice_number": "string",
"issue_date": "YYYY-MM-DD",
"vendor": { "name": "string", "tax_id": "string|null" },
"line_items": [
{ "description": "string", "quantity": "number", "unit_price": "number" }
],
"total": "number",
"currency": "ISO 4217 code"
}
The schema is your contract. It tells you which fields are required, which can be null, and what type each value must be. Every extraction method below gets dramatically more reliable when it's aimed at an explicit schema rather than asked to "pull out the important data."
Two decisions to make up front:
- Required vs optional. A missing required field is an error you want to catch; a missing optional field is normal. Mark them now.
- Normalization rules. Dates to ISO 8601, currency to a numeric amount plus a separate ISO code, phone numbers to E.164. Decide the canonical form so downstream code never has to guess.
The three extraction approaches
There are exactly three ways to get fields out of a PDF, and they trade off in predictable ways.
1. Rule-based (regex and positional)
You convert the PDF to text first (see PDF text extraction methods), then match patterns. Invoice\s+#?\s*([A-Z0-9-]+) pulls the invoice number; a fixed bounding box pulls the total from the bottom-right of every page.
- Best for: high-volume documents from a single, stable template — your own system's exports, one vendor's invoices, government forms that never change layout.
- Strengths: fast, free, fully deterministic, no per-document cost, easy to audit.
- Weakness: brittle. One layout change and the regex silently returns the wrong substring. Useless across heterogeneous documents.
2. Layout-aware models
Purpose-built form/document services use a layout model that understands key-value pairs and tables spatially. AWS Textract's AnalyzeDocument, Azure Document Intelligence's prebuilt and custom models, and Google Document AI all return structured key-value JSON directly.
- Best for: semi-structured documents at scale — invoices, receipts, tax forms, IDs — especially when layouts vary between sources.
- Strengths: robust to layout variation, strong on tables, prebuilt models for common document types mean near-zero setup.
- Weakness: per-page API cost (~$1.50–$10 per 1000 pages depending on the model), and the output schema is the service's schema — you'll still map it to yours. See cloud OCR services compared.
3. LLM extraction with a schema
Convert the PDF to Markdown, then ask an LLM to populate your schema, ideally using a structured-output / JSON mode so the response is guaranteed valid JSON.
- Best for: messy, varied, or "long-tail" documents where writing rules is hopeless and no prebuilt model fits — contracts, research papers, mixed correspondence.
- Strengths: handles wild layout variation, understands context ("net 30" → a due date), needs no training data.
- Weakness: can hallucinate a plausible-but-wrong value, costs per token, non-deterministic. Never trust it without validation.
A good default for diverse documents: feed clean Markdown (not raw PDF bytes) to the model, because better-structured input produces better-structured output. The Markdown vs plain text for LLMs guide explains why.
The LLM extraction prompt that works
If you go the LLM route, the prompt matters. A pattern that holds up:
Extract data from the document below into JSON matching this exact schema:
{schema}. Rules: (1) Usenullfor any field not present in the document — never guess or infer a value that isn't written. (2) Copy values verbatim; do not summarize or reformat except where the schema specifies a format. (3) For dates, output ISO 8601. (4) Output only the JSON object, nothing else.Document:
The single most important line is the instruction to use null rather than guess. Without it, models invent confident answers for missing fields — the most dangerous failure mode in data extraction, because it produces no error, just wrong data.
Validation: the step that makes it usable
Extraction output is a hypothesis, not a fact. Validate every record before it reaches a database.
- Schema validation. Run the JSON against a JSON Schema validator (
jsonschemain Python,ajvin JS,pydanticif you like models). This catches type errors, missing required fields, and malformed enums for free. - Internal consistency checks. Domain rules the data must obey: line items sum to the subtotal; subtotal plus tax equals total; the due date is after the issue date. These catch the errors schema validation can't — values that are well-typed but wrong.
- Confidence routing. Layout models return per-field confidence scores; use them. Route anything below a threshold (e.g. 0.9) to human review instead of straight to the database.
- Source grounding. For high-stakes fields, verify the extracted value actually appears in the source text. If the LLM returns a total of
$1,240.00that string should exist somewhere in the document — if it doesn't, it's likely a hallucination.
A practical pattern: extract → validate → if it fails, re-run once with the validation errors fed back into the prompt → if it still fails, flag for review. The retry-with-errors loop fixes a surprising share of failures automatically.
A hybrid pipeline that scales
For production volume across mixed documents, no single method wins. The strongest architecture is a router:
- Classify the document type first (a cheap step — even filename or first-page keywords work).
- Route known stable templates to rule-based extraction (free, instant).
- Route common semi-structured types (invoices, receipts) to a prebuilt layout model.
- Route the long tail to LLM-with-schema extraction.
- Validate everything through the same gate regardless of source.
- Review queue for anything that fails validation or scores low confidence.
This puts the cheapest method on the highest-volume documents and reserves the expensive, flexible method for the cases that actually need it.
Common pitfalls
- Numbers as strings.
"1,240.00"is a string with a thousands separator, not a number. Normalize during extraction, not downstream. - Multi-page records. A single logical record (one invoice) spanning several pages needs to be assembled before extraction, or fields on later pages get dropped.
- Repeating groups. Line items, transactions, and table rows are arrays of unknown length — the place schema-blind extraction most often truncates. Test specifically with documents that have many rows.
- Encoding artifacts. If the underlying text extraction is broken (ligatures, bad font maps), every method downstream inherits the garbage. Diagnose that first — see why your PDF text won't copy.
- Scanned input. If the source is a scan, you need OCR before any of this. Quality of the OCR caps the quality of the JSON. Image preprocessing for OCR pays off here.
Quick reference
- One fixed template, high volume? Rule-based regex/positional.
- Invoices, receipts, IDs, forms at scale? Layout model (Textract / Azure / Google).
- Messy, varied, or rare document types? LLM with explicit schema and JSON mode.
- Mixed everything in production? Router that combines all three behind one validation gate.
- Any approach: validate against a schema and domain rules before trusting the output.
Conclusion
Turning PDFs into structured JSON is less about the extraction tool and more about discipline around it: define the schema first, pick the method that matches the document's variability, and validate relentlessly because every method — regex, layout model, and LLM alike — produces confident wrong answers under the right conditions.
If your first step is getting clean, structured text out of the PDF, the converter on this site produces Markdown that feeds an LLM extractor well. From there, the schema and the validation gate are what make the JSON trustworthy.
← Back to all guides