Extracting Structured JSON Data from PDFs — Schemas, Tools, and Validation

Plenty of work doesn't need the prose of a PDF — it needs the fields. The invoice total, the patient's date of birth, the line items, the contract's effective date. For those jobs the target format isn't Markdown or plain text, it's JSON: a typed, validated record you can drop into a database or hand to another program.

This is a different problem from converting a PDF to readable text, and it fails in different ways. This guide covers how to define what you want, the three families of extraction technique, and — the part people skip — how to know whether the JSON you got back is actually correct.

Start with the schema, not the PDF

The most common mistake is to extract first and figure out the shape later. Do the opposite. Write down the JSON schema you want before you touch a single document:

{
  "invoice_number": "string",
  "issue_date": "YYYY-MM-DD",
  "vendor": { "name": "string", "tax_id": "string|null" },
  "line_items": [
    { "description": "string", "quantity": "number", "unit_price": "number" }
  ],
  "total": "number",
  "currency": "ISO 4217 code"
}

The schema is your contract. It tells you which fields are required, which can be null, and what type each value must be. Every extraction method below gets dramatically more reliable when it's aimed at an explicit schema rather than asked to "pull out the important data."

Two decisions to make up front:

The three extraction approaches

There are exactly three ways to get fields out of a PDF, and they trade off in predictable ways.

1. Rule-based (regex and positional)

You convert the PDF to text first (see PDF text extraction methods), then match patterns. Invoice\s+#?\s*([A-Z0-9-]+) pulls the invoice number; a fixed bounding box pulls the total from the bottom-right of every page.

2. Layout-aware models

Purpose-built form/document services use a layout model that understands key-value pairs and tables spatially. AWS Textract's AnalyzeDocument, Azure Document Intelligence's prebuilt and custom models, and Google Document AI all return structured key-value JSON directly.

3. LLM extraction with a schema

Convert the PDF to Markdown, then ask an LLM to populate your schema, ideally using a structured-output / JSON mode so the response is guaranteed valid JSON.

A good default for diverse documents: feed clean Markdown (not raw PDF bytes) to the model, because better-structured input produces better-structured output. The Markdown vs plain text for LLMs guide explains why.

The LLM extraction prompt that works

If you go the LLM route, the prompt matters. A pattern that holds up:

Extract data from the document below into JSON matching this exact schema: {schema}. Rules: (1) Use null for any field not present in the document — never guess or infer a value that isn't written. (2) Copy values verbatim; do not summarize or reformat except where the schema specifies a format. (3) For dates, output ISO 8601. (4) Output only the JSON object, nothing else.

Document:

The single most important line is the instruction to use null rather than guess. Without it, models invent confident answers for missing fields — the most dangerous failure mode in data extraction, because it produces no error, just wrong data.

Validation: the step that makes it usable

Extraction output is a hypothesis, not a fact. Validate every record before it reaches a database.

  1. Schema validation. Run the JSON against a JSON Schema validator (jsonschema in Python, ajv in JS, pydantic if you like models). This catches type errors, missing required fields, and malformed enums for free.
  2. Internal consistency checks. Domain rules the data must obey: line items sum to the subtotal; subtotal plus tax equals total; the due date is after the issue date. These catch the errors schema validation can't — values that are well-typed but wrong.
  3. Confidence routing. Layout models return per-field confidence scores; use them. Route anything below a threshold (e.g. 0.9) to human review instead of straight to the database.
  4. Source grounding. For high-stakes fields, verify the extracted value actually appears in the source text. If the LLM returns a total of $1,240.00 that string should exist somewhere in the document — if it doesn't, it's likely a hallucination.

A practical pattern: extract → validate → if it fails, re-run once with the validation errors fed back into the prompt → if it still fails, flag for review. The retry-with-errors loop fixes a surprising share of failures automatically.

A hybrid pipeline that scales

For production volume across mixed documents, no single method wins. The strongest architecture is a router:

  1. Classify the document type first (a cheap step — even filename or first-page keywords work).
  2. Route known stable templates to rule-based extraction (free, instant).
  3. Route common semi-structured types (invoices, receipts) to a prebuilt layout model.
  4. Route the long tail to LLM-with-schema extraction.
  5. Validate everything through the same gate regardless of source.
  6. Review queue for anything that fails validation or scores low confidence.

This puts the cheapest method on the highest-volume documents and reserves the expensive, flexible method for the cases that actually need it.

Common pitfalls

Quick reference

Conclusion

Turning PDFs into structured JSON is less about the extraction tool and more about discipline around it: define the schema first, pick the method that matches the document's variability, and validate relentlessly because every method — regex, layout model, and LLM alike — produces confident wrong answers under the right conditions.

If your first step is getting clean, structured text out of the PDF, the converter on this site produces Markdown that feeds an LLM extractor well. From there, the schema and the validation gate are what make the JSON trustworthy.

← Back to all guides