Extracting Data from PDF Forms — AcroForms, XFA, and Scanned Forms

A PDF form is one of the most common business documents you'll encounter, and one of the most variable to extract data from. Two forms that look identical to the user can have radically different internal structures — one a fully extractable AcroForm, the other a flattened image with no recoverable structure. The right extraction approach depends entirely on the form's internal type.

This guide covers the three main kinds of PDF forms and the extraction workflow for each.

Three kinds of PDF forms

When someone says "PDF form" they could mean any of these:

Knowing which kind you have determines the approach. Most extraction failures come from treating one kind like another (running AcroForm extraction on a scanned form and getting empty results, or trying to OCR an AcroForm and missing the actual field values).

Detecting which kind you have

A quick programmatic check:

import pymupdf

doc = pymupdf.open("form.pdf")
needs_xfa_check = False
for page in doc:
    widgets = list(page.widgets())
    if widgets:
        print("AcroForm — found", len(widgets), "fields on page", page.number)
        break
else:
    # No widgets — check for XFA
    if doc.is_form_pdf and doc.metadata.get("subject", "").lower().find("xfa") != -1:
        print("Likely XFA")
    elif doc.is_form_pdf:
        print("Form PDF but no visible widgets — possibly XFA-only")
    else:
        print("Static or scanned form")

Or visually: open the form in Acrobat. If clicking on a blank line places a cursor and lets you type, it's an AcroForm or XFA. If clicking does nothing or just selects the page area, it's a static/scanned form.

Extracting AcroForm data

AcroForms are by far the easiest case. Every modern PDF library exposes the form fields directly.

With pymupdf:

import pymupdf

doc = pymupdf.open("form.pdf")
data = {}
for page in doc:
    for widget in page.widgets():
        if widget.field_name:
            data[widget.field_name] = widget.field_value
print(data)

That's the entire happy path. The fields come out as a flat dictionary mapping field name to value.

Field types you'll encounter:

The gotchas:

Extracting XFA data

XFA forms store data as XML. Extracting requires reading the XFA stream from the PDF and parsing the XML.

import pymupdf
from lxml import etree

doc = pymupdf.open("xfa_form.pdf")
xfa = doc.xfa
if xfa:
    # xfa is a dict mapping packet names to XML bytes
    datasets_xml = xfa.get("datasets")
    if datasets_xml:
        tree = etree.fromstring(datasets_xml)
        # Walk the XML tree to extract field values
        for elem in tree.iter():
            if elem.text and elem.text.strip():
                print(elem.tag, "=", elem.text.strip())

The XFA datasets packet contains the form's data. The schema is form-specific — different XFA forms have different XML structures — so you'll need to write a small parser per form template.

Some XFA forms are XFA-only (no AcroForm equivalent inside the PDF), and some are hybrid (AcroForm widgets backed by XFA datasets). For hybrid forms, AcroForm extraction usually works. For XFA-only forms, you have to parse the XML.

A wrinkle: XFA is deprecated as of PDF 2.0 (2017), but it's still common in government and financial forms. Tools sometimes refuse to render XFA forms entirely (browser PDF viewers historically have, and newer Adobe versions are phasing it out). If you receive an XFA form that doesn't display correctly, it's not necessarily broken — your viewer might just not support it.

Extracting scanned form data

The hard case: a PDF that looks like a form but is really an image. There are no fields, no XML, just pixels.

The workflow:

  1. OCR the form. Use Tesseract, a cloud OCR service, or a vision model. See scanned PDFs to text.
  2. Match labels to values. OCR gives you text; you have to figure out which text is a label and which is a filled-in value.
  3. Validate the extracted data. Scanned forms have higher error rates; cross-check what you can.

For matching labels to values, three approaches work:

Template matching. If you process the same form repeatedly (a tax form, an enrollment form), define the layout once: "the name field is at coordinates (100, 200) to (400, 230)." OCR each page, then crop to each field's coordinates and read just the filled-in text. The most reliable approach for repeating forms. Cloud OCR services (Azure Document Intelligence's custom models, Google Document AI's custom extractors) automate this with a UI for marking fields on a sample.

Label-proximity heuristics. For unknown forms, look for label-like text ("Name:", "Date of Birth:", "Address:") and read the text immediately to the right or below. Works for simple forms; breaks on multi-column layouts or forms with non-standard placement.

LLM extraction. Pass the OCR text to an LLM with a prompt like "extract the filled-in values from this form into JSON with the schema {name, date_of_birth, address, ...}". Works on a wider variety of forms than heuristics, costs more per page. The vision-model variant — sending the image directly — is even more accurate but more expensive.

The vision-model prompt that works well:

Extract the filled-in values from this form into JSON.
Schema: {field_name: value}
For unfilled fields, omit them from the output.
For values you can't read clearly, use [???] as the value.
For checkboxes, use true/false.
Output ONLY the JSON.

Combining approaches

For mixed inputs (some AcroForm, some scanned, all claiming to be "the application form"):

def extract_form(pdf_path):
    doc = pymupdf.open(pdf_path)
    # First try AcroForm
    data = {}
    for page in doc:
        for w in page.widgets():
            if w.field_name and w.field_value:
                data[w.field_name] = w.field_value
    if data:
        return ("acroform", data)
    # Try XFA
    if doc.xfa:
        return ("xfa", parse_xfa(doc.xfa))
    # Fall back to OCR + vision model
    return ("scanned", vision_extract(pdf_path))

The routing layer is what makes this work in production. A vendor pipeline that always OCRs every form wastes money on AcroForms where the data is one API call away.

Handling multi-form documents

A common case: a single PDF that contains multiple instances of the same form (a stack of 100 applications, each one page). The extraction needs to produce one row per form, not one row total.

For AcroForms, this rarely happens — typically each PDF is one form. But for OCR-based extraction, multiple forms in one PDF is common (someone scanned a stack into a single document).

The approach:

  1. Detect form boundaries. Look for a recurring header element (a logo, a form title) and treat each occurrence as the start of a new form.
  2. Process each section independently. Run the form extraction per section.
  3. Validate consistency. Forms in a stack usually have similar shapes. If one extracted record has wildly different fields than the others, flag it for review.

Validation matters more for forms

A misread field is worse than a missing field. A form record with a name spelled wrong or a date off by a year is a data quality problem that propagates downstream. A missing record gets flagged for re-processing.

For high-stakes form data (medical, legal, financial):

When to give up on automation

A specific form that you process once is faster to transcribe by hand than to engineer extraction for. The break-even point: roughly 50–100 forms of the same template. Below that, manual entry plus a quick validation pass beats building infrastructure. Above that, template-based extraction pays for itself within the first batch.

For ad-hoc form extraction, the converter on this site with a vision-model backend handles most cases without setup — upload the form, and the model reads the filled-in values along with the surrounding labels. Good enough for one-off jobs.

← Back to all guides