Reading, Editing, and Stripping PDF Metadata — Document Info, XMP, and Hidden Data

2026-05-29 · 4 min read

Every PDF carries metadata you don't see when you read it — who created it, with what software, when, and sometimes a good deal more. That metadata is useful when you're organizing a library and dangerous when you're sharing a document, because it can leak information you never meant to send. This guide covers what's in there, how to read and edit it, and how to strip it cleanly before a file leaves your hands.

The two metadata systems in every PDF

PDFs confusingly carry metadata in two places, and they don't always agree.

The Document Information Dictionary

The older, simpler system: a small set of key-value fields stored in the PDF's trailer. The standard keys:

Title, Author, Subject, Keywords
Creator (the app that authored the content, e.g. "Microsoft Word")
Producer (the library that wrote the PDF, e.g. "macOS Quartz PDFContext")
CreationDate, ModDate

This is what most viewers show under "Document Properties."

XMP metadata

The newer system: an XML packet (Adobe's Extensible Metadata Platform) embedded in the file. It can hold everything the info dictionary does plus arbitrary structured metadata — Dublin Core fields, copyright and licensing, edit history, camera data on embedded images, and application-specific extensions.

The catch: a PDF can have both, and they can disagree (e.g. an old title in the info dictionary, a new one in XMP). Tools may read one and ignore the other. When you edit or strip metadata, handle both or you'll leave stale data behind — the source of many "but I changed the author!" surprises.

Reading metadata

Command line

exiftool is the most thorough reader — it surfaces both the info dictionary and XMP:

exiftool document.pdf

pdfinfo (from poppler-utils) shows the document info dictionary and basic structure:

pdfinfo document.pdf

Python

PyMuPDF reads the info dictionary directly and exposes XMP separately:

import fitz

doc = fitz.open("document.pdf")
print(doc.metadata)        # info dictionary: title, author, producer, dates...
xmp = doc.xref_xml_metadata()   # raw XMP XML, if present

pikepdf gives lower-level access to both, which is what you want for careful editing.

Editing metadata

To correct or set fields, write them back — remembering both systems. With PyMuPDF:

import fitz

doc = fitz.open("document.pdf")
doc.set_metadata({
    "title": "Q3 Report",
    "author": "Finance Team",
    "subject": "Quarterly results",
    "keywords": "finance, quarterly, 2026",
})
doc.save("updated.pdf")

exiftool can edit from the command line and is good at keeping XMP and the info dictionary in sync:

exiftool -Title="Q3 Report" -Author="Finance Team" document.pdf

By default exiftool writes a _original backup — convenient, but remember to remove it if you don't want the old metadata lingering on disk.

This is the part that matters most. Before sending a PDF outside your organization, consider what its metadata reveals:

Author and creator can expose a real name, username, or internal software stack.
Dates reveal when a document was really made or last edited — sometimes contradicting what you've told the recipient.
Producer strings fingerprint your toolchain.
Embedded image metadata can carry GPS coordinates, camera serial numbers, and editing history from photos placed in the document.
Revision history and hidden XMP from some authoring tools can include earlier states of the document.

Note that metadata is separate from visible sensitive content. Stripping metadata does not remove text you tried to hide with black boxes — that's a different and more serious problem covered in PDF redaction done right. Do both when sanitizing a document.

How to strip it

The most thorough single command is exiftool:

exiftool -all:all= document.pdf

This removes all metadata it can write. Verify afterward — re-run exiftool document.pdf and confirm it's clean.

For a stronger guarantee, re-create the PDF so nothing survives in the structure. Printing to PDF, or running it through Ghostscript, rebuilds the file:

gs -o clean.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
   -c "[/Title () /Author () /Creator () /Producer () /DOCINFO pdfmark" \
   -f document.pdf

Rebuilding is the most reliable approach because it doesn't just blank fields — it produces a fresh file that never contained the old metadata or hidden structural remnants.

Metadata as a workflow tool

Used deliberately, metadata is genuinely helpful:

Library organization. Title, author, and keywords drive search and filing across a document collection far better than filenames.
Provenance. Keeping accurate creation/modification metadata helps audit where a document came from.
Automation. A batch conversion pipeline can read metadata to route documents (by author, subject, or producer) and can write metadata onto outputs for traceability.

When you convert a PDF, decide whether metadata should carry forward. For text and Markdown output it usually doesn't matter, but if you're producing new PDFs, set sensible metadata rather than leaving the converting tool's default producer string.

Quick reference

Read everything? exiftool document.pdf (covers both info dictionary and XMP).
Edit a field? PyMuPDF set_metadata() or exiftool -Field=Value — handle both metadata systems.
Strip before sharing? exiftool -all:all=, then verify; rebuild with Ghostscript for a stronger guarantee.
Hiding visible content? That's redaction, not metadata — see the redaction guide.

Conclusion

PDF metadata is small, easy to forget, and occasionally embarrassing — a leaked author name or a creation date that contradicts your story. Knowing that PDFs carry it in two systems (the info dictionary and XMP) is what keeps you from stripping one and leaking the other. Read it with exiftool, edit it carefully in both places, and rebuild the file when you need to be sure it's gone.

For the broader picture of what happens to your document when you process it, see PDF privacy and security.

← Back to all guides