PDF Accessibility — Tagged PDFs, Screen Readers, and Making Output Inclusive

A PDF that looks fine sighted can be a brick wall for someone using a screen reader. The page renders perfectly but the underlying structure says nothing about headings, reading order, lists, or images. Accessibility is the dimension of PDF quality that most authors and tooling pipelines overlook — and converting PDFs to Markdown or HTML is a chance to fix it.

This guide walks through how PDFs encode accessibility, what tagged PDFs do differently, and how to produce output that works for screen-reader users, search engines, and downstream automation.

How screen readers read PDFs

A screen reader announces text and structure: "Heading level 2: Methodology. List of three items. First item: ..." For that to work, the document has to declare its structure — there has to be machine-readable tagging that says "this paragraph is a heading, this image is a figure, this group of paragraphs is a list."

PDFs without that structure are flat to a screen reader. The reader can announce the text in approximate reading order, but it can't say "heading" or "list" because the document doesn't claim those things exist. Users have to navigate paragraph by paragraph with no way to skim or jump to sections.

Three categories of PDF, by accessibility:

For each category, the conversion path to accessible output is different.

What "tagged PDF" actually means

A tagged PDF contains a tree of structure elements alongside the visible content. Each element has a tag (H1, P, L, LI, etc.) and points to the content it represents.

A simplified example:

StructTreeRoot
├── H1 "Introduction"
├── P "This document discusses..."
├── H2 "Background"
│   ├── P "The first paper to address..."
│   └── L (unordered list)
│       ├── LI "Item one"
│       ├── LI "Item two"
│       └── LI "Item three"
└── Figure (alt text: "A chart showing growth from 2020 to 2024")

The structure tree is what enables:

Most modern Word/InDesign/LaTeX exports produce tagged PDFs by default. PDFs generated from older tools or via "print to PDF" often lose the tags.

Detecting tagged vs. untagged PDFs

Two quick ways to check:

import pymupdf
doc = pymupdf.open("document.pdf")
catalog = doc.pdf_catalog()
mark_info = doc.xref_get_key(catalog, "MarkInfo")
print(mark_info)  # "(true)" or "(/Marked true)" means tagged

For batch processing, you can fall back gracefully: try to extract the tag tree, and if it doesn't exist, treat the document as untagged and infer structure heuristically.

Converting to accessible Markdown

Markdown is naturally accessible because it has semantic elements built in: #, ##, ### are headings; - is a list; ![alt](image) carries alt text. When Markdown is rendered to HTML, screen readers walk the HTML tree and announce structure correctly.

The conversion goal: produce Markdown where the heading levels match the document's logical structure, lists are real lists, and images have alt text.

For tagged PDFs:

For untagged PDFs:

Alt text for images

Images in PDFs sometimes have alt text (in tagged PDFs) and usually don't. When converting to Markdown, you have three options for an image:

How to get descriptive alt text:

For PDFs converted via the tool on this site, the OCR backends that use vision models (OpenAI, Gemini) effectively describe images in their output, which goes a long way toward accessible Markdown.

Tables

Tables are the trickiest accessibility case. A sighted reader scans a table visually; a screen-reader user navigates cell by cell and needs the headers announced for each cell.

The Markdown standard table syntax works for screen readers when rendered to HTML — the headers (first row, separated by ---) become <th> elements that get announced. The failure modes:

See preserving tables when converting PDF to Markdown for the broader conversion side.

Form fields

PDFs often contain form fields (text inputs, checkboxes, dropdowns). When converted to Markdown, these become flat text and lose their interactive nature. For accessibility:

See PDF forms data extraction for the data-extraction side.

Reading order

A subtle accessibility failure: reading order in the source PDF doesn't match the visual layout. This happens when:

For a sighted user this is invisible. For a screen-reader user the document reads as gibberish — sentences interleaved with sidebar content, footnotes appearing mid-paragraph.

The fix during conversion:

Accessibility-conscious conversion checklist

When converting a PDF for accessibility purposes:

When the conversion itself is the accessibility solution

For users of screen readers and other assistive technology, a well-converted Markdown or HTML file is often more accessible than the original PDF. PDFs are designed for visual fidelity; Markdown is designed for semantic structure.

Converting an untagged PDF to clean Markdown — with proper headings, lists, alt text, and reading order — produces something that works in any web browser, with any screen reader, on any device. That's often a better deliverable than spending hours retroactively tagging the source PDF.

For one-off conversions, the tool on this site with a vision-model backend produces Markdown with image descriptions inline. The output may need light cleanup before final publication, but the heavy lifting (text extraction, image description, structure detection) is done automatically.

← Back to all guides