PDF Accessibility — Tagged PDFs, Screen Readers, and Making Output Inclusive

2026-05-19 · 8 min read

A PDF that looks fine sighted can be a brick wall for someone using a screen reader. The page renders perfectly but the underlying structure says nothing about headings, reading order, lists, or images. Accessibility is the dimension of PDF quality that most authors and tooling pipelines overlook — and converting PDFs to Markdown or HTML is a chance to fix it.

This guide walks through how PDFs encode accessibility, what tagged PDFs do differently, and how to produce output that works for screen-reader users, search engines, and downstream automation.

How screen readers read PDFs

A screen reader announces text and structure: "Heading level 2: Methodology. List of three items. First item: ..." For that to work, the document has to declare its structure — there has to be machine-readable tagging that says "this paragraph is a heading, this image is a figure, this group of paragraphs is a list."

PDFs without that structure are flat to a screen reader. The reader can announce the text in approximate reading order, but it can't say "heading" or "list" because the document doesn't claim those things exist. Users have to navigate paragraph by paragraph with no way to skim or jump to sections.

Three categories of PDF, by accessibility:

Tagged PDFs. The PDF includes a structure tree of semantic tags: H1, H2, P, L (list), LI (list item), Figure, Table, etc. Screen readers walk the tree and announce structure properly. This is the goal state.
Untagged but text-extractable PDFs. Text is present and ordered, but there are no semantic tags. Screen readers can read the text but can't announce structure. Most PDFs from older Office exports fall here.
Image-only PDFs. No extractable text. A screen reader announces "image" and the user gets nothing. Requires OCR before any accessibility is possible. See scanned PDFs to text.

For each category, the conversion path to accessible output is different.

What "tagged PDF" actually means

A tagged PDF contains a tree of structure elements alongside the visible content. Each element has a tag (H1, P, L, LI, etc.) and points to the content it represents.

A simplified example:

StructTreeRoot
├── H1 "Introduction"
├── P "This document discusses..."
├── H2 "Background"
│   ├── P "The first paper to address..."
│   └── L (unordered list)
│       ├── LI "Item one"
│       ├── LI "Item two"
│       └── LI "Item three"
└── Figure (alt text: "A chart showing growth from 2020 to 2024")

The structure tree is what enables:

Heading-level navigation (H key in NVDA, rotor in VoiceOver)
Reading order independent of visual layout
Alt text announcement for images
Table cell navigation with header announcement
Forms tagged with field labels

Most modern Word/InDesign/LaTeX exports produce tagged PDFs by default. PDFs generated from older tools or via "print to PDF" often lose the tags.

Detecting tagged vs. untagged PDFs

Two quick ways to check:

In Adobe Acrobat: View → Navigation Panels → Tags. A tag tree means tagged; "No Tags available" means untagged.
Via Python:

import pymupdf
doc = pymupdf.open("document.pdf")
catalog = doc.pdf_catalog()
mark_info = doc.xref_get_key(catalog, "MarkInfo")
print(mark_info)  # "(true)" or "(/Marked true)" means tagged

For batch processing, you can fall back gracefully: try to extract the tag tree, and if it doesn't exist, treat the document as untagged and infer structure heuristically.

Converting to accessible Markdown

Markdown is naturally accessible because it has semantic elements built in: #, ##, ### are headings; - is a list; ![alt](image) carries alt text. When Markdown is rendered to HTML, screen readers walk the HTML tree and announce structure correctly.

The conversion goal: produce Markdown where the heading levels match the document's logical structure, lists are real lists, and images have alt text.

For tagged PDFs:

pymupdf4llm preserves heading structure from the tag tree. Headings come out as #, ##, ### reflecting the source.
marker does the same, with even better structure preservation on complex documents.
Adobe Acrobat's "Export to HTML" also preserves structure but produces verbose HTML.

For untagged PDFs:

pymupdf4llm and similar tools heuristically detect headings based on font size and weight. The output has headings, but the heading levels don't always match the original logical structure (a sub-section might come out as ## because of its font size, even though it's a ### in document logic).
marker does a better job with heuristic structure detection.
Manual correction is often necessary if accessibility is a hard requirement.

Alt text for images

Images in PDFs sometimes have alt text (in tagged PDFs) and usually don't. When converting to Markdown, you have three options for an image:

Drop the image entirely. Worst for accessibility — users don't know it was there. Acceptable only when the image is purely decorative.
Include with empty alt text. ![](figure-1.png) — the image displays but screen readers skip it. Acceptable for decorative images only.
Include with descriptive alt text. ![Line chart showing user growth from 2020 to 2024](figure-1.png) — the gold standard.

How to get descriptive alt text:

From the source PDF tags when available.
From the figure caption when the PDF has captions but not alt tags. Take the caption text or a summary of it.
Generate with a vision model when neither is available. A prompt like "Describe this image in one sentence for a screen reader user" produces usable alt text. Costs ~$0.005 per image.

For PDFs converted via the tool on this site, the OCR backends that use vision models (OpenAI, Gemini) effectively describe images in their output, which goes a long way toward accessible Markdown.

Tables

Tables are the trickiest accessibility case. A sighted reader scans a table visually; a screen-reader user navigates cell by cell and needs the headers announced for each cell.

The Markdown standard table syntax works for screen readers when rendered to HTML — the headers (first row, separated by ---) become <th> elements that get announced. The failure modes:

Merged cells in the source PDF that get split or duplicated in Markdown. Markdown doesn't support colspan/rowspan, so a complex table can't be losslessly converted. Either re-design the table, or fall back to HTML for that section.
Layout tables (using table structure for visual alignment, not for tabular data). Strip these — they're not real tables and screen readers shouldn't announce them as such.
Headers in the first column instead of the first row. Mark up explicitly with HTML if Markdown can't express it.

See preserving tables when converting PDF to Markdown for the broader conversion side.

Form fields

PDFs often contain form fields (text inputs, checkboxes, dropdowns). When converted to Markdown, these become flat text and lose their interactive nature. For accessibility:

Document the field labels even when the interactive nature is gone. A list of "Name: _", "Email: _" in the output preserves the structure for someone reading the conversion.
For tagged PDFs, the field labels are typically associated with each input. Extract them.
For untagged scanned forms, OCR captures the visible labels but loses the labels/input association — manual cleanup may be needed.

See PDF forms data extraction for the data-extraction side.

Reading order

A subtle accessibility failure: reading order in the source PDF doesn't match the visual layout. This happens when:

The PDF was generated by a tool that didn't write content in visual order (some LaTeX configurations, older PowerPoint).
The page has multiple columns and the tagging walks the columns in the wrong order.
Sidebars or pull quotes are tagged in the middle of the body text.

For a sighted user this is invisible. For a screen-reader user the document reads as gibberish — sentences interleaved with sidebar content, footnotes appearing mid-paragraph.

The fix during conversion:

Trust the tags in well-tagged PDFs from major modern tools.
Override the tags when reading order is clearly broken. Open the original PDF visually and read it; if the screen-reader order would differ from what you expect, manually re-order the converted Markdown.

Accessibility-conscious conversion checklist

When converting a PDF for accessibility purposes:

[ ] Headings come through as #, ##, ### matching the source's logical structure.
[ ] Lists are real Markdown lists (- or 1.), not paragraphs with manual hyphens.
[ ] Images have descriptive alt text or are marked decorative.
[ ] Tables use proper Markdown table syntax with a header row.
[ ] Reading order is logical and linear; no sidebars or footnotes interrupting paragraphs.
[ ] Mathematical content has both visual rendering and a text equivalent (LaTeX in Markdown, or a description).
[ ] Hyperlinks have descriptive link text ("see the methodology section"), not bare URLs or "click here".
[ ] Language is identified — at minimum, the document is in a known language so screen readers pronounce text correctly.

When the conversion itself is the accessibility solution

For users of screen readers and other assistive technology, a well-converted Markdown or HTML file is often more accessible than the original PDF. PDFs are designed for visual fidelity; Markdown is designed for semantic structure.

Converting an untagged PDF to clean Markdown — with proper headings, lists, alt text, and reading order — produces something that works in any web browser, with any screen reader, on any device. That's often a better deliverable than spending hours retroactively tagging the source PDF.

For one-off conversions, the tool on this site with a vision-model backend produces Markdown with image descriptions inline. The output may need light cleanup before final publication, but the heavy lifting (text extraction, image description, structure detection) is done automatically.

← Back to all guides