Extracting Bookmarks and Tables of Contents from PDFs
A PDF's table of contents is one of the most useful pieces of structure it carries — a ready-made map of the document's sections and where they start. Extracting it well lets you build navigation, split a document by chapter, generate Markdown headings, or feed a structured outline to an LLM. But "the table of contents" can mean two different things in a PDF, and they're extracted in completely different ways.
Two different "tables of contents"
The outline (real bookmarks)
A PDF can carry an outline — also called bookmarks — a hierarchical navigation tree stored as actual metadata in the file. This is what populates the sidebar panel in your PDF viewer. Each entry has a title, a nesting level, and a destination (a page and often a precise position). This is structured data: clean, reliable, and machine-readable.
The printed TOC
Separately, a document may have a printed table of contents — the "Contents" page near the front, with chapter names and page numbers typeset as visible text. This is just text on a page, with no inherent structure beyond what you can parse from its layout.
The crucial point: these are independent. A PDF can have one, both, or neither. A document with a beautiful printed TOC page may have no outline metadata (so the viewer's bookmark sidebar is empty), and a document with rich bookmarks may have no printed TOC. Always check which you actually have before choosing an approach.
Extracting the outline (the easy, reliable case)
If the PDF has an outline, extract that — it's structured and accurate. With PyMuPDF:
import fitz
doc = fitz.open("document.pdf")
toc = doc.get_toc() # list of [level, title, page_number]
for level, title, page in toc:
indent = " " * (level - 1)
print(f"{indent}{title} (p. {page})")
get_toc() returns a flat list where each entry carries its nesting level, so you can reconstruct the hierarchy directly. Turning it into Markdown headings or a nested list is trivial:
lines = []
for level, title, page in doc.get_toc():
lines.append(" " * (level - 1) + f"- [{title}](#page-{page})")
markdown_toc = "\n".join(lines)
Other tools that read the outline:
- pdftk —
pdftk in.pdf dump_dataincludes the bookmark structure (BookmarkTitle,BookmarkLevel,BookmarkPageNumber). - pikepdf / qpdf — expose the outline at a lower level for programmatic access.
- pdfminer.six —
PDFDocument.get_outlines()walks the outline tree.
If the outline exists, you're done — it's the cleanest structure a PDF offers.
When there's no outline: parsing the printed TOC
Many PDFs — especially ones exported from word processors or scanned — have no outline. Then you have to extract the printed "Contents" page as text and parse it, which is messier:
- Locate the TOC page(s). Usually near the front; often detectable by the heading "Contents" / "Table of Contents" and the characteristic pattern of
title .... page-numberlines. - Extract the text of those pages (see PDF text extraction methods).
- Parse each entry. The common pattern is a title, a run of dot leaders, and a page number:
Introduction ............ 12. A regex like^(.*?)\.{2,}\s*(\d+)$captures most of them, but dot leaders aren't universal — some TOCs use tabs or just whitespace alignment. - Infer hierarchy from indentation or numbering (
1,1.1,1.1.1), since the flat text doesn't carry explicit levels.
This is heuristic and brittle. Page numbers in a printed TOC are also the printed numbers, which may not match the PDF's physical page index (front matter often uses roman numerals), so mapping a TOC entry to an actual PDF page takes an extra offset step.
Generating an outline that's missing
Sometimes the most useful move is to create an outline a PDF lacks. If you've extracted headings during conversion — pymupdf4llm and similar tools detect heading levels from font size and weight — you can build a TOC from those, or write a real outline back into the PDF:
import fitz
doc = fitz.open("document.pdf")
toc = [
[1, "Chapter 1", 1],
[2, "Section 1.1", 3],
[1, "Chapter 2", 10],
]
doc.set_toc(toc)
doc.save("with_bookmarks.pdf")
This is handy for adding navigation to reports, scanned books, or anything that came out of a tool that doesn't write bookmarks.
Using the TOC downstream
A clean outline is more than navigation — it's a structural backbone:
- Splitting by chapter. The page destinations tell you exactly where to cut a long PDF into per-chapter files for separate processing.
- Markdown heading hierarchy. Map outline levels to
#,##,###to seed a converted document's structure. - Chunking for RAG. Section boundaries from the outline make far better chunk boundaries than fixed-size windows — see building a RAG pipeline from PDFs.
- Context for LLMs. Handing a model the document's outline up front gives it a map of what's where, improving questions like "summarize section 3."
Quick reference
- Need the section structure? Check for an outline first (
doc.get_toc()); it's clean and reliable. - No outline, but a printed Contents page? Parse it with a dot-leader regex and infer hierarchy from numbering/indentation — expect cleanup.
- Page numbers don't match? Account for the offset between printed numbers and physical PDF page index.
- No TOC at all? Build one from detected headings and optionally write it back with
set_toc().
Conclusion
The first question with any PDF table of contents is which kind you have: a structured outline (extract it directly and you're done) or a printed page of text (parse it and brace for edge cases). The outline is one of the few genuinely reliable pieces of structure a PDF carries — when it's there, use it, and when it's not, the headings recovered during conversion are usually enough to reconstruct one.
← Back to all guides