Extracting Highlights, Notes, and Comments from Annotated PDFs
You've read a PDF carefully, highlighted the passages that matter, scribbled notes in the margins, and now you want those annotations somewhere you can search and reuse them. They shouldn't be locked up in the PDF — they should be in your notes system, your reading app, or your knowledge base.
PDF annotations are surprisingly extractable. The PDF format stores them as structured objects separate from the page content, which means a few lines of code can pull them out cleanly. This guide walks through what annotation types exist, what tools extract them, and how to turn them into something useful downstream.
Annotation types
A PDF can contain a long list of annotation types. The ones that actually carry user content:
- Highlight — colored overlay on a region of text. The annotation includes the highlighted text and an optional comment.
- Underline — same as highlight but with an underline instead of a fill.
- Strikeout — strikethrough on a region of text.
- Squiggly — wavy underline, usually for "this needs attention."
- Sticky note (Text annotation) — a small note icon placed at a point on the page. Contains a written note that opens when clicked.
- Free text — text directly typed onto the page (not anchored to source text).
- Ink (Drawing) — freehand pen strokes. Hardest to extract because they're paths, not text.
- Stamp — a graphic placed on the page ("Approved", "Draft", custom images).
- Link — clickable region pointing to a URL or another page in the document.
For most knowledge-work use cases, the high-value annotations are highlights with their associated text, and sticky notes with their written content. Everything else is usually noise.
Extracting annotations with pymupdf
pymupdf is the most ergonomic tool for annotation extraction in Python:
import pymupdf
doc = pymupdf.open("annotated.pdf")
annotations = []
for page_num, page in enumerate(doc, start=1):
for annot in page.annots():
info = annot.info # author, content, creation date, etc.
annot_type = annot.type[1] # human-readable type
text = ""
if annot_type in ("Highlight", "Underline", "Squiggly", "StrikeOut"):
# Extract the text under the annotation's quad points
for quad in annot.vertices_to_quads():
text += page.get_textbox(quad.rect) + " "
text = text.strip()
annotations.append({
"page": page_num,
"type": annot_type,
"highlighted_text": text,
"comment": info.get("content", ""),
"author": info.get("title", ""),
"color": annot.colors.get("stroke") or annot.colors.get("fill"),
"created": info.get("creationDate", ""),
})
import json
print(json.dumps(annotations, indent=2))
This produces a flat JSON list of every annotation, with the highlighted text, any associated comment, and the page number. From here you can format as Markdown, CSV, or anything else.
For highlights specifically, the structure of the output depends on how the user highlighted. A single uninterrupted highlight across one line produces one annotation with a single rectangle. A highlight that wraps across multiple lines produces one annotation with multiple rectangles ("quads"); you need to read the text under each quad and concatenate.
Per-app extraction quirks
Different PDF readers write annotations slightly differently. Knowing what to expect from each helps the extraction:
Adobe Acrobat / Reader — produces clean, well-structured annotations. The author field contains the user's name. Comments are stored in the content field. Highlights with no comment have an empty content.
PDF Expert (macOS/iOS) — similar to Acrobat. Free text annotations are stored as FreeText type and contain rich text formatting that you may want to strip.
GoodReader, LiquidText — produce standard annotations but sometimes use proprietary extensions for advanced features (linked notes in LiquidText). Standard extraction gets the basic content; the advanced links are lost.
Apple Preview — produces a smaller subset of annotation types and sometimes stores them slightly differently. Highlight color is sometimes not in the standard color field.
Foxit, Xodo, Drawboard — generally standard. Watch for non-standard color encodings.
Tablet apps (GoodNotes, Notability, Noteshelf) — usually export annotations as flattened ink drawings rather than text-anchored highlights. Hard to extract meaningfully. If you use these apps and want extractable annotations, configure them to use the standard highlight tool, not the highlighter pen.
Turning annotations into Markdown notes
A common goal: produce a Markdown document with the highlights organized by page, with comments inline. A template:
def annotations_to_markdown(annotations, doc_title):
md = [f"# Notes from {doc_title}\n"]
current_page = None
for a in annotations:
if a["page"] != current_page:
md.append(f"\n## Page {a['page']}\n")
current_page = a["page"]
if a["type"] in ("Highlight", "Underline"):
md.append(f"> {a['highlighted_text']}\n")
if a["comment"]:
md.append(f"\n*{a['comment']}*\n")
elif a["type"] == "Text": # sticky note
md.append(f"📝 **Note:** {a['comment']}\n")
elif a["type"] == "FreeText":
md.append(f"✏️ {a['comment']}\n")
md.append("\n")
return "".join(md)
The output Markdown drops cleanly into Obsidian, Notion, Logseq, or anywhere else you take notes. See Converting research papers to Markdown for Obsidian, Notion, and Logseq for the broader workflow of building a knowledge base from PDFs and their annotations.
Color-coded highlights
Many readers use highlight colors to encode meaning: yellow for important, green for evidence, red for disagreement. The annotation extraction can preserve color so the downstream system can filter or render by category.
def color_category(rgb):
if not rgb:
return "unknown"
r, g, b = rgb
# Naive matching — refine for your color scheme
if r > 0.7 and g > 0.7 and b < 0.3:
return "yellow" # important
if g > 0.5 and r < 0.5:
return "green" # evidence
if r > 0.7 and g < 0.3:
return "red" # disagree
if r < 0.3 and g < 0.3 and b > 0.7:
return "blue" # for follow-up
return "other"
A small thing that pays off: define your color scheme up front and stick to it. A library of 500 PDFs with consistent color meaning becomes a structured note system; one with random highlight colors becomes noise.
Sticky notes and free text
Sticky notes (the small note icon placed at a point on the page) contain longer-form written content. They're often more valuable than highlights because they're your own words.
Extraction is straightforward — the content is in the content field of the annotation. The wrinkle is that sticky notes aren't anchored to source text the way highlights are. The annotation has page coordinates but no associated "what text is this about." In your Markdown output, you can include the surrounding page text for context:
def context_around_point(page, x, y, radius=100):
# Get text within a bounding box around the annotation's anchor point
rect = pymupdf.Rect(x - radius, y - radius, x + radius, y + radius)
return page.get_textbox(rect)
The resulting note in Markdown looks like:
📝 Note on page 12, near: "...the methodology fails to account for selection bias when..."
The authors don't mention the well-known critique from Hofmann (2019).
This format gives future you something useful to search and read.
Tools that do this without code
If you don't want to write code, a few options:
- Adobe Acrobat: File → Export To → Summarize Comments. Produces a PDF or text summary. Cumbersome formatting; usable in a pinch.
- PDF Expert: "Export Annotations" produces a clean Markdown or text file with all annotations. The most ergonomic GUI option on macOS.
- Readwise Reader: import the PDF into Readwise, annotate there or sync existing annotations, and Readwise builds a structured highlights database. Subscription service; good for heavy readers.
- Mendeley, Zotero: academic reference managers with annotation extraction. Better for papers than for general documents.
- Highlights.app (macOS): dedicated annotation extractor with Markdown export. Pleasant interface for academic reading.
For occasional extraction, the GUI tools are faster than writing code. For batch processing (a research library, a stack of contracts), code wins.
Extracting ink drawings
Ink annotations (freehand drawings) are paths, not text. There's no text in the annotation; just a sequence of x,y coordinates that the reader joins into strokes.
For most knowledge-work purposes, ink annotations aren't worth extracting — they're not searchable, and rendering them outside the original PDF is awkward. The two reasonable approaches:
- Skip them. Most ink content is supplementary; the highlight + sticky note pipeline captures the actual knowledge.
- Render the page with annotations, then OCR. If the ink contains text (handwritten notes, hand-drawn diagrams with labels), render the annotated page to an image and run a vision model on it. See handwritten OCR for the workflow.
Annotation extraction at scale
A small batch script for processing a folder of annotated PDFs:
from pathlib import Path
import pymupdf, json
input_dir = Path("annotated-pdfs/")
output_dir = Path("extracted-notes/")
output_dir.mkdir(exist_ok=True)
for pdf in input_dir.glob("*.pdf"):
doc = pymupdf.open(pdf)
annotations = []
for page_num, page in enumerate(doc, start=1):
for annot in page.annots():
annotations.append(extract_annot(annot, page, page_num))
if annotations:
out = output_dir / f"{pdf.stem}.json"
out.write_text(json.dumps(annotations, indent=2))
Run periodically against your reading folder, and you accumulate a structured archive of every annotation you've ever made.
When annotations aren't preserved
A few things destroy annotations:
- Printing to PDF. Re-printing an annotated PDF as a new PDF flattens annotations into the page content. The visible highlights stay (as colored backgrounds), but the structured annotations are gone.
- PDF flattening. Some tools have an explicit "flatten" option that does the same.
- Converting to other formats. Word, EPUB, plain text — none of these formats preserve PDF annotations. Extract before converting.
The lesson: extract annotations as early in your workflow as possible. Once a PDF has been flattened or converted, the structured annotations are gone for good.
← Back to all guides