Extracting Highlights, Notes, and Comments from Annotated PDFs

You've read a PDF carefully, highlighted the passages that matter, scribbled notes in the margins, and now you want those annotations somewhere you can search and reuse them. They shouldn't be locked up in the PDF — they should be in your notes system, your reading app, or your knowledge base.

PDF annotations are surprisingly extractable. The PDF format stores them as structured objects separate from the page content, which means a few lines of code can pull them out cleanly. This guide walks through what annotation types exist, what tools extract them, and how to turn them into something useful downstream.

Annotation types

A PDF can contain a long list of annotation types. The ones that actually carry user content:

For most knowledge-work use cases, the high-value annotations are highlights with their associated text, and sticky notes with their written content. Everything else is usually noise.

Extracting annotations with pymupdf

pymupdf is the most ergonomic tool for annotation extraction in Python:

import pymupdf

doc = pymupdf.open("annotated.pdf")
annotations = []

for page_num, page in enumerate(doc, start=1):
    for annot in page.annots():
        info = annot.info  # author, content, creation date, etc.
        annot_type = annot.type[1]  # human-readable type
        text = ""

        if annot_type in ("Highlight", "Underline", "Squiggly", "StrikeOut"):
            # Extract the text under the annotation's quad points
            for quad in annot.vertices_to_quads():
                text += page.get_textbox(quad.rect) + " "
            text = text.strip()

        annotations.append({
            "page": page_num,
            "type": annot_type,
            "highlighted_text": text,
            "comment": info.get("content", ""),
            "author": info.get("title", ""),
            "color": annot.colors.get("stroke") or annot.colors.get("fill"),
            "created": info.get("creationDate", ""),
        })

import json
print(json.dumps(annotations, indent=2))

This produces a flat JSON list of every annotation, with the highlighted text, any associated comment, and the page number. From here you can format as Markdown, CSV, or anything else.

For highlights specifically, the structure of the output depends on how the user highlighted. A single uninterrupted highlight across one line produces one annotation with a single rectangle. A highlight that wraps across multiple lines produces one annotation with multiple rectangles ("quads"); you need to read the text under each quad and concatenate.

Per-app extraction quirks

Different PDF readers write annotations slightly differently. Knowing what to expect from each helps the extraction:

Adobe Acrobat / Reader — produces clean, well-structured annotations. The author field contains the user's name. Comments are stored in the content field. Highlights with no comment have an empty content.

PDF Expert (macOS/iOS) — similar to Acrobat. Free text annotations are stored as FreeText type and contain rich text formatting that you may want to strip.

GoodReader, LiquidText — produce standard annotations but sometimes use proprietary extensions for advanced features (linked notes in LiquidText). Standard extraction gets the basic content; the advanced links are lost.

Apple Preview — produces a smaller subset of annotation types and sometimes stores them slightly differently. Highlight color is sometimes not in the standard color field.

Foxit, Xodo, Drawboard — generally standard. Watch for non-standard color encodings.

Tablet apps (GoodNotes, Notability, Noteshelf) — usually export annotations as flattened ink drawings rather than text-anchored highlights. Hard to extract meaningfully. If you use these apps and want extractable annotations, configure them to use the standard highlight tool, not the highlighter pen.

Turning annotations into Markdown notes

A common goal: produce a Markdown document with the highlights organized by page, with comments inline. A template:

def annotations_to_markdown(annotations, doc_title):
    md = [f"# Notes from {doc_title}\n"]
    current_page = None
    for a in annotations:
        if a["page"] != current_page:
            md.append(f"\n## Page {a['page']}\n")
            current_page = a["page"]
        if a["type"] in ("Highlight", "Underline"):
            md.append(f"> {a['highlighted_text']}\n")
            if a["comment"]:
                md.append(f"\n*{a['comment']}*\n")
        elif a["type"] == "Text":  # sticky note
            md.append(f"📝 **Note:** {a['comment']}\n")
        elif a["type"] == "FreeText":
            md.append(f"✏️ {a['comment']}\n")
        md.append("\n")
    return "".join(md)

The output Markdown drops cleanly into Obsidian, Notion, Logseq, or anywhere else you take notes. See Converting research papers to Markdown for Obsidian, Notion, and Logseq for the broader workflow of building a knowledge base from PDFs and their annotations.

Color-coded highlights

Many readers use highlight colors to encode meaning: yellow for important, green for evidence, red for disagreement. The annotation extraction can preserve color so the downstream system can filter or render by category.

def color_category(rgb):
    if not rgb:
        return "unknown"
    r, g, b = rgb
    # Naive matching — refine for your color scheme
    if r > 0.7 and g > 0.7 and b < 0.3:
        return "yellow"   # important
    if g > 0.5 and r < 0.5:
        return "green"    # evidence
    if r > 0.7 and g < 0.3:
        return "red"      # disagree
    if r < 0.3 and g < 0.3 and b > 0.7:
        return "blue"     # for follow-up
    return "other"

A small thing that pays off: define your color scheme up front and stick to it. A library of 500 PDFs with consistent color meaning becomes a structured note system; one with random highlight colors becomes noise.

Sticky notes and free text

Sticky notes (the small note icon placed at a point on the page) contain longer-form written content. They're often more valuable than highlights because they're your own words.

Extraction is straightforward — the content is in the content field of the annotation. The wrinkle is that sticky notes aren't anchored to source text the way highlights are. The annotation has page coordinates but no associated "what text is this about." In your Markdown output, you can include the surrounding page text for context:

def context_around_point(page, x, y, radius=100):
    # Get text within a bounding box around the annotation's anchor point
    rect = pymupdf.Rect(x - radius, y - radius, x + radius, y + radius)
    return page.get_textbox(rect)

The resulting note in Markdown looks like:

📝 Note on page 12, near: "...the methodology fails to account for selection bias when..."

The authors don't mention the well-known critique from Hofmann (2019).

This format gives future you something useful to search and read.

Tools that do this without code

If you don't want to write code, a few options:

For occasional extraction, the GUI tools are faster than writing code. For batch processing (a research library, a stack of contracts), code wins.

Extracting ink drawings

Ink annotations (freehand drawings) are paths, not text. There's no text in the annotation; just a sequence of x,y coordinates that the reader joins into strokes.

For most knowledge-work purposes, ink annotations aren't worth extracting — they're not searchable, and rendering them outside the original PDF is awkward. The two reasonable approaches:

Annotation extraction at scale

A small batch script for processing a folder of annotated PDFs:

from pathlib import Path
import pymupdf, json

input_dir = Path("annotated-pdfs/")
output_dir = Path("extracted-notes/")
output_dir.mkdir(exist_ok=True)

for pdf in input_dir.glob("*.pdf"):
    doc = pymupdf.open(pdf)
    annotations = []
    for page_num, page in enumerate(doc, start=1):
        for annot in page.annots():
            annotations.append(extract_annot(annot, page, page_num))
    if annotations:
        out = output_dir / f"{pdf.stem}.json"
        out.write_text(json.dumps(annotations, indent=2))

Run periodically against your reading folder, and you accumulate a structured archive of every annotation you've ever made.

When annotations aren't preserved

A few things destroy annotations:

The lesson: extract annotations as early in your workflow as possible. Once a PDF has been flattened or converted, the structured annotations are gone for good.

← Back to all guides