Extracting Images and Figures from PDFs — Embedded Bitmaps vs Rendered Pages

"Extract the images from this PDF" sounds like one task. It's actually two, and confusing them is why people end up with blurry thumbnails or, worse, hundreds of tiny sliver-images that are useless. This guide draws the distinction clearly and shows how to get exactly the figures you want at the resolution you need.

Two completely different operations

Operation A: Extract embedded images

A PDF can contain embedded raster images — photographs, scanned pages, logos, chart bitmaps — stored inside the file as compressed image data (usually JPEG, PNG-like Flate, or CCITT for scans). Extracting these pulls out the original image bytes exactly as they were embedded.

Operation B: Render pages to images

Rendering rasterizes a whole page (or a region of it) into a new image at a resolution you choose. This captures everything as it visually appears — vector graphics, text, and embedded images flattened together.

The single most common mistake is using Operation A on a chart that was drawn as vector graphics. There's no embedded bitmap to extract, so you get nothing (or just the chart's axis labels as separate text). Vector figures must be rendered, not extracted.

How to tell which kind of figure you have

Open the PDF and try to select text inside the figure. If you can highlight individual labels and numbers, the figure is (at least partly) vector — you'll need to render it. If the whole figure selects as one block or not at all, it's an embedded bitmap you can extract directly. Charts from Excel, matplotlib savefig to PDF, and most LaTeX figures are vector. Screenshots, photos, and scanned figures are bitmaps.

Extracting embedded images with PyMuPDF

PyMuPDF (fitz) is the most reliable tool for embedded image extraction:

import fitz

doc = fitz.open("document.pdf")
for page_num, page in enumerate(doc):
    for img_index, img in enumerate(page.get_images(full=True)):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha > 3:        # CMYK: convert to RGB
            pix = fitz.Pixmap(fitz.csRGB, pix)
        pix.save(f"page{page_num}_img{img_index}.png")
        pix = None

To filter out the noise (logos, icons), skip anything below a size threshold:

if pix.width < 100 or pix.height < 100:
    continue   # likely an icon or decoration

A useful refinement is to deduplicate by image xref — a logo on every page is the same xref repeated, so you only need it once.

Rendering pages (or regions) to images

For vector figures or full-page captures, render at a chosen DPI. With PyMuPDF:

import fitz

doc = fitz.open("document.pdf")
page = doc[0]
zoom = 300 / 72            # 72 is the PDF's native DPI; this targets 300 DPI
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
pix.save("page0_300dpi.png")

The zoom factor is everything. PDFs are defined at 72 DPI internally; rendering at the default gives you a small, soft image. For print-quality or OCR-quality output, target 300 DPI (zoom = 300/72 ≈ 4.17). For a screen thumbnail, 150 DPI is plenty.

To extract just one figure rather than the whole page, render a clip rectangle:

clip = fitz.Rect(72, 200, 540, 480)   # x0, y0, x1, y1 in points
pix = page.get_pixmap(matrix=mat, clip=clip)

Finding the rectangle is the manual part — you can read coordinates from a viewer or detect figure regions with a layout model.

Tools at a glance

Resolution and quality gotchas

How this fits the convert-to-text workflow

If your goal is text, not the images themselves, you usually don't extract images at all — you OCR them. The pipeline behind this site's converter does exactly that: it finds embedded images and pages without selectable text, and runs OCR on them so the recognized text lands in the Markdown output. Extracting images as files is the right move when you need the figures as figures — for a slide deck, a dataset of charts, or to feed individual figures to a vision model. For a chart whose data you want, rendering the figure and asking a vision model to read it often beats trying to reconstruct the numbers (related: PDF tables to CSV and Excel).

Quick reference

Conclusion

The whole game is knowing whether your figure is an embedded bitmap or vector graphics, because that decides between extraction and rendering — and using the wrong one is why people get empty output or blurry junk. Once you've made that call, PyMuPDF and the poppler tools handle both cleanly, and the only remaining knob that matters is DPI.

← Back to all guides