Extracting Images and Figures from PDFs — Embedded Bitmaps vs Rendered Pages
"Extract the images from this PDF" sounds like one task. It's actually two, and confusing them is why people end up with blurry thumbnails or, worse, hundreds of tiny sliver-images that are useless. This guide draws the distinction clearly and shows how to get exactly the figures you want at the resolution you need.
Two completely different operations
Operation A: Extract embedded images
A PDF can contain embedded raster images — photographs, scanned pages, logos, chart bitmaps — stored inside the file as compressed image data (usually JPEG, PNG-like Flate, or CCITT for scans). Extracting these pulls out the original image bytes exactly as they were embedded.
- You get: the source images at their original resolution.
- Use it when: you want the actual photos or figures a document contains.
- Gotcha: you get every embedded image, including decorative icons, repeated logos, and background textures — often dozens of tiny fragments per page.
Operation B: Render pages to images
Rendering rasterizes a whole page (or a region of it) into a new image at a resolution you choose. This captures everything as it visually appears — vector graphics, text, and embedded images flattened together.
- You get: a faithful picture of the page as printed.
- Use it when: the "figure" is drawn with vector graphics (most charts, diagrams, and equations are), or you need a screenshot-style capture, or you're feeding pages to a vision model.
- Gotcha: text becomes pixels — no longer selectable or searchable.
The single most common mistake is using Operation A on a chart that was drawn as vector graphics. There's no embedded bitmap to extract, so you get nothing (or just the chart's axis labels as separate text). Vector figures must be rendered, not extracted.
How to tell which kind of figure you have
Open the PDF and try to select text inside the figure. If you can highlight individual labels and numbers, the figure is (at least partly) vector — you'll need to render it. If the whole figure selects as one block or not at all, it's an embedded bitmap you can extract directly. Charts from Excel, matplotlib savefig to PDF, and most LaTeX figures are vector. Screenshots, photos, and scanned figures are bitmaps.
Extracting embedded images with PyMuPDF
PyMuPDF (fitz) is the most reliable tool for embedded image extraction:
import fitz
doc = fitz.open("document.pdf")
for page_num, page in enumerate(doc):
for img_index, img in enumerate(page.get_images(full=True)):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha > 3: # CMYK: convert to RGB
pix = fitz.Pixmap(fitz.csRGB, pix)
pix.save(f"page{page_num}_img{img_index}.png")
pix = None
To filter out the noise (logos, icons), skip anything below a size threshold:
if pix.width < 100 or pix.height < 100:
continue # likely an icon or decoration
A useful refinement is to deduplicate by image xref — a logo on every page is the same xref repeated, so you only need it once.
Rendering pages (or regions) to images
For vector figures or full-page captures, render at a chosen DPI. With PyMuPDF:
import fitz
doc = fitz.open("document.pdf")
page = doc[0]
zoom = 300 / 72 # 72 is the PDF's native DPI; this targets 300 DPI
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
pix.save("page0_300dpi.png")
The zoom factor is everything. PDFs are defined at 72 DPI internally; rendering at the default gives you a small, soft image. For print-quality or OCR-quality output, target 300 DPI (zoom = 300/72 ≈ 4.17). For a screen thumbnail, 150 DPI is plenty.
To extract just one figure rather than the whole page, render a clip rectangle:
clip = fitz.Rect(72, 200, 540, 480) # x0, y0, x1, y1 in points
pix = page.get_pixmap(matrix=mat, clip=clip)
Finding the rectangle is the manual part — you can read coordinates from a viewer or detect figure regions with a layout model.
Tools at a glance
- PyMuPDF (fitz) — best all-rounder for both extraction and rendering; fast, scriptable, handles CMYK and color spaces correctly.
- pdfimages (poppler-utils) — command-line embedded-image extraction:
pdfimages -all in.pdf out_prefix. Great for quick batch dumps; the-listflag inventories every image with its resolution. - pdftoppm / pdftocairo (poppler-utils) — command-line page rendering to PNG/JPEG/TIFF at a chosen DPI.
- pdf2image (Python wrapper around poppler) — convenient page rendering in Python, popular for OCR pipelines.
- ImageMagick / Ghostscript — render pages, but slower and easier to misconfigure on DPI; fine for one-offs.
Resolution and quality gotchas
- Don't upscale. Rendering a page at 600 DPI doesn't add detail that wasn't there if the embedded content is a 96 DPI bitmap — you just get a bigger blurry file. Extract embedded images at their native resolution instead.
- CMYK and color profiles. Print PDFs often use CMYK; extract without conversion and colors look wrong. Convert to RGB on the way out.
- Inline images. A few PDFs use "inline images" embedded directly in the content stream rather than as referenced objects.
page.get_images()can miss these; rendering the page captures them regardless. - JBIG2 and CCITT scans. Scanned pages are often stored as 1-bit black-and-white in these formats. They extract fine but may need conversion to a common format before other tools accept them.
How this fits the convert-to-text workflow
If your goal is text, not the images themselves, you usually don't extract images at all — you OCR them. The pipeline behind this site's converter does exactly that: it finds embedded images and pages without selectable text, and runs OCR on them so the recognized text lands in the Markdown output. Extracting images as files is the right move when you need the figures as figures — for a slide deck, a dataset of charts, or to feed individual figures to a vision model. For a chart whose data you want, rendering the figure and asking a vision model to read it often beats trying to reconstruct the numbers (related: PDF tables to CSV and Excel).
Quick reference
- Photos or scanned figures (bitmaps)? Extract embedded images with PyMuPDF or
pdfimages. - Charts, diagrams, equations (vector)? Render the page or region; extraction returns nothing useful.
- Not sure which? Try to select text inside the figure — selectable means vector means render.
- For OCR or vision models? Render at 300 DPI.
- Drowning in tiny icon images? Filter by minimum size and deduplicate by xref.
Conclusion
The whole game is knowing whether your figure is an embedded bitmap or vector graphics, because that decides between extraction and rendering — and using the wrong one is why people get empty output or blurry junk. Once you've made that call, PyMuPDF and the poppler tools handle both cleanly, and the only remaining knob that matters is DPI.
← Back to all guides