Why Your PDF Text Won't Copy — Encoding, Fonts, and Image-Only Pages Explained
You select text in a PDF and copy it. You paste — and get nothing, or worse, garbled symbols like "˙˙ÿÿ@@" or every letter mapped to a Cyrillic lookalike. The frustrating part: the PDF clearly shows the text on screen. So where did it go?
This article walks through the four most common reasons PDFs misbehave on copy-paste, with a way to diagnose and fix each one.
A quick mental model of how text lives in a PDF
A PDF page is a script of drawing commands: "place glyph #47 of font F1 at position (x, y); place glyph #92 of font F1 at position (x+5, y); ..." The PDF doesn't store words — it stores glyph IDs and positions.
For copy-paste to work, the PDF also needs a ToUnicode map alongside each font. The ToUnicode map says "glyph #47 in font F1 means the letter A; glyph #92 means the letter B." When your PDF viewer copies text, it walks the glyph sequence on the selected region and uses the ToUnicode map to translate each glyph back to a Unicode character.
If the ToUnicode map is missing, broken, or intentionally scrambled, the page still renders perfectly (the viewer is just drawing shapes) but copy produces nothing useful. This is why a PDF can look fine on screen but produce garbage when copied — the visual is unrelated to the text layer.
The four common failure cases all trace back to this model.
Case 1: image-only PDF
Symptom: You can't select any text at all. Clicking and dragging across a sentence either highlights nothing or highlights the entire page as a single block.
Cause: The page is a single embedded image — a scan, a photo, or a screenshot saved as PDF. There are no glyph drawing commands; there's just one image command per page.
How to confirm: Zoom in to 400%. If text edges look pixelated, show JPEG compression halos, or reveal the texture of paper, the PDF is image-only.
Fix: Run OCR. The guide on converting scanned PDFs to text walks through the workflow end to end.
Case 2: missing ToUnicode map
Symptom: Text selects fine, but copying produces gibberish — sometimes random Unicode characters, sometimes characters from an unrelated alphabet entirely. Letters may look right visually but paste as wrong characters.
Cause: The PDF was generated by a tool that subsetted or renamed font glyphs but didn't include the ToUnicode map. The viewer can render the page because it has the font shapes, but it can't map glyphs back to characters.
This happens most often with:
- Certain older LaTeX configurations, especially with custom or non-standard fonts
- Older versions of Microsoft PowerPoint export
- Some Asian-language typesetting tools
- Documents that embed proprietary or designer fonts without proper Unicode metadata
How to confirm: In Adobe Acrobat, go to Tools → Print Production → Preflight, and run the check "Text uses encodings without ToUnicode mapping." The report tells you which fonts on which pages are missing the map.
Fix paths:
- If you have access to the source document, re-save with embedded text using a different export option or tool.
- Treat the document as an image-only PDF on purpose: render each page to an image, then run OCR or a vision model on the rendered pages. This bypasses the encoding problem entirely.
- For the technically inclined:
mutool(part of MuPDF) can sometimes repair ToUnicode maps via heuristic glyph-shape matching, but results are variable.
Case 3: custom encoding or cipher fonts
Symptom: Text selects and copies in the right shape, but every letter is shifted or substituted. The pasted text might look like Caesar-cipher output (every letter shifted by one position), or it might map cleanly into another known alphabet.
Cause: The PDF uses a custom font where glyph #1 maps visually to "A" but is encoded as "X" — sometimes accidentally (subsetting gone wrong), sometimes intentionally as a light DRM technique to discourage text extraction.
How to spot it: Copy a recognizable English sentence ("The quick brown fox") and inspect the pasted result. If the pattern is consistent — every "T" becomes the same wrong character, every "H" becomes another consistent wrong character — you're looking at a custom encoding.
Fix: OCR the rendered pages. OCR ignores the PDF's character encoding entirely and reads from pixels, so it sidesteps the cipher. Modern vision models (GPT-4o, Claude, Gemini) handle this case especially cleanly because they're robust to minor rendering quirks.
Case 4: protected or restricted PDFs
Symptom: Selection works but copy doesn't, sometimes accompanied by a tooltip or warning message stating that copy is disabled. The Edit → Copy menu item may be greyed out.
Cause: The PDF has copy and/or print permissions disabled in its metadata. The permissions flag is a request to the viewer, not enforced cryptographically — but most viewers honor it.
Legal caveat: Bypassing copy protection on a document you don't own may violate copyright law or the terms under which you received the document. The cases below assume you're working with your own PDFs or have permission from the rights holder.
Fix for your own documents:
qpdf --decrypt input.pdf output.pdfstrips the permissions flag if you have the user password (or if no password was set).- Re-print to PDF using your operating system's print-to-PDF feature. This produces a clean copy without the permissions flag — though the new PDF may have a slightly different visual rendering.
- Open in a viewer that ignores permissions for documents you own (some open-source viewers do this).
For documents you don't own, contact the publisher and ask for a copy-enabled version.
The ligature problem
A separate but related issue: many PDFs copy mostly correctly but garble fi, fl, ffi, and ff combinations.
Symptom: Copy works, the text is readable, but every "fi" or "fl" comes out as a single odd Unicode character (fi, fl) or sometimes as no character at all.
Cause: Typographic ligatures are stored in the font as single glyphs (fi is one glyph, not two). If the ToUnicode map doesn't break the ligature back into its constituent letters, the ligature glyph ends up in the copied text.
Fixes:
- Quick: a regex find-replace on the copied text.
fi→fi,fl→fl,ffi→ffi,ffl→ffl. - Better: run the document through a Markdown converter like the one on this site. Most modern converters handle ligature normalization automatically.
- Important if you publish the output: search engines index ligatures as their own characters, so unfixed ligatures hurt discoverability.
A simple decision tree
When a PDF refuses to give up its text, work through these questions in order:
- Can you select any text? If no, the PDF is image-only. Run OCR.
- Does selection work but copy produce garbage? Either the ToUnicode map is missing or the encoding is custom. The fastest universal fix is to OCR the rendered pages — it bypasses every encoding problem.
- Does copy mostly work but ligatures break? Regex find-replace, or run through a Markdown converter.
- Is copy blocked by a permissions warning? Check legal rights first, then
qpdf --decryptyour own document.
The thread connecting most of these solutions: when in doubt, treat the PDF as an image and OCR it. That single technique handles image-only PDFs, missing ToUnicode maps, custom encodings, and (with the right tool) even decent-quality cipher fonts.
If you want to try this without installing anything, the converter on this site includes an OCR option that handles all of these cases.
← Back to all guides