How to Convert Scanned PDFs to Searchable Text — A Complete Guide

You select text in a scanned PDF and copy it. You paste — and get nothing. Or, sometimes, a confused string of square boxes and question marks. The pages on screen are full of words, but somehow the words aren't there when you ask for them.

That's because a scanned PDF is essentially a stack of pictures. Each page is an image of text rather than text itself. Your PDF reader can render the page perfectly — it's just drawing the picture — but the underlying file contains no actual character data for you to copy.

To get text out, you have to do what your eyes do automatically: look at the picture and identify each letter. That's the job of OCR — optical character recognition. This guide walks through identifying scanned PDFs, picking the right OCR approach for the job, running it cleanly, and dealing with the messy edges that always show up in real documents.

How to tell if your PDF is scanned

Before reaching for OCR tools, confirm you actually need one. Three quick checks usually settle the question.

Try to select text. Open the PDF and drag your cursor across a sentence. If selection works smoothly and copy-paste produces readable text, the PDF has an embedded text layer and OCR isn't needed. If selection skips erratically, highlights entire pages as a single block, or produces gibberish on paste, the text layer is missing or broken.

Check the file size. A digital text PDF runs about 50 KB per page. A scanned PDF runs five to ten times more because each page is essentially a JPEG. A 30-page document weighing 30 MB is almost certainly scanned.

Zoom in to 400%. Digital text stays sharp at any zoom level; scanned text gets pixelated, shows JPEG compression artifacts, or reveals the texture of the original paper.

One trap to watch for: some scanned PDFs already carry a hidden text layer from a previous OCR pass. You can copy from them, but the quality varies — often poorly. If the copied text is mostly readable but riddled with stray errors, you're working with low-quality OCR output, and re-running OCR with a better tool may give cleaner results. For a deeper look at the different ways PDF text gets garbled on copy, see why your PDF text won't copy.

What OCR actually does

OCR — optical character recognition — converts pictures of text into actual text data. The basic problem is harder than it sounds. Letters in real documents vary by font, size, weight, color, lighting, scanner quality, and the age and condition of the original.

Modern OCR engines handle this with neural networks trained on large datasets of text images. Tesseract, the most widely used open-source engine, moved to an LSTM-based recognition model in 2018. The newer commercial and AI-based options — cloud vision APIs and vision language models — handle even messier inputs.

That capability has real limits, though. Even the best OCR struggles with handwriting, math notation, complex tables, multi-column layouts mixed with figures, and scans below 200 DPI. Knowing where each tool fails is the difference between a clean extraction and a weekend of manual cleanup.

OCR tool options compared

Five families of tools cover almost every realistic use case:

Tool Cost Best at Main limitation Setup
Tesseract Free Clean printed text in 100+ languages Weak on tables, handwriting, complex layouts Local install
Adobe Acrobat Pro ~$20/mo Mature GUI workflow, batch jobs Subscription only Desktop app
Online converters Usually free One-off documents, no install Privacy depends on provider Web upload
Cloud APIs (Textract, Document AI) ~$1.50 / 1000 pages Tables, forms, structured documents Requires cloud account API integration
AI vision models (GPT-4o, Claude, Gemini) ~$0.01–0.05 / page Handwriting, messy layouts, diagrams Costs per call, network needed API key

Tesseract is the free, local default. It's been around since 1985 and has 100+ language packs. For clean, single-column printed text at 300+ DPI, it produces solid output. For anything with tables, handwriting, or complex layouts, look elsewhere. A deeper dive on Tesseract's strengths and tuning is in the Tesseract OCR guide.

Adobe Acrobat Pro is mature and reliable, with strong batch features. It's the right pick if you already pay for Adobe and don't want to assemble a pipeline.

Online converters trade convenience for varying privacy guarantees. Fine for non-sensitive documents; not appropriate for legal, medical, or financial files unless the operator publishes a strict no-retention policy. For more on what actually happens to your file during an online conversion, see PDF privacy and security.

Cloud APIs from AWS, Azure, and Google are the right answer when tables or forms matter. They're specifically trained on those structures. Expect about $1.50 per 1000 pages on standard tiers.

AI vision models are the newest option and now the strongest choice for handwriting and messy real-world documents. The downside is per-page cost (small but non-zero) and reliance on a network round-trip. For handwriting specifically, see the handwritten OCR guide.

A simple decision rule: clean printed text goes to Tesseract; forms and tables go to a cloud API; anything with handwriting or unusual layouts goes to a vision model.

Running OCR locally with Tesseract

Tesseract reads images, not PDFs, so the local workflow is two steps: convert each PDF page to an image, then OCR each image. Both tools are free.

Install Tesseract and Poppler (which provides pdftoppm):

# macOS
brew install tesseract poppler

# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

Convert your PDF to PNG images at 300 DPI (lower than this hurts accuracy noticeably):

pdftoppm -r 300 input.pdf page -png

You'll now have page-1.png, page-2.png, and so on. Run Tesseract on each:

for img in page-*.png; do
  tesseract "$img" "${img%.png}" -l eng --psm 6
done
cat page-*.txt > output.txt

A few flags matter here:

For a ten-page document, this pipeline runs in under a minute on a modern laptop.

Running OCR through the browser

If installing tools isn't an option, pdfs2txt and similar online converters offer a no-install path. Upload the PDF, select an OCR option, and download the result as Markdown or plain text. The trade-off is privacy: your file lives on someone else's server for the duration of the conversion. For non-sensitive documents this is fine; for legal, medical, or financial files, run OCR locally instead.

The browser workflow makes sense when:

Improving OCR accuracy

The single biggest lever on OCR accuracy is the input image quality. Garbage-in patterns hold strongly:

When OCR alone isn't enough

OCR is great for what it was designed for — printed text — but four categories of documents need different approaches:

If your document falls into one of these categories, save yourself the cleanup time and start with the right tool.

Putting it together

Identifying a scanned PDF takes thirty seconds; picking the right OCR tool takes another minute. After that, the work is largely mechanical. Pick the tool that matches your document type, scan at 300 DPI, use the right language pack, and budget a few minutes per 100 pages for cleanup.

For most printed documents, Tesseract works well and runs locally for free. For anything with handwriting, complex tables, or unusual layouts, a vision model or specialized cloud API earns its cost in cleanup time saved.

If you want to try OCR without installing anything, the converter on this site offers a free path: upload the PDF, choose the OCR option, download the result. For sensitive documents, run OCR locally with the steps above.

← Back to all guides