How to Convert Scanned PDFs to Searchable Text — A Complete Guide
You select text in a scanned PDF and copy it. You paste — and get nothing. Or, sometimes, a confused string of square boxes and question marks. The pages on screen are full of words, but somehow the words aren't there when you ask for them.
That's because a scanned PDF is essentially a stack of pictures. Each page is an image of text rather than text itself. Your PDF reader can render the page perfectly — it's just drawing the picture — but the underlying file contains no actual character data for you to copy.
To get text out, you have to do what your eyes do automatically: look at the picture and identify each letter. That's the job of OCR — optical character recognition. This guide walks through identifying scanned PDFs, picking the right OCR approach for the job, running it cleanly, and dealing with the messy edges that always show up in real documents.
How to tell if your PDF is scanned
Before reaching for OCR tools, confirm you actually need one. Three quick checks usually settle the question.
Try to select text. Open the PDF and drag your cursor across a sentence. If selection works smoothly and copy-paste produces readable text, the PDF has an embedded text layer and OCR isn't needed. If selection skips erratically, highlights entire pages as a single block, or produces gibberish on paste, the text layer is missing or broken.
Check the file size. A digital text PDF runs about 50 KB per page. A scanned PDF runs five to ten times more because each page is essentially a JPEG. A 30-page document weighing 30 MB is almost certainly scanned.
Zoom in to 400%. Digital text stays sharp at any zoom level; scanned text gets pixelated, shows JPEG compression artifacts, or reveals the texture of the original paper.
One trap to watch for: some scanned PDFs already carry a hidden text layer from a previous OCR pass. You can copy from them, but the quality varies — often poorly. If the copied text is mostly readable but riddled with stray errors, you're working with low-quality OCR output, and re-running OCR with a better tool may give cleaner results. For a deeper look at the different ways PDF text gets garbled on copy, see why your PDF text won't copy.
What OCR actually does
OCR — optical character recognition — converts pictures of text into actual text data. The basic problem is harder than it sounds. Letters in real documents vary by font, size, weight, color, lighting, scanner quality, and the age and condition of the original.
Modern OCR engines handle this with neural networks trained on large datasets of text images. Tesseract, the most widely used open-source engine, moved to an LSTM-based recognition model in 2018. The newer commercial and AI-based options — cloud vision APIs and vision language models — handle even messier inputs.
That capability has real limits, though. Even the best OCR struggles with handwriting, math notation, complex tables, multi-column layouts mixed with figures, and scans below 200 DPI. Knowing where each tool fails is the difference between a clean extraction and a weekend of manual cleanup.
OCR tool options compared
Five families of tools cover almost every realistic use case:
| Tool | Cost | Best at | Main limitation | Setup |
|---|---|---|---|---|
| Tesseract | Free | Clean printed text in 100+ languages | Weak on tables, handwriting, complex layouts | Local install |
| Adobe Acrobat Pro | ~$20/mo | Mature GUI workflow, batch jobs | Subscription only | Desktop app |
| Online converters | Usually free | One-off documents, no install | Privacy depends on provider | Web upload |
| Cloud APIs (Textract, Document AI) | ~$1.50 / 1000 pages | Tables, forms, structured documents | Requires cloud account | API integration |
| AI vision models (GPT-4o, Claude, Gemini) | ~$0.01–0.05 / page | Handwriting, messy layouts, diagrams | Costs per call, network needed | API key |
Tesseract is the free, local default. It's been around since 1985 and has 100+ language packs. For clean, single-column printed text at 300+ DPI, it produces solid output. For anything with tables, handwriting, or complex layouts, look elsewhere. A deeper dive on Tesseract's strengths and tuning is in the Tesseract OCR guide.
Adobe Acrobat Pro is mature and reliable, with strong batch features. It's the right pick if you already pay for Adobe and don't want to assemble a pipeline.
Online converters trade convenience for varying privacy guarantees. Fine for non-sensitive documents; not appropriate for legal, medical, or financial files unless the operator publishes a strict no-retention policy. For more on what actually happens to your file during an online conversion, see PDF privacy and security.
Cloud APIs from AWS, Azure, and Google are the right answer when tables or forms matter. They're specifically trained on those structures. Expect about $1.50 per 1000 pages on standard tiers.
AI vision models are the newest option and now the strongest choice for handwriting and messy real-world documents. The downside is per-page cost (small but non-zero) and reliance on a network round-trip. For handwriting specifically, see the handwritten OCR guide.
A simple decision rule: clean printed text goes to Tesseract; forms and tables go to a cloud API; anything with handwriting or unusual layouts goes to a vision model.
Running OCR locally with Tesseract
Tesseract reads images, not PDFs, so the local workflow is two steps: convert each PDF page to an image, then OCR each image. Both tools are free.
Install Tesseract and Poppler (which provides pdftoppm):
# macOS
brew install tesseract poppler
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
Convert your PDF to PNG images at 300 DPI (lower than this hurts accuracy noticeably):
pdftoppm -r 300 input.pdf page -png
You'll now have page-1.png, page-2.png, and so on. Run Tesseract on each:
for img in page-*.png; do
tesseract "$img" "${img%.png}" -l eng --psm 6
done
cat page-*.txt > output.txt
A few flags matter here:
-l engselects the English language pack. Use-l eng+deufor bilingual documents. The wrong language produces much worse output.--psm 6(page segmentation mode 6) treats each image as a uniform block of text. The default mode 3 (auto-detect) sometimes interleaves columns; mode 6 produces cleaner output for most single-column scans.- For receipts, forms, or sparse text, try
--psm 11instead.
For a ten-page document, this pipeline runs in under a minute on a modern laptop.
Running OCR through the browser
If installing tools isn't an option, pdfs2txt and similar online converters offer a no-install path. Upload the PDF, select an OCR option, and download the result as Markdown or plain text. The trade-off is privacy: your file lives on someone else's server for the duration of the conversion. For non-sensitive documents this is fine; for legal, medical, or financial files, run OCR locally instead.
The browser workflow makes sense when:
- You only have a few documents to convert
- The PDF isn't sensitive
- You don't want to maintain a local Python or command-line environment
- You need OCR for a language Tesseract doesn't handle well, and the converter offers a cloud-vision option
Improving OCR accuracy
The single biggest lever on OCR accuracy is the input image quality. Garbage-in patterns hold strongly:
- DPI matters more than tooling. 300 DPI is the sweet spot. Below 200, accuracy degrades fast — letters get blurry enough that even neural networks struggle. If you control the scanning step, scan at 300 DPI or higher.
- Pre-process the image before running OCR. Skewed scans, faded ink, and uneven contrast all hurt. ImageMagick's deskew filter (
convert input.png -deskew 40% fixed.png) and a simple contrast bump (-level 20%,80%) often improve accuracy by several percentage points. - Choose the right language pack. Tesseract supports more than 100 languages, but you have to tell it which one to use. Mixed-language documents need multiple packs:
-l eng+spafor English plus Spanish. - Pick the right page segmentation mode. Mode 6 (uniform block) is the right default for most printed documents. Mode 11 (sparse text) helps with receipts, license plates, and form fields. Mode 4 (single column of variable-sized text) helps with mixed-size content like magazine pages.
- Plan for cleanup. Even good OCR produces a few errors per page. Budget five to ten minutes per 100 pages for spot-fixing — common errors include ligature artifacts (
fibecoming a single character) and zero-vs-O confusion in tables of numbers.
When OCR alone isn't enough
OCR is great for what it was designed for — printed text — but four categories of documents need different approaches:
- Handwritten content is essentially noise to Tesseract. Switch to a vision model. GPT-4o, Claude, and Gemini all produce decent transcriptions on clear handwriting, with realistic accuracy around 90% for legible cursive.
- Math formulas get misread by general-purpose OCR because the symbols and 2D layout don't match the engine's training distribution. Specialized tools like Mathpix handle equations cleanly and output LaTeX.
- Complex tables — anything with merged cells, nested headers, or sparse columns — get mangled by any OCR engine. AWS Textract and Azure Document Intelligence are purpose-built for this and worth the API cost when tables are the point.
- Mixed layouts with figures and captions confuse segmentation engines. Vision models do better here because they can use visual context — they understand that a caption refers to the figure above it.
If your document falls into one of these categories, save yourself the cleanup time and start with the right tool.
Putting it together
Identifying a scanned PDF takes thirty seconds; picking the right OCR tool takes another minute. After that, the work is largely mechanical. Pick the tool that matches your document type, scan at 300 DPI, use the right language pack, and budget a few minutes per 100 pages for cleanup.
For most printed documents, Tesseract works well and runs locally for free. For anything with handwriting, complex tables, or unusual layouts, a vision model or specialized cloud API earns its cost in cleanup time saved.
If you want to try OCR without installing anything, the converter on this site offers a free path: upload the PDF, choose the OCR option, download the result. For sensitive documents, run OCR locally with the steps above.
← Back to all guides