Tesseract OCR Explained — Strengths, Weaknesses, and Tuning Tips
Tesseract has been around since 1985 — originally an HP Labs project, then taken up by Google, now community-maintained. It's still the default free OCR engine for a reason: it works, it runs locally, it's fast, and it costs nothing at any volume. But it has rough edges that bite first-time users, and most of the "Tesseract is bad at X" advice you'll find online is from the pre-2018 era and no longer applies.
This guide covers what Tesseract is genuinely good at, where it falls down today, and the half-dozen settings that meaningfully change output quality.
A short history
Tesseract started in 1985 at HP Labs as a desktop OCR project. By 1995 it was placing among the top three OCR engines in independent accuracy tests run by the University of Nevada — and then it sat on a shelf for a decade. HP open-sourced the code in 2005, Google sponsored development through the 2010s, and the project is now under a community-led maintainership.
The version most people use today is descended from Tesseract 4 (2018), which replaced the old template-matching recognition engine with an LSTM-based neural network. Tesseract 5 (2021) refined the architecture but didn't fundamentally change it. The practical implication: a lot of older advice ("Tesseract is hopeless on italics", "Tesseract can't handle ligatures") was true in 2010 and isn't true now.
How modern Tesseract works
Tesseract runs three stages on every page:
- Page segmentation. The engine analyzes the image and identifies text regions, blocks, lines, and individual words. This is heuristic work and the most common failure point on real-world documents.
- Line recognition. Each line of pixels gets fed to an LSTM neural network that reads it as a sequence and outputs a sequence of characters. This is the stage that benefits most from the neural-network move — it handles ligatures, italics, and most font variations.
- Language model post-processing. A dictionary-based step re-ranks candidate readings using language statistics. This is why picking the right
-lflag matters so much: telling Tesseract "this is German" lets the language model bias toward German words.
Three things to take from this:
- Segmentation is fragile. If Tesseract gets the page layout wrong, even perfect line recognition won't save the output.
- The neural-net stage is genuinely good — better than most users give it credit for on clean inputs.
- The language model is necessary but limited. If your document mixes languages or contains many proper nouns, the language model can do as much harm as good.
Where Tesseract shines
Tesseract is the right tool for:
- Clean, printed, single-column text. Novels, technical manuals, reports, government forms with mostly text.
- High-DPI scans (300+). The recognition engine needs sharp pixel edges; at 300 DPI on a clean document, accuracy regularly hits 99%+.
- Standard fonts in supported languages. Times, Arial, Helvetica, most book typefaces, most print newspaper fonts.
- Documents where you can tune parameters once and run on many files. Batch jobs where the per-document setup cost is amortized.
- Offline workflows. Legal, medical, classified, or air-gapped environments where cloud APIs are out of the question.
- Any volume. Free at one page, free at one million pages.
Where Tesseract fails
Honest list of limitations:
- Handwriting — Tesseract was never trained on handwriting and produces essentially noise. Use a vision model. See the handwritten OCR guide.
- Complex tables — Tesseract reads tables as flowing text. Column alignment is lost, multi-line cells flatten into adjacent cells, headers detach from data. For tables that matter, use AWS Textract, Azure Document Intelligence, or a vision model. See preserving tables when converting PDF to Markdown.
- Multi-column layouts mixed with figures. Segmentation often interleaves columns or grabs figure-caption text into a body paragraph.
- Skewed or rotated pages. Tesseract has only mild built-in deskew. Pre-process with ImageMagick first.
- Low-DPI scans. Below 200 DPI, accuracy drops sharply. Rescan if you can.
- Stylized fonts. Display fonts, blackletter, very condensed typefaces, decorative scripts.
- Math notation. Equations get misread as random punctuation. Use Mathpix or a math-aware tool.
If your document falls into one of these categories, save yourself the cleanup time and start with a different tool.
The settings that actually matter
Tesseract has dozens of command-line flags but only six change output quality meaningfully:
-l <lang>— language pack. Use-l eng+spafor bilingual documents. The wrong language pack produces noticeably worse output. Tesseract ships with English; install others separately (see below).--psm <N>— page segmentation mode. The default is 3 (auto-detect), which often interleaves columns. Try mode 6 (uniform block of text) as a better default for most printed documents. Mode 11 (sparse text) works well on receipts and forms.--oem <N>— OCR engine mode. Mode 1 is LSTM-only and usually best on modern inputs. Mode 3 (the default) combines LSTM with the older legacy engine and can be slower without being better.-c preserve_interword_spaces=1— preserves spacing in code blocks or tabular text. Off by default, which collapses multi-space runs to single spaces.-c tessedit_char_whitelist=...— restricts output to a specific character set. Useful for license plates (uppercase letters and digits only) or ID numbers.--dpi <N>— manually sets scan DPI when the image's EXIF data is missing or wrong. Important because the language model uses DPI to calibrate expectations about character size.
A solid baseline command for most printed documents:
tesseract input.png output -l eng --psm 6 --oem 1
Practical pre-processing pipeline
Tesseract reads images, not PDFs, so the typical workflow is two steps:
# Convert PDF pages to PNG at 300 DPI
pdftoppm -r 300 input.pdf page -png
# Optional: deskew and contrast-bump each page
for img in page-*.png; do
convert "$img" -deskew 40% -level 20%,80% "${img%.png}-clean.png"
done
# Run Tesseract
for img in page-*-clean.png; do
tesseract "$img" "${img%-clean.png}" -l eng --psm 6 --oem 1
done
# Combine results
cat page-*.txt > output.txt
The deskew step in particular is worth running on any scanned document. Even a 1–2 degree page rotation knocks several percentage points off accuracy, and ImageMagick's -deskew fixes it in milliseconds.
Language packs in practice
Tesseract supports more than 100 languages but ships with only English by default. To install others:
# macOS — install all packs at once
brew install tesseract-lang
# Ubuntu/Debian — install specific languages
sudo apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-spa
# Verify what's installed
tesseract --list-langs
Per-language quality varies:
- Excellent: English, Spanish, German, French, Portuguese, Italian, Dutch
- Good: Russian, Polish, Czech, Swedish, Norwegian
- Decent on print: Chinese (simplified and traditional), Japanese, Korean
- Variable: Arabic, Hebrew, Thai, Vietnamese (script complexity varies)
- Weak: Rare scripts and historical orthographies
For Chinese, Japanese, and Korean (CJK) documents, verify output carefully — Tesseract handles printed CJK reasonably but mis-segments characters in dense layouts. For Arabic, the right-to-left reading order needs --psm 1 (auto + orientation detection) and even then, results are usable but require cleanup.
When to reach for something else
Modern AI vision models — GPT-4o, Claude, Gemini — often beat Tesseract on messy real-world documents, especially anything with handwriting, complex layouts, or low scan quality. The tradeoff is per-page cost (roughly $0.01–0.05) and a network round-trip.
A pragmatic hybrid workflow:
- Run Tesseract first on every document.
- Score the output: if it's empty, very short, or contains many unusual character sequences (a sign of low confidence), flag the page.
- Route flagged pages to a vision model for re-processing.
This keeps costs low (most pages use free Tesseract) while catching the documents where Tesseract struggles.
Conclusion
Tesseract still earns its place as the free, local, offline OCR option. The neural-network engine introduced in 2018 closed the gap with commercial OCR on clean inputs; the remaining weakness is segmentation on complex layouts, not character recognition itself.
Pick the right settings (--psm 6, --oem 1, the correct -l language pack), pre-process scanned images with ImageMagick, and budget realistic cleanup time. For anything beyond clean printed text — handwriting, complex tables, math — start with a tool that's designed for the job.
If you'd like to try Tesseract without installing anything, the converter on this site runs it server-side and returns Markdown.
← Back to all guides