Multi-Language OCR — Handling Non-English and Mixed-Script Documents

2026-05-22 · 7 min read

OCR engines are trained on enormous corpora of one language, and tend to over-train on it. Throw a Korean document, a French-Arabic bilingual contract, or a Japanese form with embedded English brand names at a stock OCR setup and you'll get either gibberish, missing characters, or a confident misread of one script as another. Multi-language OCR has its own techniques separate from the English-only path.

This guide walks through what changes for non-English documents, where each tool stands, and the prompting and configuration tricks that turn unusable output into clean text.

What "multi-language" actually means

Three different cases hide under the same label:

Single non-English document. A book in Spanish, a Korean tax form, a Hindi report. You need the right language model — the engine knows what to expect on every page.
Multi-language document. A bilingual contract with English and Mandarin in parallel columns. The engine needs to recognize both, and ideally tag which language each block is.
Mixed-script document. A primarily-Japanese document with embedded English brand names and Arabic numerals. The engine has to switch scripts mid-line, sometimes mid-word.

Each case has different best-tool recommendations. Knowing which case you're in is half the work.

Tesseract for non-English documents

Tesseract supports over 100 languages, but you have to install each language pack separately and tell the engine which language(s) to use. The basic invocation:

tesseract document.tif output -l spa     # Spanish only
tesseract document.tif output -l deu+eng # German + English (mixed)
tesseract document.tif output -l chi_sim+eng  # Simplified Chinese + English

The +eng pattern matters more than people realize. Most non-English documents contain some English — brand names, URLs, technical terms, citations. If you don't include eng, Tesseract maps those English words into the primary language's closest-looking characters and produces nonsense for them.

Installing language packs:

# Debian/Ubuntu
sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-chi-sim

# macOS Homebrew
brew install tesseract-lang     # installs all language packs

# Direct download
# https://github.com/tesseract-ocr/tessdata_best — best accuracy, larger files
# https://github.com/tesseract-ocr/tessdata_fast — faster, slightly less accurate

Use the _best data files when accuracy matters more than runtime. The default download is the middle-tier tessdata.

Realistic Tesseract accuracy by language family:

Western European (English, French, German, Spanish, Italian, Portuguese, Dutch) — 95%+ on clean scans, the engine's home turf.
Eastern European (Polish, Czech, Russian, Ukrainian, Greek) — 90–95%. Slightly weaker on diacritics.
CJK (Chinese, Japanese, Korean) — 80–90%. Tesseract's CJK is competent but trails cloud OCR substantially.
Right-to-left (Arabic, Hebrew, Persian) — 75–85%. Connected letters and contextual forms confuse the segmenter.
South Asian (Hindi, Tamil, Bengali) — 70–85%. Complex ligatures and conjunct characters are hard.
Southeast Asian (Thai, Khmer, Lao) — 60–80%. No spaces between words; Tesseract's segmentation often misfires.

For the lower-accuracy languages, a cloud OCR service or vision model is usually worth the cost.

CJK-specific gotchas

Chinese, Japanese, and Korean each have specific issues worth knowing:

Chinese. Choose chi_sim for Simplified, chi_tra for Traditional. Mixing them produces lower accuracy than picking the right one. Tesseract handles vertical text poorly — for vertical-flow documents (some Traditional Chinese, some Japanese), rotate the image 90° before OCR and rotate the output back. For modern documents (subtitles, signs, advertising), Document AI substantially outperforms Tesseract.

Japanese. jpn covers Hiragana, Katakana, and the most common Kanji. Tesseract sometimes confuses similar Kanji (e.g., 末 vs 未). Vertical text is more common in Japanese than Chinese; same rotation trick applies. For furigana (small phonetic guides above Kanji), Tesseract often picks up the furigana as separate text — strip or position-filter post-OCR.

Korean. kor is decent but Hangul is composed of syllabic blocks of jamo (consonant/vowel pieces); engines sometimes split or merge blocks incorrectly. Document AI and GPT-4o handle Hangul noticeably better than Tesseract.

For all three, vision models in 2026 produce 95%+ accuracy on typical documents — usually the right pick when accuracy matters.

Right-to-left scripts

Arabic, Hebrew, Persian, and Urdu present an extra dimension of difficulty: text flows right-to-left, and Arabic-script letters have different forms depending on position (initial, medial, final, isolated). A naive OCR engine that ignores positional forms gets confused, and tools downstream of the OCR sometimes don't know how to display the output correctly.

The practical guidance:

Use a vision model or Google Document AI. Tesseract's Arabic is usable but the modern alternatives are clearly better.
Preserve the Unicode direction marks. Output that includes U+202B (Right-to-Left Embedding) and U+202C (Pop Directional Formatting) displays correctly in editors that respect them; output that strips them displays as gibberish on left-to-right systems.
Numbers stay left-to-right within RTL text. A date like "23 يناير 2024" has the digits flowing left-to-right inside the broader RTL paragraph. OCR sometimes reverses digit groups; spot-check dates and figures.
English brand names in Arabic documents. Always include +eng when using Tesseract. Vision models handle this transparently.

Vision models for hard cases

For the languages where Tesseract is weak (CJK, RTL, South Asian) and for any mixed-script document, vision models in 2026 are the pragmatic choice. They've been trained on internet-scale multilingual data and handle script-switching naturally.

A prompt that improves results:

This is a document in [language(s)]. The script is primarily [script name],
with embedded [other scripts/languages] in [where they appear: brand names,
citations, technical terms].

Transcribe the document exactly. Preserve:
- Original line breaks
- Original script and direction
- Numbers as they appear (do not convert Eastern Arabic numerals to Western)
- Punctuation specific to the language (CJK quotation marks, Arabic comma ،,
  Spanish opening ¿ and ¡)

Output ONLY the transcription, no commentary.

The "do not convert numerals" instruction is critical for Arabic and Persian documents — vision models love to "helpfully" normalize ٢٠٢٤ to 2024, which silently destroys data.

Detecting language automatically

For batch jobs where you don't know the language of each document in advance:

Tesseract has -l osd (Orientation and Script Detection) which identifies the script but not the specific language.
Cloud OCR services auto-detect as part of their normal output. Document AI is the most accurate at this.
For a Python pipeline: run a fast language detector (lingua-py, langdetect) on a small sample of pymupdf-extracted text first; pick the OCR language accordingly. If pymupdf extracts no text (scanned PDF), do a low-resolution Tesseract pass in script-detection mode first.

A pattern for mixed corpora:

import pymupdf
from lingua import LanguageDetectorBuilder, Language

detector = LanguageDetectorBuilder.from_all_languages().build()

def detect_language(pdf_path):
    doc = pymupdf.open(pdf_path)
    sample = "\n".join(page.get_text() for page in doc[:5])
    if len(sample.strip()) < 100:
        return None  # likely scanned, do OCR-based script detection
    return detector.detect_language_of(sample)

Diacritics and accented characters

A failure mode shared across many languages: diacritics (accents, umlauts, cedillas) get dropped, converted to base letters, or substituted by unrelated characters.

The fixes:

Use the _best Tesseract data files. They preserve diacritics noticeably better than the default tier.
OCR at higher DPI. Diacritics are small features; 600 DPI captures them where 300 DPI smooths them away.
Don't pre-binarize aggressively. Threshold-based binarization can erase diacritic dots and accents. Use adaptive thresholding instead. See image preprocessing for OCR.
Validate post-OCR. A document in French with no acute accents anywhere is almost certainly mis-OCR'd; flag it for re-processing.

Multi-language search and storage

After OCR, multi-language text still has gotchas in downstream pipelines:

Normalize Unicode. Use NFC normalization (unicodedata.normalize("NFC", text)) before storing — same character can be encoded multiple ways and identical-looking text won't match.
Strip BOM and zero-width characters. Some OCR pipelines leak zero-width joiners (especially in Arabic and Devanagari output) that break exact-match search.
Be careful with case folding. Turkish has dotless I (ı) vs dotted İ; German ß lowercases to ß but uppercases to ẞ (recently) or SS (traditionally). Locale-aware case folding matters.
CJK has no spaces between words. A keyword search expecting space-separated tokens will fail. Use a CJK-aware analyzer (Kuromoji for Japanese, Jieba for Chinese) when indexing.

When to give up on Tesseract for a language

A pragmatic threshold: if Tesseract is producing under 85% character accuracy on clean scans of your target language, the engineering effort to fix it is rarely worth it. Move to a vision model or cloud OCR service. The cost difference at small volumes is negligible; at large volumes, the cost difference is real but so is the accuracy improvement.

For one-off translations of non-English documents, the converter on this site with GPT-4o or Gemini handles most of the cases above without manual configuration — just pick the vision model and upload the PDF.

← Back to all guides