Multi-Language OCR — Handling Non-English and Mixed-Script Documents

OCR engines are trained on enormous corpora of one language, and tend to over-train on it. Throw a Korean document, a French-Arabic bilingual contract, or a Japanese form with embedded English brand names at a stock OCR setup and you'll get either gibberish, missing characters, or a confident misread of one script as another. Multi-language OCR has its own techniques separate from the English-only path.

This guide walks through what changes for non-English documents, where each tool stands, and the prompting and configuration tricks that turn unusable output into clean text.

What "multi-language" actually means

Three different cases hide under the same label:

Each case has different best-tool recommendations. Knowing which case you're in is half the work.

Tesseract for non-English documents

Tesseract supports over 100 languages, but you have to install each language pack separately and tell the engine which language(s) to use. The basic invocation:

tesseract document.tif output -l spa     # Spanish only
tesseract document.tif output -l deu+eng # German + English (mixed)
tesseract document.tif output -l chi_sim+eng  # Simplified Chinese + English

The +eng pattern matters more than people realize. Most non-English documents contain some English — brand names, URLs, technical terms, citations. If you don't include eng, Tesseract maps those English words into the primary language's closest-looking characters and produces nonsense for them.

Installing language packs:

# Debian/Ubuntu
sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-chi-sim

# macOS Homebrew
brew install tesseract-lang     # installs all language packs

# Direct download
# https://github.com/tesseract-ocr/tessdata_best — best accuracy, larger files
# https://github.com/tesseract-ocr/tessdata_fast — faster, slightly less accurate

Use the _best data files when accuracy matters more than runtime. The default download is the middle-tier tessdata.

Realistic Tesseract accuracy by language family:

For the lower-accuracy languages, a cloud OCR service or vision model is usually worth the cost.

CJK-specific gotchas

Chinese, Japanese, and Korean each have specific issues worth knowing:

Chinese. Choose chi_sim for Simplified, chi_tra for Traditional. Mixing them produces lower accuracy than picking the right one. Tesseract handles vertical text poorly — for vertical-flow documents (some Traditional Chinese, some Japanese), rotate the image 90° before OCR and rotate the output back. For modern documents (subtitles, signs, advertising), Document AI substantially outperforms Tesseract.

Japanese. jpn covers Hiragana, Katakana, and the most common Kanji. Tesseract sometimes confuses similar Kanji (e.g., 末 vs 未). Vertical text is more common in Japanese than Chinese; same rotation trick applies. For furigana (small phonetic guides above Kanji), Tesseract often picks up the furigana as separate text — strip or position-filter post-OCR.

Korean. kor is decent but Hangul is composed of syllabic blocks of jamo (consonant/vowel pieces); engines sometimes split or merge blocks incorrectly. Document AI and GPT-4o handle Hangul noticeably better than Tesseract.

For all three, vision models in 2026 produce 95%+ accuracy on typical documents — usually the right pick when accuracy matters.

Right-to-left scripts

Arabic, Hebrew, Persian, and Urdu present an extra dimension of difficulty: text flows right-to-left, and Arabic-script letters have different forms depending on position (initial, medial, final, isolated). A naive OCR engine that ignores positional forms gets confused, and tools downstream of the OCR sometimes don't know how to display the output correctly.

The practical guidance:

Vision models for hard cases

For the languages where Tesseract is weak (CJK, RTL, South Asian) and for any mixed-script document, vision models in 2026 are the pragmatic choice. They've been trained on internet-scale multilingual data and handle script-switching naturally.

A prompt that improves results:

This is a document in [language(s)]. The script is primarily [script name],
with embedded [other scripts/languages] in [where they appear: brand names,
citations, technical terms].

Transcribe the document exactly. Preserve:
- Original line breaks
- Original script and direction
- Numbers as they appear (do not convert Eastern Arabic numerals to Western)
- Punctuation specific to the language (CJK quotation marks, Arabic comma ،,
  Spanish opening ¿ and ¡)

Output ONLY the transcription, no commentary.

The "do not convert numerals" instruction is critical for Arabic and Persian documents — vision models love to "helpfully" normalize ٢٠٢٤ to 2024, which silently destroys data.

Detecting language automatically

For batch jobs where you don't know the language of each document in advance:

A pattern for mixed corpora:

import pymupdf
from lingua import LanguageDetectorBuilder, Language

detector = LanguageDetectorBuilder.from_all_languages().build()

def detect_language(pdf_path):
    doc = pymupdf.open(pdf_path)
    sample = "\n".join(page.get_text() for page in doc[:5])
    if len(sample.strip()) < 100:
        return None  # likely scanned, do OCR-based script detection
    return detector.detect_language_of(sample)

Diacritics and accented characters

A failure mode shared across many languages: diacritics (accents, umlauts, cedillas) get dropped, converted to base letters, or substituted by unrelated characters.

The fixes:

Multi-language search and storage

After OCR, multi-language text still has gotchas in downstream pipelines:

When to give up on Tesseract for a language

A pragmatic threshold: if Tesseract is producing under 85% character accuracy on clean scans of your target language, the engineering effort to fix it is rarely worth it. Move to a vision model or cloud OCR service. The cost difference at small volumes is negligible; at large volumes, the cost difference is real but so is the accuracy improvement.

For one-off translations of non-English documents, the converter on this site with GPT-4o or Gemini handles most of the cases above without manual configuration — just pick the vision model and upload the PDF.

← Back to all guides