Measuring and Improving OCR Accuracy — CER, WER, and What "95%" Really Means
Every OCR tool advertises an accuracy number. "99% accurate!" "State-of-the-art recognition!" These numbers are close to meaningless without knowing how they were measured and on what. A tool that's 99% accurate on clean printed text might be 70% accurate on your faxed, coffee-stained forms.
This guide explains how OCR accuracy is actually quantified, why a single percentage hides important detail, and how to benchmark tools on your documents instead of trusting a marketing figure.
Why "accuracy" needs a definition
"95% accurate" could mean any of several different things:
- 95% of characters correct
- 95% of words correct
- 95% of documents with zero errors
- 95% of fields (in a form) correct
These are wildly different bars. Because errors compound across units, a 95% character accuracy implies far worse word accuracy: if 5% of characters are wrong and the average word is 5 characters, roughly 1 in 4 words contains an error. The headline number is almost always the most flattering framing. To compare tools meaningfully you need standard metrics.
The two metrics that matter: CER and WER
Character Error Rate (CER)
CER is the edit distance between the OCR output and the ground truth, divided by the number of characters in the ground truth:
CER = (substitutions + insertions + deletions) / total_characters
A CER of 0.02 means 2% of characters are wrong — substituted, missing, or invented. Lower is better. CER is the most fundamental OCR metric because it's granular and insensitive to how text is tokenized.
- Excellent: CER < 1% (clean modern print, good scan)
- Usable: CER 1–5% (typical real-world scans)
- Marginal: CER 5–10% (degraded documents; needs cleanup)
- Poor: CER > 10% (faxes, old print, bad scans, handwriting)
Word Error Rate (WER)
WER is the same edit-distance idea applied to whole words:
WER = (word_subs + word_inserts + word_deletes) / total_words
WER is always higher than CER because one wrong character makes the whole word wrong. WER matters more when downstream use is word-oriented — search indexing, NLP, keyword matching. A document with CER 2% might have WER 8–10%.
Which to optimize depends on use: CER for字-level fidelity (and for languages without clear word boundaries), WER for search and text-mining workflows.
Computing the metrics yourself
You don't need special tooling. In Python, jiwer computes both:
from jiwer import cer, wer
ground_truth = "The quick brown fox jumps over the lazy dog."
ocr_output = "The qulck brown fox jumps ovor the Iazy dog."
print(f"CER: {cer(ground_truth, ocr_output):.3f}")
print(f"WER: {wer(ground_truth, ocr_output):.3f}")
For a real benchmark, the work isn't the math — it's the ground truth. You need a set of documents where you know the correct text. The honest way to get it: pick 10–20 pages representative of your actual document mix and transcribe them by hand (or correct the OCR output meticulously). This is tedious but it's the only way to get numbers that mean something for your data.
Build a benchmark on your own documents
Marketing numbers are measured on clean, curated test sets — often academic corpora that look nothing like your inputs. The only accuracy figure that predicts your results is one measured on your documents. The process:
- Sample representatively. Pull pages that span your real range: clean ones, degraded ones, different fonts, any tables or columns, the languages you actually handle.
- Create ground truth. Hand-transcribe the sample. Decide rules up front: do you include headers/footers? page numbers? How do you handle figures?
- Run each candidate tool on the same pages with the same settings.
- Compute CER and WER per page, then aggregate. Look at the distribution, not just the mean — a tool with low average CER but a few catastrophic pages may be worse in practice than a consistent one.
- Inspect the errors. The kinds of errors matter: systematic confusions (l↔I, 0↔O, rn↔m) are fixable with post-processing; random garbage means the input is too degraded.
Twenty pages of ground truth is enough to rank tools confidently. It's a half-day of work that saves you from picking the wrong engine for a 100,000-page project.
What actually moves accuracy
Once you can measure, here's where the gains are, roughly in order of impact:
1. Input image quality (biggest lever)
OCR accuracy is capped by what the engine can see. Resolution below 300 DPI, skew, low contrast, and noise hurt more than tool choice. Rescanning at 300–400 DPI often beats switching engines. The full playbook is in image preprocessing for OCR — deskew, binarize, denoise, and ensure adequate DPI before recognition.
2. The right engine for the material
- Clean modern print: Tesseract is excellent and free. See the Tesseract OCR guide.
- Degraded, complex-layout, or mixed content: cloud services (Textract, Azure, Google) measurably outperform — see cloud OCR services compared.
- Handwriting: a specialized handwriting model, not general OCR — see OCRing handwritten documents.
- Non-English or mixed scripts: a tool with the right language packs — see multi-language OCR.
3. Correct language and segmentation settings
Telling the engine the right language and page-segmentation mode can cut error rates substantially, especially for non-Latin scripts and multi-column pages.
4. Post-processing
Dictionary-based correction, regex fixes for systematic confusions, and — increasingly — passing the OCR output to an LLM with the prompt "correct obvious OCR errors without changing meaning" can recover several percentage points of WER, especially on natural-language text. Be cautious with this on data like serial numbers or codes, where the model may "correct" something that was actually right.
Confidence scores ≠ accuracy
Most engines emit a per-word or per-character confidence score. These are useful for routing — flagging low-confidence regions for review — but they are not the same as accuracy. A confidently-recognized wrong character (a clean O that should be 0) scores high confidence and is still an error. Use confidence to triage, use ground-truth CER/WER to evaluate.
Quick reference
- Comparing tools? Compute CER and WER on 10–20 hand-transcribed pages of your own documents — never trust the marketing number.
- CER vs WER? CER for character-level fidelity and word-boundary-free languages; WER for search and text mining.
- Accuracy too low? Fix the image first (DPI, deskew, contrast), then the engine, then settings, then post-process — in that order.
- Need to find bad pages at scale? Use the engine's confidence scores to route low-confidence pages to review.
Conclusion
"Accuracy" is only a useful word once you attach a metric and a test set to it. CER and WER, computed on a small ground-truth sample of your own documents, turn vague vendor claims into a number you can act on. And when the number disappoints, the input image — not the engine — is usually the cheapest thing to fix.
To experiment quickly, the converter here runs Tesseract on scanned pages so you can eyeball results on a real document before committing to a full benchmark.
← Back to all guides