Measuring and Improving OCR Accuracy — CER, WER, and What "95%" Really Means

Every OCR tool advertises an accuracy number. "99% accurate!" "State-of-the-art recognition!" These numbers are close to meaningless without knowing how they were measured and on what. A tool that's 99% accurate on clean printed text might be 70% accurate on your faxed, coffee-stained forms.

This guide explains how OCR accuracy is actually quantified, why a single percentage hides important detail, and how to benchmark tools on your documents instead of trusting a marketing figure.

Why "accuracy" needs a definition

"95% accurate" could mean any of several different things:

These are wildly different bars. Because errors compound across units, a 95% character accuracy implies far worse word accuracy: if 5% of characters are wrong and the average word is 5 characters, roughly 1 in 4 words contains an error. The headline number is almost always the most flattering framing. To compare tools meaningfully you need standard metrics.

The two metrics that matter: CER and WER

Character Error Rate (CER)

CER is the edit distance between the OCR output and the ground truth, divided by the number of characters in the ground truth:

CER = (substitutions + insertions + deletions) / total_characters

A CER of 0.02 means 2% of characters are wrong — substituted, missing, or invented. Lower is better. CER is the most fundamental OCR metric because it's granular and insensitive to how text is tokenized.

Word Error Rate (WER)

WER is the same edit-distance idea applied to whole words:

WER = (word_subs + word_inserts + word_deletes) / total_words

WER is always higher than CER because one wrong character makes the whole word wrong. WER matters more when downstream use is word-oriented — search indexing, NLP, keyword matching. A document with CER 2% might have WER 8–10%.

Which to optimize depends on use: CER for字-level fidelity (and for languages without clear word boundaries), WER for search and text-mining workflows.

Computing the metrics yourself

You don't need special tooling. In Python, jiwer computes both:

from jiwer import cer, wer

ground_truth = "The quick brown fox jumps over the lazy dog."
ocr_output   = "The qulck brown fox jumps ovor the Iazy dog."

print(f"CER: {cer(ground_truth, ocr_output):.3f}")
print(f"WER: {wer(ground_truth, ocr_output):.3f}")

For a real benchmark, the work isn't the math — it's the ground truth. You need a set of documents where you know the correct text. The honest way to get it: pick 10–20 pages representative of your actual document mix and transcribe them by hand (or correct the OCR output meticulously). This is tedious but it's the only way to get numbers that mean something for your data.

Build a benchmark on your own documents

Marketing numbers are measured on clean, curated test sets — often academic corpora that look nothing like your inputs. The only accuracy figure that predicts your results is one measured on your documents. The process:

  1. Sample representatively. Pull pages that span your real range: clean ones, degraded ones, different fonts, any tables or columns, the languages you actually handle.
  2. Create ground truth. Hand-transcribe the sample. Decide rules up front: do you include headers/footers? page numbers? How do you handle figures?
  3. Run each candidate tool on the same pages with the same settings.
  4. Compute CER and WER per page, then aggregate. Look at the distribution, not just the mean — a tool with low average CER but a few catastrophic pages may be worse in practice than a consistent one.
  5. Inspect the errors. The kinds of errors matter: systematic confusions (l↔I, 0↔O, rn↔m) are fixable with post-processing; random garbage means the input is too degraded.

Twenty pages of ground truth is enough to rank tools confidently. It's a half-day of work that saves you from picking the wrong engine for a 100,000-page project.

What actually moves accuracy

Once you can measure, here's where the gains are, roughly in order of impact:

1. Input image quality (biggest lever)

OCR accuracy is capped by what the engine can see. Resolution below 300 DPI, skew, low contrast, and noise hurt more than tool choice. Rescanning at 300–400 DPI often beats switching engines. The full playbook is in image preprocessing for OCR — deskew, binarize, denoise, and ensure adequate DPI before recognition.

2. The right engine for the material

3. Correct language and segmentation settings

Telling the engine the right language and page-segmentation mode can cut error rates substantially, especially for non-Latin scripts and multi-column pages.

4. Post-processing

Dictionary-based correction, regex fixes for systematic confusions, and — increasingly — passing the OCR output to an LLM with the prompt "correct obvious OCR errors without changing meaning" can recover several percentage points of WER, especially on natural-language text. Be cautious with this on data like serial numbers or codes, where the model may "correct" something that was actually right.

Confidence scores ≠ accuracy

Most engines emit a per-word or per-character confidence score. These are useful for routing — flagging low-confidence regions for review — but they are not the same as accuracy. A confidently-recognized wrong character (a clean O that should be 0) scores high confidence and is still an error. Use confidence to triage, use ground-truth CER/WER to evaluate.

Quick reference

Conclusion

"Accuracy" is only a useful word once you attach a metric and a test set to it. CER and WER, computed on a small ground-truth sample of your own documents, turn vague vendor claims into a number you can act on. And when the number disappoints, the input image — not the engine — is usually the cheapest thing to fix.

To experiment quickly, the converter here runs Tesseract on scanned pages so you can eyeball results on a real document before committing to a full benchmark.

← Back to all guides