Image Preprocessing for OCR — DPI, Deskew, Contrast, Binarization

OCR accuracy is more sensitive to image quality than to engine choice. The same Tesseract installation can produce 60% accuracy on a poorly-prepped image and 97% accuracy on the same content scanned and processed properly. Modern vision models are more forgiving but still benefit substantially from clean inputs.

This guide walks through the preprocessing steps that consistently improve OCR output and the ones that look helpful but actually hurt.

DPI: the most important variable

Resolution is the single biggest lever. Tesseract, in particular, is calibrated for content that's roughly 30–40 pixels tall per character — equivalent to body text rendered at 300 DPI. Below that, the engine's character segmenter starts failing. Above 600 DPI, you're spending compute without accuracy gain.

The practical guidance:

For batch jobs, the rendering cost scales with DPI squared. 600 DPI uses 4× the memory and compute of 300 DPI. Don't oversample if you don't need to.

Deskew

Documents fed through a scanner or photographed by hand are rarely perfectly aligned. Even a 1–2° rotation hurts OCR accuracy because the engine's character segmenter assumes horizontal baselines.

Detection: most OCR engines can detect skew but won't correct it for you. Use a separate deskew step:

import cv2
import numpy as np

def deskew(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    coords = np.column_stack(np.where(cv2.bitwise_not(gray) > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    h, w = image.shape[:2]
    M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

For more robust deskew across varied documents, the OpenCV approach above is OK but deskew (the Python package) or ImageMagick's -deskew 40% produce slightly better results on tricky cases.

When not to deskew: documents where the rotation is intentional (a landscape page, a rotated form). Most deskew algorithms can't tell the difference and will rotate the whole page. Add a check: if the detected angle is greater than ~20°, leave it alone.

Denoising

Scans often have noise: paper texture, JPEG compression artifacts, dust speckles, residual stamps. Denoising helps for traditional OCR engines but is risky.

For Tesseract:

For vision models:

The anti-pattern: applying heavy denoising on already-clean digital documents. Removes nothing useful, softens character edges, hurts accuracy.

Binarization

Binarization converts a grayscale or color image to pure black-and-white. Critical for Tesseract (which works best on binary input), irrelevant or even harmful for vision models.

Two main approaches:

import cv2

# Otsu — uniform background
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Adaptive — uneven lighting
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                cv2.THRESH_BINARY, 31, 10)

Tune the block size (31 above) to roughly 2–3× the height of a character. Tuning the constant offset (10) controls how aggressively the threshold separates ink from paper.

For images going to a vision model, skip binarization. The vision model uses grayscale and color information; throwing it away hurts. Tesseract is the only major engine that strictly needs binary input.

Contrast and levels

For faded ink, poor scans, or printed material with low contrast:

# ImageMagick: stretch the histogram, pushing the lightest to white and darkest to black
convert input.png -auto-level output.png

# More aggressive: stretch with explicit cutoffs
convert input.png -level 20%,80% output.png

# Increase contrast without clipping
convert input.png -contrast -contrast output.png

The -level 20%,80% form is the workhorse — it remaps the 20th percentile of pixel intensities to black and the 80th percentile to white, stretching everything in between. Effective for most documents that look "dim" or "washed out".

For very faded documents (carbon copies, old thermal-paper receipts), pushing the levels harder (-level 35%,75%) recovers more text but starts to introduce false strokes from paper texture. Test on a sample before batch-processing.

Cropping

OCR engines don't waste time on white space, but they do sometimes pick up content from page edges (binder shadows, ruler marks, hand fingertips in phone photos) that isn't part of the document.

Practical cropping:

For phone photos specifically, a "perspective correction" step that detects the four corners of the document and warps it to a flat rectangle has dramatic effects on OCR accuracy. The OpenCV approach:

import cv2
import numpy as np

def perspective_correct(image, corners):
    """corners: 4 (x,y) points in TL, TR, BR, BL order."""
    h, w = 1100, 850  # target output size
    src = np.array(corners, dtype=np.float32)
    dst = np.array([[0,0],[w,0],[w,h],[0,h]], dtype=np.float32)
    M = cv2.getPerspectiveTransform(src, dst)
    return cv2.warpPerspective(image, M, (w, h))

Mobile document scanners (Adobe Scan, Microsoft Lens, Apple Notes) do this automatically. If you're using one of them as the capture path, the perspective is already corrected.

Color handling

Most OCR engines convert to grayscale internally; passing a color image just slows them down. But:

For vision models, send color. They use it.

What the engines do internally

Knowing what the engine does automatically tells you what not to do:

The high-level rule: traditional OCR engines (Tesseract being the main one) benefit from aggressive preprocessing; modern ML-based engines do not.

A pragmatic preprocessing pipeline

For Tesseract on scanned documents:

  1. Upscale to 300 DPI if lower.
  2. Convert to grayscale.
  3. Light Gaussian blur (1 px radius) for noisy scans.
  4. Deskew.
  5. Adaptive threshold.

For vision models:

  1. Resize to 1024–2048 px on the long side.
  2. Send as-is. Skip everything else.

For the converter on this site, preprocessing happens automatically before OCR. If you're running OCR locally and getting poor results, the highest-leverage fixes are checking the DPI and turning off any preprocessing that isn't earning its keep.

← Back to all guides