Optical Character Recognition (OCR) - Getting Started: Tesseract

Basic overview of several tools (both open source such as Tesseract and commercial such as Adobe Acrobat) that perform optical character recognition (OCR).

Tesseract

Tesseract Overview

Tesseract is an open-source (Apache 2 license) command line program without a built-in graphical interface. It was originally developed by HP, and is now developed by Google. In addition to plain text outputs, Tesseract can produce PDF and hOCR formats.

Installation

Tesseract is available on GitHub, and can be installed on Mac, Windows, or Linux. A Docker container is also available.

Using Tesseract from the command line

Basic usage guidance for Tessearact is available on GitHub.

OCR a single image (e.g., JPG, TIF) to plain text

tesseract [file/path/to/image/file] [file/path/for/text/output]

For example, the following would take the file 'myimage.jpg' from the current directory and create a new text file called 'myimage.txt' in the same directory:

tesseract myimage.jpg myimage.txt

OCR a single image (e.g., JPG, TIF) to PDF

$ tesseract [file/path/to/image/file] [file/path/for/text/output] pdf

OCR multiple images into a single plain text file

Tesseract will also accept a list of filenames as input, which it will turn into a single text file of the OCR'd output. Create a text file called 'filenames.txt' with the paths to the image files, e.g.,

[file/path/to/image/file1]
[file/path/to/image/file1]
[file/path/to/image/file1]
[file/path/to/image/file1]

Then run:

$ tesseract filenames.txt [path/to/combined/output/file]

Languages

Support for many languages is available on GitHub. You may need to download files for specific language after you're installed Tesseract.

For example, to specify Spanish

$ tesseract -l spa [file/path/to/image/file] [file/path/for/text/output]