Tesseract is an open-source (Apache 2 license) command line program without a built-in graphical interface. It was originally developed by HP, and is now developed by Google. In addition to plain text outputs, Tesseract can produce PDF and hOCR formats.
Tesseract is available on GitHub, and can be installed on Mac, Windows, or Linux. A Docker container is also available.
Basic usage guidance for Tessearact is available on GitHub.
$ tesseract [file/path/to/image/file] [file/path/for/text/output]
For example, the following would take the file 'myimage.jpg' from the current directory and create a new text file called 'myimage.txt' in the same directory:
$ tesseract myimage.jpg myimage.txt
Tesseract will also accept a list of filenames as input, which it will turn into a single text file of the OCR'd output. Create a text file called 'filenames.txt' with the paths to the image files, e.g.,
[file/path/to/image/file1] |
Then run:
$ tesseract filenames.txt [path/to/combined/output/file]
Support for many languages is available on GitHub. You may need to download files for specific language after you're installed Tesseract.
For example, to specify Spanish
$ tesseract -l spa [file/path/to/image/file] [file/path/for/text/output]