What Is OCR?

The OCR acronym stands for Optical Character Recognition: a software program and system whereby a computer can read the text inside images. Imagine taking a photo of your favorite passage from one the Lord of The Rings books.

You’d like to quote it elsewhere, but all you have is a photo. OCR Software can help you by parsing that photo/image and finding all text within it.

The OCR Software will then, for each letter discovered, analyze the graphical dots seen in the image, and translate/transform that into actual text a computer can use, for example in a word processor.

While there are many OCR software available, some paid and some free, they are not all of the same quality. Some packages will provide poorer quality results, others will closely align to the text seen in the photo or image.

Generally speaking, standard books (or Internet web page prints) will work very well, and should produce reasonable quality results in all cases, as the fonts are straight and uniform and under a singe angle, provided that the original photo or scan is of reasonable quality.

Also good to keep in mind is that even advanced software packages may struggle with poor quality or blurred images, and most packages may struggle with different handwriting styles etc. Other challenges may include text mixed with images or photos, or different direction (for example left-right as well as top-down, or angled text) within the same page.

This makes choosing, and potentially paying for, an OCR package a perhaps long winded process, especially if you want to test and evaluate each package.

For those who are using Linux, there is a great alternative route. A free, top quality OCR software based on LSTM Neural Net with unicode (UTF-8) support, and which can recognize more then 100 languages by default. It also supports many output formats like HTML, PDF, and plain text.

Without further ado; welcome to Tesseract OCR!

Installing Tesseract OCR

To install Tesseract OCR on your Debian/Apt based Linux distribution (Like Ubuntu and Mint), do:

sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

To install Tesseract OCR on RHEL and Centos, do:

sudo yum install epel-release sudo yum install tesseract-devel leptonica-devel

To install Tesseract OCR on Fedora, do:

sudo yum install tesseract-devel leptonica-devel

To install Tesseract OCR on OSX, do:

brew install tesseract

Let’s OCR!

We will use a simple image which contains the following text:

To convert this image, all you have to do is open your Terminal prompt, change directory (using the cd your_directory_with_images command) to the directory which contains your images (for example, if you have made a directory images in your home directory (~/images) you can simply use cd ~/images), and OCR the files:

Very simple and straightforward. And as we can see, the output is perfect.

We specify the English language by using the -l eng option. You can check the tesseract manual (man tesseract) for any other available language codes.

We also specified the input image (input_for_ocr.png) as well as the output file output_from_ocr without any file extension, which will use the default plain text .txt format.

We can also change the output format to PDF by using a slightly longer command which simply specifies the output format at the end:

By adding the pdf suffix, the output format used was PDF. When we open the PDF file (output_from_ocr.pdf), we can see that the text can be selected and copied/pasted as was done with the word Readers! here:

In other words, the PDF file contains text based and selectable data, not graphical (and therefore unselectable) information. Great!

What if I Want to OCR a PDF file?

Sometimes you may receive a PDF file which – though the PDF format supports actual text inside pages – contains only images with text. This can be frustrating as copy and paste will not be available. You can OCR these pages also, with a small workaround.

You will first want to convert your PDF file to images – one image per page – and then OCR the individual pages into text. A little more work, but still a great time saver over re-typing text manually.

For simple steps to convert a PDF file to images, or even to script and automate the conversion of multiple PDF files, you can read our article Convert PDF to Images From the Linux Command Line!

Wrapping Up

In this article, we explored Tesseract, the top quality free command-line OCR engine for Linux. We saw how we could easily convert images to text using a simple command.

We also looked at converting images to text-based PDF files, and referred an article where you can find information on how to pre-convert image-based PDF files to images so they can subsequently be converted to text using the OCR method shown here.

Enjoy!