Python recognize text from image

How to Extract Text from Images with Python?

OCR (Optical Character Recognition) is the process of electronical conversion of Digital images into machine-encoded text. Where the digital image is generally an image that contains regions that resemble characters of a language. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. This is due to the fact that newer OCR’s are trained by providing them sample data which is ran over a machine learning algorithm. This technique of extracting text from images is generally carried out in work environments where it is certain that the image would be containing text data. In this article, we would learn about extracting text from images. We would be utilizing python programming language for doing so.

For enabling our python program to have Character recognition capabilities, we would be making use of pytesseract OCR library. The library could be installed onto our python environment by executing the following command in the command interpreter of the OS:-

The library (if used on Windows OS) requires the tesseract.exe binary to be also present for proper installation of the library. During the installation of the aforementioned executable, we would be prompted to specify a path for it. This path needs to be remembered as it would be utilized later on in the code. For most installations the path would be C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe.

Читайте также:  Python 3 requests удалить

Explanation:

Firstly we imported the Image module from PIL library (for opening an image) and then pytesseract module from pytesseract library(for text extraction). Then after we defined the path_to_tesseract variable which contains the path to the executable binary (tesseract.exe) that we installed in the prerequisite (this path would depend on the location where the binary is installed). Then we defined the image_path variable which contains the path to the image file. This path is passed to the open() function to create an image object out of our image. After this, we assigned the pytesseract.tesseract_cmd variable the path stored in path_to_tesseract variable (this would be used by the library to find the executable and use it for extraction). After which we passed the image object (img) to image_to_string() function. This function takes in argument an image object and returns the text recognized inside it. In the end, we displayed the text which was found in the image using text[:-1] (due to a additional character (^L) that gets appended by default).

Image for demonstration:

An image of white text with black background

Below is the full implementation:

Источник

How to Build Optical Character Recognition (OCR) in Python

Optical character recognition (OCR) is a tool that can recognize text in images. Here’s how to build an OCR engine in Python.

Fahmi Nurfikri is a software engineer for the machine learning firm Prosa.ai, with more than five years of experience in artificial intelligence, Python and data science. Nurfikri holds a master’s degree in Informatics from Telkom University.

A person using their phone

Optical character recognition (OCR) is a technology that recognizes text in images , such as scanned documents and photos. Perhaps you’ve taken a photo of a text just because you didn’t want to take notes or because taking a photo is faster than typing it. Fortunately, thanks to smartphones today, we can apply OCR so that we can copy the picture of text we took before without having to retype it.

What Is Python Optical Character Recognition (OCR)?

Python OCR is a technology that recognizes and pulls out text in images like scanned documents and photos using Python. It can be completed using the open-source OCR engine Tesseract.

We can do this in Python using a few lines of code. One of the most common OCR tools that are used is the Tesseract . Tesseract is an optical character recognition engine for various operating systems.

Python OCR Installation

Tesseract runs on Windows, macOS and Linux platforms. It supports Unicode (UTF-8) and more than 100 languages. In this article, we will start with the Tesseract OCR installation process, and test the extraction of text in images.

The first step is to install the Tesseract. In order to use the Tesseract library, we need to install it on our system. If you’re using Ubuntu, you can simply use apt-get to install Tesseract OCR:

sudo apt-get install Tesseract-ocr

For macOS users, we’ll be using Homebrew to install Tesseract.

For Windows, please see the Tesseract documentation .

Let’s begin by getting pyTesseract installed.

Python OCR Implementation

After installation is completed, let’s move forward by applying Tesseract with Python. First, we import the dependencies.

from PIL import Image import pyTesseract import numpy as np

I will use a simple image to test the usage of the Tesseract.

The words

Let’s load this image and convert it to text.

filename = 'image_01.png' img1 = np.array(Image.open(filename)) text = pyTesseract.image_to_string(img1)

Result after running the OCR in Python.

The results obtained from the Tesseract are good enough for simple images. However, in the real world it is difficult to find images that are really simple, so I will add noise to test the performance of the Tesseract.

The words

We’ll do the same process as before.

filename = 'image_02.png' img2 = np.array(Image.open(filename)) text = pyTesseract.image_to_string(img2) print(text)

No result after trying to pull text from an image with noise.

The result is, nothing. This means that tesseract cannot read words in images that have noise.

Next we’ll try to use a little image processing to eliminate noise in the image. Here I will use the OpenCV library. In this experiment, I’m using normalization, thresholding and image blur.

import numpy as np import cv2norm_img = np.zeros((img.shape[0], img.shape[1])) img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX) img = cv2.threshold(img, 100, 255, cv2.THRESH_BINARY)[1] img = cv2.GaussianBlur(img, (1, 1), 0)

The result will be like this:

The sample image with noise cleaned to reveal the text.

Now that the image is clean enough, we will try again with the same process as before. And this is the result.

Result revealing that the OCR picked up the text.

As you can see, the results are in accordance with what we expect.

Video introducing the basics of how to use PyTesseract to extract text from images. | Video: ProgrammingKnowledge

Text Localization and Detection in Python OCR

With Tesseract, we can also do text localization and detection from images. We will first enter the dependencies that we need.

from pyTesseract import Output import pyTesseract import cv2

I will use a simple image like the example above to test the usage of the Tesseract.

Sample image to run in the OCR.

Now, let’s load this image and extract the data.

filename = 'image_01.png' image = cv2.imread(filename)

This is different from what we did in the previous example. In the previous example we immediately changed the image into a string. In this example, we’ll convert the image into a dictionary .

results = pyTesseract.image_to_data(image, output_type=Output.DICT)

The following results are the contents of the dictionary.

I will not explain the purpose of each value in the dictionary. Instead, we will use the left, top, width and height to draw a bounding box around the text along with the text itself. In addition, we will need a conf key to determine the boundary of the detected text.

Now, we will extract the bounding box coordinates of the text region from the current result, and we’ll specify the confidence value that we want. Here, I’ll use the value conf = 70. The code will look like this:

for i in range(0, len(results[“text”])): x = results[“left”][i] y = results[“top”][i] w = results[“width”][i] h = results[“height”][i] text = results[“text”][i] conf = int(results[“conf”][i]) if conf > 70: text = "".join([c if ord(c) < 128 else "" for c in text]).strip() cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2) cv2.putText(image, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 200), 2)

Now that everything is set, we can display the results using this code.

Results with box coordinates around the text.

Ultimately, the Tesseract is most suitable when building a document processing pipeline where images are scanned and processed. This works best for situations with high-resolution input, where foreground text is neatly segmented from the background.

For text localization and detection, there are several parameters that you can change, such as confident value limits. Or if you find it unattractive, you can change the thickness or color of the bounding box or text.

Источник

Оцените статью