Tesseract ocr in java

OCR in Java with Tess4J

Optical character recognition (OCR) is the conversion of images containing text to machine-encoded text. A popular tool for this is the open source project Tesseract. Tesseract can be used as standalone application from the command line. Alternatively it can be integrated into applications using its C++ API. For other programming languages various wrapper APIs are available. In this post we will use the Java Wrapper Tess4J.

Getting started

We start with adding the Tess4J maven dependency to our project:

 net.sourceforge.tess4j tess4j 4.5.2 

Next we need to make sure the native libraries required by Tess4j are accessible from our application. Tess4J jar files ship with native libraries included. However, they need to be extracted before they can be loaded. We can do this programmatically using a Tess4J utility method:

File tmpFolder = LoadLibs.extractTessResources("win32-x86-64"); System.setProperty("java.library.path", tmpFolder.getPath());

With LoadLibs.extractTessResources(..) we can extract resources from the jar file to a local temp directory. Note that the argument (here win32-x86-64) depends on the system you are using. You can see available options by looking into the Tess4J jar file. We can instruct Java to load native libraries from the temp directory by setting the Java system property java.library.path.

Other options to provide the libraries might be installing Tesseract on your system. If you do not want to change the java.library.path property you can also manually load the libraries using System.load(..) .

Читайте также:  Date month and year in java

Next we need to provide language dependent data files to Tesseract. These data files contain trained models for Tesseracts LSTM OCR engine and can be downloaded from GitHub. For example, for detecting german text we have to download deu.traineddata (deu is the ISO 3166-1-alpha-3 country code for Germany). We place one or more downloaded data files in the resources/data directory.

Detecting Text

Now we are ready to use Tesseract within our Java application. The following snippet shows a minimal example:

Tesseract tesseract = new Tesseract(); tesseract.setLanguage("deu"); tesseract.setOcrEngineMode(1); Path dataDirectory = Paths.get(ClassLoader.getSystemResource("data").toURI()); tesseract.setDatapath(dataDirectory.toString()); BufferedImage image = ImageIO.read(Main.class.getResourceAsStream("/ocrexample.jpg")); String result = tesseract.doOCR(image); System.out.println(result);

First we create a new Tesseract instance. We set the language we want to recognize (here: german). With setOcrEngineMode(1) we tell Tesseract to use the LSTM OCR engine.

Next we set the data directory with setDatapath(..) to the directory containing our downloaded LSTM models (here: resources/data).

Finally we load an example image from the classpath and use the doOCR(..) method to perform character recognition. As a result we get a String containing detected characters.

For example, feeding Tesseract with this photo from the German wikipedia OCR article might produce the following text output.

ocr-example

Grundsätzliches [Quelltext bearbeiten] Texterkennung ist deshalb notwendig, weil optische Eingabegeräte (Scanner oder Digitalkameras, aber auch Faxempfänger) als Ergebnis ausschließlich Rastergrafiken liefern können. d. h. in Zeiten und Spaten angeordnete Punkte unterschiedlicher Färbung (Pixel). Texterkennung bezeichnet dabei die Aufgabe, die so dargestellten Buchstaben als solche zu erkennen, dh. zu identifizieren und ihnen den Zahlenwert zuzuordnen, der ihnen nach üblicher Textcodierung zukommt (ASCII, Unicode). Automatische Texterkennung und OCR werden im deutschen Sprachraum oft als Synonym verwendet In technischer Hinsicht bezieht sich OCR jedoch nur auf den Teilbereich der Muster vergleiche von separierten Bildteilen als Kandidaten zur ( Erkennung von Einzelzeichen. Diesem OCR—Prozess geht eine globale Strukturerkennung voraus, in der zuerst Textblöcke von graphischen Elementen unterschieden, die Zeilenstrukturen erkannt und schließlich | Einzeizeichen separiert werden. Bei der Entscheidung, welches Zeichen vorliegt, kann über weitere \ . Algorithmen ein sprachlicher Kontext berücksichtigt werden

Summary

Tesseract is a popular open source project for OCR. With Tess4J we can access the Tesseract API in Java. A little bit of set up is required for loading native libraries and downloading Tesseracts LSTM data. After that it is quite easy to perform OCR in Java. If you are not happy with the recognized text it is a good idea to have a look at the Improving the quality of the output section of the Tesseract documentation.

You can find the source code for the shown example on GitHub.

Источник

Tesseract OCR with Java with Examples

In this article, we will learn how to work with Tesseract OCR in Java using the Tesseract API.

What is Tesseract OCR?
Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and open sourced in 2005. Since 2006 it is developed by Google. Tesseract has Unicode (UTF-8) support and can recognize more than 100 languages “out of the box” and thus can be used for building different language scanning software also. Latest Tesseract version is Tesseract 4. It adds a new neural net (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.

Generally OCR works as follows:

  1. Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
  2. Detect lines, words and characters.
  3. Produce ranked list of candidate characters based on trained data set. (here the setDataPath() method is used for setting path of trainer data)
  4. Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.

The advantages of OCR are numerous, but namely:

  • it increases the efficiency and effectiveness of office work
  • The ability to instantly search through content is immensely useful, especially in an office setting that has to deal with high volume scanning or high document inflow.
  • OCR is quick ensuring the document’s content remains intact while saving time as well.
  • Workflow is increased since employees no longer have to waste time on manual labour and can work quicker and more efficiently.
  • The OCR is limited to language recognition.
  • There is lot of effort that is required to make trainer data of different languages and implement that.
  • One also need to do extra work on image processing as it is the most essential part that really matters when it comes to the performance of OCR.
  • After doing such a great amount of work, no OCR can offer an accuracy of 100% and even after OCR we have to determine the unrecognized character by neighbouring methods of machine learning or manually correct it.

How to use Tesseract OCR

  1. The first step is to download the Tess4J API from the link
  2. Extract the Files from the downloaded file
  3. Open your IDE and make a new project
  4. Link the jar file with your project. Refer this link .
  5. Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.

Java

Performing OCR on unclear images Note that the image selected above is actually very clear and grayscaled but this doesn’t happen in most of the cases. In most of the cases, we get a noisy image and thus a very nosy output. To deal with it we need to perform some processing on the image called Image processing. Tesseract works best when there is a very clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee good segmentation. There are a variety of reasons you might not get good quality output from Tesseract if the image has noise on the background. Noise removal from image comes in the part of image processing. For this, we need to know that in what way an image should be processed. You can refer this article for a detail understanding of how can you improve the accuracy. To implement the same in JAVA, we will make a small intelligence-based model which will scan the RGB content of the image and then convert it into the grayscaled content and also we will perform some zooming effect on the image too. The below example is a sample code on how the image can be grayscaled based on its RGB content. So if images are very dark then they become brighter and clearer and if in case the images are whitish then they are scaled to little dark contrast so that text is visible.

Источник

Tesseract OCR with Java with Examples

Optical Character Recognition (OCR) plays an instrumental role in digitizing printed text, allowing it to be edited, searched, and stored more compactly. One of the most powerful OCR tools available is Tesseract OCR. This article will explore how to use Tesseract OCR with Java, providing detailed examples to enhance your understanding.

What is Tesseract OCR?

Tesseract OCR is an open-source OCR engine sponsored by Google that can recognize more than 100 languages out of the box. It’s widely regarded for its accuracy and adaptability, making it a popular choice for developers across various applications.

Integrating Tesseract OCR with Java

To integrate Tesseract OCR with Java, we need to use the Tesseract API for Java, typically known as Tess4J. Tess4J provides a Java JNA wrapper for Tesseract OCR API, bridging the gap between the Tesseract engine and Java applications.

Step 1: Setting Up the Environment

First, we need to install Tesseract OCR and Tess4J. Tesseract can be installed on Windows, Linux, and MacOS using their respective package managers. To include Tess4J in your Java project, you can add it as a Maven dependency −

 net.sourceforge.tess4j tess4j 4.5.4 

Step 2: Performing OCR on an Image

Below is a simple Java code snippet that performs OCR on an image file −

import net.sourceforge.tess4j.*; public class OCRExample < public static void main(String[] args) < File imageFile = new File("path_to_your_image_file"); ITesseract instance = new Tesseract(); // JNA Interface Mapping instance.setDatapath("path_to_tessdata"); // replace with your tessdata path try < String result = instance.doOCR(imageFile); System.out.println(result); >catch (TesseractException e) < System.err.println(e.getMessage()); >> >

In this example, we instantiate a Tesseract object and set the path to the tessdata directory, which contains language data files. We then call doOCR() on our image file, which returns a String containing the recognized text.

Step 3: Handling Multiple Languages

Tesseract OCR supports over 100 languages. To perform OCR with a different language, simply set the language on the Tesseract instance −

instance.setLanguage("fra"); // for French

try < String result = instance.doOCR(imageFile); System.out.println(result); >catch (TesseractException e)

This will now perform OCR on the image using French language data.

Conclusion

Tesseract OCR, combined with Java, presents a powerful toolset for developers needing to implement OCR capabilities into their applications. The flexibility, accuracy, and extensive language support of Tesseract make it an excellent choice for a broad range of OCR tasks.

Источник

Оцените статью