Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
License
kyper999/tesseract-ocr-for-php
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
A wrapper to work with Tesseract OCR inside PHP.
$ composer require thiagoalessio/tesseract_ocr
‼️ This library depends on Tesseract OCR, version 3.02 or later.
There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
choco install capture2text --version 3.9
⚠️ Recent versions of Capture2Text stopped shipping the tesseract binary.
With MacPorts you can install support for individual languages, like so:
$ sudo port install tesseract-
But that is not possible with Homebrew. It comes only with English support by default, so if you intend to use it for other language, the quickest solution is to install them all:
$ brew install tesseract --with-all-languages
use thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('text.png')) ->run();
The quick brown fox jumps over the lazy dog.
use thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('german.png')) ->lang('deu') ->run();
use thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('mixed-languages.png')) ->lang('eng', 'jpn', 'spa') ->run();
use thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('8055.png')) ->whitelist(range('A', 'Z')) ->run();
Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look at this comment:
Define the path of an image to be recognized by tesseract .
$ocr = new TesseractOCR(); $ocr->image('/path/to/image.png'); $ocr->run();
Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH .
echo (new TesseractOCR('img.png')) ->executable('/path/to/tesseract') ->run();
Returns the current version of tesseract .
echo (new TesseractOCR())->version();
Returns a list of available languages/scripts.
foreach((new TesseractOCR())->availableLanguages() as $lang) echo $lang;
Specify a custom location for the tessdata directory.
echo (new TesseractOCR('img.png')) ->tessdataDir('/path') ->run();
Specify the location of user words file.
This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract .
Useful when dealing with contents that contain technical terminology, jargon, etc.
$ cat /path/to/user-words.txt foo bar
echo (new TesseractOCR('img.png')) ->userWords('/path/to/user-words.txt') ->run();
Specify the location of user patterns file.
If the contents you are dealing with have known patterns, this option can help a lot tesseract’s recognition accuracy.
$ cat /path/to/user-patterns.txt' 1-\d\d\d-GOOG-441 www.\n\\\*.com
echo (new TesseractOCR('img.png')) ->userPatterns('/path/to/user-patterns.txt') ->run();
Define one or more languages to be used during the recognition. A complete list of available languages can be found at: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages
Tip from @daijiale: Use the combination ->lang(‘chi_sim’, ‘chi_tra’) for proper recognition of Chinese.
echo (new TesseractOCR('img.png')) ->lang('lang1', 'lang2', 'lang3') ->run();
Specify the Page Segmentation Method, which instructs tesseract how to interpret the given image.
echo (new TesseractOCR('img.png')) ->psm(6) ->run();
Specify the OCR Engine Mode. (see tesseract —help-oem )
echo (new TesseractOCR('img.png')) ->oem(2) ->run();
This is a shortcut for ->config(‘tessedit_char_whitelist’, ‘abcdef. ‘) .
echo (new TesseractOCR('img.png')) ->whitelist(range('a', 'z'), range(0, 9), '-_@') ->run();
Specify a config file to be used. It can either be the path to your own config file or the name of one of the predefined config files: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
echo (new TesseractOCR('img.png')) ->configFile('hocr') ->run();
Specify an output format other than text. Available options are HOCR and TSV (TSV is only available on Tesseract 3.05+)
echo (new TesseractOCR('img.png')) ->format('hocr') ->run();
Shortcut for ->configFile(‘digits’) .
echo (new TesseractOCR('img.png')) ->digits() ->run();
Shortcut for ->configFile(‘hocr’) .
echo (new TesseractOCR('img.png')) ->hocr() ->run();
Shortcut for ->configFile(‘pdf’) .
echo (new TesseractOCR('img.png')) ->pdf() ->run();
Shortcut for ->configFile(‘quiet’) .
echo (new TesseractOCR('img.png')) ->quiet() ->run();
Shortcut for ->configFile(‘tsv’) .
echo (new TesseractOCR('img.png')) ->tsv() ->run();
Shortcut for ->configFile(‘txt’) .
echo (new TesseractOCR('img.png')) ->txt() ->run();
Define a custom directory to store temporary files generated by tesseract. Make sure the directory actually exists and the user running php is allowed to write in there.
echo (new TesseractOCR('img.png')) ->tempDir('./my/custom/temp/dir') ->run();
Any configuration option offered by Tesseract can be used like that:
echo (new TesseractOCR('img.png')) ->config('config_var', 'value') ->config('other_config_var', 'other value') ->run();
echo (new TesseractOCR('img.png')) ->configVar('value') ->otherConfigVar('other value') ->run();
Sometimes, it may be useful to limit the number of threads that tesseract is allowed to use (e.g. in this case). Set the maxmium number of threads as param for the run function:
echo (new TesseractOCR('img.png')) ->threadLimit(1) ->run();
You can contribute to this project by:
- Helping new users on Gitter;
- Opening an Issue if you found a bug or wish to propose a new feature;
- Placing a Pull Request with code that fix a bug, missing/wrong documentation or implement a new feature;
Just make sure you take a look at our Code of Conduct and Contributing instructions.
tesseract-ocr-for-php is released under the MIT License.