Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.
Python
Tesseract
Requires libtesseract (>=3.04) and libleptonica (>=1.71).
On Debian/Ubuntu:
$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-configOn RedHat/Fedora:
$ sudo dnf install tesseract tesseract-devel leptonica-devel leptonica$ git clone <project_repo>$ cd <project_directory>/$ pip install -r requirements/dev.txt$ python -m buildFor Windows
$ pip install dist/ocrmatcher-<version>-py3-none-any.whlFor Linux
$ pip install dist/ocrmatcher-<version>-tar.gz- Add
datasetfolder current directory - Add Scanned
PDFfiles intodatasetdirectory - Add
keywords.txtfile intodatasetdirectory - Add Search Keywords to
keywords.txtfile (each keywords must be new line without numbering)
List of available commands
$ ocrmatcher --helpOr
$ python -m ocrmatcher --helpAdd new keywords by add-keywords command
$ ocrmatcher add-keywords --k my-search-keyword1 my-search-keyword2 etc.Search Keywords
$ ocrmatcher search Run with specific language
Search Keywords
$ ocrmatcher search --lang Occupant-PigsRun with specific threshold for two strings similarity, default is: 95
Search Keywords
$ ocrmatcher search --threshold 75Pdf file convert to images
$ ocrmatcher pdf2img