About

Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.

Prerequisites

Python
Tesseract

Install dependencies for Linux

Requires libtesseract (>=3.04) and libleptonica (>=1.71).

On Debian/Ubuntu:

$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

On RedHat/Fedora:

$ sudo dnf install tesseract tesseract-devel leptonica-devel leptonica

Install dependencies for Windows

Setup Project

$ git clone <project_repo>

$ cd <project_directory>/

Install Source dependencies from `requirements`

$ pip install -r requirements/dev.txt

Package Build and Install

$ python -m build

For Windows

$ pip install dist/ocrmatcher-<version>-py3-none-any.whl

For Linux

$ pip install dist/ocrmatcher-<version>-tar.gz

Using

Add dataset folder current directory
Add Scanned PDF files into dataset directory
Add keywords.txt file into dataset directory
Add Search Keywords to keywords.txt file (each keywords must be new line without numbering)

Commands

List of available commands

$ ocrmatcher --help

Or

$ python -m ocrmatcher --help

Add new keywords by add-keywords command

$ ocrmatcher add-keywords --k my-search-keyword1 my-search-keyword2 etc.

Search Keywords

$ ocrmatcher search

Run with specific language

Search Keywords

$ ocrmatcher search --lang Occupant-Pigs

Run with specific threshold for two strings similarity, default is: 95

Search Keywords

$ ocrmatcher search --threshold 75

Pdf file convert to images

$ ocrmatcher pdf2img

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
bin		bin
examples		examples
requirements		requirements
script		script
src/ocrmatcher		src/ocrmatcher
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About - OCR Toolkit

Prerequisites

Install dependencies for Linux

Install dependencies for Windows

Setup Project

Install Source dependencies from `requirements`

Package Build and Install

Using

Commands

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Lh4cKg/ocr-toolkit

Folders and files

Latest commit

History

Repository files navigation

About - OCR Toolkit

Prerequisites

Install dependencies for Linux

Install dependencies for Windows

Setup Project

Install Source dependencies from requirements

Package Build and Install

Using

Commands

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Install Source dependencies from `requirements`

Packages