A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.
Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.
Watch a demo of Chaplin here.
- Clone the repository, and
cdinto it:git clone https://github.com/amanvirparhar/chaplin cd chaplin - Run the setup script...
...which will automatically download the required model files from Hugging Face Hub and place them in the appropriate directories:
./setup.sh
chaplin/ ├── benchmarks/ ├── LRS3/ ├── language_models/ ├── lm_en_subword/ ├── models/ ├── LRS3_V_WER19.1/ ├── ... - Install and run
ollama, and pull theqwen3:4bmodel. - Install
uv.
- Run the following command:
uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe
- Once the camera feed is displayed, you can start "recording" by pressing the
optionkey (Mac) or thealtkey (Windows/Linux), and start mouthing words. - To stop recording, press the
optionkey (Mac) or thealtkey (Windows/Linux) again. The raw VSR output will get logged in your terminal, and the LLM-corrected version will be typed at your cursor. - To exit gracefully, focus on the window displaying the camera feed and press
q.