A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4.
echo 'Welcome to the world of speech synthesis!' | \
./piper --model en-us-blizzard_lessac-medium.onnx --output_file welcome.wavVoices are trained with VITS and exported to the onnxruntime.
Our goal is to support Home Assistant and the Year of Voice.
Download voices from the release.
Supported languages:
- Catalan (ca)
- Danish (da)
- Dutch (nl)
- French (fr)
- German (de)
- Italian (it)
- Kazakh (kk)
- Nepali (ne)
- Norwegian (no)
- Spanish (es)
- Ukrainian (uk)
- U.S. English (en-us)
- Vietnamese (vi)
Download a release:
If you want to build from source, see the Makefile and C++ source. Last tested with onnxruntime 1.13.1.
- Download a voice and extract the
.onnxand.onnx.jsonfiles - Run the
piperbinary with text on standard input,--model /path/to/your-voice.onnx, and--output_file output.wav
For example:
echo 'Welcome to the world of speech synthesis!' | \
./piper --model blizzard_lessac-medium.onnx --output_file welcome.wavFor multi-speaker models, use --speaker <number> to change speakers (default: 0).
See piper --help for more options.
See src/python
Start by creating a virtual environment:
cd piper/src/python
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools
pip3 install -r requirements.txtRun the build_monotonic_align.sh script in the src/python directory to build the extension.
Ensure you have espeak-ng installed (sudo apt-get install espeak-ng).
Next, preprocess your dataset:
python3 -m piper_train.preprocess \
--language en-us \
--input-dir /path/to/ljspeech/ \
--output-dir /path/to/training_dir/ \
--dataset-format ljspeech \
--sample-rate 22050Datasets must either be in the LJSpeech format or from Mimic Recording Studio (--dataset-format mycroft).
Finally, you can train:
python3 -m piper_train \
--dataset-dir /path/to/training_dir/ \
--accelerator 'gpu' \
--devices 1 \
--batch-size 32 \
--validation-split 0.05 \
--num-test-examples 5 \
--max_epochs 10000 \
--precision 32Training uses PyTorch Lightning. Run tensorboard --logdir /path/to/training_dir/lightning_logs to monitor. See python3 -m piper_train --help for many additional options.
It is highly recommended to train with the following Dockerfile:
FROM nvcr.io/nvidia/pytorch:22.03-py3
RUN pip3 install \
'pytorch-lightning'
ENV NUMBA_CACHE_DIR=.numba_cacheSee the various infer_* and export_* scripts in src/python/piper_train to test and export your voice from the checkpoint in lightning_logs. The dataset.jsonl file in your training directory can be used with python3 -m piper_train.infer for quick testing:
head -n5 /path/to/training_dir/dataset.jsonl | \
python3 -m piper_train.infer \
--checkpoint lightning_logs/path/to/checkpoint.ckpt \
--sample-rate 22050 \
--output-dir wavsSee src/python_run
Run scripts/setup.sh to create a virtual environment and install the requirements. Then run:
echo 'Welcome to the world of speech synthesis!' | scripts/piper \
--model /path/to/voice.onnx \
--output_file welcome.wavIf you'd like to use a GPU, install the onnxruntime-gpu package:
.venv/bin/pip3 install onnxruntime-gpuand then run scripts/piper with the --cuda argument. You will need to have a functioning CUDA environment, such as what's available in NVIDIA's PyTorch containers.