This is a simplified implementation of F5R-TTS based on the paper F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization, intended for learning purposes.
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5r-tts python=3.10
conda activate f5r-tts
pip install -r requirements.txt
python ./src/f5_tts/infer/infer_cli.py \
--model F5TTS_v1_Base \
--ckpt_file "your_model_path" \
--ref_audio "path_to_reference.wav" \
--ref_text "reference_text" \
--gen_text "generated_text" \
--output_dir ./tests
You need to download wespeaker pretrained model and put it under src/rl/wespeaker/multilingual
directory for GRPO phase.
accelerate config
# Data preparing
python src/f5_tts/train/datasets/prepare_libritts.py
# Pretraining phase
accelerate launch src/f5_tts/train/train.py
# GRPO phase
accelerate launch src/f5_tts/train/train_rl.py