Skip to content

cowperc/ComfyUI_StepAudioTTS

 
 

Repository files navigation

中文 | English

A Text To Speech node using Step-Audio-TTS in ComfyUI. Can speak, rap, sing, or clone voice.

Update

[2025-03-21] ⚒️: Completely refactored the code, added more tunable parameters, and max_length can be adjusted according to the text length. Optional unload_model to choose whether to unload the model to accelerate inference speed.

[2025-03-07]⚒️: Custom speakers can be defined directly in ComfyUI\models\TTS\Step-Audio-speakers\speakers_info.json without the need for input in the node.

Move the Step-Audio-speakers folder from this repository to the ComfyUI\models\TTS folder. The structure is as follows:

ComfyUI\models\TTS
├── Step-Audio-Tokenizer
├── Step-Audio-speakers
├── Step-Audio-TTS-3B

You can then freely customize speakers under the ComfyUI\models\TTS\Step-Audio-speakers folder for use. Ensure that the speaker name configuration matches exactly:

[2025-03-06]⚒️: New recording node MW Audio Recorder can be used to record audio with a microphone, and the progress bar displays the recording progress:

参数名/Parameter 作用描述/Description 范围/Range 注意事項/Notes
trigger 录音触发开关 - 设为True开始录音
Recording trigger - Set to True to start recording
Boolean (True/False) 需要从False切到True才能触发
Requires changing from False to True to activate
record_sec 主录音时长(秒)
Main recording duration (seconds)
1-60 (整数/integer) 实际时长
Actual duration
n_fft FFT窗口大小(影响频率分辨率)
FFT window size (affects frequency resolution)
512,1024,...,4096 (512倍数/multiplies) 值越大频率分辨率越高
Higher values give better frequency resolution
sensitivity 降噪灵敏度(值越高越激进)
Noise reduction sensitivity (higher=more aggressive)
0.5-3.0 (步长0.1/step 0.1) 1.2=标准办公室环境
1.2=standard office environment
smooth 时频平滑系数(值越高越自然)
Time-frequency smoothing (higher=more natural)
1,3,5,7,9,11 (奇数/odd numbers) 建议语音:5,音乐:7
Recommended: 5 for speech, 7 for music
sample_rate 采样率(影响音质与文件大小)
Sampling rate (affects quality & size)
16000/44100/48000 Hz 44100=CD音质
44100=CD quality

[2025-03-02]⚒️: Add experimental custom_mark, surrounding with "()", for example (温柔)(东北话), it may have an effect.

[2025-02-25]⚒️: Support custom speaker custom_stpeaker.

Installation

cd ComfyUI/custom_nodes
git clone https://github.com/billwuhao/ComfyUI_StepAudioTTS.git
cd ComfyUI_StepAudioTTS
pip install -r requirements.txt

# python_embeded
./python_embeded/python.exe -m pip install -r requirements.txt

Model Download

Download to the ComfyUI\models\TTS folder

Huggingface

Models Links
Step-Audio-Tokenizer 🤗huggingface
Step-Audio-TTS-3B 🤗huggingface

Modelscope

Models Links
Step-Audio-Tokenizer modelscope
Step-Audio-TTS-3B modelscope

Supports Chinese, English, Korean, Japanese, Sichuanese, Cantonese etc.

Acknowledgements

Part of the code for this project comes from:

Thank you to all the open-source projects for their contributions to this project!

About

A Text To Speech node using Step-Audio-TTS in ComfyUI. Can speak, rap, sing, or clone voice.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Other 0.5%