A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.
conda create -n sac python=3.10
conda activate sac
pip install -r requirements.txt # pip version == 24.0To use SAC, you need to prepare the pretrained dependencies, including the GLM-4-Voice-Tokenizer for semantic tokenization and the ERes2Net speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., configs/xxx.yaml).
The following table lists the available SAC checkpoints:
| Model Name | Hugging Face | Sample Rate | Token Rate | BPS |
|---|---|---|---|---|
| SAC | 🤗 Soul-AILab/SAC-16k-37_5Hz | 16 kHz | 37.5 Hz | 525 |
| SAC | 🤗 Soul-AILab/SAC-16k-62_5Hz | 16 kHz | 62.5 Hz | 875 |
To perform audio reconstruction, you can use the following command:
python -m bins.inferWe also provide batch scripts for audio reconstruction, encoding, decoding, and embedding extraction in the scripts/batch directory as references (you can refer to the batch scripts guide for details).
You can run the following command to perform evaluation:
bash scripts/eval.shFor details on dataset preparation and evaluation setup, please first refer to the evaluation guide.
Before training, organize your dataset in JSONL format. You can refer to example/training_data.jsonl. Each entry should include:
- utt — unique utterance ID (customizable)
- wav_path — path to raw audio
- ssl_path — path to offline-extracted Whisper features (for semantic supervision)
- semantic_token_path — path to offline-extracted semantic tokens
To accelerate training, you need to extract semantic tokens and Whisper features offline first before starting. Refer to the feature extraction guide for detailed instructions.
You can adjust training and DeepSpeed configurations by editing:
configs/xxx.yaml— main training configurationconfigs/ds_stage2.json— DeepSpeed configuration
Run the following script to start SAC training:
bash scripts/train.shOur codebase builds upon the awesome SparkVox and DAC. We thank the authors for their excellent work.
If you find this work useful in your research, please consider citing:
@article{chen2025sac,
title={SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization},
author={Chen, Wenxi and Wang, Xinsheng and Yan, Ruiqi and Chen, Yushen and Niu, Zhikang and Ma, Ziyang and Li, Xiquan and Liang, Yuzhe and Wen, Hanlin and Yin, Shunshun and others},
journal={arXiv preprint arXiv:2510.16841},
year={2025}
}This project is licensed under the Apache 2.0 License.