Starred repositories
A modern GUI client based on Tauri, designed to run in Windows, macOS and Linux for tailored proxy experience
A Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
A 10000+ hours dataset for Chinese speech recognition
Speaker anonymization pipeline for hiding the identity of the speaker of a recording by changing the voice in it.
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
[CVPR 2025] "DiC: Rethinking Conv3x3 Designs in Diffusion Models", a performant & speedy Conv3x3 diffusion model.
[Interspeech 2025] DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
Official repository for FlowSE (Interspeech 2025)
TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework.
[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)
[AAAI 2025] EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
This project uses a variety of advanced voiceprint recognition models such as EcapaTdnn, ResNetSE, ERes2Net, CAM++, etc. It is not excluded that more models will be supported in the future. At the …
Official implementation of "Sonic: Shifting Focus to Global Audio Perception in Portrait Animation"
grazder / DeepFilterNet
Forked from Rikorose/DeepFilterNetNoise supression using deep filtering
Limiter, compressor, convolver, equalizer and auto volume and many other plugins for PipeWire applications
Unofficial SoundStream implementation of Pytorch with training code and 16kHz pretrained checkpoint
LibriSpeech-Long is a benchmark dataset for long-form speech generation and processing. Released as part of "Long-Form Speech Generation with Spoken Language Models" (arXiv 2024).
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
PyTorch Implementation of TCSinger(EMNLP 2024): Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control