Highlights
- Pro
Stars
Vogent Turn: fast, open-source turn-detection for Voice AI applications
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Long-form streaming TTS system for multi-speaker dialogue generation
Awesome Neural Codec Models, Text-to-Speech Synthesizers & Speech Language Models
A Conversational Speech Generation Model
Mellea is a library for writing generative programs.
AG-UI: the Agent-User Interaction Protocol. Bring Agents into Frontend Applications.
UI over MCP. Create next-gen UI experiences with the protocol and SDK!
Example apps for the Apps SDK
MiMo-Audio: Audio Language Models are Few-Shot Learners
Build an email assistant with human-in-the-loop and memory
Chat with your Letta agents over a low-latency voice connection. Advanced voice mode, but with advanced memory.
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Andr…
[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Fast and local neural text-to-speech engine
Flexible and powerful framework for managing multiple AI agents and handling complex conversations
Next Generation Agentic Proxy for AI Agents and MCP servers
A lightweight end-of-utterance detection model fine-tuned on SmolLM2-135M, optimized for Raspberry Pi and low-power devices.