Stars
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
On-device voice activity detection (VAD) powered by deep learning
A simple library and set of tools for parsing, modifying, and composing SRT files.
OCR & Document Extraction using vision models
The official ElevenLabs MCP server
FlashMLA: Efficient Multi-head Latent Attention Kernels
Text to speech alignment using CTC forced alignment
HunyuanVideo: A Systematic Framework For Large Video Generation Model
DUSTED: Spoken-Term Discovery using Discrete Speech Units
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Schedule-Free Optimization in PyTorch
♾️ A react hook that makes it easy to add infinite scroll in any components. It is very simple to integrate and supports any direction.
Codemod Stripe used to migrate 6.5m+ lines of code from Flow to TypeScript
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio a…
A modern replacement for Redis and Memcached
Python re-implementation of the (constrained) spectral clustering algorithms used in Google's speaker diarization papers.
Material UI: Comprehensive React component library that implements Google's Material Design. Free forever.
Shared data types for building collaborative software
Unofficial implementation of NaturalSpeech2 for Voice Conversion and Text to Speech
Caption-Anything is a versatile tool combining image segmentation, visual captioning, and ChatGPT, generating tailored captions with diverse controls for user preferences. https://huggingface.co/sp…
phoneme tokenizer and grapheme-to-phoneme model for 8k languages
🤖 Assemble, configure, and deploy autonomous AI Agents in your browser.
A family of diffusion models for text-to-audio generation.