Real-time voice-to-text transcription with hotkey support
Maivi (My AI Voice Input) is a cross-platform desktop application that turns your voice into text using state-of-the-art AI models. Simply press Alt+Q to start recording, and press again to stop. Your transcription appears in real-time and is automatically copied to your clipboard.
- π€ Hotkey Recording - Toggle recording with Alt+Q
- β‘ Real-time Transcription - See text appear as you speak
- π Clipboard Integration - Automatic copy to clipboard
- πͺ Floating Overlay - Live transcription in a sleek overlay window
- π Smart Chunk Merging - Advanced overlap-based merging eliminates duplicates
- π» CPU-Only - No GPU required (though GPU acceleration is supported)
- π High Accuracy - Powered by NVIDIA Parakeet TDT 0.6B model (~6-9% WER)
- π Fast - ~0.36x RTF (processes 7s audio in 2.5s on CPU)
CPU-only (Recommended - much faster, 100MB vs 2GB+):
pip install maivi --extra-index-url https://download.pytorch.org/whl/cpu
Or with GPU support (if you have NVIDIA GPU):
pip install maivi --extra-index-url https://download.pytorch.org/whl/cu121
Standard install (may download large CUDA files):
pip install maivi
Linux:
sudo apt-get install portaudio19-dev python3-pyaudio
macOS:
brew install portaudio
Windows:
- PortAudio is usually included with PyAudio
GUI Mode (Recommended):
maivi
Press Alt+Q to start recording, press Alt+Q again to stop. The transcription will appear in a floating overlay and be copied to your clipboard.
CLI Mode:
# Basic CLI
maivi-cli
# With live terminal UI
maia-cli --show-ui
# Custom parameters
maia-cli --window 10 --slide 5 --show-ui
Controls:
- Alt+Q - Start/stop recording (toggle mode)
- Esc - Exit application
Maia uses a sophisticated streaming architecture:
- Sliding Window Recording - Captures audio in overlapping 7-second chunks every 3 seconds
- Real-time Transcription - Each chunk is transcribed by the NVIDIA Parakeet model
- Smart Merging - Chunks are merged using overlap detection (4-second overlap)
- Live Updates - The UI updates in real-time as transcription progresses
Chunk 1: "hello world how are you"
Chunk 2: "how are you doing today"
^^^^^^^^^^^^^^
Overlap detected β merge!
Result: "hello world how are you doing today"
This approach ensures:
- β No words cut mid-syllable
- β Context preserved for better accuracy
- β Seamless merging without duplicates
- β Fast processing (no queue buildup)
maia-cli --window 7.0 --slide 3.0 --delay 2.0
--window
: Chunk size in seconds (default: 7.0)- Larger = better quality, slower processing
--slide
: Slide interval in seconds (default: 3.0)- Smaller = more overlap, higher CPU usage
- Rule: Must be >
window Γ 0.36
to avoid queue buildup
--delay
: Processing start delay in seconds (default: 2.0)
# Speed adjustment (experimental)
maia-cli --speed 1.5
# Custom UI width
maia-cli --show-ui --ui-width 50
# Disable pause detection
maia-cli --no-pause-breaks
# Stream to file (for voice commands)
maia-cli --output-file transcription.txt
Maivi can be packaged as standalone executables for easy distribution:
# Install build dependencies
pip install maivi[build]
# Build executable
pyinstaller --onefile --windowed \
--name maivi \
--add-data "src/maia:maia" \
src/maia/__main__.py
Pre-built executables are available in Releases.
# Clone repository
git clone https://github.com/MaximeRivest/maivi.git
cd maivi
# Install in development mode
pip install -e .[dev]
# Run tests
pytest
maia/
βββ src/maia/
β βββ __init__.py
β βββ __main__.py # GUI entry point
β βββ core/
β β βββ streaming_recorder.py
β β βββ chunk_merger.py
β β βββ pause_detector.py
β βββ gui/
β β βββ qt_gui.py
β βββ cli/
β β βββ cli.py
β β βββ server.py
β β βββ terminal_ui.py
β βββ utils/
βββ tests/
βββ docs/
βββ pyproject.toml
βββ README.md
βββ LICENSE
This is expected behavior when there are long pauses (5+ seconds of silence). The system adds "..." gap markers to indicate the pause.
Check that processing time < slide interval:
- Processing:
window_seconds Γ 0.36
(RTF) - Should be <
slide_seconds
- Default:
7 Γ 0.36 = 2.52s < 3s
β
The first run downloads the NVIDIA Parakeet model (~600MB) from HuggingFace. If download fails:
- Check internet connection
- Verify HuggingFace is accessible
- Clear cache:
rm -rf ~/.cache/huggingface/
If the GUI crashes on Linux:
# Check Qt installation
python -c "from PySide6 import QtWidgets; print('Qt OK')"
# Fall back to CLI mode
maia-cli --show-ui
Memory:
- Model: ~2GB RAM
- Audio buffer: ~1MB
- Total: ~2.5GB RAM
CPU:
- Idle: <5% CPU
- Recording: 30-40% of 1 core
- Transcription: 100% of 1 core (during processing)
Latency:
- First transcription: 2s (start delay)
- Updates: Every 3s (slide interval)
- Completion: 1-3s after recording stops
Accuracy:
- Model WER: ~5-8%
- Overlap merging: <1% word loss
- Total effective WER: ~6-9%
v0.2 - Platform Support:
- Test and verify macOS support
- Test and verify Windows support
- Platform-specific installers (.app, .exe)
v0.3 - Features:
- Configurable hotkeys via GUI
- Multi-language support
- Custom model selection
- Voice commands support
v0.4 - Optimization:
- GPU acceleration (CUDA)
- Export formats (JSON, SRT)
- Text editor integration
- Plugin system
MIT License - see LICENSE file for details.
- Built with NVIDIA NeMo ASR toolkit
- Uses Parakeet TDT 0.6B model
- GUI powered by PySide6
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- π« Create an issue
- π‘ Feature requests
- π Bug reports
Made with β€οΈ by Maxime Rivest