A cross-platform desktop application that records audio and transcribes it to text using OpenAI's Whisper API or compatible services. Perfect for dictation, note-taking, and accessibility.
The project is a tool for fulfilling my personal needs. I use Linux + Wayland and the tool has been tested only on this platform.
It supports only OpenAI compatible Whisper API. Supported output methods you can find down below.
Feel free to fork the project and make it compatible with your needs. PRs are welcome.
The backend was rewritten to an actor-based architecture using Akka.NET and the pipeline was extended with optional AI post‑processing and dataset saving. Comprehensive unit and integration tests were added.
Key changes:
- Akka.NET actor model with a supervised pipeline and clear FSM states
- Frozen settings per session, stashing updates while processing
- Observer actor exposes a reactive stream for UI state updates
- Optional post‑processing via Microsoft.Extensions.AI (OpenAI‑compatible)
- Optional dataset saving (original → processed pairs) when post‑processing is enabled
- Robust error handling and retries per actor (configurable policy)
- Tests: FSM/unit, pipeline integration with deterministic timing, and error scenarios
- Audio Recording: Capture audio from selected microphone (system default or user‑selected)
- Speech-to-Text Transcription: Convert speech to text using OpenAI's Whisper API or compatible services
- Multiple Output Options:
- Copy to clipboard (Avalonia clipboard; splash workaround due to platform issue)
- Use
wl-copyfor Wayland systems - Type text directly using
ydotool - Type text directly using
wtype
- System Tray Integration: Monitor recording status with color-coded tray icon
- Unix Socket Control: Control the application via command line scripts
- Configurable Settings:
- API endpoint and key
- Whisper model selection
- Language preference
- Custom prompts for better recognition
- Optional Post‑Processing: Improve text with an LLM via Microsoft.Extensions.AI
- Optional Dataset Saving (for ML datasets): Append original and processed pairs when post‑processing is enabled (see Configuration → Dataset Saving)
- Safety Timeouts (optional): Hard cut‑offs for Recording, Transcribing, Post‑Processing steps
- Remove the splash screen after clipboard issue is fixed
- Add shortcut support
- Add more post-processing options
- For Linux:
lame,socat(for socket control) - For Wayland clipboard support:
wl-copy - For typing output:
ydotoolorwtype - OpenAL (see dedicated section below)
- OpenAI API key or compatible Whisper API endpoint
- OpenAI base URL:
https://api.openai.com - OpenAI model name:
whisper-1 - Self-hosted servers often use Whisper Large variants (e.g., faster‑whisper). The UI defaults use a large model name. Adjust to
whisper-1if you call OpenAI directly.
- OpenAI base URL:
The application requires a native OpenAL runtime for audio capture. The repository contains only the managed wrapper (OpenTK.OpenAL); the native runtime is not bundled.
Linux:
- Usually already installed as a dependency of other desktop software.
- If recording fails with
DllNotFoundException: libopenal.soinstall your distro package:- Arch / Manjaro:
pacman -S openal - Debian / Ubuntu:
sudo apt install libopenal1 - Fedora:
sudo dnf install openal-soft - openSUSE:
sudo zypper install openal-soft
- Arch / Manjaro:
macOS:
- A system OpenAL is present. If you explicitly need OpenAL Soft you can install it with Homebrew:
brew install openal-soft(normally not required).
Windows:
- Install OpenAL using the official installer from https://www.openal.org/downloads/ (oalinst.exe) and restart the application; OR use a package manager:
- WinGet:
winget install --id CreativeLabs.OpenAL --source winget - Chocolatey:
choco install openal
- WinGet:
- Symptom if missing:
System.DllNotFoundException: Could not load the dll 'openal32.dll'when starting recording.
- For Linux: Install
lamefrom your package manager. - Ensure OpenAL is available (see OpenAL section).
-
Clone the repository:
git clone https://github.com/yourusername/WhisperVoiceInput.git cd WhisperVoiceInput -
Build the application:
dotnet build -c Release
-
Run the application:
dotnet run --project WhisperVoiceInput/WhisperVoiceInput.csproj
Download the latest release from the Releases page.
On first run, the application creates a configuration directory at:
~/.config/WhisperVoiceInput/ (Linux/macOS)
%APPDATA%\WhisperVoiceInput\ (Windows)
- Open the settings window by clicking on the tray icon
- Enter your OpenAI API key or configure a compatible endpoint
- Select the Whisper model
- OpenAI:
whisper-1 - Self-hosted: a Faster-Whisper model name (e.g.,
whisper-large-v3)
- OpenAI:
- Set your preferred language (e.g., "en")
- Optionally add a prompt to guide the transcription
- In Settings → Audio Settings, use the “Input Device” dropdown to choose a microphone:
System defaultuses your OS default input device.- Or select a specific device from the list.
- Click “Refresh” to enumerate devices on demand (keeps startup/settings opening light‑weight).
- Under the hood, the app queries OpenAL capture devices and, when supported, also uses the extended enumeration to include more (e.g., virtual) devices.
- The selection is saved as a plain string setting (
PreferredCaptureDevice).- Empty value means
System default.
- Empty value means
- If the preferred device is unavailable at runtime, the recorder automatically falls back to the system default.
Choose your preferred output method:
- Clipboard (Avalonia API)
- wl-copy (Wayland)
- ydotool (types the text)
- wtype (types the text)
- Enable to improve transcriptions via Microsoft.Extensions.AI
- Endpoint and model are OpenAI‑compatible (OpenAI or local LLM gateways)
- Defaults in the app may point to a local endpoint and model (e.g., Ollama
http://localhost:11434withllama3.2); adjust as needed - Provide API key if your endpoint requires it
- Three independent limits in minutes: Recording, Transcribing, Post‑Processing
- Each timeout can be enabled via a toggle and a minutes spinner (minimum 1 minute)
- Semantics:
- Value > 0: timeout is enabled; the corresponding actor schedules a self‑timeout message
- Value ≤ 0 (internally stored as -1): timeout is disabled
- Behavior on timeout:
- The actor throws
UserConfiguredTimeoutExceptionwhich is treated as unrecoverable by supervision (no retries) - For Recording and Transcribing, the current audio file is deleted to avoid leaving temporary files behind
- The actor throws
Build your own training datasets from the pipeline output.
- Availability: Only works when Post‑Processing is enabled
- Format per entry:
<original text> - <processed text> --- - How to enable:
- In Settings, enable Post‑Processing
- Turn on "Save dataset"
- Choose the target file path (created if missing)
- Run the pipeline; after post‑processing, an entry is appended asynchronously
- Notes:
- Appends are non-blocking and won’t stall the UI
- Success and errors are logged
- Ensure the chosen location is writable by your user
I personally use Speaches as a self-hosted Whisper API.
An example of docker-compose file for GPU enhanced version of Speaches:
speaches:
image: ghcr.io/speaches-ai/speaches:0.7.0-cuda # https://github.com/speaches-ai/speaches/pkgs/container/speaches/versions?filters%5Bversion_type%5D=tagged
container_name: speaches
restart: unless-stopped
ports:
- "1264:8000"
volumes:
- ./speaches_cache:/home/ubuntu/.cache/huggingface/hub
environment:
- ENABLE_UI=false
- WHISPER__TTL=-1 # default TTL is 300 (5min), -1 to disable, 0 to unload directly, 43200=12h
- WHISPER__INFERENCE_DEVICE=cuda
- WHISPER__COMPUTE_TYPE=float16
- WHISPER__MODEL=deepdml/faster-whisper-large-v3-turbo-ct2 # uses ~2.5Gb VRAM in CUDA version
#- WHISPER__MODEL=Systran/faster-whisper-large-v3
- WHISPER__DEVICE_INDEX=1
- ALLOW_ORIGINS=[ "*", "app://obsidian.md" ]
- API_KEY=sk-1234567890
- LOOPBACK_HOST_URL=yourdomain.com
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]- Click the tray icon to start/stop recording
- When recording, the icon turns yellow
- During transcription/post‑processing/saving, the icon turns light blue
- On success, the icon briefly turns green and the text is output per your settings
- On error, the icon turns red and a tooltip shows details
The application can be controlled via a Unix socket. Two scripts are provided in the repo root:
transcribe_toggle_simplified.sh(simple)transcribe_toggle.sh(enhanced checks)
Make the scripts executable:
chmod +x transcribe_toggle_simplified.sh transcribe_toggle.shRun to toggle recording:
./transcribe_toggle_simplified.shGlobal hotkey support is available on Windows, macOS, and Linux X11. It is automatically disabled on Wayland. Configure the hotkey in Settings → Global Hotkey by focusing the field and pressing your desired combination. A Reset button clears it.
Shortcuts are implemented with the SharpHook library. Check its documentation for platform-specific limitations.
On Wayland, use the provided toggle scripts and bind them in your DE (examples below).
gsettings set org.gnome.settings-daemon.plugins.media-keys custom-keybindings "['/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/']"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ name "Toggle WhisperVoiceInput"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ command "/path/to/transcribe_toggle_simplified.sh"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ binding "<Ctrl><Alt>w"- System Settings > Shortcuts > Custom Shortcuts
- Add a new shortcut
- Set the command to
/path/to/transcribe_toggle_simplified.sh - Assign a keyboard shortcut
Local Seq server is supported and should be reachable on http://localhost:5341.
- Ensure your microphone is properly connected and set as the default input device
- Check system permissions for microphone access
- Ensure OpenAL is installed (see OpenAL section). Windows symptom if missing:
System.DllNotFoundException: Could not load the dll 'openal32.dll'
- Verify your API key is correct (if required by your endpoint)
- Check your internet connection
- Ensure the server address is correct
- Try a different Whisper model (smaller models may be faster but less accurate)
- Verify endpoint URL, model, and API key
- If using a local LLM gateway, confirm it’s running and reachable
- Ensure the application is running
- Check if the socket file exists at
/tmp/WhisperVoiceInput/pipe - Verify
socatis installed:sudo apt install socat
On Linux/macOS: ~/.config/WhisperVoiceInput/logs/
On Windows: %APPDATA%\WhisperVoiceInput\logs\
Actors and responsibilities:
- MainOrchestratorActor (FSM): Coordinates the pipeline (Idle → Recording → Transcribing → PostProcessing → Saving). Supervises children, freezes settings per session, stashes settings updates, notifies UI via Observer.
- AudioRecordingActor: Records from OpenAL and writes MP3 using NAudio.Lame. Emits AudioRecordedEvent.
- TranscribingActor: Calls
{ServerAddress}/v1/audio/transcriptionswith model/language/prompt (async via PipeTo). Emits TranscriptionCompletedEvent. Handles temp file cleanup/move and deletes temp file on timeout/failure. - PostProcessorActor (optional): Uses Microsoft.Extensions.AI to enhance text. Emits PostProcessedEvent.
- ResultSaverActor: Outputs final text per selected strategy (clipboard, wl-copy, ydotool, wtype). Emits ResultSavedEvent.
- ObserverActor: Bridges actor system to UI with IObservable.
- SocketListenerActor (Linux): Listens on
/tmp/WhisperVoiceInput/pipeand forwardstranscribe_toggleto the orchestrator.
Primary messages:
- Commands: ToggleCommand, UpdateSettingsCommand, RecordCommand, StopRecordingCommand, TranscribeCommand(audioPath), PostProcessCommand(text), StartListeningCommand, StopListeningCommand, GetStateObservableCommand
- Events: AudioRecordedEvent, TranscriptionCompletedEvent, PostProcessedEvent, ResultAvailableEvent, ResultSavedEvent, StateUpdatedEvent, StateObservableResult
A dedicated test project validates the actor pipeline.
- FSM/Unit tests for
MainOrchestratorActortransitions and messaging - Pipeline integration tests using
TestSchedulerfor deterministic timing - Error scenario tests (network timeouts, auth failures, file not found, multi‑error cases)
- Dataset saving behavior with and without post‑processing
Project layout (simplified):
WhisperVoiceInput.Tests/
Actors/
MainOrchestratorActorTests.cs
PipelineIntegrationTests.cs
SpecificErrorScenariosTests.cs
TestBase/
AkkaTestBase.cs
TestDoubles/
... (probes, mocks, configurable error actors)
flowchart LR
UI["UI / ViewModels"] -- Toggle --> Orchestrator["MainOrchestratorActor (FSM)"]
SettingsService -- UpdateSettingsCommand --> Orchestrator
Orchestrator -- RecordCommand --> Audio["AudioRecordingActor"]
Audio -- AudioRecordedEvent --> Orchestrator
Audio -- (self) RecordingTimeout --> Audio
Orchestrator -- TranscribeCommand --> Trans["TranscribingActor"]
Trans -- TranscriptionCompletedEvent --> Orchestrator
Trans -- (self) TranscriptionTimeout --> Trans
Orchestrator -- PostProcessCommand --> Post["PostProcessorActor (optional)"]
Post -- PostProcessedEvent --> Orchestrator
Post -- (self) PostProcessingTimeout --> Post
Orchestrator -- ResultAvailableEvent --> Saver["ResultSaverActor"]
Saver -- ResultSavedEvent --> Orchestrator
Orchestrator -- StateUpdatedEvent --> Observer["ObserverActor"]
Observer -- StateObservableResult --> UI
Socket["SocketListenerActor (/tmp/WhisperVoiceInput/pipe)"] -- transcribe_toggle --> Orchestrator
flowchart TD
subgraph user["/user/"]
Orchestrator[MainOrchestratorActor]
Observer[ObserverActor]
subgraph SocketSup["SocketSupervisorActor"]
SocketListener[SocketListenerActor]
end
end
Orchestrator --> Audio[AudioRecordingActor]
Orchestrator --> Trans[TranscribingActor]
Orchestrator --> Post[PostProcessorActor]
Orchestrator --> Saver[ResultSaverActor]
Note["Note: SocketSupervisorActor exists but current listener is created as top-level sibling under /user."]
stateDiagram-v2
[*] --> idle
idle --> recording: ToggleCommand
recording --> transcribing: AudioRecordedEvent
transcribing --> postprocessing: TranscriptionCompletedEvent
postprocessing --> saving: PostProcessedEvent
transcribing --> saving: (post-processing disabled)
saving --> idle: ResultSavedEvent
recording --> idle: error after retries or user timeout
transcribing --> idle: error after retries or user timeout
postprocessing --> idle: error after retries or user timeout
saving --> idle: error after retries
sequenceDiagram
participant User as User
participant UI as UI/ViewModel
participant Orch as MainOrchestrator
participant Aud as AudioRecording
participant Tr as Transcribing
participant PP as PostProcessing
participant Sav as ResultSaver
participant Obs as Observer
User->>UI: Toggle
UI->>Orch: ToggleCommand
Orch->>Aud: RecordCommand
Aud-->>Orch: AudioRecordedEvent
Orch->>Tr: TranscribeCommand
alt Success
Tr-->>Orch: TranscriptionCompletedEvent(text)
alt Post-processing enabled
Orch->>PP: PostProcessCommand(text)
PP-->>Orch: PostProcessedEvent(processed)
Orch->>Sav: ResultAvailableEvent(processed)
else Disabled
Orch->>Sav: ResultAvailableEvent(text)
end
Sav-->>Orch: ResultSavedEvent
Orch-->>Obs: StateUpdatedEvent(Success)
else Error
Note over Tr,Orch: Error at any stage (recording/transcribing/post-processing/saving)
Orch-->>Obs: StateUpdatedEvent(Error, details)
Orch->>Orch: Cleanup and transition to Idle
end
Obs-->>UI: IObservable<StateUpdatedEvent>
- OpenAI Whisper - Speech recognition model
- Avalonia UI - Cross-platform UI framework
- ReactiveUI - MVVM framework
- NAudio - Audio library for .NET
- OpenTK.OpenAL - OpenAL bindings for .NET
- Akka.NET — Actor framework
- Microsoft.Extensions.AI — AI abstractions for post‑processing
- SharpHook — Global hotkey support