WhisperVoiceInput

A cross-platform desktop application that records audio and transcribes it to text using OpenAI's Whisper API or compatible services. Perfect for dictation, note-taking, and accessibility.

Disclaimer

The project is a tool for fulfilling my personal needs. I use Linux + Wayland and the tool has been tested only on this platform.

It supports only OpenAI compatible Whisper API. Supported output methods you can find down below.

Feel free to fork the project and make it compatible with your needs. PRs are welcome.

What’s new (major refactor) - 10.08.2025

The backend was rewritten to an actor-based architecture using Akka.NET and the pipeline was extended with optional AI post‑processing and dataset saving. Comprehensive unit and integration tests were added.

Key changes:

Akka.NET actor model with a supervised pipeline and clear FSM states
Frozen settings per session, stashing updates while processing
Observer actor exposes a reactive stream for UI state updates
Optional post‑processing via Microsoft.Extensions.AI (OpenAI‑compatible)
Optional dataset saving (original → processed pairs) when post‑processing is enabled
Robust error handling and retries per actor (configurable policy)
Tests: FSM/unit, pipeline integration with deterministic timing, and error scenarios

Features

Audio Recording: Capture audio from selected microphone (system default or user‑selected)
Speech-to-Text Transcription: Convert speech to text using OpenAI's Whisper API or compatible services
Multiple Output Options:
- Copy to clipboard (Avalonia clipboard; splash workaround due to platform issue)
- Use wl-copy for Wayland systems
- Type text directly using ydotool
- Type text directly using wtype
System Tray Integration: Monitor recording status with color-coded tray icon
Unix Socket Control: Control the application via command line scripts
Configurable Settings:
- API endpoint and key
- Whisper model selection
- Language preference
- Custom prompts for better recognition
Optional Post‑Processing: Improve text with an LLM via Microsoft.Extensions.AI
Optional Dataset Saving (for ML datasets): Append original and processed pairs when post‑processing is enabled (see Configuration → Dataset Saving)
Safety Timeouts (optional): Hard cut‑offs for Recording, Transcribing, Post‑Processing steps

Roadmap

Remove the splash screen after clipboard issue is fixed
Add shortcut support
Add more post-processing options

Requirements

For Linux: lame, socat (for socket control)
For Wayland clipboard support: wl-copy
For typing output: ydotool or wtype
OpenAL (see dedicated section below)
OpenAI API key or compatible Whisper API endpoint
- OpenAI base URL: https://api.openai.com
- OpenAI model name: whisper-1
- Self-hosted servers often use Whisper Large variants (e.g., faster‑whisper). The UI defaults use a large model name. Adjust to whisper-1 if you call OpenAI directly.

OpenAL (audio backend dependency)

The application requires a native OpenAL runtime for audio capture. The repository contains only the managed wrapper (OpenTK.OpenAL); the native runtime is not bundled.

Linux:

Usually already installed as a dependency of other desktop software.
If recording fails with DllNotFoundException: libopenal.so install your distro package:
- Arch / Manjaro: pacman -S openal
- Debian / Ubuntu: sudo apt install libopenal1
- Fedora: sudo dnf install openal-soft
- openSUSE: sudo zypper install openal-soft

macOS:

A system OpenAL is present. If you explicitly need OpenAL Soft you can install it with Homebrew: brew install openal-soft (normally not required).

Windows:

Install OpenAL using the official installer from https://www.openal.org/downloads/ (oalinst.exe) and restart the application; OR use a package manager:
- WinGet: winget install --id CreativeLabs.OpenAL --source winget
- Chocolatey: choco install openal
Symptom if missing: System.DllNotFoundException: Could not load the dll 'openal32.dll' when starting recording.

Installation

Prerequisites

For Linux: Install lame from your package manager.
Ensure OpenAL is available (see OpenAL section).

From Source

Clone the repository:

git clone https://github.com/yourusername/WhisperVoiceInput.git
cd WhisperVoiceInput

Build the application:
```
dotnet build -c Release
```

Run the application:

dotnet run --project WhisperVoiceInput/WhisperVoiceInput.csproj

Pre-built Binaries

Download the latest release from the Releases page.

Configuration

On first run, the application creates a configuration directory at:

~/.config/WhisperVoiceInput/ (Linux/macOS)
%APPDATA%\WhisperVoiceInput\ (Windows)

API Configuration

Open the settings window by clicking on the tray icon
Enter your OpenAI API key or configure a compatible endpoint
Select the Whisper model
- OpenAI: whisper-1
- Self-hosted: a Faster-Whisper model name (e.g., whisper-large-v3)
Set your preferred language (e.g., "en")
Optionally add a prompt to guide the transcription

Audio Input Device Selection

In Settings → Audio Settings, use the “Input Device” dropdown to choose a microphone:
- System default uses your OS default input device.
- Or select a specific device from the list.
Click “Refresh” to enumerate devices on demand (keeps startup/settings opening light‑weight).
- Under the hood, the app queries OpenAL capture devices and, when supported, also uses the extended enumeration to include more (e.g., virtual) devices.
The selection is saved as a plain string setting (PreferredCaptureDevice).
- Empty value means System default.
If the preferred device is unavailable at runtime, the recorder automatically falls back to the system default.

Output Configuration

Choose your preferred output method:

Clipboard (Avalonia API)
wl-copy (Wayland)
ydotool (types the text)
wtype (types the text)

Post-Processing (optional)

Enable to improve transcriptions via Microsoft.Extensions.AI
Endpoint and model are OpenAI‑compatible (OpenAI or local LLM gateways)
Defaults in the app may point to a local endpoint and model (e.g., Ollama http://localhost:11434 with llama3.2); adjust as needed
Provide API key if your endpoint requires it

Safety Timeouts (optional)

Three independent limits in minutes: Recording, Transcribing, Post‑Processing
Each timeout can be enabled via a toggle and a minutes spinner (minimum 1 minute)
Semantics:
- Value > 0: timeout is enabled; the corresponding actor schedules a self‑timeout message
- Value ≤ 0 (internally stored as -1): timeout is disabled
Behavior on timeout:
- The actor throws UserConfiguredTimeoutException which is treated as unrecoverable by supervision (no retries)
- For Recording and Transcribing, the current audio file is deleted to avoid leaving temporary files behind

Dataset Saving (optional)

Build your own training datasets from the pipeline output.

Availability: Only works when Post‑Processing is enabled
Format per entry:
```
<original text>
-
<processed text>
---
```
How to enable:
1. In Settings, enable Post‑Processing
2. Turn on "Save dataset"
3. Choose the target file path (created if missing)
4. Run the pipeline; after post‑processing, an entry is appended asynchronously
Notes:
- Appends are non-blocking and won’t stall the UI
- Success and errors are logged
- Ensure the chosen location is writable by your user

Self-Hosted Whisper API

I personally use Speaches as a self-hosted Whisper API.

An example of docker-compose file for GPU enhanced version of Speaches:

  speaches:
    image: ghcr.io/speaches-ai/speaches:0.7.0-cuda # https://github.com/speaches-ai/speaches/pkgs/container/speaches/versions?filters%5Bversion_type%5D=tagged
    container_name: speaches
    restart: unless-stopped
    ports:
      - "1264:8000"
    volumes:
      - ./speaches_cache:/home/ubuntu/.cache/huggingface/hub
    environment:
      - ENABLE_UI=false
      - WHISPER__TTL=-1 # default TTL is 300 (5min), -1 to disable, 0 to unload directly, 43200=12h
      - WHISPER__INFERENCE_DEVICE=cuda
      - WHISPER__COMPUTE_TYPE=float16
      - WHISPER__MODEL=deepdml/faster-whisper-large-v3-turbo-ct2 # uses ~2.5Gb VRAM in CUDA version
      #- WHISPER__MODEL=Systran/faster-whisper-large-v3
      - WHISPER__DEVICE_INDEX=1
      - ALLOW_ORIGINS=[ "*", "app://obsidian.md" ]
      - API_KEY=sk-1234567890
      - LOOPBACK_HOST_URL=yourdomain.com
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Usage

GUI Usage

Click the tray icon to start/stop recording
When recording, the icon turns yellow
During transcription/post‑processing/saving, the icon turns light blue
On success, the icon briefly turns green and the text is output per your settings
On error, the icon turns red and a tooltip shows details

Command Line Control

The application can be controlled via a Unix socket. Two scripts are provided in the repo root:

transcribe_toggle_simplified.sh (simple)
transcribe_toggle.sh (enhanced checks)

Make the scripts executable:

chmod +x transcribe_toggle_simplified.sh transcribe_toggle.sh

Run to toggle recording:

./transcribe_toggle_simplified.sh

Keyboard Shortcuts

Global hotkey support is available on Windows, macOS, and Linux X11. It is automatically disabled on Wayland. Configure the hotkey in Settings → Global Hotkey by focusing the field and pressing your desired combination. A Reset button clears it.

Shortcuts are implemented with the SharpHook library. Check its documentation for platform-specific limitations.

On Wayland, use the provided toggle scripts and bind them in your DE (examples below).

GNOME Example:

gsettings set org.gnome.settings-daemon.plugins.media-keys custom-keybindings "['/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/']"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ name "Toggle WhisperVoiceInput"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ command "/path/to/transcribe_toggle_simplified.sh"
gsettings set org.gnome.settings-daemon.plugins.media-keys.custom-keybinding:/org/gnome/settings-daemon/plugins/media-keys/custom-keybindings/custom0/ binding "<Ctrl><Alt>w"

KDE Example:

System Settings > Shortcuts > Custom Shortcuts
Add a new shortcut
Set the command to /path/to/transcribe_toggle_simplified.sh
Assign a keyboard shortcut

Troubleshooting

Local Seq server is supported and should be reachable on http://localhost:5341.

Recording Issues

Ensure your microphone is properly connected and set as the default input device
Check system permissions for microphone access
Ensure OpenAL is installed (see OpenAL section). Windows symptom if missing: System.DllNotFoundException: Could not load the dll 'openal32.dll'

Transcription Issues

Verify your API key is correct (if required by your endpoint)
Check your internet connection
Ensure the server address is correct
Try a different Whisper model (smaller models may be faster but less accurate)

Post‑Processing Issues

Verify endpoint URL, model, and API key
If using a local LLM gateway, confirm it’s running and reachable

Socket Control Issues

Ensure the application is running
Check if the socket file exists at /tmp/WhisperVoiceInput/pipe
Verify socat is installed: sudo apt install socat

Logs

On Linux/macOS: ~/.config/WhisperVoiceInput/logs/ On Windows: %APPDATA%\WhisperVoiceInput\logs\

Architecture (actor-based)

Actors and responsibilities:

MainOrchestratorActor (FSM): Coordinates the pipeline (Idle → Recording → Transcribing → PostProcessing → Saving). Supervises children, freezes settings per session, stashes settings updates, notifies UI via Observer.
AudioRecordingActor: Records from OpenAL and writes MP3 using NAudio.Lame. Emits AudioRecordedEvent.
TranscribingActor: Calls {ServerAddress}/v1/audio/transcriptions with model/language/prompt (async via PipeTo). Emits TranscriptionCompletedEvent. Handles temp file cleanup/move and deletes temp file on timeout/failure.
PostProcessorActor (optional): Uses Microsoft.Extensions.AI to enhance text. Emits PostProcessedEvent.
ResultSaverActor: Outputs final text per selected strategy (clipboard, wl-copy, ydotool, wtype). Emits ResultSavedEvent.
ObserverActor: Bridges actor system to UI with IObservable.
SocketListenerActor (Linux): Listens on /tmp/WhisperVoiceInput/pipe and forwards transcribe_toggle to the orchestrator.

Primary messages:

Commands: ToggleCommand, UpdateSettingsCommand, RecordCommand, StopRecordingCommand, TranscribeCommand(audioPath), PostProcessCommand(text), StartListeningCommand, StopListeningCommand, GetStateObservableCommand
Events: AudioRecordedEvent, TranscriptionCompletedEvent, PostProcessedEvent, ResultAvailableEvent, ResultSavedEvent, StateUpdatedEvent, StateObservableResult

Testing

A dedicated test project validates the actor pipeline.

FSM/Unit tests for MainOrchestratorActor transitions and messaging
Pipeline integration tests using TestScheduler for deterministic timing
Error scenario tests (network timeouts, auth failures, file not found, multi‑error cases)
Dataset saving behavior with and without post‑processing

Project layout (simplified):

WhisperVoiceInput.Tests/
  Actors/
    MainOrchestratorActorTests.cs
    PipelineIntegrationTests.cs
    SpecificErrorScenariosTests.cs
  TestBase/
    AkkaTestBase.cs
  TestDoubles/
    ... (probes, mocks, configurable error actors)

Diagrams

Data Flow

flowchart LR
    UI["UI / ViewModels"] -- Toggle --> Orchestrator["MainOrchestratorActor (FSM)"]
    SettingsService -- UpdateSettingsCommand --> Orchestrator

    Orchestrator -- RecordCommand --> Audio["AudioRecordingActor"]
    Audio -- AudioRecordedEvent --> Orchestrator
    Audio -- (self) RecordingTimeout --> Audio

    Orchestrator -- TranscribeCommand --> Trans["TranscribingActor"]
    Trans -- TranscriptionCompletedEvent --> Orchestrator
    Trans -- (self) TranscriptionTimeout --> Trans

    Orchestrator -- PostProcessCommand --> Post["PostProcessorActor (optional)"]
    Post -- PostProcessedEvent --> Orchestrator
    Post -- (self) PostProcessingTimeout --> Post

    Orchestrator -- ResultAvailableEvent --> Saver["ResultSaverActor"]
    Saver -- ResultSavedEvent --> Orchestrator

    Orchestrator -- StateUpdatedEvent --> Observer["ObserverActor"]
    Observer -- StateObservableResult --> UI

    Socket["SocketListenerActor (/tmp/WhisperVoiceInput/pipe)"] -- transcribe_toggle --> Orchestrator

Supervision (runtime)

flowchart TD
    subgraph user["/user/"]
      Orchestrator[MainOrchestratorActor]
      Observer[ObserverActor]
      subgraph SocketSup["SocketSupervisorActor"]
        SocketListener[SocketListenerActor]
      end
    end

    Orchestrator --> Audio[AudioRecordingActor]
    Orchestrator --> Trans[TranscribingActor]
    Orchestrator --> Post[PostProcessorActor]
    Orchestrator --> Saver[ResultSaverActor]

    Note["Note: SocketSupervisorActor exists but current listener is created as top-level sibling under /user."]

FSM States

stateDiagram-v2
    [*] --> idle
    idle --> recording: ToggleCommand
    recording --> transcribing: AudioRecordedEvent
    transcribing --> postprocessing: TranscriptionCompletedEvent
    postprocessing --> saving: PostProcessedEvent
    transcribing --> saving: (post-processing disabled)
    saving --> idle: ResultSavedEvent

    recording --> idle: error after retries or user timeout
    transcribing --> idle: error after retries or user timeout
    postprocessing --> idle: error after retries or user timeout
    saving --> idle: error after retries

Sequence (happy path + error path)

sequenceDiagram
    participant User as User
    participant UI as UI/ViewModel
    participant Orch as MainOrchestrator
    participant Aud as AudioRecording
    participant Tr as Transcribing
    participant PP as PostProcessing
    participant Sav as ResultSaver
    participant Obs as Observer

    User->>UI: Toggle
    UI->>Orch: ToggleCommand
    Orch->>Aud: RecordCommand
    Aud-->>Orch: AudioRecordedEvent
    Orch->>Tr: TranscribeCommand
    alt Success
        Tr-->>Orch: TranscriptionCompletedEvent(text)
        alt Post-processing enabled
            Orch->>PP: PostProcessCommand(text)
            PP-->>Orch: PostProcessedEvent(processed)
            Orch->>Sav: ResultAvailableEvent(processed)
        else Disabled
            Orch->>Sav: ResultAvailableEvent(text)
        end
        Sav-->>Orch: ResultSavedEvent
        Orch-->>Obs: StateUpdatedEvent(Success)
    else Error
        Note over Tr,Orch: Error at any stage (recording/transcribing/post-processing/saving)
        Orch-->>Obs: StateUpdatedEvent(Error, details)
        Orch->>Orch: Cleanup and transition to Idle
    end
    Obs-->>UI: IObservable<StateUpdatedEvent>

License

MIT License

Acknowledgements

OpenAI Whisper - Speech recognition model
Avalonia UI - Cross-platform UI framework
ReactiveUI - MVVM framework
NAudio - Audio library for .NET
OpenTK.OpenAL - OpenAL bindings for .NET
Akka.NET — Actor framework
Microsoft.Extensions.AI — AI abstractions for post‑processing
SharpHook — Global hotkey support

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
WhisperVoiceInput.Tests		WhisperVoiceInput.Tests
WhisperVoiceInput		WhisperVoiceInput
.gitignore		.gitignore
LICENSE		LICENSE
WhisperVoiceInput.sln		WhisperVoiceInput.sln
readme.md		readme.md
transcribe_toggle.sh		transcribe_toggle.sh
transcribe_toggle_simplified.sh		transcribe_toggle_simplified.sh

License

V0v1kkk/WhisperVoiceInput

Folders and files

Latest commit

History

Repository files navigation

WhisperVoiceInput

Disclaimer

What’s new (major refactor) - 10.08.2025

Features

Roadmap

Requirements

OpenAL (audio backend dependency)

Installation

Prerequisites

From Source

Pre-built Binaries

Configuration

API Configuration

Audio Input Device Selection

Output Configuration

Post-Processing (optional)

Safety Timeouts (optional)

Dataset Saving (optional)

Self-Hosted Whisper API

Usage

GUI Usage

Command Line Control

Keyboard Shortcuts

GNOME Example:

KDE Example:

Troubleshooting

Recording Issues

Transcription Issues

Post‑Processing Issues

Socket Control Issues

Logs

Architecture (actor-based)

Testing

Diagrams

Data Flow

Supervision (runtime)

FSM States

Sequence (happy path + error path)

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Languages

Packages