A PyTorch implementation of WhisQ, a multimodal approach for music quality assessment that leverages Whisper (audio) and Qwen (text) models with optimal transport alignment for enhanced cross-modal understanding.
WhisQ combines audio and text modalities to predict both Overall Quality (OQM) and Textual Alignment (TA) scores for musical content. The model uses optimal transport theory to align audio and text representations, enabling better cross-modal understanding for music quality assessment.
- Multimodal Architecture: Combines Whisper-Base (audio) and Qwen3-0.6B (text) pretrained models
- Optimal Transport Alignment: Uses Sinkhorn divergence for cross modal alignment
- Sequence Co-Attention: Attention mechanism for enhanced feature fusion
- Dual Prediction: Simultaneous prediction of overall quality and textual alignment scores
- Flexible Training: Support for multiple loss functions (MSE, L1, Huber) and optimizers
src/
├── train_align.py # Main training script with optimal transport alignment
├── evaluate.py # Evaluation script for validation and test sets
├── wrapper.py # WhisperQwenWrapper
├── model.py # MosPredictor model definition
├── data_utils.py # Dataset utilities and data loading
├── utils.py # Helper functions and utilities
└── sweep.yaml # W&B hyperparameter sweep configuration
pip install torch torchvision torchaudio
pip install transformers
pip install geomloss
pip install wandb
pip install tqdm scipy numpyThe implementation uses the following pretrained models:
- Audio:
openai/whisper-base - Text:
Qwen/Qwen3-0.6B-Base
Basic training with optimal transport alignment:
python train_align.py \
--datadir ../data/MusicEval-phase1 \
--expname exp_whisq_ot \
--alignment_loss ot \
--alignment_weight 0.00004056897283396114 \
--use_seq_coatt \
--lr 0.0007306998648015165 \
--optimizer sgd \
--batch_size 128 \
--epochs 250 \
--momentum 0.7435171279297897 \
--loss_fn huber| Argument | Default | Description |
|---|---|---|
--datadir |
../data/MusicEval-phase1 |
Path to dataset directory |
--expname |
exp_hub |
Experiment name for W&B logging |
--alignment_loss |
ot |
Alignment loss type (none, ot) |
--alignment_weight |
0.00004056897283396114 |
Weight for alignment loss |
--use_seq_coatt |
False |
Enable sequence co-attention |
--lr |
0.0007306998648015165 |
Learning rate |
--optimizer |
sgd |
Optimizer (sgd, adam) |
--batch_size |
128 |
Training batch size |
--epochs |
250 |
Number of training epochs |
--momentum |
0.7435171279297897 |
SGD momentum |
--loss_fn |
huber |
Loss function (mse, l1, huber) |
Evaluate on validation set:
python evaluate.py \
--datadir ../data/eval \
--ckpt ../track1_ckpt/expot_4ot/best_ckpt_148Generate test predictions:
python evaluate.py \
--datadir ../data/eval \
--ckpt ../track1_ckpt/expot_4ot/best_ckpt_148 \
--test_mode \
--test_list sets/test_list.txtThe project includes a W&B sweep configuration (sweep.yaml) for hyperparameter optimization:
program: train_align.py
method: bayes
metric:
name: val_loss
goal: minimize
parameters:
lr:
distribution: log_uniform_values
min: 0.0001
max: 0.01
alignment_weight:
distribution: log_uniform_values
min: 0.00001
max: 0.001
momentum:
distribution: uniform
min: 0.5
max: 0.9
batch_size:
values: [64, 128, 256]
optimizer:
values: ['sgd', 'adam']
loss_fn:
values: ['mse', 'l1', 'huber']Run sweep:
wandb sweep sweep.yaml
wandb agent <sweep_id>The model achieves state-of-the-art performance on MusicEval Track-1:
| Metric | Baseline (Utterance) | WhisQ + OT (Utterance) | Baseline (System) | WhisQ + OT (System) |
|---|---|---|---|---|
| Overall Quality (OQM) | ||||
| MSE↓ | 0.6175 | 0.3584 | 0.3863 | 0.1095 |
| LCC↑ | 0.6908 | 0.7523 | 0.8016 | 0.8991 |
| SRCC↑ | 0.6881 | 0.7558 | 0.7764 | 0.8773 |
| KTAU↑ | 0.5143 | 0.5746 | 0.5862 | 0.7094 |
| Textual Alignment (TA) | ||||
| MSE↓ | 0.5936 | 0.4735 | 0.2322 | 0.0773 |
| LCC↑ | 0.5803 | 0.6176 | 0.7461 | 0.8721 |
| SRCC↑ | 0.5425 | 0.6109 | 0.7202 | 0.8695 |
| KTAU↑ | 0.3933 | 0.4474 | 0.5074 | 0.6749 |
Expected dataset structure:
data/
├── MusicEval-phase1/
│ ├── wav/ # Audio files
│ ├── sets/
│ │ ├── train_mos_list.txt # Training list
│ │ └── dev_mos_list.txt # Validation list
│ └── system_mos/
│ └── system_mos_phase1.csv # System-level ground truth
└── eval/
├── wav/ # Test audio files
├── sets/
│ └── eval_list.txt # Test list
└── system_mos/
└── system_mos_phase1.csv
Training/Validation lists (train_mos_list.txt, dev_mos_list.txt):
filename1,overall_score,textual_score
filename2,overall_score,textual_score
Test list (eval_list.txt):
filename1
filename2
System-level ground truth (system_mos_phase1.csv):
system_id,overall_mos,textual_mos
S001,4.2,3.8
S002,3.9,4.1
If you use this code in your research, please cite:
Coming Soon! This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for the Whisper model
- Alibaba for the Qwen model series
- The MusicEval challenge organizers