WhisQ: Music Quality Assessment with Multimodal Alignment

A PyTorch implementation of WhisQ, a multimodal approach for music quality assessment that leverages Whisper (audio) and Qwen (text) models with optimal transport alignment for enhanced cross-modal understanding.

Overview

WhisQ combines audio and text modalities to predict both Overall Quality (OQM) and Textual Alignment (TA) scores for musical content. The model uses optimal transport theory to align audio and text representations, enabling better cross-modal understanding for music quality assessment.

Key Features

Multimodal Architecture: Combines Whisper-Base (audio) and Qwen3-0.6B (text) pretrained models
Optimal Transport Alignment: Uses Sinkhorn divergence for cross modal alignment
Sequence Co-Attention: Attention mechanism for enhanced feature fusion
Dual Prediction: Simultaneous prediction of overall quality and textual alignment scores
Flexible Training: Support for multiple loss functions (MSE, L1, Huber) and optimizers

Architecture

Project Structure

src/
├── train_align.py      # Main training script with optimal transport alignment
├── evaluate.py         # Evaluation script for validation and test sets
├── wrapper.py          # WhisperQwenWrapper 
├── model.py           # MosPredictor model definition
├── data_utils.py      # Dataset utilities and data loading
├── utils.py           # Helper functions and utilities
└── sweep.yaml         # W&B hyperparameter sweep configuration

Installation

Requirements

pip install torch torchvision torchaudio
pip install transformers
pip install geomloss 
pip install wandb     
pip install tqdm scipy numpy

Model Dependencies

The implementation uses the following pretrained models:

Audio: openai/whisper-base
Text: Qwen/Qwen3-0.6B-Base

Usage

Training

Basic training with optimal transport alignment:

python train_align.py \
    --datadir ../data/MusicEval-phase1 \
    --expname exp_whisq_ot \
    --alignment_loss ot \
    --alignment_weight 0.00004056897283396114 \
    --use_seq_coatt \
    --lr 0.0007306998648015165 \
    --optimizer sgd \
    --batch_size 128 \
    --epochs 250 \
    --momentum 0.7435171279297897 \
    --loss_fn huber

Training Arguments

Argument	Default	Description
`--datadir`	`../data/MusicEval-phase1`	Path to dataset directory
`--expname`	`exp_hub`	Experiment name for W&B logging
`--alignment_loss`	`ot`	Alignment loss type (`none`, `ot`)
`--alignment_weight`	`0.00004056897283396114`	Weight for alignment loss
`--use_seq_coatt`	`False`	Enable sequence co-attention
`--lr`	`0.0007306998648015165`	Learning rate
`--optimizer`	`sgd`	Optimizer (`sgd`, `adam`)
`--batch_size`	`128`	Training batch size
`--epochs`	`250`	Number of training epochs
`--momentum`	`0.7435171279297897`	SGD momentum
`--loss_fn`	`huber`	Loss function (`mse`, `l1`, `huber`)

Evaluation

Evaluate on validation set:

python evaluate.py \
    --datadir ../data/eval \
    --ckpt ../track1_ckpt/expot_4ot/best_ckpt_148

Generate test predictions:

python evaluate.py \
    --datadir ../data/eval \
    --ckpt ../track1_ckpt/expot_4ot/best_ckpt_148 \
    --test_mode \
    --test_list sets/test_list.txt

Hyperparameter Sweeps

The project includes a W&B sweep configuration (sweep.yaml) for hyperparameter optimization:

program: train_align.py
method: bayes
metric:
  name: val_loss
  goal: minimize
parameters:
  lr:
    distribution: log_uniform_values
    min: 0.0001
    max: 0.01
  alignment_weight:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.001
  momentum:
    distribution: uniform
    min: 0.5
    max: 0.9
  batch_size:
    values: [64, 128, 256]
  optimizer:
    values: ['sgd', 'adam']
  loss_fn:
    values: ['mse', 'l1', 'huber']

Run sweep:

wandb sweep sweep.yaml
wandb agent <sweep_id>

Results

The model achieves state-of-the-art performance on MusicEval Track-1:

Baseline Comparison

Metric	Baseline (Utterance)	WhisQ + OT (Utterance)	Baseline (System)	WhisQ + OT (System)
Overall Quality (OQM)
MSE↓	0.6175	0.3584	0.3863	0.1095
LCC↑	0.6908	0.7523	0.8016	0.8991
SRCC↑	0.6881	0.7558	0.7764	0.8773
KTAU↑	0.5143	0.5746	0.5862	0.7094
Textual Alignment (TA)
MSE↓	0.5936	0.4735	0.2322	0.0773
LCC↑	0.5803	0.6176	0.7461	0.8721
SRCC↑	0.5425	0.6109	0.7202	0.8695
KTAU↑	0.3933	0.4474	0.5074	0.6749

Dataset Structure

Expected dataset structure:

data/
├── MusicEval-phase1/
│   ├── wav/                    # Audio files
│   ├── sets/
│   │   ├── train_mos_list.txt  # Training list
│   │   └── dev_mos_list.txt    # Validation list
│   └── system_mos/
│       └── system_mos_phase1.csv  # System-level ground truth
└── eval/
    ├── wav/                    # Test audio files
    ├── sets/
    │   └── eval_list.txt       # Test list
    └── system_mos/
        └── system_mos_phase1.csv

Data Format

Training/Validation lists (train_mos_list.txt, dev_mos_list.txt):

filename1,overall_score,textual_score
filename2,overall_score,textual_score

Test list (eval_list.txt):

filename1
filename2

System-level ground truth (system_mos_phase1.csv):

system_id,overall_mos,textual_mos
S001,4.2,3.8
S002,3.9,4.1

Citation

If you use this code in your research, please cite:

Coming Soon!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI for the Whisper model
Alibaba for the Qwen model series
The MusicEval challenge organizers

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WhisQ: Music Quality Assessment with Multimodal Alignment

Overview

Key Features

Architecture

Project Structure

Installation

Requirements

Model Dependencies

Usage

Training

Training Arguments

Evaluation

Hyperparameter Sweeps

Results

Baseline Comparison

Dataset Structure

Data Format

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

jakariaemon/WhisQ

Folders and files

Latest commit

History

Repository files navigation

WhisQ: Music Quality Assessment with Multimodal Alignment

Overview

Key Features

Architecture

Project Structure

Installation

Requirements

Model Dependencies

Usage

Training

Training Arguments

Evaluation

Hyperparameter Sweeps

Results

Baseline Comparison

Dataset Structure

Data Format

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages