Skip to content

jakariaemon/WhisQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

WhisQ: Music Quality Assessment with Multimodal Alignment

A PyTorch implementation of WhisQ, a multimodal approach for music quality assessment that leverages Whisper (audio) and Qwen (text) models with optimal transport alignment for enhanced cross-modal understanding.

Overview

WhisQ combines audio and text modalities to predict both Overall Quality (OQM) and Textual Alignment (TA) scores for musical content. The model uses optimal transport theory to align audio and text representations, enabling better cross-modal understanding for music quality assessment.

Key Features

  • Multimodal Architecture: Combines Whisper-Base (audio) and Qwen3-0.6B (text) pretrained models
  • Optimal Transport Alignment: Uses Sinkhorn divergence for cross modal alignment
  • Sequence Co-Attention: Attention mechanism for enhanced feature fusion
  • Dual Prediction: Simultaneous prediction of overall quality and textual alignment scores
  • Flexible Training: Support for multiple loss functions (MSE, L1, Huber) and optimizers

Architecture

whisq

Project Structure

src/
├── train_align.py      # Main training script with optimal transport alignment
├── evaluate.py         # Evaluation script for validation and test sets
├── wrapper.py          # WhisperQwenWrapper 
├── model.py           # MosPredictor model definition
├── data_utils.py      # Dataset utilities and data loading
├── utils.py           # Helper functions and utilities
└── sweep.yaml         # W&B hyperparameter sweep configuration

Installation

Requirements

pip install torch torchvision torchaudio
pip install transformers
pip install geomloss 
pip install wandb     
pip install tqdm scipy numpy

Model Dependencies

The implementation uses the following pretrained models:

  • Audio: openai/whisper-base
  • Text: Qwen/Qwen3-0.6B-Base

Usage

Training

Basic training with optimal transport alignment:

python train_align.py \
    --datadir ../data/MusicEval-phase1 \
    --expname exp_whisq_ot \
    --alignment_loss ot \
    --alignment_weight 0.00004056897283396114 \
    --use_seq_coatt \
    --lr 0.0007306998648015165 \
    --optimizer sgd \
    --batch_size 128 \
    --epochs 250 \
    --momentum 0.7435171279297897 \
    --loss_fn huber

Training Arguments

Argument Default Description
--datadir ../data/MusicEval-phase1 Path to dataset directory
--expname exp_hub Experiment name for W&B logging
--alignment_loss ot Alignment loss type (none, ot)
--alignment_weight 0.00004056897283396114 Weight for alignment loss
--use_seq_coatt False Enable sequence co-attention
--lr 0.0007306998648015165 Learning rate
--optimizer sgd Optimizer (sgd, adam)
--batch_size 128 Training batch size
--epochs 250 Number of training epochs
--momentum 0.7435171279297897 SGD momentum
--loss_fn huber Loss function (mse, l1, huber)

Evaluation

Evaluate on validation set:

python evaluate.py \
    --datadir ../data/eval \
    --ckpt ../track1_ckpt/expot_4ot/best_ckpt_148

Generate test predictions:

python evaluate.py \
    --datadir ../data/eval \
    --ckpt ../track1_ckpt/expot_4ot/best_ckpt_148 \
    --test_mode \
    --test_list sets/test_list.txt

Hyperparameter Sweeps

The project includes a W&B sweep configuration (sweep.yaml) for hyperparameter optimization:

program: train_align.py
method: bayes
metric:
  name: val_loss
  goal: minimize
parameters:
  lr:
    distribution: log_uniform_values
    min: 0.0001
    max: 0.01
  alignment_weight:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.001
  momentum:
    distribution: uniform
    min: 0.5
    max: 0.9
  batch_size:
    values: [64, 128, 256]
  optimizer:
    values: ['sgd', 'adam']
  loss_fn:
    values: ['mse', 'l1', 'huber']

Run sweep:

wandb sweep sweep.yaml
wandb agent <sweep_id>

Results

The model achieves state-of-the-art performance on MusicEval Track-1:

Baseline Comparison

Metric Baseline (Utterance) WhisQ + OT (Utterance) Baseline (System) WhisQ + OT (System)
Overall Quality (OQM)
MSE↓ 0.6175 0.3584 0.3863 0.1095
LCC↑ 0.6908 0.7523 0.8016 0.8991
SRCC↑ 0.6881 0.7558 0.7764 0.8773
KTAU↑ 0.5143 0.5746 0.5862 0.7094
Textual Alignment (TA)
MSE↓ 0.5936 0.4735 0.2322 0.0773
LCC↑ 0.5803 0.6176 0.7461 0.8721
SRCC↑ 0.5425 0.6109 0.7202 0.8695
KTAU↑ 0.3933 0.4474 0.5074 0.6749

Dataset Structure

Expected dataset structure:

data/
├── MusicEval-phase1/
│   ├── wav/                    # Audio files
│   ├── sets/
│   │   ├── train_mos_list.txt  # Training list
│   │   └── dev_mos_list.txt    # Validation list
│   └── system_mos/
│       └── system_mos_phase1.csv  # System-level ground truth
└── eval/
    ├── wav/                    # Test audio files
    ├── sets/
    │   └── eval_list.txt       # Test list
    └── system_mos/
        └── system_mos_phase1.csv

Data Format

Training/Validation lists (train_mos_list.txt, dev_mos_list.txt):

filename1,overall_score,textual_score
filename2,overall_score,textual_score

Test list (eval_list.txt):

filename1
filename2

System-level ground truth (system_mos_phase1.csv):

system_id,overall_mos,textual_mos
S001,4.2,3.8
S002,3.9,4.1

Citation

If you use this code in your research, please cite:

Coming Soon!  

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI for the Whisper model
  • Alibaba for the Qwen model series
  • The MusicEval challenge organizers

About

Automatic MOS Predictor: AudioMOS 2025 Challenge Track 1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages