MM-HSD: Multi-Modal Hate Speech Detection in Videos

This is the code accompanying the paper “MM-HSD: Multi-Modal Hate Speech Detection in Videos” by B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz, O. Hrynenko, and A. Cavallaro, published in the Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM ’25).

The code reproduces the paper’s three experimental setups: (I) Late fusion of modality encoders, (II) Late fusion + CMA as an additional modality (MM-HSD), and (III) CMA as a standalone early feature extractor.

Content warning: research artifacts and examples may contain hate speech.

Install & Setup

Clone the repository:

[email protected]:idiap/mm-hsd.git

cd mm-hsd

This project targets Python 3.9 and uses pyproject.toml for all dependencies. Choose one of the options below (uv is recommended).

Option A: `uv`

uv will automatically provide a matching Python interpreter (3.9) and create an isolated environment. Find installation instructions here

CPU-only:

uv sync --extra cpu

CUDA 12.9:

uv sync --extra cu129 --extra-index-url https://download.pytorch.org/whl/cu129

Option B: `conda` + `pip`

Use conda only to supply Python 3.9, then install from pyproject.toml.

conda create -n mm_hsd python=3.9
conda activate mm_hsd

# CPU-only
pip install -e ".[cpu]"

# CUDA 12.9
pip install --extra-index-url https://download.pytorch.org/whl/cu129 -e ".[cu129]"

Option C: `venv` + `pip`

Create a virtual environment with Python 3.9 and install from pyproject.toml.

python3.9 -m venv .venv
source .venv/bin/activate

# CPU-only
pip install -e ".[cpu]"

# CUDA 12.9
pip install --extra-index-url https://download.pytorch.org/whl/cu129 -e ".[cu129]"

Using a different CUDA version

[cpu] and [cu129] in pyproject.toml decide whether PyTorch is installed; they don’t lock the CUDA build.
E.g. if you want to install CUDA 12.4 (cu124) instead of cu129, you would have to install the desired PyTorch wheels first, then install this package without deps:

Using uv

# 1) Install the exact CUDA wheels you want (example: CUDA 12.4)
uv pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
  "torch==2.8.*+cu124" "torchvision==0.23.*+cu124"

# 2) Install your package without touching deps
uv pip install -e . --no-deps

Using venv or conda + pip

# venv example
python3.9 -m venv .venv && source .venv/bin/activate
# OR: conda create -n mm_hsd python=3.9 && conda activate mm_hsd

# 1) CUDA wheels you want (example: CUDA 12.4)
pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
  "torch==2.8.*+cu124" "torchvision==0.23.*+cu124"

# 2) Install your package without re-resolving deps
pip install -e .

Verify

With uv:

uv run python -c "import mm_hsd, torch; print('mm_hsd:', mm_hsd.__file__); print('torch:', torch.__version__, 'CUDA:', torch.cuda.is_available())"

With conda/venv:

python -c "import mm_hsd, torch; print('mm_hsd:', mm_hsd.__file__); print('torch:', torch.__version__, 'CUDA:', torch.cuda.is_available())"

Notes

Use -e (editable) during development (i.e. edits in src/mm_hsd are immediately reflected when you import mm_hsd); omit it for a regular install: pip install ".[cpu]".
Package layout: src/mm_hsd/__init__.py. After installation, import mm_hsd should work from anywhere.

Data: Feature extraction

Below is the description of how features were extracted and the format expected by the model.

The code expects videos grouped by label (hate, non_hate). For reference, <video_name> should behate_video_1, hate_video_2,non_hate_video_1, non_hate_video_2, etc.

The pipeline preprocesses each video to obtain four modalities, out of which we extract different features.

Model versions

Below are the exact models used in our pipeline.

Audio transcripts (Whisper): openai/whisper small model.
On-screen text (OCR): paddleocr English pipeline. Requires paddleocr and paddlepaddle packages.
Audio features (Wav2Vec2): jonatasgrosman/wav2vec2-large-xlsr-53-english (1024-dimensional, mean-pooled last hidden state). 5
Video features (ViT): google/vit-base-patch16-224-in21k (per-frame [CLS], 768-dimensional)
Text embeddings (Detoxify): Detoxify("original") (last-layer [CLS], 768-dimensional). Requires the detoxify package.

Preprocessing (extracting modalities)

Here we describe how to obtain modalities before feature extraction.

Audio extraction: For each video, extract a mono 16 kHz (PCM 16-bit) wav. We use the file name <video_name>.wav.
Frame extraction: Sample 1 frame per second from each video, save frames as JPGs.
Audio transcripts: Run OpenAI's whisper-small on each audio file <video_name>.wav and keep one transcript string per audio.
On-screen text: Run OCR on all sampled frames of a video in temporal order. Clean text (punctuation and whitespaces), remove near-duplicates, merge overlaps into a single string.

Feature extraction

Our models expect a feature folder features/ with the following structure:

features/
├── audio/
│   ├── hate_features.csv
│   └── non_hate_features.csv
├── video/
│   ├── <video_name>_vit.p
│   └── ...
├── text/
│   └── embeddings_detoxify.json
└── ocr/
    └── embeddings_detoxify.json

The config paths should be provided as such:

dir_frames: /path/to/features/video
dir_audio: /path/to/features/audio
dir_text: /path/to/features/text/embeddings_detoxify.json
dir_ocr: /path/to/features/ocr/embeddings_detoxify.json

1) Audio features

One CSV file per label. The columns are the following:

file_name (string): the audio file name, <video_name>.wav
feature_0, feature_1, ..., feature_1023 (floats): the 1024-dimensional wav2vec2 feature vector, obtained as the mean over time of the last hidden state from wav2vec2.

2) Video features

One pickle file per video (all videos in the same folder), named <video_name>_vit.p. Each file stores a python list of length 100 (100 frames), where each element are 768 features extracted using ViT.

Particularities

We build a sequence of 100 frames per video:
- If the video has fewer than 100 frames at 1 FPS, we pad with 224×224 black images.
- If it has more than 100, 100 frames are selected evenly distributed.
The pickled object must yield a Python list of length 100, each element a NumPy array of shape (768,).

3) Text features from transcripts

A JSON file mapping each file name to a 768-dimensional vector extracted from the audio trancripts using Detoxify.

JSON structure:

  {
    "<video_name>": [0.124, ... , 0.765]
  }

4) Text features from OCR

A JSON file mapping each audio file name to a 768-dimensional vector extracted using Detoxify from the OCR text of the corresponding video.

JSON structure: same as transcripts JSON.

Code

This repository provides a full pipeline for training, evaluation, and logging MM-HSD (and other configurations). It supports multi-modal data loading, model setup and training with configurable parameters, optional experiment tracking with wandb (which requires the user to provide their own initialization script), and flexible testing either automatically or separately. The pipeline also saves the best model per fold during cross-validation, or a single best model when not using cross-validation.

Config according to training setup

Before launching the code, a yaml file needs to be created. Below we will give a guide for those variables that are specific to the setups presented in the paper (excluding the ones that are common, such as input_size_text or dir_text, or training parameters such as batch_size).

Variable load_modalities determines which videos are used for training by requiring presence across all specified modalities. For consistency in experiments, keep this the same across setups, as it does not control which modalities are actually used in the model.

load_modalities: 
  - text
  - audio
  - video
  - ocr

Variable include_modalities determines which modalities will be included in the training input. E.g. if only using text, audio and video:

include_modalities: 
  - text
  - audio
  - video

For all experiments, provide:

input_size_text:  # Size of input embeddings
input_size_audio: # Size of audio embeddings
input_size_ocr: # Size of on-screen text embeddings
input_size_video: # Size of video embeddings

The main setups presented in the paper and their corresponding parameters (those that are not common for all setups) are:

I) Late fusion experiements: Each modality is first in independently encoded (e.g. by LSTM or Fully Connected (FC) layers), and their outputs are fused using either concatenation or Cross-Modal Attention (CMA).

Example: Unimodal Audio

fusion: null
include_modalities: 
  - audio
load_modalities: 
  - text
  - audio
  - video
  - ocr

Example: Audio+Video+Text with CMA fusion

fusion: cross_modal
include_modalities: 
  - audio
  - video
  - text
load_modalities: 
  - text
  - audio
  - video
  - ocr
query_modalities:
  - audio
key_modalities:
  - video
  - text

II) Late Fusion with CMA as Additional Modality: CMA is applied to raw modality embeddings to generate an additional feature, which is then combined (via concatenation) with the encoded features.

Example: Audio+Video+Text+OCR+CMA with concat fusion (MM-HSD)

fusion: concat
include_modalities: 
  - audio
  - video
  - text
  - ocr
  - cross_modal
query_modalities:
  - ocr
key_modalities:
  - video
  - text
  - audio
load_modalities: 
  - text
  - audio
  - video
  - ocr

III) Early Fusion with CMA as Unique Feature Extractor: Raw modality embeddings are fused directly using CMA, without any modality-specific processing.

Example: CMA (as a standalone feature extractor) using Video+Audio+OCR

fusion: null
include_modalities: 
  - cross_modal
load_modalities: 
  - text
  - audio
  - video
  - ocr
query_modalities:
  - ocr
key_modalities:
  - video
  - audio

Launch training

python mm-hsd/src/mm_hsd/scripts/train.py -c mm-hsd/src/mm_hsd/configs/config_indiv_and_cma.yml

Citation

Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, and Andrea Cavallaro. 2025. MM-HSD: Multi-Modal Hate Speech Detection in Videos. In Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Ireland. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3746027.3754558

Bibtex

@inproceedings{cespedes-sarrias_mm-hsd_2025,
	title = {MM-HSD: Multi-Modal Hate Speech Detection in Videos},
	doi = {https://doi.org/10.1145/3746027.3754558},
	booktitle = {Proceedings of the 33rd {ACM} {International} {Conference} on {Multimedia}},
	publisher = {ACM},
	author = {Céspedes-Sarrias, Berta and Collado-Capell, Carlos and Rodenas-Ruiz, Pablo and Hrynenko, Olena and Cavallaro, Andrea},
	year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSES		LICENSES
src/mm_hsd		src/mm_hsd
.gitignore		.gitignore
README.md		README.md
REUSE.toml		REUSE.toml
THIRD-PARTY-LICENSES.md		THIRD-PARTY-LICENSES.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM-HSD: Multi-Modal Hate Speech Detection in Videos

Install & Setup

Option A: `uv`

Option B: `conda` + `pip`

Option C: `venv` + `pip`

Using a different CUDA version

Verify

Notes

Data: Feature extraction

Model versions

Preprocessing (extracting modalities)

Feature extraction

1) Audio features

2) Video features

3) Text features from transcripts

4) Text features from OCR

Code

Config according to training setup

Launch training

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

idiap/mm-hsd

Folders and files

Latest commit

History

Repository files navigation

MM-HSD: Multi-Modal Hate Speech Detection in Videos

Install & Setup

Option A: uv

Option B: conda + pip

Option C: venv + pip

Using a different CUDA version

Verify

Notes

Data: Feature extraction

Model versions

Preprocessing (extracting modalities)

Feature extraction

1) Audio features

2) Video features

3) Text features from transcripts

4) Text features from OCR

Code

Config according to training setup

Launch training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Option A: `uv`

Option B: `conda` + `pip`

Option C: `venv` + `pip`

Packages