This is the code accompanying the paper “MM-HSD: Multi-Modal Hate Speech Detection in Videos” by B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz, O. Hrynenko, and A. Cavallaro, published in the Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM ’25).
The code reproduces the paper’s three experimental setups: (I) Late fusion of modality encoders, (II) Late fusion + CMA as an additional modality (MM-HSD), and (III) CMA as a standalone early feature extractor.
Content warning: research artifacts and examples may contain hate speech.
Clone the repository:
[email protected]:idiap/mm-hsd.git
cd mm-hsdThis project targets Python 3.9 and uses pyproject.toml for all dependencies.
Choose one of the options below (uv is recommended).
uv will automatically provide a matching Python interpreter (3.9) and create an isolated environment.
Find installation instructions here
CPU-only:
uv sync --extra cpuCUDA 12.9:
uv sync --extra cu129 --extra-index-url https://download.pytorch.org/whl/cu129Use conda only to supply Python 3.9, then install from pyproject.toml.
conda create -n mm_hsd python=3.9
conda activate mm_hsd
# CPU-only
pip install -e ".[cpu]"
# CUDA 12.9
pip install --extra-index-url https://download.pytorch.org/whl/cu129 -e ".[cu129]"Create a virtual environment with Python 3.9 and install from pyproject.toml.
python3.9 -m venv .venv
source .venv/bin/activate
# CPU-only
pip install -e ".[cpu]"
# CUDA 12.9
pip install --extra-index-url https://download.pytorch.org/whl/cu129 -e ".[cu129]"[cpu] and [cu129] in pyproject.toml decide whether PyTorch is installed; they don’t lock the CUDA build.
E.g. if you want to install CUDA 12.4 (cu124) instead of cu129, you would have to install the desired PyTorch wheels first, then install this package without deps:
Using uv
# 1) Install the exact CUDA wheels you want (example: CUDA 12.4)
uv pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
"torch==2.8.*+cu124" "torchvision==0.23.*+cu124"
# 2) Install your package without touching deps
uv pip install -e . --no-depsUsing venv or conda + pip
# venv example
python3.9 -m venv .venv && source .venv/bin/activate
# OR: conda create -n mm_hsd python=3.9 && conda activate mm_hsd
# 1) CUDA wheels you want (example: CUDA 12.4)
pip install --extra-index-url https://download.pytorch.org/whl/cu124 \
"torch==2.8.*+cu124" "torchvision==0.23.*+cu124"
# 2) Install your package without re-resolving deps
pip install -e .With uv:
uv run python -c "import mm_hsd, torch; print('mm_hsd:', mm_hsd.__file__); print('torch:', torch.__version__, 'CUDA:', torch.cuda.is_available())"With conda/venv:
python -c "import mm_hsd, torch; print('mm_hsd:', mm_hsd.__file__); print('torch:', torch.__version__, 'CUDA:', torch.cuda.is_available())"- Use
-e(editable) during development (i.e. edits insrc/mm_hsdare immediately reflected when youimport mm_hsd); omit it for a regular install:pip install ".[cpu]". - Package layout:
src/mm_hsd/__init__.py. After installation,import mm_hsdshould work from anywhere.
Below is the description of how features were extracted and the format expected by the model.
The code expects videos grouped by label (hate, non_hate). For reference, <video_name> should behate_video_1, hate_video_2,non_hate_video_1, non_hate_video_2, etc.
The pipeline preprocesses each video to obtain four modalities, out of which we extract different features.
Below are the exact models used in our pipeline.
-
Audio transcripts (Whisper):
openai/whispersmall model. -
On-screen text (OCR):
paddleocrEnglish pipeline. Requirespaddleocrandpaddlepaddlepackages. -
Audio features (Wav2Vec2):
jonatasgrosman/wav2vec2-large-xlsr-53-english(1024-dimensional, mean-pooled last hidden state). 5 -
Video features (ViT):
google/vit-base-patch16-224-in21k(per-frame [CLS], 768-dimensional) -
Text embeddings (Detoxify):
Detoxify("original")(last-layer [CLS], 768-dimensional). Requires thedetoxifypackage.
Here we describe how to obtain modalities before feature extraction.
-
Audio extraction: For each video, extract a mono 16 kHz (PCM 16-bit) wav. We use the file name
<video_name>.wav. -
Frame extraction: Sample 1 frame per second from each video, save frames as JPGs.
-
Audio transcripts: Run OpenAI's whisper-small on each audio file
<video_name>.wavand keep one transcript string per audio. -
On-screen text: Run OCR on all sampled frames of a video in temporal order. Clean text (punctuation and whitespaces), remove near-duplicates, merge overlaps into a single string.
Our models expect a feature folder features/ with the following structure:
features/
├── audio/
│ ├── hate_features.csv
│ └── non_hate_features.csv
├── video/
│ ├── <video_name>_vit.p
│ └── ...
├── text/
│ └── embeddings_detoxify.json
└── ocr/
└── embeddings_detoxify.json
The config paths should be provided as such:
dir_frames: /path/to/features/video
dir_audio: /path/to/features/audio
dir_text: /path/to/features/text/embeddings_detoxify.json
dir_ocr: /path/to/features/ocr/embeddings_detoxify.jsonOne CSV file per label. The columns are the following:
file_name(string): the audio file name,<video_name>.wavfeature_0, feature_1, ..., feature_1023(floats): the 1024-dimensional wav2vec2 feature vector, obtained as the mean over time of the last hidden state from wav2vec2.
One pickle file per video (all videos in the same folder), named <video_name>_vit.p. Each file stores a python list of length 100 (100 frames), where each element are 768 features extracted using ViT.
Particularities
- We build a sequence of 100 frames per video:
- If the video has fewer than 100 frames at 1 FPS, we pad with 224×224 black images.
- If it has more than 100, 100 frames are selected evenly distributed.
- The pickled object must yield a Python list of length 100, each element a NumPy array of shape
(768,).
A JSON file mapping each file name to a 768-dimensional vector extracted from the audio trancripts using Detoxify.
-
JSON structure:
{ "<video_name>": [0.124, ... , 0.765] }
A JSON file mapping each audio file name to a 768-dimensional vector extracted using Detoxify from the OCR text of the corresponding video.
- JSON structure: same as transcripts JSON.
This repository provides a full pipeline for training, evaluation, and logging MM-HSD (and other configurations). It supports multi-modal data loading, model setup and training with configurable parameters, optional experiment tracking with wandb (which requires the user to provide their own initialization script), and flexible testing either automatically or separately. The pipeline also saves the best model per fold during cross-validation, or a single best model when not using cross-validation.
Before launching the code, a yaml file needs to be created. Below we will give a guide for those variables that are specific to the setups presented in the paper (excluding the ones that are common, such as input_size_text or dir_text, or training parameters such as batch_size).
Variable load_modalities determines which videos are used for training by requiring presence across all specified modalities. For consistency in experiments, keep this the same across setups, as it does not control which modalities are actually used in the model.
load_modalities:
- text
- audio
- video
- ocrVariable include_modalities determines which modalities will be included in the training input. E.g. if only using text, audio and video:
include_modalities:
- text
- audio
- videoFor all experiments, provide:
input_size_text: # Size of input embeddings
input_size_audio: # Size of audio embeddings
input_size_ocr: # Size of on-screen text embeddings
input_size_video: # Size of video embeddingsThe main setups presented in the paper and their corresponding parameters (those that are not common for all setups) are:
I) Late fusion experiements: Each modality is first in independently encoded (e.g. by LSTM or Fully Connected (FC) layers), and their outputs are fused using either concatenation or Cross-Modal Attention (CMA).
Example: Unimodal Audio
fusion: null
include_modalities:
- audio
load_modalities:
- text
- audio
- video
- ocrExample: Audio+Video+Text with CMA fusion
fusion: cross_modal
include_modalities:
- audio
- video
- text
load_modalities:
- text
- audio
- video
- ocr
query_modalities:
- audio
key_modalities:
- video
- textII) Late Fusion with CMA as Additional Modality: CMA is applied to raw modality embeddings to generate an additional feature, which is then combined (via concatenation) with the encoded features.
Example: Audio+Video+Text+OCR+CMA with concat fusion (MM-HSD)
fusion: concat
include_modalities:
- audio
- video
- text
- ocr
- cross_modal
query_modalities:
- ocr
key_modalities:
- video
- text
- audio
load_modalities:
- text
- audio
- video
- ocrIII) Early Fusion with CMA as Unique Feature Extractor: Raw modality embeddings are fused directly using CMA, without any modality-specific processing.
Example: CMA (as a standalone feature extractor) using Video+Audio+OCR
fusion: null
include_modalities:
- cross_modal
load_modalities:
- text
- audio
- video
- ocr
query_modalities:
- ocr
key_modalities:
- video
- audiopython mm-hsd/src/mm_hsd/scripts/train.py -c mm-hsd/src/mm_hsd/configs/config_indiv_and_cma.ymlBerta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, and Andrea Cavallaro. 2025. MM-HSD: Multi-Modal Hate Speech Detection in Videos. In Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Ireland. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3746027.3754558
Bibtex
@inproceedings{cespedes-sarrias_mm-hsd_2025,
title = {MM-HSD: Multi-Modal Hate Speech Detection in Videos},
doi = {https://doi.org/10.1145/3746027.3754558},
booktitle = {Proceedings of the 33rd {ACM} {International} {Conference} on {Multimedia}},
publisher = {ACM},
author = {Céspedes-Sarrias, Berta and Collado-Capell, Carlos and Rodenas-Ruiz, Pablo and Hrynenko, Olena and Cavallaro, Andrea},
year = {2025},
}