🤗 Model Weights | 📊 AnyCapEval Benchmark | 📝 Paper | 📚 Code
- 🏆 Unified Multi-modal Captioning: One framework covers image, audio, and video captioning with controllable styles.
- 📝 Customizable Caption Styles: Control caption styles through predefined instructions and models.
- 📊 Open Benchmark & Evaluation: AnyCapEval—an industry-level, multi-modal benchmark with comprehensive evaluation protocols.
- 🛠️ End-to-End Open Source: Full training pipeline, evaluation toolkits, dataset pipeline and open benchmark.
- Paper released
- AnyCapEval benchmark available
- Pretrained model weights released
- Training dataset (AnyCapDataset) to be open-sourced soon
git clone https://github.com/qishisuren123/AnyCap.git
cd AnyCap
pip install -r requirements.txtInstall Fairseq manually:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./- AnyCapModel weights on HuggingFace
- We recommend storing the downloaded model weights in the following directory structure:
AnyCap/
├── AnyCapDataset/
├── assets/
├── model_weights/
│ ├── ACM
│ ├── InternVL
│ └── ...
├── eval
├── ...
- AnyCapEval benchmark on HuggingFace
- Training dataset (AnyCapDataset) will be released soon.
AnyCap is a unified and controllable omni-modal captioning framework, supporting caption generation for images, audio, and videos with fine-grained style control. The framework is fully open-source, featuring training code, benchmark datasets, and a comprehensive evaluation toolkit—all-in-one.
Figure 1 above shows an overview of the AnyCap architecture and data pipeline.
Figure 2 – Evaluation methodology of AnyCapEval.
(a) Examples demonstrating content scoring with Key-point Density (KPD) and style scoring rules.
(b) KPD correlation analysis, showing that KPD length‐based metrics achieve the highest Pearson/Spearman/Kendall correlations with human judgments.
(c) Radar chart illustrating the large performance gains delivered by ACM integration across ten dimensions (IApt–Thm).
| GPT-4o | GPT-4o + ACM | InternVL2.5-8B | InternVL2.5-8B + ACM | |
|---|---|---|---|---|
| Average ↑ | 2.79 | 4.15 | 2.75 | 3.98 |
Key takeaway • ACM boosts GPT-4o’s content scores by +45 % and style scores by +12 %, and yields similar gains on strong open models, highlighting the reliability and coverage of AnyCapEval.
Here we illustrate the usage for video modality (audio and image modalities follow a similar structure).
- Generate Captions:
python /path/to/AnyCapEval/gen/gen_xxx.pyThis generates two files:
content.jsonlstyle.jsonl
- Configure Generated Files:
Edit the file anycapeval_video.sh, update paths:
OUTPUT_PATH_CONTENT=/path/to/generated/content.jsonl
OUTPUT_PATH_STYLE=/path/to/generated/style.jsonl- Run Evaluation:
Activate proxy and run:
bash anycapeval_video.shWe illustrate usage with the video modality benchmark--VidCapBench(image and audio modalities follow a similar approach).
- Generate Captions:
python /path/to/vidcapbench/gen/gen_xxx.py- Run Evaluation:
Set the generated .jsonl file in the --caption_path parameter:
python eval_xxx.py --caption_path /path/to/generated/captions.jsonlHigh-quality, fully annotated datasets for all three modalities (image, audio, video) will be released soon on HuggingFace. Stay tuned!
We welcome contributions! Please open issues or submit PRs for feedback and improvements.
@article{ren2025anycap,
title={AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning},
author={Ren, Yiming and Lin, Zhiqiang and Li, Yu and Meng, Gao and Wang, Weiyun and Wang, Junjie and Lin, Zicheng and Dai, Jifeng and Yang, Yujiu and Wang, Wenhai and others},
journal={arXiv preprint arXiv:2507.12841},
year={2025}
}This project is licensed under the MIT License – see the LICENSE file for details.