Skip to content

A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable caption styles.

License

Notifications You must be signed in to change notification settings

qishisuren123/AnyCap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

🤗 Model Weights  |  📊 AnyCapEval Benchmark  |  📝 Paper  |  📚 Code


🚩 Highlights

  • 🏆 Unified Multi-modal Captioning: One framework covers image, audio, and video captioning with controllable styles.
  • 📝 Customizable Caption Styles: Control caption styles through predefined instructions and models.
  • 📊 Open Benchmark & Evaluation: AnyCapEval—an industry-level, multi-modal benchmark with comprehensive evaluation protocols.
  • 🛠️ End-to-End Open Source: Full training pipeline, evaluation toolkits, dataset pipeline and open benchmark.

📑 Todo List

  • Paper released
  • AnyCapEval benchmark available
  • Pretrained model weights released
  • Training dataset (AnyCapDataset) to be open-sourced soon

🚀 Quick Start

Installation

git clone https://github.com/qishisuren123/AnyCap.git
cd AnyCap
pip install -r requirements.txt

Install Fairseq manually:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Download Weights

AnyCap/
├── AnyCapDataset/   
├── assets/                    
├── model_weights/                       
│   ├── ACM
│   ├── InternVL
│   └── ...
├── eval
├── ...    

Download Benchmark Data


💡 Introduction

AnyCap is a unified and controllable omni-modal captioning framework, supporting caption generation for images, audio, and videos with fine-grained style control. The framework is fully open-source, featuring training code, benchmark datasets, and a comprehensive evaluation toolkit—all-in-one.

Figure 1 above shows an overview of the AnyCap architecture and data pipeline.


📊 Benchmark & Evaluation

AnyCapEval Benchmark

Figure 2 – Evaluation methodology of AnyCapEval.
(a) Examples demonstrating content scoring with Key-point Density (KPD) and style scoring rules.
(b) KPD correlation analysis, showing that KPD length‐based metrics achieve the highest Pearson/Spearman/Kendall correlations with human judgments.
(c) Radar chart illustrating the large performance gains delivered by ACM integration across ten dimensions (IApt–Thm).

GPT-4o GPT-4o + ACM InternVL2.5-8B InternVL2.5-8B + ACM
Average ↑ 2.79 4.15 2.75 3.98

Key takeaway • ACM boosts GPT-4o’s content scores by +45 % and style scores by +12 %, and yields similar gains on strong open models, highlighting the reliability and coverage of AnyCapEval.

Here we illustrate the usage for video modality (audio and image modalities follow a similar structure).

  1. Generate Captions:
python /path/to/AnyCapEval/gen/gen_xxx.py

This generates two files:

  • content.jsonl
  • style.jsonl
  1. Configure Generated Files:

Edit the file anycapeval_video.sh, update paths:

OUTPUT_PATH_CONTENT=/path/to/generated/content.jsonl
OUTPUT_PATH_STYLE=/path/to/generated/style.jsonl
  1. Run Evaluation:

Activate proxy and run:

bash anycapeval_video.sh

Related Caption Benchmarks (e.g., VidCapBench)

We illustrate usage with the video modality benchmark--VidCapBench(image and audio modalities follow a similar approach).

  1. Generate Captions:
python /path/to/vidcapbench/gen/gen_xxx.py
  1. Run Evaluation:

Set the generated .jsonl file in the --caption_path parameter:

python eval_xxx.py --caption_path /path/to/generated/captions.jsonl

📂 Dataset

AnyCapDataset (Coming Soon)

High-quality, fully annotated datasets for all three modalities (image, audio, video) will be released soon on HuggingFace. Stay tuned!


🤝 Contributing

We welcome contributions! Please open issues or submit PRs for feedback and improvements.


📝 Citation

@article{ren2025anycap,
  title={AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning},
  author={Ren, Yiming and Lin, Zhiqiang and Li, Yu and Meng, Gao and Wang, Weiyun and Wang, Junjie and Lin, Zicheng and Dai, Jifeng and Yang, Yujiu and Wang, Wenhai and others},
  journal={arXiv preprint arXiv:2507.12841},
  year={2025}
}

License

This project is licensed under the MIT License – see the LICENSE file for details.

About

A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable caption styles.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published