FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

CVPR 2025 (Data Released!)

FINECAPTION is a novel Vision-Language model with the improved capabilities of Attribute-Aware Regional Captioning, Regional Dense Captioning, and Comprehensive Global Image Captioning. FINECAPTION can recognize arbitrary masks as referential inputs and process high-resolution images.

Comparison of Models

Model	Region Referral			Semantic Evaluation
Model	Visual Prompt	Resolution	# Image Token	ROUGE-L ↑	BLEU-4 ↑	METEOR ↑	CIDEr ↑	BERT Score ↑
Zero-Shot Learning
Kosmos-2	Bbox	224	256	9.21	0.14	1.98	1.07	37.69
Alpha-CLIP-13B	Mask	336	576	13.89	0.51	5.94	2.68	42.01
Qwen2-VL-7B	Bbox	AnyRes	-	14.12	0.57	6.18	2.74	42.97
Ferret-13B	MContour	336	576	15.01	1.06	5.86	3.12	43.82
ViP-LLaVA-13B	MContour	336	576	15.47	1.48	5.76	3.84	44.29
LLaMA-3.2-11B-Vision-Instruction	Bbox	-	-	15.64	1.59	9.73	3.95	44.53
LLaMA-3.2-90B-Vision-Instruction	Bbox	-	-	16.21	1.75	11.70	4.53	48.29
InternVL-2-40B	Bbox	1792	4096	16.21	1.79	11.91	4.63	48.38
GPT-4o	Bbox	-	-	17.87	3.21	12.87	6.49	49.85
Supervised Learning
Qwen2-VL-7B	Bbox	AnyRes	-	31.59	9.11	13.56	90.32	75.86
LLaVA-1.6-13B	Bbox	AnyRes	576	31.72	9.35	13.64	90.71	75.89
VILA1.5-8B	Bbox	336	144	31.87	9.03	13.79	90.01	75.95
ViP-LLaVA-13B	MContour	336	576	32.42	9.97	14.82	91.44	76.77
Alpha-CLIP-13B	Mask	336	576	35.68	10.96	16.11	93.85	77.66
LLaVA-HR-X	Bbox	1024	1024	35.97	11.25	16.57	95.12	78.08
LLaMA-3.2-11B-Vision	Bbox	-	-	38.14	12.87	18.31	99.11	78.94
FINECAPTION-8B (ours)	Mask	1024	1024	41.05	14.46	22.01	127.95	80.97

Table 2. Comparison of the capabilities of FINECAPTION and other related VLMs including both open-sourced models and API-based models.

Install

Please following the guide here to prepare the environment on Linux OS.

Clone this repository

https://github.com/hanghuacs/FineCaption.git
cd FineCaption

Create environment and install package

. init.sh

Mask Decoding

def decompress_mask(comp_string, height, width):
    compressed_bytes = base64.b64decode(comp_string.encode('ascii'))
    decompressed_bytes = gzip.decompress(compressed_bytes)
    return np.frombuffer(decompressed_bytes, dtype=np.uint8).reshape((height, width))

✏️ Citation

@article{hua2024finecaption,
  title={FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity},
  author={Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Sooye, Kim and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo},
  journal={arXiv preprint arXiv:2411.15411},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Data		Data
static		static
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

CVPR 2025 (Data Released!)

Comparison of Models

Install

Mask Decoding

✏️ Citation

About

Uh oh!

Releases

Packages

Languages

hanghuacs/FineCaption

Folders and files

Latest commit

History

Repository files navigation

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

CVPR 2025 (Data Released!)

Comparison of Models

Install

Mask Decoding

✏️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages