FINECAPTION is a novel Vision-Language model with the improved capabilities of Attribute-Aware Regional
Captioning, Regional Dense Captioning, and Comprehensive Global Image Captioning. FINECAPTION can recognize arbitrary masks
as referential inputs and process high-resolution images.
| Model | Region Referral | Semantic Evaluation | ||||||
|---|---|---|---|---|---|---|---|---|
| Visual Prompt | Resolution | # Image Token | ROUGE-L ↑ | BLEU-4 ↑ | METEOR ↑ | CIDEr ↑ | BERT Score ↑ | |
| Zero-Shot Learning | ||||||||
| Kosmos-2 | Bbox | 224 | 256 | 9.21 | 0.14 | 1.98 | 1.07 | 37.69 |
| Alpha-CLIP-13B | Mask | 336 | 576 | 13.89 | 0.51 | 5.94 | 2.68 | 42.01 |
| Qwen2-VL-7B | Bbox | AnyRes | - | 14.12 | 0.57 | 6.18 | 2.74 | 42.97 |
| Ferret-13B | MContour | 336 | 576 | 15.01 | 1.06 | 5.86 | 3.12 | 43.82 |
| ViP-LLaVA-13B | MContour | 336 | 576 | 15.47 | 1.48 | 5.76 | 3.84 | 44.29 |
| LLaMA-3.2-11B-Vision-Instruction | Bbox | - | - | 15.64 | 1.59 | 9.73 | 3.95 | 44.53 |
| LLaMA-3.2-90B-Vision-Instruction | Bbox | - | - | 16.21 | 1.75 | 11.70 | 4.53 | 48.29 |
| InternVL-2-40B | Bbox | 1792 | 4096 | 16.21 | 1.79 | 11.91 | 4.63 | 48.38 |
| GPT-4o | Bbox | - | - | 17.87 | 3.21 | 12.87 | 6.49 | 49.85 |
| Supervised Learning | ||||||||
| Qwen2-VL-7B | Bbox | AnyRes | - | 31.59 | 9.11 | 13.56 | 90.32 | 75.86 |
| LLaVA-1.6-13B | Bbox | AnyRes | 576 | 31.72 | 9.35 | 13.64 | 90.71 | 75.89 |
| VILA1.5-8B | Bbox | 336 | 144 | 31.87 | 9.03 | 13.79 | 90.01 | 75.95 |
| ViP-LLaVA-13B | MContour | 336 | 576 | 32.42 | 9.97 | 14.82 | 91.44 | 76.77 |
| Alpha-CLIP-13B | Mask | 336 | 576 | 35.68 | 10.96 | 16.11 | 93.85 | 77.66 |
| LLaVA-HR-X | Bbox | 1024 | 1024 | 35.97 | 11.25 | 16.57 | 95.12 | 78.08 |
| LLaMA-3.2-11B-Vision | Bbox | - | - | 38.14 | 12.87 | 18.31 | 99.11 | 78.94 |
| FINECAPTION-8B (ours) | Mask | 1024 | 1024 | 41.05 | 14.46 | 22.01 | 127.95 | 80.97 |
Table 2. Comparison of the capabilities of FINECAPTION and other related VLMs including both open-sourced models and API-based models.
Please following the guide here to prepare the environment on Linux OS.
- Clone this repository
https://github.com/hanghuacs/FineCaption.git
cd FineCaption- Create environment and install package
. init.shdef decompress_mask(comp_string, height, width):
compressed_bytes = base64.b64decode(comp_string.encode('ascii'))
decompressed_bytes = gzip.decompress(compressed_bytes)
return np.frombuffer(decompressed_bytes, dtype=np.uint8).reshape((height, width))@article{hua2024finecaption,
title={FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity},
author={Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Sooye, Kim and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo},
journal={arXiv preprint arXiv:2411.15411},
year={2024}
}