Skip to content

hanghuacs/FineCaption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

CVPR 2025 (Data Released!)

   


FINECAPTION is a novel Vision-Language model with the improved capabilities of Attribute-Aware Regional Captioning, Regional Dense Captioning, and Comprehensive Global Image Captioning. FINECAPTION can recognize arbitrary masks as referential inputs and process high-resolution images.


Comparison of Models

Model Region Referral Semantic Evaluation
Visual Prompt Resolution # Image Token ROUGE-L ↑ BLEU-4 ↑ METEOR ↑ CIDEr ↑ BERT Score ↑
Zero-Shot Learning
Kosmos-2Bbox2242569.210.141.981.0737.69
Alpha-CLIP-13BMask33657613.890.515.942.6842.01
Qwen2-VL-7BBboxAnyRes-14.120.576.182.7442.97
Ferret-13BMContour33657615.011.065.863.1243.82
ViP-LLaVA-13BMContour33657615.471.485.763.8444.29
LLaMA-3.2-11B-Vision-InstructionBbox--15.641.599.733.9544.53
LLaMA-3.2-90B-Vision-InstructionBbox--16.211.7511.704.5348.29
InternVL-2-40BBbox1792409616.211.7911.914.6348.38
GPT-4oBbox--17.873.2112.876.4949.85
Supervised Learning
Qwen2-VL-7BBboxAnyRes-31.599.1113.5690.3275.86
LLaVA-1.6-13BBboxAnyRes57631.729.3513.6490.7175.89
VILA1.5-8BBbox33614431.879.0313.7990.0175.95
ViP-LLaVA-13BMContour33657632.429.9714.8291.4476.77
Alpha-CLIP-13BMask33657635.6810.9616.1193.8577.66
LLaVA-HR-XBbox1024102435.9711.2516.5795.1278.08
LLaMA-3.2-11B-VisionBbox--38.1412.8718.3199.1178.94
FINECAPTION-8B (ours)Mask10241024 41.0514.4622.01127.9580.97

Table 2. Comparison of the capabilities of FINECAPTION and other related VLMs including both open-sourced models and API-based models.


Install

Please following the guide here to prepare the environment on Linux OS.

  1. Clone this repository
https://github.com/hanghuacs/FineCaption.git
cd FineCaption
  1. Create environment and install package
. init.sh

Mask Decoding

def decompress_mask(comp_string, height, width):
    compressed_bytes = base64.b64decode(comp_string.encode('ascii'))
    decompressed_bytes = gzip.decompress(compressed_bytes)
    return np.frombuffer(decompressed_bytes, dtype=np.uint8).reshape((height, width))

✏️ Citation

@article{hua2024finecaption,
  title={FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity},
  author={Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Sooye, Kim and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo},
  journal={arXiv preprint arXiv:2411.15411},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published