This project extends CLIP to enable text-conditioned vision feature alignment at multiple granularities.
- Python 3.10
- CUDA GPUs
- CUDA Toolkit 12.4+ (for PyTorch 2.6.0)
For detailed installation instructions, see INSTALL.md. For data setup, see DATA.md.
Ensure the following configs are set:
--metadata data/ShareGPT4V/annotations/share-captioner_coco_lcs_sam_1246k_1107_filtered.json \
--root data/ShareGPT4V/data \
--resume pretrained/ViT-B-16.pt \
For training β-CLIP, we provide the following example scripts for locally training (adjust the number of GPUS via --nproc_per_node=1):
train_CE.sh: local training with CE losstrain_BCE.sh: local training with BCE loss
If using submitit on a shared slurm cluster for multi-node training (ensure the environment is activated before submitting):
submitit_text_1+5+30_global+bcal_CE_beta_0.5.sh: Multi-node training with CE losssubmitit_text_1+5+30_global+bcal_BCE_beta_0.5.sh: Multi-node training with BCE loss
Change /path/to/ckpts in run_with_submitit_ft.py to your full checkpoint path.
For evaluation only, change the --resume config to point to a fine-tuned checkpoint and enable the --evaluate flag.
ckpt β-CLIP: CE, β=0.5, 30 phrases
ckpt β-CLIP: BCE, β=0.5, 30 phrases
These checkpoints can be downloaded using download_ckpts.sh.
| Method | FG-OVD | SV-1k | U-1k | |||||
|---|---|---|---|---|---|---|---|---|
| Hard | Medium | Easy | Trivial | T2I | I2T | T2I | I2T | |
| K=36, β=0.5, CE | 30.9 | 55.4 | 60.4 | 80.3 | 93.7 | 94.0 | 89.0 | 88.6 |
| K=36, β=0.5, BCE | 20.1 | 38.5 | 34.2 | 71.3 | 94.4 | 94.1 | 91.8 | 92.3 |
-
--tcil-loss-mode k_positives_ce: Specifies the β-CAL loss computation mode:k_positives_bce: Binary cross-entropy loss with K positive samples per imagek_positives_ce: Cross-entropy loss with K positive samples
-
--beta 0.5: Beta parameter for controlling contextualization in the loss. -
--fg-loss-fn cls+tcil: Specifies which loss functions to use. Options include:cls: Standard CLIP contrastive loss between CLS and EOS embeddingstcil: β Contextualized Contrastive Alignment Loss for fine-grained alignment
-
--epochs 10: Number of training epochs. -
--batch-size 64: Batch size per GPU. The effective batch size isbatch-size × update-freq × num_gpus × num_nodes. -
--model CLIP_VITB16_OPENAI: Model architecture specification. Currently supports CLIP ViT-B/16 and CLIP ViT-L/14 with OpenAI pretrained weights.
β-CLIP/
├── main_clip_ft.py # Main training script
├── datasets.py # Dataset loading and preprocessing
├── models_tome.py # Model definitions
├── losses.py # Loss function implementations
@article{zohra2025β-CLIP,
title={β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision Language Alignment in CLIP},
author={Zohra, Fatimah and Zhao, Chen and Itani, Hani and Ghanem, Bernard},
journal={arXiv preprint},
year={2025},
eprint={2512.12678},
url={https://arxiv.org/abs/2512.12678}
}