β-CLIP

This project extends CLIP to enable text-conditioned vision feature alignment at multiple granularities.

Prerequisites

Python 3.10
CUDA GPUs
CUDA Toolkit 12.4+ (for PyTorch 2.6.0)

For detailed installation instructions, see INSTALL.md. For data setup, see DATA.md.

Training

Ensure the following configs are set:

  --metadata data/ShareGPT4V/annotations/share-captioner_coco_lcs_sam_1246k_1107_filtered.json \
  --root data/ShareGPT4V/data \
  --resume pretrained/ViT-B-16.pt \

For training β-CLIP, we provide the following example scripts for locally training (adjust the number of GPUS via --nproc_per_node=1):

train_CE.sh: local training with CE loss
train_BCE.sh: local training with BCE loss

If using submitit on a shared slurm cluster for multi-node training (ensure the environment is activated before submitting):

submitit_text_1+5+30_global+bcal_CE_beta_0.5.sh: Multi-node training with CE loss
submitit_text_1+5+30_global+bcal_BCE_beta_0.5.sh: Multi-node training with BCE loss

Change /path/to/ckpts in run_with_submitit_ft.py to your full checkpoint path.

Evaluation

For evaluation only, change the --resume config to point to a fine-tuned checkpoint and enable the --evaluate flag.

ckpt β-CLIP: CE, β=0.5, 30 phrases

ckpt β-CLIP: BCE, β=0.5, 30 phrases

These checkpoints can be downloaded using download_ckpts.sh.

Results

Method		FG-OVD			SV-1k		U-1k
	Hard	Medium	Easy	Trivial	T2I	I2T	T2I	I2T
K=36, β=0.5, CE	30.9	55.4	60.4	80.3	93.7	94.0	89.0	88.6
K=36, β=0.5, BCE	20.1	38.5	34.2	71.3	94.4	94.1	91.8	92.3

Key Configs

--tcil-loss-mode k_positives_ce: Specifies the β-CAL loss computation mode:
- k_positives_bce: Binary cross-entropy loss with K positive samples per image
- k_positives_ce: Cross-entropy loss with K positive samples
--beta 0.5: Beta parameter for controlling contextualization in the loss.
--fg-loss-fn cls+tcil: Specifies which loss functions to use. Options include:
- cls: Standard CLIP contrastive loss between CLS and EOS embeddings
- tcil: β Contextualized Contrastive Alignment Loss for fine-grained alignment
--epochs 10: Number of training epochs.
--batch-size 64: Batch size per GPU. The effective batch size is batch-size × update-freq × num_gpus × num_nodes.
--model CLIP_VITB16_OPENAI: Model architecture specification. Currently supports CLIP ViT-B/16 and CLIP ViT-L/14 with OpenAI pretrained weights.

Project Structure

β-CLIP/
├── main_clip_ft.py              # Main training script
├── datasets.py                  # Dataset loading and preprocessing
├── models_tome.py               # Model definitions
├── losses.py                    # Loss function implementations

Citation

@article{zohra2025β-CLIP,
  title={β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision Language Alignment in CLIP},
  author={Zohra, Fatimah and Zhao, Chen and Itani, Hani and Ghanem, Bernard},
  journal={arXiv preprint},
  year={2025},
  eprint={2512.12678},
  url={https://arxiv.org/abs/2512.12678}
}

This project is developed using SLIP and TIMM.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
beit_finetuning		beit_finetuning
redcaps		redcaps
tome		tome
.gitignore		.gitignore
DATA.md		DATA.md
INSTALL.md		INSTALL.md
README.md		README.md
auxiliary.py		auxiliary.py
benchmark.py		benchmark.py
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
dataset_catalog.json		dataset_catalog.json
datasets.py		datasets.py
download_ckpts.sh		download_ckpts.sh
download_sam.sh		download_sam.sh
download_sharegpt4v.sh		download_sharegpt4v.sh
labels.json		labels.json
losses.py		losses.py
main_clip_ft.py		main_clip_ft.py
make_dataset.py		make_dataset.py
models.py		models.py
models.txt		models.txt
models_tome.py		models_tome.py
requirements.txt		requirements.txt
run_with_submitit_ft.py		run_with_submitit_ft.py
setup.py		setup.py
sharegpt4v_statistics_phrases.png		sharegpt4v_statistics_phrases.png
sharegpt4v_statistics_phrases.txt		sharegpt4v_statistics_phrases.txt
submitit_text_1+5+30_global+bcal_BCE_beta_0.5.sh		submitit_text_1+5+30_global+bcal_BCE_beta_0.5.sh
submitit_text_1+5+30_global+bcal_CE_beta_0.5.sh		submitit_text_1+5+30_global+bcal_CE_beta_0.5.sh
templates.json		templates.json
tokenizer.py		tokenizer.py
train_BCE.sh		train_BCE.sh
train_CE.sh		train_CE.sh
utils.py		utils.py
validate_fgovd_distributed.py		validate_fgovd_distributed.py
validate_sharegpt4v_distributed.py		validate_sharegpt4v_distributed.py
validate_urban1k_distributed.py		validate_urban1k_distributed.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

β-CLIP

Prerequisites

Training

Evaluation

Results

Key Configs

Project Structure

Citation

About

Uh oh!

Releases

Packages

Languages

fzohra/B-CLIP

Folders and files

Latest commit

History

Repository files navigation

β-CLIP

Prerequisites

Training

Evaluation

Results

Key Configs

Project Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages