Skip to content
/ CatVTON Public
forked from Zheng-Chong/CatVTON

[ICLR 2025] CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).

License

Notifications You must be signed in to change notification settings

JelinR/CatVTON

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CatVTON_ViWear

Setting up Conda Env

To set up the conda environment that is compatible with SegFormer-SCHP, CatVTON and RTMPose architectures, follow the instructions.

  1. Create environment using environment.yml file.
#Create and activate Env
conda env create -f environment.yml
conda activate schp_catvton
  1. Install CUDA libraries
conda install nvidia/label/cuda-12.1.0::cuda-nvcc -y
conda install -c nvidia cuda-toolkit=12.1 -y
  1. Set CUDA paths

In order to properly access the CUDA libraries in the conda environment, we need to set local variables that reference the required paths. For this, we modify the environment's activation and deactivation scripts, respectively setting and unsetting the CUDA paths.

In case certain headers are missing (like libcudaxx), please install or clone to local directory.

#Clone headers
git clone https://github.com/NVIDIA/libcudacxx.git ~/.local/libcudacxx

To initialize path variables at the time of environment activation, do the following.

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
vim $CONDA_PREFIX/etc/conda/activate.d/cuda_includes.sh

# -------- Put everything below inside the script -------------

#!/bin/bash
# Set CUDA include paths dynamically on env activation

# Remember old values (so we can restore them on deactivate)
export _OLD_CPLUS_INCLUDE_PATH="$CPLUS_INCLUDE_PATH"
export _OLD_C_INCLUDE_PATH="$C_INCLUDE_PATH"

# Add conda CUDA headers and any vendored includes
export CPLUS_INCLUDE_PATH="$CONDA_PREFIX/targets/x86_64-linux/include:$CONDA_PREFIX/include:$HOME/.local/libcudacxx/include:$HOME/.local/thrust:${CPLUS_INCLUDE_PATH}"
export C_INCLUDE_PATH="$CONDA_PREFIX/targets/x86_64-linux/include:$CONDA_PREFIX/include:$HOME/.local/libcudacxx/include:$HOME/.local/thrust:${C_INCLUDE_PATH}"

# Optional: define CUDAHOME & architecture
export CUDAHOME="$CONDA_PREFIX"
export CUDACXX="$CONDA_PREFIX/bin/nvcc"

To unset the path variables when deactivating the environment, please do the following.

mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
vim $CONDA_PREFIX/etc/conda/deactivate.d/cuda_includes.sh

# -------- Put everything below inside the script -------------

#!/bin/bash
# Restore previous include paths
export CPLUS_INCLUDE_PATH="$_OLD_CPLUS_INCLUDE_PATH"
export C_INCLUDE_PATH="$_OLD_C_INCLUDE_PATH"
unset _OLD_CPLUS_INCLUDE_PATH
unset _OLD_C_INCLUDE_PATH
unset CUDAHOME
unset CUDACXX
unset TORCH_CUDA_ARCH_LIST

To test if these scripts work, do the following:

conda deactivate
conda activate schp_catvton

echo $CPLUS_INCLUDE_PATH | tr ':' '\n' | head -n 5

#Expected Result : 
# /mnt/anaconda3/envs/schp_catvton/targets/x86_64-linux/include
# /mnt/anaconda3/envs/schp_catvton/include
# /home/ubuntu/.local/libcudacxx/include
# /home/ubuntu/.local/thrust
  1. Install Open-MMLab libraries (Required for CatVTON, RTMPose)
#MMLab packages
mim install "mmengine>=0.10.0,<1.0.0"
mim install "mmcv==2.1.0"
mim install "mmdet==3.2.0"
mim install "mmpose==1.3.2"

#Check if it works
python -c "from model.SegFormer_SCHP.modules import InPlaceABNSync; print('OK')"

Retrieving Checkpoints

pip install gdown

#Download to checkpoints dir
gdown --folder https://drive.google.com/drive/folders/17nyD5fUD8mjqVMOhSZ_kOF1_uBshIqrl?usp=drive_link -O ./checkpoints

Setting up Dataset

First, install the bare VITON-HD dataset into the directory VITON-HD.

The download link can be obtained from the VITON-HD repo : https://github.com/shadow2496/VITON-HD

pip install gdown
mkdir VITON-HD
cd VITON-HD

gdown 1tLx8LRp-sxDp0EcYmYoV_vXdSc-jJ79w
unzip zalando-hd-resized.zip
rm zalando-hd-resized.zip

mv test_pairs.txt test_pairs_unpaired.txt

Now, to create the required cloth or pose masks as part of the dataset, run the below preprocessing script.

CUDA_VISIBLE_DEVICES=0 python preprocess_agnostic_mask_ViWear.py \
--data_root_path VITON-HD/train \
--lip_chkpt checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar \
--masker_type only-schp-pose

Finetuning

To finetune CatVTON, we have two available scripts:

  • train.py : The standard training script, as per the experiment details in the source paper.

  • train_pose.py : Two main modifications are made to the standard script:

    • Pose Loss Term : The standard loss term (MSE of noise vectors) is appended with a pose loss term. In order to implement a pose loss, we use RTMPose-m to predict the pose logits of the source and predicted images.

    • Fusion of Masks (Optional) : To aid the model, we fuse the cloth and pose masks of the source person, and use the fused mask as part of the condition input to the model. To use this functionality, please use the argument --fuse_with_pose.

In both of these scripts, the most relevant arguments for finetuning are:

  • --train_unet_groups: Finetune only selected modules or parameters under a commong group. For example, assigning ".attn1.,down_blocks|.attn2." would consider two groups (separated by "|"):

    • Group 1 : All parameters containing both ".attn1." and "down_blocks" in their names.
    • Group 2 : All parameters containing ".attn2." in their names.
  • --lr_unet_groups : Assign individual learning rates to each group mentioned in the --train_unet_groups argument.

These ensure flexible finetuning of select groups, with the possibility of group-dependent learning rates.

  python -m train_pose \
    --dataset_root VITON-HD/ \
    --mask_dirname agnostic-mask \
    --pose_dirname pose-mask-viwear-lip-pose \
    --batch_size 128 \
    --allow_tf32 \
    --output_dir checkpoints/finetune_with_pose \
    --fuse_with_pose

The loss values and the checkpoint will be saved in the mentioned output_dir argument.

Evaluation

Evaluating the model consists of three main stages :

  • Preprocessing Masks : Generate cloth masks given an image of a person
  • Inference : Using the data (person, target cloth, cloth mask), generate try-on images, which are saved locally.
  • Evaluation : The generated images (results) are evaluated on four different metrics (FID, KID, SSIM, LPIPS)

There are a few available model variants which can be used to produce masks, and aid in try-on generation. Generally, we developed two main variants (all-schp, only-schp) which produces better cloth masks, resulting in enhanced performance. Note that the second variant (only-schp) produces better results than the standard case (stand) while running on a low inference time and memory cost. Please refer to the table below for a quantitative comparison (evaluation done on ~730 test samples).

The mask generation process expands on the variants, and even possible extensions. Please refer to the next section for more details.

Ablation Results

Note that the inference time and memory correspond to that of generating masks, excluding the contribution of CatVTON's image generation. Since the latter's inference time is ~209 ms, we can claim that using the only-schp model generates images at ~3 fps (369 ms), a noticeable increase from the stand model's case at ~ 1 fps (933 ms). This further motivates the choice of only-stand.

Preprocessing Masks

NOTE: To train or finetune models using SCHP to produce segmentation masks, please refer to the Self-Correction-Human-Parsing directory. Below, we use trained and finetuned models to produce masks.

To generate the try-on image, the CatVTON model requires an image of the person, the target cloth, and also a cloth mask that is supposed to mask the source cloth of the person. In short, the input to the model consists of the masked person and the target cloth, and the model tries to inpaint the target cloth into the masked region.

Therefore, we require a cloth mask assigned to each data sample, which can be done using the preprocess_agnostic_mask_ViWear.py script. There are 5 different ways to generate the masks:

  • Standard (stand) : Standard CatVTON pipeline to generate masks. This consists of using three models (SCHP_LIP, SCHP_ATR, DensePose) to refine the masking region.

  • Replace SCHP (all-schp) : We only replace the SCHP_LIP (from the standard setting) with SCHP_SegFormer (better performing model). This provides better results than stand with a lower inference time and memory cost. The weights for this model can be found at checkpoints/SegFormer_SCHP/mit_b4_schp_6_torso_5k_checkpoint.pth.tar.

  • Only SCHP (only-schp) : Instead of three models (standard case), we only keep SCHP_Segformer. This leads to lower inference time while producing closely aligned results with all-schp. The weights for this model can be found at checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar.

  • Replace SCHP + Pose (all-schp-pose) : Generates cloth masks same as all-schp, but also produces pose predictions to aid the model (in case of mask fusion).

  • Only SCHP + Pose (only-schp-pose) : Generates cloth masks same as only-schp, but also produces pose predictions to aid the model (in case of mask fusion).

In order to vary the models, please mention the model type (as mentioned in brackets above) to the --masker_type CLI argument, and mention the path to the model weights under the --lip_chkpt CLI argument.

CUDA_VISIBLE_DEVICES=0 python preprocess_agnostic_mask_ViWear.py \
--data_root_path VITON-HD/test \
--lip_chkpt checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar \
--masker_type only-schp-pose

Inference

Once the data is prepared (masks for each data sample), we can run the CatVTON model to generate try-on results for each test sample. This produces images (try-on results) and saves them locally.

If you require the model to process pose masks (fusion of cloth and pose masks) in the pipeline, make sure to mention the pose variants (only-schp-pose, all-schp-pose) under the --masker-type CLI argument.

CUDA_VISIBLE_DEVICES=0 python inference_ViWear.py \
--masker_type only-schp-pose \
--local_ckpt_dir checkpoints/finetune_with_pose \
--dataset vitonhd \
--data_root_path VITON-HD/ \
--output_dir results/viwear/lip_pose_mask  \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 555 \
--mixed_precision fp16 \
--allow_tf32 \
--repaint \
--eval_pair \
--save_fused_masks

Evaluation

The saved local results are evaluated under four different metrics: FID, KID, SSIM, LPIPS.

CUDA_VISIBLE_DEVICES=0 python eval.py \
--gt_folder VITON-HD/test/image/ \
--pred_folder results/viwear/lip_pose_mask/vitonhd-512/paired \
--paired \
--batch_size=16 \
--num_workers=16

About

[ICLR 2025] CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 50.0%
  • Jupyter Notebook 44.7%
  • Cuda 2.0%
  • JavaScript 1.7%
  • C++ 1.4%
  • HTML 0.2%