CatVTON_ViWear

Setting up Conda Env

To set up the conda environment that is compatible with SegFormer-SCHP, CatVTON and RTMPose architectures, follow the instructions.

Create environment using environment.yml file.

#Create and activate Env
conda env create -f environment.yml
conda activate schp_catvton

Install CUDA libraries

conda install nvidia/label/cuda-12.1.0::cuda-nvcc -y
conda install -c nvidia cuda-toolkit=12.1 -y

Set CUDA paths

In order to properly access the CUDA libraries in the conda environment, we need to set local variables that reference the required paths. For this, we modify the environment's activation and deactivation scripts, respectively setting and unsetting the CUDA paths.

In case certain headers are missing (like libcudaxx), please install or clone to local directory.

#Clone headers
git clone https://github.com/NVIDIA/libcudacxx.git ~/.local/libcudacxx

To initialize path variables at the time of environment activation, do the following.

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
vim $CONDA_PREFIX/etc/conda/activate.d/cuda_includes.sh

# -------- Put everything below inside the script -------------

#!/bin/bash
# Set CUDA include paths dynamically on env activation

# Remember old values (so we can restore them on deactivate)
export _OLD_CPLUS_INCLUDE_PATH="$CPLUS_INCLUDE_PATH"
export _OLD_C_INCLUDE_PATH="$C_INCLUDE_PATH"

# Add conda CUDA headers and any vendored includes
export CPLUS_INCLUDE_PATH="$CONDA_PREFIX/targets/x86_64-linux/include:$CONDA_PREFIX/include:$HOME/.local/libcudacxx/include:$HOME/.local/thrust:${CPLUS_INCLUDE_PATH}"
export C_INCLUDE_PATH="$CONDA_PREFIX/targets/x86_64-linux/include:$CONDA_PREFIX/include:$HOME/.local/libcudacxx/include:$HOME/.local/thrust:${C_INCLUDE_PATH}"

# Optional: define CUDAHOME & architecture
export CUDAHOME="$CONDA_PREFIX"
export CUDACXX="$CONDA_PREFIX/bin/nvcc"

To unset the path variables when deactivating the environment, please do the following.

mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
vim $CONDA_PREFIX/etc/conda/deactivate.d/cuda_includes.sh

# -------- Put everything below inside the script -------------

#!/bin/bash
# Restore previous include paths
export CPLUS_INCLUDE_PATH="$_OLD_CPLUS_INCLUDE_PATH"
export C_INCLUDE_PATH="$_OLD_C_INCLUDE_PATH"
unset _OLD_CPLUS_INCLUDE_PATH
unset _OLD_C_INCLUDE_PATH
unset CUDAHOME
unset CUDACXX
unset TORCH_CUDA_ARCH_LIST

To test if these scripts work, do the following:

conda deactivate
conda activate schp_catvton

echo $CPLUS_INCLUDE_PATH | tr ':' '\n' | head -n 5

#Expected Result : 
# /mnt/anaconda3/envs/schp_catvton/targets/x86_64-linux/include
# /mnt/anaconda3/envs/schp_catvton/include
# /home/ubuntu/.local/libcudacxx/include
# /home/ubuntu/.local/thrust

Install Open-MMLab libraries (Required for CatVTON, RTMPose)

#MMLab packages
mim install "mmengine>=0.10.0,<1.0.0"
mim install "mmcv==2.1.0"
mim install "mmdet==3.2.0"
mim install "mmpose==1.3.2"

#Check if it works
python -c "from model.SegFormer_SCHP.modules import InPlaceABNSync; print('OK')"

Retrieving Checkpoints

pip install gdown

#Download to checkpoints dir
gdown --folder https://drive.google.com/drive/folders/17nyD5fUD8mjqVMOhSZ_kOF1_uBshIqrl?usp=drive_link -O ./checkpoints

Setting up Dataset

First, install the bare VITON-HD dataset into the directory VITON-HD.

The download link can be obtained from the VITON-HD repo : https://github.com/shadow2496/VITON-HD

pip install gdown
mkdir VITON-HD
cd VITON-HD

gdown 1tLx8LRp-sxDp0EcYmYoV_vXdSc-jJ79w
unzip zalando-hd-resized.zip
rm zalando-hd-resized.zip

mv test_pairs.txt test_pairs_unpaired.txt

Now, to create the required cloth or pose masks as part of the dataset, run the below preprocessing script.

CUDA_VISIBLE_DEVICES=0 python preprocess_agnostic_mask_ViWear.py \
--data_root_path VITON-HD/train \
--lip_chkpt checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar \
--masker_type only-schp-pose

Finetuning

To finetune CatVTON, we have two available scripts:

train.py : The standard training script, as per the experiment details in the source paper.
train_pose.py : Two main modifications are made to the standard script:
- Pose Loss Term : The standard loss term (MSE of noise vectors) is appended with a pose loss term. In order to implement a pose loss, we use RTMPose-m to predict the pose logits of the source and predicted images.
- Fusion of Masks (Optional) : To aid the model, we fuse the cloth and pose masks of the source person, and use the fused mask as part of the condition input to the model. To use this functionality, please use the argument --fuse_with_pose.

In both of these scripts, the most relevant arguments for finetuning are:

--train_unet_groups: Finetune only selected modules or parameters under a commong group. For example, assigning ".attn1.,down_blocks|.attn2." would consider two groups (separated by "|"):
- Group 1 : All parameters containing both ".attn1." and "down_blocks" in their names.
- Group 2 : All parameters containing ".attn2." in their names.
--lr_unet_groups : Assign individual learning rates to each group mentioned in the --train_unet_groups argument.

These ensure flexible finetuning of select groups, with the possibility of group-dependent learning rates.

  python -m train_pose \
    --dataset_root VITON-HD/ \
    --mask_dirname agnostic-mask \
    --pose_dirname pose-mask-viwear-lip-pose \
    --batch_size 128 \
    --allow_tf32 \
    --output_dir checkpoints/finetune_with_pose \
    --fuse_with_pose

The loss values and the checkpoint will be saved in the mentioned output_dir argument.

Evaluation

Evaluating the model consists of three main stages :

Preprocessing Masks : Generate cloth masks given an image of a person
Inference : Using the data (person, target cloth, cloth mask), generate try-on images, which are saved locally.
Evaluation : The generated images (results) are evaluated on four different metrics (FID, KID, SSIM, LPIPS)

There are a few available model variants which can be used to produce masks, and aid in try-on generation. Generally, we developed two main variants (all-schp, only-schp) which produces better cloth masks, resulting in enhanced performance. Note that the second variant (only-schp) produces better results than the standard case (stand) while running on a low inference time and memory cost. Please refer to the table below for a quantitative comparison (evaluation done on ~730 test samples).

The mask generation process expands on the variants, and even possible extensions. Please refer to the next section for more details.

Note that the inference time and memory correspond to that of generating masks, excluding the contribution of CatVTON's image generation. Since the latter's inference time is ~209 ms, we can claim that using the only-schp model generates images at ~3 fps (369 ms), a noticeable increase from the stand model's case at ~ 1 fps (933 ms). This further motivates the choice of only-stand.

Preprocessing Masks

NOTE: To train or finetune models using SCHP to produce segmentation masks, please refer to the Self-Correction-Human-Parsing directory. Below, we use trained and finetuned models to produce masks.

To generate the try-on image, the CatVTON model requires an image of the person, the target cloth, and also a cloth mask that is supposed to mask the source cloth of the person. In short, the input to the model consists of the masked person and the target cloth, and the model tries to inpaint the target cloth into the masked region.

Therefore, we require a cloth mask assigned to each data sample, which can be done using the preprocess_agnostic_mask_ViWear.py script. There are 5 different ways to generate the masks:

Standard (stand) : Standard CatVTON pipeline to generate masks. This consists of using three models (SCHP_LIP, SCHP_ATR, DensePose) to refine the masking region.
Replace SCHP (all-schp) : We only replace the SCHP_LIP (from the standard setting) with SCHP_SegFormer (better performing model). This provides better results than stand with a lower inference time and memory cost. The weights for this model can be found at checkpoints/SegFormer_SCHP/mit_b4_schp_6_torso_5k_checkpoint.pth.tar.
Only SCHP (only-schp) : Instead of three models (standard case), we only keep SCHP_Segformer. This leads to lower inference time while producing closely aligned results with all-schp. The weights for this model can be found at checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar.
Replace SCHP + Pose (all-schp-pose) : Generates cloth masks same as all-schp, but also produces pose predictions to aid the model (in case of mask fusion).
Only SCHP + Pose (only-schp-pose) : Generates cloth masks same as only-schp, but also produces pose predictions to aid the model (in case of mask fusion).

In order to vary the models, please mention the model type (as mentioned in brackets above) to the --masker_type CLI argument, and mention the path to the model weights under the --lip_chkpt CLI argument.

CUDA_VISIBLE_DEVICES=0 python preprocess_agnostic_mask_ViWear.py \
--data_root_path VITON-HD/test \
--lip_chkpt checkpoints/SegFormer_SCHP/mit_b4_schp_6_sep_checkpoint.pth.tar \
--masker_type only-schp-pose

Inference

Once the data is prepared (masks for each data sample), we can run the CatVTON model to generate try-on results for each test sample. This produces images (try-on results) and saves them locally.

If you require the model to process pose masks (fusion of cloth and pose masks) in the pipeline, make sure to mention the pose variants (only-schp-pose, all-schp-pose) under the --masker-type CLI argument.

CUDA_VISIBLE_DEVICES=0 python inference_ViWear.py \
--masker_type only-schp-pose \
--local_ckpt_dir checkpoints/finetune_with_pose \
--dataset vitonhd \
--data_root_path VITON-HD/ \
--output_dir results/viwear/lip_pose_mask  \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 555 \
--mixed_precision fp16 \
--allow_tf32 \
--repaint \
--eval_pair \
--save_fused_masks

Evaluation

The saved local results are evaluated under four different metrics: FID, KID, SSIM, LPIPS.

CUDA_VISIBLE_DEVICES=0 python eval.py \
--gt_folder VITON-HD/test/image/ \
--pred_folder results/viwear/lip_pose_mask/vitonhd-512/paired \
--paired \
--batch_size=16 \
--num_workers=16

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
__pycache__		__pycache__
densepose		densepose
detectron2		detectron2
model		model
resource		resource
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
app_flux.py		app_flux.py
app_p2p.py		app_p2p.py
eval.py		eval.py
eval_ViWear.py		eval_ViWear.py
index.html		index.html
inference.py		inference.py
inference_ViWear.py		inference_ViWear.py
preprocess_agnostic_mask.py		preprocess_agnostic_mask.py
preprocess_agnostic_mask_ViWear.py		preprocess_agnostic_mask_ViWear.py
requirements.txt		requirements.txt
requirements_ViWear.txt		requirements_ViWear.txt
rough_catvton.ipynb		rough_catvton.ipynb
schp_catvton.yml		schp_catvton.yml
train.py		train.py
train_gpt.py		train_gpt.py
train_pose.py		train_pose.py
train_utils.py		train_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CatVTON_ViWear

Setting up Conda Env

Retrieving Checkpoints

Setting up Dataset

Finetuning

Evaluation

Preprocessing Masks

Inference

Evaluation

About

Uh oh!

Releases

Packages

Languages

License

JelinR/CatVTON

Folders and files

Latest commit

History

Repository files navigation

CatVTON_ViWear

Setting up Conda Env

Retrieving Checkpoints

Setting up Dataset

Finetuning

Evaluation

Preprocessing Masks

Inference

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages