VEnhancer, a generative space-time enhancement framework that can improve the existing T2V results.
| AIGC video | +VEnhancer |
|
|
|
📖 For more visual results, go checkout our project page
- [2024.08.23] We have enhanced T2V results from keling🤗. (The used VEnhancer checkpoint is the released one 🤗.)
brickman_art_gallery.mp4
A.little.brick.man.visiting.an.art.gallery.mp4
A little brick man visiting an art gallery.
- [2024.08.19] We have enhanced some T2V results from CogVideoX🤗. (The used VEnhancer checkpoint is not the released one 😰.)
Short captions (less than three sentences) are more suitable for VEnhancer. Please shorten the long captions when you are using VEnhancer.
boat_input.mp4
boat_up3.mp4
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
- [2024.08.18] 😸 Support enhancement for abitrary long videos (by spliting the videos into muliple chunks with overlaps); Faster sampling with only 15 steps without obvious quality loss (by setting
--solver_mode 'fast'in the script command); Use temporal VAE to reduce video flickering. - [2024.07.28] 🔥 Inference code and pretrained video enhancement model are released.
- [2024.07.10] 🤗 This repo is created.
VEnhancer achieves spatial super-resolution, temporal super-resolution (frame interpolation), and video refinement in a unified framework. It is flexible to adapt to different upsampling factors (e.g., 1x~8x) for either spatial or temporal super-resolution. Besides, it provides flexible control to modify the refinement strength for handling diversified video artifacts.
It follows ControlNet and copies the architecures and weights of multi-frame encoder and middle block of a pretrained video diffusion model to build a trainable condition network.
This video ControlNet accepts both low-resolution key frames and full frames of noisy latents as inputs.
Also, the noise level
# clone this repo
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer
# create environment
conda create -n venhancer python=3.10
conda activate venhancer
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip install -r requirements.txtNote that ffmpeg command should be enabled. If you have sudo access, then you can install it using the following command:
sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y| Model Name | Description | HuggingFace | BaiduNetdisk |
|---|---|---|---|
| venhancer_paper.pth | video enhancement model, paper version | download | download |
- Download the VEnhancer model and then put the checkpoint in the
VEnhancer/ckptsdirectory. (optional as it can be done automatically) - run the following command.
bash run_VEnhancer.shIn run_VEnhancer.sh,
-
up_scaleis the upsampling factor ($1\sim8$ ) for spatial super-resolution.$\times2,3,4$ are recommended. -
target_fpsis your expected target fps. default is 24. -
noise_augis the noise level ($0\sim300$ ) regarding noise augmentation. higher noise corresponds to stronger refinement.
The same functionality is also available as a gradio demo
python gradio_app.pyIf you use our work in your research, please cite our publication:
@article{he2024venhancer,
title={VEnhancer: Generative Space-Time Enhancement for Video Generation},
author={He, Jingwen and Xue, Tianfan and Liu, Dongyang and Lin, Xinqi and Gao, Peng and Lin, Dahua and Qiao, Yu and Ouyang, Wanli and Liu, Ziwei},
journal={arXiv preprint arXiv:2407.07667},
year={2024}
}
Our codebase builds on modelscope. Thanks the authors for sharing their awesome codebases!
If you have any questions, please feel free to reach us at [email protected].