Zijian Zhou1,2, Shikun Liu1, Haozhe Liu1, Haonan Qiu1, Zhaochong An1, Weiming Ren1, Zhiheng Liu1, Xiaoke Huang1, Kam Woh Ng1, Tian Xie1, Xiao Han1, Yuren Cong1, Hang Li1, Chuyan Zhu1, Aditya Patel1, Tao Xiang1, Sen He1
1 Meta AI 2 King's College London
The training and inference code will be released once it has been organized. Please stay tuned.
Saber is a scalable zero-shot framework for reference-to-video (R2V) generation. By introducing a masked training strategy, Saber bypasses the bottleneck of explicit reference image-video-text triplet datasets, training exclusively on video-text pairs to achieve zero-shot generation capabilities without explicit R2V data.
If you find Saber useful for your research, please cite our paper:
@article{zhou2025scaling,
title={Scaling Zero-Shot Reference-to-Video Generation},
author={Zhou, Zijian and Liu, Shikun and Liu, Haozhe and Qiu, Haonan and An, Zhaochong and Ren, Weiming and Liu, Zhiheng and Huang, Xiaoke and Ng, Kam Woh and Xie, Tian and Han, Xiao and Cong, Yuren and Li, Hang and Zhu, Chuyan and Patel, Aditya and Xiang, Tao and He, Sen},
journal={arXiv preprint arXiv:2512.06905},
year={2025}
}