Lists (6)
Sort Name ascending (A-Z)
Stars
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Educational implementation of the Discrete Flow Matching paper
Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models (CVPR 2024 Highlight)
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Reference PyTorch implementation and models for DINOv3
StyleShot: A SnapShot on Any Style. 一款可以迁移任意风格到任意内容的模型,无需针对图片微调,即能生成高质量的个性风格化图片!
Code of ICCV 2023 paper titled General Image-to-Image Translation with One-Shot Image Guidance
Official Code for ICCV 2025 paper — Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
About A collection of AWESOME things about information geometry Topics
A xray/v2ray client for iOS/macOS, support vmess/vless/shadowsocks
Official implementation of "UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing"
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Official PyTorch Implementation of "Latent Diffusion Model Without Variational Autoencoder".
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
Pure TypeScript media toolkit for reading, writing, and converting video and audio files, directly in the browser.
[SIGGRAPH Asia 2025] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
PyTorch code and models for V-JEPA self-supervised learning from video.
Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive arch…
Code for the paper "Conditional Representation Learning for Customized Tasks" (NeurIPS 2025 Spotlight)
[NeurIPS 2025] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation