-
HUST | Research intern at ByteDance
- Wuhan, China
- https://wjf5203.github.io/
Stars
MiroThinker is a series of open-source agentic models trained for deep research and complex tool use scenarios.
[ICCV 2025] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Fully Open Framework for Democratized Multimodal Training
WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Official PyTorch implementation of FlowMo.
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Summarize existing representative LLMs text datasets.
Curated list of datasets and tools for post-training.
A quick guide (especially) for trending instruction finetuning datasets
Awesome LLM pre-training resources, including data, frameworks, and methods.
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Official inference code and LongText-Bench benchmark for our paper X-Omni (https://arxiv.org/pdf/2507.22058).
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
[NeurIPS 2025] Efficient Reasoning Vision Language Models
[NeurIPS 2025] MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators
This is a repo to track the latest autoregressive visual generation papers.
[ICCV 2025] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Image and video Tokenizer/VAE selection guide, text and face reconstruction evaluation.
[ICLR 2025][arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization
Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"
High-Resolution Image Synthesis with Latent Diffusion Models