This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:
@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}
- 
The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU. 
- 
This repo is a modification on the DeiT repo. Installation and preparation follow that repo. 
- 
This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.
We leverage this implementation and base model: GitHub: Masked Autoencoders: A PyTorch Implementation
pretrained/mae_pretrain_vit_{base|large|huge}
# Links:
# https://dl.fbaipublicfiles.com/mae/visualize/mae_visualize_vit_base.pth
# https://dl.fbaipublicfiles.com/mae/visualize/mae_visualize_vit_large.pth
# https://dl.fbaipublicfiles.com/mae/visualize/mae_visualize_vit_huge.pth
2.2. Double check that we have timm==0.4.12 version installed. Code from mae.py should run without issues, but certain changes to numpy may affect current functionality.
- Visualization demo
- Pre-trained checkpoints + fine-tuning code
- Pre-training code
Run our interactive visualization demo using Colab notebook (no GPU needed):
Also, locally check out Masked_AutoEncoder_Scalable_Vision_Learner.ipynb.
The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:
| ViT-Base | ViT-Large | ViT-Huge | |
|---|---|---|---|
| pre-trained checkpoint | download | download | download | 
| md5 | 8cad7c | b8b06e | 9bdbb0 | 
The fine-tuning instruction is in FINETUNE.md.
By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):
| ViT-B | ViT-L | ViT-H | ViT-H448 | prev best | |
|---|---|---|---|---|---|
| ImageNet-1K (no external data) | 83.6 | 85.9 | 86.9 | 87.8 | 87.1 | 
| following are evaluation of the same model weights (fine-tuned in original ImageNet-1K): | |||||
| ImageNet-Corruption (error rate) | 51.7 | 41.8 | 33.8 | 36.8 | 42.5 | 
| ImageNet-Adversarial | 35.9 | 57.1 | 68.2 | 76.7 | 35.8 | 
| ImageNet-Rendition | 48.3 | 59.9 | 64.4 | 66.5 | 48.7 | 
| ImageNet-Sketch | 34.5 | 45.3 | 49.6 | 50.9 | 36.0 | 
| following are transfer learning by fine-tuning the pre-trained MAE on the target dataset: | |||||
| iNaturalists 2017 | 70.5 | 75.7 | 79.3 | 83.4 | 75.4 | 
| iNaturalists 2018 | 75.4 | 80.1 | 83.0 | 86.8 | 81.2 | 
| iNaturalists 2019 | 80.5 | 83.4 | 85.7 | 88.3 | 84.1 | 
| Places205 | 63.9 | 65.8 | 65.9 | 66.8 | 66.0 | 
| Places365 | 57.9 | 59.4 | 59.8 | 60.3 | 58.0 | 
The pre-training instruction is in PRETRAIN.md.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.