SimpleFold: Folding Proteins is Simpler than You Think

This github repository accompanies the research paper, SimpleFold: Folding Proteins is Simpler than You Think (Arxiv 2025).

Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Joshua M. Susskind, Miguel Angel Bautista

[Paper] [BibTex]

Introduction

We introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer layers. SimpleFold does not rely on expensive modules like triangle attention or pair representation biases, and is trained via a generative flow-matching objective. We scale SimpleFold to 3B parameters and train it on more than 8.6M distilled protein structures together with experimental PDB data. To the best of our knowledge, SimpleFold is the largest scale folding model ever developed. On standard folding benchmarks, SimpleFold-3B model achieves competitive performance compared to state-of-the-art baselines. Due to its generative training objective, SimpleFold also demonstrates strong performance in ensemble prediction. SimpleFold challenges the reliance on complex domain-specific architectures designs in folding, highlighting an alternative yet important avenue of progress in protein structure prediction.

Installation

To install simplefold package from github repository, run

git clone https://github.com/apple/ml-simplefold.git
cd ml-simplefold
conda create -n simplefold python=3.10
python -m pip install -U pip build; pip install -e .

If you want to use MLX backend on Apple silicon:

pip install mlx==0.28.0
pip install git+https://github.com/facebookresearch/esm.git

Example

We provide a jupyter notebook sample.ipynb to predict protein structures from example protein sequences.

Inference

Once you have simplefold package installed, you can predict the protein structure from target fasta file(s) via the following command line. We provide support for both PyTorch and MLX (recommended for Apple hardware) backends in inference.

simplefold \
    --simplefold_model simplefold_100M \  # specify folding model in simplefold_100M/360M/700M/1.1B/1.6B/3B
    --num_steps 500 --tau 0.01 \        # specify inference setting
    --nsample_per_protein 1 \           # number of generated conformers per target
    --plddt \                           # output pLDDT
    --fasta_path [FASTA_PATH] \         # path to the target fasta directory or file
    --output_dir [OUTPUT_DIR] \         # path to the output directory
    --backend [mlx, torch]              # choose from MLX and PyTorch for inference backend

Evaluation

We provide predicted structures from SimpleFold of different model sizes:

https://ml-site.cdn-apple.com/models/simplefold/cameo22_predictions.zip # predicted structures of CAMEO22
https://ml-site.cdn-apple.com/models/simplefold/casp14_predictions.zip  # predicted structures of CASP14
https://ml-site.cdn-apple.com/models/simplefold/apo_predictions.zip     # predicted structures of Apo
https://ml-site.cdn-apple.com/models/simplefold/codnas_predictions.zip  # predicted structures of Fold-switch (CoDNaS)

We use the docker image of openstructure 2.9.1 to evaluate generated structures for folding tasks (i.e., CASP14/CAMEO22). Once having the docker image enabled, you can run evaluation via:

python src/simplefold/evaluation/analyze_folding.py \
    --data_dir [PATH_TO_TARGET_MMCIF] \
    --sample_dir [PATH_TO_PREDICTED_MMCIF] \
    --out_dir [PATH_TO_OUTPUT] \
    --max-workers [NUMBER_OF_WORKERS]

To evaluate results of two-state prediction (i.e., Apo/CoDNaS), one need to compile the TMsore and then run evaluation via:

python src/simplefold/evaluation/analyze_two_state.py \ 
    --data_dir [PATH_TO_TARGET_DATA_DIRECTORY] \
    --sample_dir [PATH_TO_PREDICTED_PDB] \
    --tm_bin [PATH_TO_TMscore_BINARY] \
    --task apo \ # choose from apo and codnas
    --nsample 5

Train

You can also train or tune SimpleFold on your end. Instructions below include details for SimpleFold training.

Data preparation

Training targets

SimpleFold is trained on joint datasets including experimental structures from PDB, as well as distilled predictions from AFDB SwissProt and AFESM. Target lists of filtered SwissProt and AFESM targets thta are used in our training can be found:

https://ml-site.cdn-apple.com/models/simplefold/swissprot_list.csv # list of filted SwissProt (~270K targets)
https://ml-site.cdn-apple.com/models/simplefold/afesm_list.csv # list of filted AFESM targets (~1.9M targets)
https://ml-site.cdn-apple.com/models/simplefold/afesme_dict.json # list of filted extended AFESM (AFESM-E) (~8.6M targets)

In afesme_dict.json, the data is stored in the following structure:

{
    cluster 1 ID: {"members": [protein 1 ID, protein 2 ID, ...]},
    cluster 2 ID: {"members": [protein 1 ID, protein 2 ID, ...]},
    ...
}

Of course, one can use own customized datasets to train or tune SimpleFold models. Instructions below list how to process the dataset for SimpleFold training.

Process mmcif structures

To process downloaded mmcif files, you need Redis installed and launch the Redis server:

wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.rdb
redis-server --dbfilename ccd.rdb --port 7777

You can then process mmcif files to input format for SimpleFold:

python src/simplefold/process_mmcif.py \
    --data_dir [MMCIF_DIR]   # directory of mmcif files
    --out_dir [OUTPUT_DIR]   # directory of processed targets
    --use-assembly

Training

The configuration of model is based on Hydra. An example training configuration can be found in configs/experiment/train. To change dataset and model settings, one can refer to config files in configs/data and configs/model. To initiate SimpleFold training:

python train experiment=train

To train SimpleFold with FSDP strategy:

python train_fsdp.py experiment=train_fsdp

Citation

If you found this code useful, please cite the following paper:

@article{simplefold,
  title={SimpleFold: Folding Proteins is Simpler than You Think},
  author={Wang, Yuyang and Lu, Jiarui and Jaitly, Navdeep and Susskind, Josh and Bautista, Miguel Angel},
  journal={arXiv preprint arXiv:2509.18480},
  year={2025}
}

Acknowledgements

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

License

Please check out the repository LICENSE before using the provided code and LICENSE_MODEL for the released models.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
configs		configs
src/simplefold		src/simplefold
.gitignore		.gitignore
ACKNOWLEDGEMENTS		ACKNOWLEDGEMENTS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE_MODEL		LICENSE_MODEL
README.md		README.md
pyproject.toml		pyproject.toml
sample.ipynb		sample.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimpleFold: Folding Proteins is Simpler than You Think

Introduction

Installation

Example

Inference

Evaluation

Train

Data preparation

Training targets

Process mmcif structures

Training

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

apple/ml-simplefold

Folders and files

Latest commit

History

Repository files navigation

SimpleFold: Folding Proteins is Simpler than You Think

Introduction

Installation

Example

Inference

Evaluation

Train

Data preparation

Training targets

Process mmcif structures

Training

Citation

Acknowledgements

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages