- MSA Generation
- Model Inference
- Generating Input Structures for MD
- Running MD Simulations
- Clustering MD Simulations into Macrostates
- References
Scripts to generate MSAs and corresponding input features for structure prediction via openfold for monomers or multimers.
- Generate MSAs given either UniProt IDs or PDB IDs
- Utilize pre-computed alignments from OpenProteinSet when available
- Computes alignments from scratch when no corresponding MSAs are found in OpenProteinSet
- Create pickled feature file for downstream structure prediction
Before running the pipeline, install custom_openfold_utils (locally via pip) and set up AWS credentials for OpenProteinSet access:
#create AWS credentials file
mkdir -p ~/.aws
vi ~/.aws/credentialsAdd the following content to your credentials file:
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEYEnsure your AWS user has the proper permissions:
- Log in to the AWS Management Console
- Go to IAM (Identity and Access Management)
- Attach the
AmazonS3ReadOnlyAccesspolicy to your user/group.
python gen_msa_monomer.py [OPTIONS]
python gen_msa_multimer.py [OPTIONS]| Argument | Description |
|---|---|
--uniprot_id |
UniProt identifier for the protein of interest |
--pdb_id |
PDB identifier with chain (format: XXX_Y) |
--fasta_path |
Path to a FASTA file |
--msa_save_dir |
Parent directory to save the generated alignments (note: files are not directly saved in this directory, see Output section for further details) |
--template_mmcif_dir |
Directory containing mmCIF files for template search |
The scripts can take three different input methods:
UniProt ID:
python ./msa_utils/gen_msa_monomer.py --uniprot_id A0A003 --msa_save_dir=./test
python ./msa_utils/gen_msa_multimer.py --uniprot_id_list A0A003,A0A003 --msa_save_dir=./testPDB ID with chain:
python ./msa_utils/gen_msa_monomer.py --pdb_id 6kvc_Avi --msa_save_dir=./test
python ./msa_utils/gen_msa_multimer.py --pdb_id_list 6kvc_A,6kvc_A --msa_save_dir=./testFASTA file:
python ./msa_utils/gen_msa_monomer.py --fasta_path /path/A0A003.fasta --msa_save_dir=./test
python ./msa_utils/gen_msa_multimer.py --fasta_path /path/A0A003-A0A003.fasta --msa_save_dir=./testExample FASTA files for monomer:
>XXXX_Y where XXXX corresponds to PDB ID and Y corresponds to chain num
seqExample FASTA files for multimer:
>AAAA-BBBB-CCCC_A where AAAA,BBBB,CCCC corresponds to PDB ID of each respective chain (can be the same or different)
seq1
>AAAA-BBBB-CCCC_B
seq2
>AAAA-BBBB-CCCC_C where
seq3The relevant database paths also need to be passed in as inputs. We provide examples for the monomer/multimer script below:
python gen_msa_monomer.py --uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 --uniclust30_database_path=$DOWNLOAD_DIR/uniref30/UniRef30_2021_03 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --jackhmmer_binary_path=/opt/applications/hmmer/3.3.2/gnu/bin/jackhmmer --hhblits_binary_path=/opt/applications/hhsuite/3.3.0/gnu/bin/hhblits --hhsearch_binary_path=/opt/applications/hhsuite/3.3.0/gnu/bin/hhsearch --kalign_binary_path=/opt/applications/kalign/2.04/gnu/bin/kalign --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files
python gen_msa_multimer.py --uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 --uniclust30_database_path=$DOWNLOAD_DIR/uniref30/UniRef30_2021_03 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --jackhmmer_binary_path=/opt/applications/hmmer/3.3.2/gnu/bin/jackhmmer --hhblits_binary_path=/opt/applications/hhsuite/3.3.0/gnu/bin/hhblits --hhsearch_binary_path=/opt/applications/hhsuite/3.3.0/gnu/bin/hhsearch --hmmsearch_binary_path=/opt/applications/hmmer/3.3.2/gnu/bin/hmmsearch --hmmbuild_binary_path=/opt/applications/hmmer/3.3.2/gnu/bin/hmmbuild --kalign_binary_path=/opt/applications/kalign/2.04/gnu/bin/kalign --pdb_seqres_database_path=$DOWNLOAD_DIR/pdb_seqres/pdb_seqres.txt --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files
msa_save_dir
|
+----> uniprot_id
|
+----> pdb_id
+----> pdb_id.fasta
+----> bfd_uniclust_hits.a3m
+----> mgnify_hits.a3m
+----> pdb70_hits.hhr
+----> uniref90_hits.a3m
+----> features.pkl
Scripts to run inference via MSA subsampling (AFSample2), AF-RW (AlphaFold-RandomWalk), or AF-Ensemble.
- openfold
- mkdssp
- CUDA modules for intrinsic dimensionality (install via
python intrinsic_said_setup.py develop --user) custom_openfold_utils(install locally via pip)
AFSample2: python run_openfold_benchmark_monomer_wrapper.py [OPTIONS]
AF-RW: python run_openfold_rw_monomer_wrapper.py [OPTIONS]
AF-Ensemble: python run_openfold_ensemble_monomer_wrapper.py [OPTIONS]| Argument | Description |
|---|---|
--alignment_dir |
Directory in which sequence alignment files are saved |
--output_dir_base |
Name of base directory in which output files are saved |
--template_mmcif_dir |
Directory containing mmCIF files for template search |
| Argument | Description |
|---|---|
--openfold_weights_dir |
Directory in which OpenFold weights are saved |
--af_weights_dir |
Directory in which AlphaFold weights are saved |
--use_af_weights |
Boolean flag indicating whether to use AlphaFold weights or OpenFold weights (defaults to AlphaFold weights) |
--num_predictions_per_model |
Number of predictions to generate via MSA masking (defaults to 100) |
| Argument | Description |
|---|---|
--openfold_weights_dir |
Directory in which OpenFold weights are saved |
--af_weights_dir |
Directory in which AlphaFold weights are saved |
--use_af_weights |
Boolean flag indicating whether to use AlphaFold weights or OpenFold weights (defaults to AlphaFold weights) |
--num_training_conformations |
Number of bootstrapped conformations used to generate parameters to run random walk (defaults to 3, increasing/decreasing this number can increase/decrease diversity of predictions with a corresponding linear increase/decrease in runtime) |
--num_rw_steps |
Number of predictions to generate for each set of random walk parameters (defaults to 100). The number of total predictions corresponds to num_rw_steps*num_training_conformations |
--module_config |
module_config_x where x is a number. Defaults to module_config_0. These configurable parameters affect which modules and layers of AlphaFold are updated. See rw_monomer_config.json for more info. |
--rw_hp_config |
hp_config_x where x is a number. Defaults to hp_config_0. These configurable parameters correspond to hyperparameters affecting the random walk. See rw_monomer_config.json for more info. |
--train_hp_config |
hp_config_x where x is a number. Defaults to hp_config_1. These configurable parameters correspond to hyperparameters used in the gradient descent phase of the pipeline. See rw_monomer_config.json for more info. |
| Argument | Description |
|---|---|
--ft_weights_dir |
Directory in which weights for all relevant fine-tuned models are saved |
--num_predictions_per_model |
Number of predictions to generate via MSA masking for each fine-tuned model (defaults to 3, increasing/decreasing this number can increase/decrease diversity of predictions with a corresponding linear increase/decrease in runtime). The total number of predcitions corresponds to num_predictions_per_model times the number of fine-tuned models in ft_weights_dir (566 by default) |
output_dir_base
|
+----> msa_mask_fraction=15
|
+---->template=none
|
+---->
initial_pred_msa_mask_fraction-0_unrelaxed.pdb
pred_*_msa_mask_fraction-15_unrelaxed.pdb
conformation_info.pkl
output_dir_base
|
+----> alternative_conformations-summary
|
+----> initial_pred
+----> initial_pred_unrelaxed.pdb
+----> rw_output
|
+----> target=conformation[0-9]
+----> ACCEPTED
+----> *.pdb
+----> REJECTED
+----> *.pdb
+----> conformation_info.pkl
output_dir_base
|
+---->model=*
|
+----> msa_mask_fraction=15
|
+---->template=none
|
+---->
initial_pred_msa_mask_fraction-0_unrelaxed.pdb
pred_*_msa_mask_fraction-15_unrelaxed.pdb
conformation_info.pkl
Clusters conformations generated by AF, extracts a representative subset, and refines each conformation to account for any structural violations.
numpypandassklearnscipypymolbiopythonpdbfixeropenmmcustom_openfold_utils
python gen_md_input_structures.py [OPTIONS]| Argument | Description |
|---|---|
--conformation_info_dir |
Parent or ancestor directory in which relevant conformation_info.pkl files are stored. |
--initial_pred_path |
Path to prediction made by default version of AF |
--num_clusters |
Number of clusters to group conformations into |
--num_md_structures |
Number of structures to refine and use as input to MD simulations |
--remove_disordered_tails |
Boolean flag indicating whether or not to remove disordered tails from structures |
--plddt_threshold |
Only clusters conformations whose pLDDT score is greater than this threshold (which is a value between 0-100). Defaults to 70. |
--disordered_threshold |
Only clusters conformations whose disordered percentage is less than this threshold (which is a value between 0-100). Defaults to 70. |
AFSample2: python gen_md_input_structures.py --conformation_info_dir=./misc_results/UniProt_ID/benchmark/msa_mask_fraction=15/template=none --initial_pred_path=./misc_results/UniProt_ID/benchmark/msa_mask_fraction=15/template=none/initial_pred_msa_mask_fraction-0_unrelaxed.pdb --num_clusters 10 --num_md_structures=50 --remove_disordered_tails
AF-RW: python gen_md_input_structures.py --conformation_info_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output --initial_pred_path=./misc_results/UniProt_ID/rw/alternative_conformations-summary/initial_pred/initial_pred_unrelaxed.pdb --num_clusters 10 --num_md_structures=50 --remove_disordered_tails
AF-Ensemble: python gen_md_input_structures.py --conformation_info_dir=./misc_results/UniProt_ID/ensemble --initial_pred_path=./misc_results/UniProt_ID/benchmark/msa_mask_fraction=15/template=none/initial_pred_msa_mask_fraction-0_unrelaxed.pdb --num_clusters 10 --num_md_structures=50 --remove_disordered_tails
conformation_info_dir
|
+----> md_starting_structures
|
+---->num_clusters=10
|
+---->plddt_threshold=70
|
+---->disordered_threshold=70
|
+---->openmm_refined_structures
+---->cluster_*_idx_*_plddt_*_openmmrefinement.pdb
+---->cluster_*_idx_*_plddt_*_openmmrefinement_info.json
Runs unbiased MD simulations for structures generated in the previous part of pipeline.
pdbfixeropenmmparmedmdtraj
python unbiased_md_wrapper.py [OPTIONS]| Argument | Description |
|---|---|
--refined_conformations_dir |
Directory pointing to the location storing the folder openmm_refined_structures. |
--run_all_conformations |
Boolean flag specifying to run MD simulations for all conformations |
--run_single_conformation_per_cluster |
Boolean flag specifying to run MD simulations for only a single conformation per cluster (i.e if 10 clusters were generated, 10 simulations will run, each one starting from a specific cluster) |
--specific_cluster_to_run |
Only run simulations corresponding to this cluster (e.g 0, 1, 2, initial etc.) |
--output_dir |
Directory to save results from MD simulations |
--production_steps |
Number of steps to run for MD simulations (defaults to 125000000, corresponding to 250 ns) |
#runs simulations for single conformation per cluster
python unbiased_md_wrapper.py --refined_conformations_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=5/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures --output_dir=./misc_results/UniProt_ID/md_output/rw --run_single_conformation_per_cluster
#runs simulations for all conformations
python unbiased_md_wrapper.py --refined_conformations_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=5/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures --output_dir=./misc_results/UniProt_ID/md_output/rw --run_all_conformations
#runs simulations for single conformation for cluster 0 only
python unbiased_md_wrapper.py --refined_conformations_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=5/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures --output_dir=./misc_results/UniProt_ID/md_output/rw --run_single_conformation_per_cluster --specific_cluster_to_run=0
#runs simulations for single conformation for default AF prediction only (i.e initial)
python unbiased_md_wrapper.py --refined_conformations_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=5/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures --output_dir=./misc_results/UniProt_ID/md_output/rw --run_single_conformation_per_cluster --specific_cluster_to_run=initial
#runs simulations for all conformations for cluster 0 only
python unbiased_md_wrapper.py --refined_conformations_dir=./misc_results/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=5/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures --output_dir=./misc_results/UniProt_ID/md_output/rw --run_all_conformations --specific_cluster_to_run=0
output_dir
|
+----> cluster*-idx*
+---->cluster_*_idx_*_plddt_*_openmmrefinement.pdb
+---->trajectory.nc
+----> cluster*-idx*
+---->cluster_*_idx_*_plddt_*_openmmrefinement.pdb
+---->trajectory.nc
Runs unbiased MD simulations for structures generated in the previous part of pipeline.
MDAnalysisnumpypandasscipysklearnpymolcustom_openfold_utils- msmhelper
- Clustering
python msm_pipeline.py [OPTIONS]| Argument | Description |
|---|---|
--uniprot_id |
UniProt identifier |
--traj_parent_dir |
Absolute path to parent directory in which relevant trajectory.nc files are saved |
--initial_pred_path |
Absolute path to where default AF prediction refined via openmm is saved |
--exp_pdb_ids |
If you want to retrieve experimentally determined reference pdbs to supplement your analysis, feed in a list of strings separated by commas where strings are of the format XXX_Y, where XXXX is the pdb_id and Y is the chain_id |
--calc_similarity_md_frames_exp_pdb |
Boolean flag that if true, calculates the RMSD and TM-score between MD frames and experimentally determined reference pdbs (as specified by exp_pdb_ids). This adds a significant amount of extra computation and is primarily useful for detailed quantiative evaluation |
--TMalign_path |
Absolute path to TMalign executable. Used if calc_similarity_md_frames_exp_pdb is set to True |
--save_dir |
Absolute path to directory to save results |
--skip_input_coordinate_gen |
Boolean flag that if true, skips PCA feature construction and assumes it has already been generated. |
#without calculating similarity to experimental PDBs (faster and a more typical use case)
python msm_pipeline.py --uniprot_id=UniProt_ID --traj_parent_dir=/abs/path/to/UniProt_ID/md_output/rw --initial_pred_path=/abs/path/to/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=10/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures/initial_pred_openmm_refinement.pdb --exp_pdb_ids=XXX_Y --save_dir=/abs/path/to/misc_results/UniProt_ID/md_output/rw
#with calculating similarity to experimental PDBs (slower)
python msm_pipeline.py --uniprot_id=UniProt_ID --traj_parent_dir=/abs/path/to/UniProt_ID/md_output/rw --initial_pred_path=/abs/path/to/UniProt_ID/rw/alternative_conformations-summary/rw_output/md_starting_structures/num_clusters=10/plddt_threshold=70/disordered_threshold=70/openmm_refined_structures/initial_pred_openmm_refinement.pdb --exp_pdb_ids=XXX_Y --calc_similarity_md_frames_exp_pdb --TMalign_path=/abs/path/to/TMalign --save_dir=/abs/path/to/misc_results/UniProt_ID/md_output/rw
save_dir
|
+---->msm_pipeline_output
|
+---->pca_contacts (clustering results whose input features correspond PCA applied to C-alpha contacts)
|
+---->macrostate_info
|
+---->component*
|
+---->pymol_sessions
| +---->macrostate*.pse.gz
+---->pymol_sessions_publication (same as pymol_sessions except ribbon is shown in cylindrical representation)
+---->macrostate*.pse.gz
|
+---->pymol_sessions
+---->cluster_summary.pse.gz
+---->pca_pdist (clustering results whose input features correspond PCA applied to C-alpha pairwise distances)
|
+---->macrostate_info
|
+---->component*
|
+---->pymol_sessions
| +---->macrostate*.pse.gz
+---->pymol_sessions_publication (same as pymol_sessions except ribbon is shown in cylindrical representation)
+---->macrostate*.pse.gz
|
+---->pymol_sessions
+---->cluster_summary.pse.gz
This project was built on top of OpenFold
Citation: Ahdritz, G. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 21, 1514-1524 (2024).