Skip to content

reymond-group/drfp

Repository files navigation

test workflow DOI

DRFP

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Read the associated open access article

Description

Predicting the nature and outcome of reactions using computational methods is an important tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP outperforms DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, and reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent.

Getting Started

The best way to start exploring DRFP is on binder. A notebook that gets you started on creating and using DRFP:

Binder

A notbook that explains how you can use SHAP to analyse and interpret your machine learning models when using DRFP:

Binder

To reproduce the electronic laboratory experiments, see here.

Installation and Usage

DRFP can be installed from pypi using pip install drfp.

Once DRFP is installed, there are two ways you can use it. You can use the cli app drfp or the library provided by the package.

CLI

drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 512

This will create a pickle dump containing an numpy ndarray containing DRFP fingerprints with a dimensionality of 512. To also export the mapping, use the flag --mapping. This will create the additional file my_rxn_fps.map.pkl. You can call drfp --help to show all available flags and options.

Library

Following is a basic exmple of how to use DRFP in a Python script.

from drfp import DrfpEncoder

rxn_smiles = [
    "CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
    "CCOC(=O)C(CC)c1cccnc1.Cl.O>>CCC(C(=O)O)c1cccnc1",
]

fps = DrfpEncoder.encode(rxn_smiles)

The variable fps now points to a list containing the fingerprints for the two reaction SMILES as numpy arrays.

Documentation

The library contains the class DrfpEncoder with one public method encode.

DrfpEncoder.encode() Description Type Default
X An iterable (e.g. a list) of reaction SMILES or a single reaction SMILES to be encoded Iterable or str
n_folded_length The folded length of the fingerprint (the parameter for the modulo hashing) int 2048
min_radius The minimum radius of a substructure (0 includes single atoms) int 0
radius The maximum radius of a substructure int 3
rings Whether to include full rings as substructures bool True
mapping Return a feature to substructure mapping in addition to the fingerprints. If true, the return signature of this method is Tuple[List[np.ndarray], Dict[int, Set[str]]] bool False
atom_index_mapping Return the atom indices of mapped substructures for each reaction bool False
root_central_atom Whether to root the central atom of substructures when generating SMILES bool True
include_hydrogens Whether to explicitly include hydrogens in the molecular graph bool False
show_progress_bar Whether to show a progress bar when encoding reactions bool False

Reproduce

Want to reproduce the results in our paper? You can find all the data in the data folder and encoding and training scripts in the scripts folder.

Electronic Laboratory Notebook (ELN) Experiment

To reproduce the experiments on the electronic laboratory notebook (ELN) data:

  1. Clone this repository git clone [email protected]:reymond-group/drfp.git
  2. Install drfp either using pip install drfp or in the cloned directory using pip install .. Note that the latter will install the current development version of drfp.
  3. Encode the data using the script encode_az_reactions.py. This will write the files az-2048-3-true.pkl and az-2048-3-true.pkl.gz to the folder data/az.
cd scripts
python encoding/encode_az_reactions.py
  1. Train and test the xgboost model using the script yield_prediction_az.py (or yield_prediction_az_rf.py for the random forest model):
python training/yield_prediction_az.py
python training/yield_prediction_az_rf.py

Cite Us

@article{probst2022reaction,
  title={Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP},
  author={Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
  journal={Digital Discovery},
  year={2022},
  publisher={Royal Society of Chemistry}
}

Development Setup

This project uses UV for dependency management. To set up a development environment:

  1. Install UV following the official instructions

  2. Clone the repository:

git clone https://github.com/reymond-group/drfp
cd drfp
  1. Install dependencies including development packages:
uv sync --dev
  1. Run tests:
uv run pytest

About

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Resources

License

Stars

Watchers

Forks

Packages

No packages published