The BEND paper (ICLR 2024) is available here:
"BEND: BENCHMARKING DNA LANGUAGE MODELS ON BIOLOGICALLY MEANINGFUL TASKS"
Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, Wouter Boomsma
Documentation for the BEND code repository
All data is available for download here
The data can be downloaded via a script, see section 2.5
The data for each task is stored as a bed file. This file includes the genomic coordinates for each sample, as well as its split membership and potentially a label. Together with a reference genome, the file is used to extract the DNA sequences for training. Labels that are too complex to be stored in a column in the text-based bed file are stored in a hdf5 file. The two files share their index, so that sample i in the bed file matches record i in the hdf5 file.
bed is a tab-separated format that can be read like a regular table. All our task files include a column split, and optionally label. If label is missing, the labels are found in the hdf5 file of the same name.
chromosome start end split label
chr1 1055037 1055849 train 1
chr3 1070026 1070436 valid 0
Prerequisites: Install uv - a fast Python package manager
curl -LsSf https://astral.sh/uv/install.sh | shOne-command setup:
git clone https://github.com/frederikkemarin/BEND.git
cd BEND
make setup # Set up environment with Python 3.11
make download-data # Download dataset via Google Cloud Storage (fast)
source .venv/bin/activate # Activate environmentData download options:
# Fast download via Google Cloud Storage (recommended)
make download-data # Requires gsutil (Google Cloud SDK)
# Original download method (fallback)
make download-data-original # Uses the original download scriptInteractive setup:
./setup_dev.sh # Interactive script with setup optionsAvailable make commands:
make help # Show all available commands
make setup # Basic environment setup
make setup-dev # Development setup with testing tools
make download-data # Download dataset (Google Cloud Storage - fast)
make download-data-original # Download dataset (original method - fallback)
make test # Run tests
make format # Format code
make clean # Clean build artifacts
make status # Check environment statusIf you prefer manual setup or don't have make available:
- Clone the BEND repository:
git clone https://github.com/frederikkemarin/BEND.git - Change to the BEND directory:
cd BEND - Install Python 3.11:
uv python install 3.11 - Create virtual environment:
uv venv --python 3.11 - Install the requirements:
uv pip install -r requirements.txt - Install BEND in development mode:
source .venv/bin/activate && uv pip install -e . - Download the data:
- Fast (recommended):
mkdir -p data && gsutil -m cp -r gs://curvebio-mahdibaghbanzadeh/bend/* data/ - Fallback:
python scripts/download_bend.py
- Fast (recommended):
Prerequisites for fast download:
- Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install
- This provides the
gsutilcommand for efficient data transfer
Prerequisites for HyenaDNA models:
- Install Git LFS (Large File Storage): Required for downloading HyenaDNA model checkpoints
# On Ubuntu/Debian sudo apt-get install git-lfs # On macOS with Homebrew brew install git-lfs # On other systems, see: https://git-lfs.github.io/
- Initialize Git LFS in your repository:
git lfs install
Note: We recommend Python 3.11 due to compatibility issues with some dependencies in Python 3.12.
When training models on multi-GPU systems, you may want to specify which GPU to use. Here are some useful tricks for GPU device selection:
This is the cleanest approach - make only a specific GPU visible to the process:
# Use only GPU 3 (appears as cuda:0 to PyTorch)
CUDA_VISIBLE_DEVICES=3 python scripts/train_on_task.py --config-name gene_finding embedder=dnabert2
# Use only GPU 1
CUDA_VISIBLE_DEVICES=1 python scripts/precompute_embeddings.py --config-name embed task=gene_finding model=hyenadna-medium-160kYou can specify the device ID directly in the command:
python scripts/train_on_task.py --config-name gene_finding embedder=dnabert2 params.device_id=3Note: This may cause issues with DataParallel on multi-GPU systems. Use Option 1 or 3 instead.
For training with multiple specific GPUs:
# Use GPUs 2 and 3 (appear as cuda:0 and cuda:1 to PyTorch)
CUDA_VISIBLE_DEVICES=2,3 python scripts/train_on_task.py --config-name gene_finding embedder=dnabert2
# Use GPUs 0, 1, and 3
CUDA_VISIBLE_DEVICES=0,1,3 python scripts/train_on_task.py --config-name gene_finding embedder=dnabert2- Avoids device conflicts: Prevents DataParallel from trying to use all available GPUs
- Clean resource allocation: Other processes can't accidentally use your specified GPUs
- Local effect only: This is an environment variable that only affects the specific command/process you run it with
- No system-wide changes: Your system GPU configuration remains unchanged
- No impact on other terminals: Other terminal sessions and processes are unaffected
- Temporary: The effect ends when the command completes
- Process-specific: Each command can use different GPU configurations independently
- Works with all training modes: Compatible with single-GPU, DataParallel, and DDP training
For training downstream models, it is practical to precompute and save the embeddings to avoid recomputing them at each epoch. As embeddings can grow large when working with genomes, we use Webdataset tar.gz files as the format.
Firstly download the desired data from the data folder and place it in BEND/ (for ease of use maintain the same folder structure).
To precompute the embeddings for all models and tasks, run :
python scripts/precompute_embeddings.py
This script automatically calls the hydra config file at /../conf/embedding/embed.yaml.
By default all embeddings are generated for all tasks. To alter the tasks/model for which to compute the embeddings, please alter the tasks and/or the models list in the config file (under ```hydra.sweeper``) or override the behaviour from the commandline in the following manner:
python scripts/precompute_embeddings.py model=resnetlm,awdlstm task=gene_finding,enhancer_annotation
Train, validation and test embeddings are saved in chunks of (default) 50,000. To parallelize embeddings generation, you can call precompute_embeddings.py as above multiple times, but add additional arguments of the form chunk=[10,11,12] splits=[train,valid] to the individual calls in order to only compute specific chunks in a given call. If these arguments are not provided, the command will default to computing all chunks and splits.
If you need to make embeddings for other purposes than preparing downstream task data, bend.embedders contains wrapper classes around the individual models. Each embedder takes a path (or name, if available on HuggingFace) of a checkpoint as the first argument, and provides an embed() method that takes a list of sequences and returns a list of embeddings.
Embedders have a default-true argument remove_special_tokens=True in embed() that removes any [CLS], [SEP] tokens from the returned embeddings. For models that return less embedding vectors than their number of input nucleotides, embeddings can be upsampled to the original input sequence length using the upsample_embeddings=True argument in embed().
Some of the embedders currently also support computing logits and cross entropy losses, see the documentation for more information.
| Embedder | Reference | Models | Info |
|---|---|---|---|
| DNABertEmbedder | Ji et al. | 4 different k-mer tokenizations available | has an additional argument kmer=6 to specify the k-mer size. |
| NucleotideTransformerEmbedder | Dalla-Torre et al. | 8 different models available | |
| ConvNetEmbedder | BEND | 1 model available | A baseline LM used in BEND. |
| AWDLSTMEmbedder | BEND | 1 model available | A baseline LM used in BEND. |
| GPNEmbedder | Benegas et al. | Models trained on A. thaliana and Brassicales available | This LM was not evaluated in BEND as it was not trained on the human genome. |
| GENALMEmbedder | Fishman et al. | 8 different models available | |
| HyenaDNAEmbedder | Nguyen et al. | 5 different models available | Requires Git LFS to be installed to automatically download checkpoints. Instead of the HF checkpoint name, the argument when instantiating needs to be of the format path/to/save/checkpoints/checkpoint_name |
| DNABert2Embedder | Zhou et al. | 1 model available | |
| GROVEREmbedder | Sanabria et al. | 1 model available | The original BPE tokenizer is not available, so we apply MaxMatch for segmentation of the input sequence into tokens. |
| CaduceusEmbedder | Schiff et al. | 2 different models available | Requires mamba-ssm==1.2.0.post1 to be installed in the environment. |
All embedders can be used as follows:
from bend.utils.embedders import NucleotideTransformerEmbedder
# load the embedder with a valid checkpoint name or path
embedder = NucleotideTransformerEmbedder('InstaDeepAI/nucleotide-transformer-2.5b-multi-species')
# embed a list of sequences
embeddings = embedder.embed(['AGGATGCCGAGAGTATATGGGA', 'CCCAACCGAGAGTATATGTTAT'])
# or just call directly to embed a single sequence
embedding = embedder('AGGATGCCGAGAGTATATGGGA')
# This requires git LFS and will automatically download the checkpoint, if not already present
from bend.utils.embedders import HyenaDNAEmbedder
embedder = HyenaDNAEmbedder('pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen')BEND now includes support for models that have been continuously pretrained on bisulfite-sequencing (BS-seq) data. These models are fine-tuned versions of existing architectures that have been further trained to understand methylation patterns and modified DNA sequences.
Available BS-seq Models:
| Model | Base Architecture | Description | Configuration |
|---|---|---|---|
dnabert2-bs-seq |
DNABERT-2 | DNABERT-2 continuously pretrained on BS-seq data | Uses DNABert2Embedder |
hyenadna-bs-seq |
HyenaDNA | HyenaDNA continuously pretrained on BS-seq data | Uses HyenaDNAEmbedder |
Setting up BS-seq Models:
To use these models, you need to download the pretrained checkpoints from Google Cloud Storage and configure them locally:
# Download DNABERT-2 BS-seq checkpoint
mkdir -p ./pretrained_models/dnabert2-bs-seq
gsutil -m cp -r gs://curvebio-mahdibaghbanzadeh/neurips_2025/pretrain_dnabert2_s3/checkpoint-100000/* ./pretrained_models/dnabert2-bs-seq/
# Download HyenaDNA BS-seq checkpoint
mkdir -p ./pretrained_models/hyenadna-bs-seq
gsutil -m cp -r gs://curvebio-mahdibaghbanzadeh/neurips_2025/pretrain_hyena_s3/checkpoint-100000/* ./pretrained_models/hyenadna-bs-seq/Usage:
# Using DNABERT-2 BS-seq model
from bend.utils.embedders import DNABert2Embedder
embedder = DNABert2Embedder('pretrained_models/dnabert2-bs-seq')
embeddings = embedder.embed(['AGGATGCCGAGAGTATATGGGA'])
# Using HyenaDNA BS-seq model
from bend.utils.embedders import HyenaDNAEmbedder
embedder = HyenaDNAEmbedder('pretrained_models/hyenadna-bs-seq')
embeddings = embedder.embed(['AGGATGCCGAGAGTATATGGGA'])Running with BS-seq Models:
# Precompute embeddings with BS-seq models
python scripts/precompute_embeddings.py model=dnabert2-bs-seq task=gene_finding
python scripts/precompute_embeddings.py model=hyenadna-bs-seq task=cpg_methylation
# Train downstream tasks
python scripts/train_on_task.py --config-name cpg_methylation embedder=dnabert2-bs-seq,hyenadna-bs-seqThese BS-seq models are particularly well-suited for tasks involving:
- CpG methylation prediction
- Chromatin accessibility analysis
- Epigenetic modifications
- Any task where understanding of methylation patterns is important
It is first required that the above step (computing the embeddings) is completed.
The embeddings should afterwards be located in BEND/data/{task_name}/{embedder}/*tar.gz
To run a downstream task run (from BEND/):
python scripts/train_on_task.py --config-name {tasl}
By default the task is run on all embeddings. To alter this either modify the config file or change the settings from the commandline E.g. to run gene finding on all embeddings the commandline is:
python scripts/train_on_task.py --config-name gene_finding
To run only on resnetlm and awdlstm embeddings:
python scripts/train_on_task.py --config-name gene_finding embedder=resnetlm,awdlstm
The full list of current task names are :
gene_findingenhancer_annotationvariant_effectshistone_modificationchromatin_accessibilitycpg_methylation
And the list of available embedders/models used for training on the tasks are :
awdlstmresnetlmnt_transformer_msnt_transformer_human_refdnabert6resnet_supervisedonehotnt_transformer_1000gdnabert2dnabert2-bs-seqgena-lm-bigbird-base-t2tgena-lm-bert-large-t2hyenadna-large-1mhyenadna-tiny-1khyenadna-small-32khyenadna-medium-160khyenadna-bs-seqgrover
The train_on_task.py script calls a trainer class bend.utils.task_trainer. All configurations required to adapt these 2 scripts to train on a specific task (input data, downstream model, parameters, evaluation metric etc.) are specified in the task specific hydra config files stored in the conf directory. This minimizes the changes required to the scripts in order to introduce a potential new task.
The results of a run can be found at :
BEND/downstream_tasks/{task_name}/{embedder}/
If desired, the config files can be modified to change parameters, output/input directory etc.
For unsupervised prediction of variant effects, embeddings don't have to be precomputed and stored. Embeddings are generated and directly evaluated using
python3 scripts/predict_variant_effects.py {variant_file_name}.bed {output_file_name}.csv {model_type} {path_to_checkpoint} {path_to_reference_genome_fasta} --embedding_idx {position_of_embedding}There are two variant effect prediction tasks available for {variant_file_name}: Variants with expression effect (eQTLs) in variant_effects_expression.bed and disease-causing variants in variant_effects_disease.bed.
A notebook with an example of how to run the script and evaluate the results can be found in examples/unsupervised_variant_effects.ipynb. To run all models, you can use the script scripts/run_variant_effects.sh.
All embedders are defined in bend/utils/embedders.py and inherit BaseEmbedder. A new embedder needs to implement load_model, which should set up all required attributes of the class and handle loading the model checkpoint into memory. It also needs to implement embed, which takes a list of sequences, and returns a list of embedding matrices formatted as numpy arrays. The embed method should be able to handle sequences of different lengths.
As the first step, the data for a new task needs to be formatted in the bed-based format. If necessary, a split and label column should be included. The next step is to add new config files to ../conf/supervised_tasks. You should create a new directory named after the task, and add a config file for each embedder you want to evaluate. The config files should be named after the embedder.
The datasets included in BEND were collected from a variety of sources. When you use any of the datasets, please ensure to correctly cite the respective original publications describing each dataset.
@inproceedings{
marin2024bend,
title={{BEND}: Benchmarking {DNA} Language Models on Biologically Meaningful Tasks},
author={Frederikke Isa Marin and Felix Teufel and Marc Horlacher and Dennis Madsen and Dennis Pultz and Ole Winther and Wouter Boomsma},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=uKB4cFNQFg}
}
Gene finding (GENCODE)
@article{frankish_gencode_2021,
title = {{GENCODE} 2021},
volume = {49},
issn = {0305-1048},
url = {https://doi.org/10.1093/nar/gkaa1087},
doi = {10.1093/nar/gkaa1087},
number = {D1},
urldate = {2022-09-26},
journal = {Nucleic Acids Research},
author = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tomás and Donaldson, Sarah and Fiddes, Ian T and García Girón, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Martínez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guigó, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul},
month = jan,
year = {2021},
pages = {D916--D923},
}
Chromatin accessibility (ENCODE)
Histone modification (ENCODE)
CpG methylation (ENCODE)
@article{noauthor_integrated_2012,
title = {An {Integrated} {Encyclopedia} of {DNA} {Elements} in the {Human} {Genome}},
volume = {489},
issn = {0028-0836},
url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153/},
doi = {10.1038/nature11247},
number = {7414},
urldate = {2023-05-23},
journal = {Nature},
month = sep,
year = {2012},
pmid = {22955616},
pmcid = {PMC3439153},
pages = {57--74},
}
Enhancer annotation (Fulco et al., Gasperini et al., Avsec et al. )
Enhancers
@article{fulco_activity-by-contact_2019,
title = {Activity-by-contact model of enhancer–promoter regulation from thousands of {CRISPR} perturbations},
volume = {51},
copyright = {2019 The Author(s), under exclusive licence to Springer Nature America, Inc.},
issn = {1546-1718},
url = {https://www.nature.com/articles/s41588-019-0538-0},
doi = {10.1038/s41588-019-0538-0},
language = {en},
number = {12},
urldate = {2023-05-23},
journal = {Nature Genetics},
author = {Fulco, Charles P. and Nasser, Joseph and Jones, Thouis R. and Munson, Glen and Bergman, Drew T. and Subramanian, Vidya and Grossman, Sharon R. and Anyoha, Rockwell and Doughty, Benjamin R. and Patwardhan, Tejal A. and Nguyen, Tung H. and Kane, Michael and Perez, Elizabeth M. and Durand, Neva C. and Lareau, Caleb A. and Stamenova, Elena K. and Aiden, Erez Lieberman and Lander, Eric S. and Engreitz, Jesse M.},
month = dec,
year = {2019},
note = {Number: 12
Publisher: Nature Publishing Group},
keywords = {Epigenetics, Epigenomics, Functional genomics, Gene expression, Gene regulation},
pages = {1664--1669},
}
Enhancers
@article{gasperini_genome-wide_2019,
title = {A {Genome}-wide {Framework} for {Mapping} {Gene} {Regulation} via {Cellular} {Genetic} {Screens}},
volume = {176},
issn = {0092-8674},
url = {https://www.sciencedirect.com/science/article/pii/S009286741831554X},
doi = {10.1016/j.cell.2018.11.029},
language = {en},
number = {1},
urldate = {2023-05-23},
journal = {Cell},
author = {Gasperini, Molly and Hill, Andrew J. and McFaline-Figueroa, José L. and Martin, Beth and Kim, Seungsoo and Zhang, Melissa D. and Jackson, Dana and Leith, Anh and Schreiber, Jacob and Noble, William S. and Trapnell, Cole and Ahituv, Nadav and Shendure, Jay},
month = jan,
year = {2019},
keywords = {CRISPR, CRISPRi, RNA-seq, crisprQTL, eQTL, enhancer, gene regulation, genetic screen, human genetics, single cell},
pages = {377--390.e19},
}
Transcription start sites
@article{avsec_effective_2021,
title = {Effective gene expression prediction from sequence by integrating long-range interactions},
volume = {18},
copyright = {2021 The Author(s)},
issn = {1548-7105},
url = {https://www.nature.com/articles/s41592-021-01252-x},
doi = {10.1038/s41592-021-01252-x},
language = {en},
number = {10},
urldate = {2023-05-23},
journal = {Nature Methods},
author = {Avsec, Žiga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},
month = oct,
year = {2021},
note = {Number: 10
Publisher: Nature Publishing Group},
keywords = {Gene expression, Machine learning, Software, Transcriptomics},
pages = {1196--1203},
}
Noncoding Variant Effects (Expression) (DeepSEA)
DeepSEA's data was sourced from GRASP and the 1000 Genomes Project, which should also be attributed accordingly.
@article{zhou_predicting_2015,
title = {Predicting effects of noncoding variants with deep learning–based sequence model},
url = {https://www.nature.com/articles/nmeth.3547},
doi = {10.1038/nmeth.3547},
language = {en},
number = {10},
urldate = {2023-06-07},
journal = {Nature Methods},
author = {Zhou, Jian and Troyanskaya, Olga G},
year = {2015},
}
Noncoding variant effects (Disease) (ClinVar)
In case the variant consequences categories are used, Ensembl VEP should be attributed.
@article{10.1093/nar/gkz972,
author = {Landrum, Melissa J and Chitipiralla, Shanmuga and Brown, Garth R and Chen, Chao and Gu, Baoshan and Hart, Jennifer and Hoffman, Douglas and Jang, Wonhee and Kaur, Kuljeet and Liu, Chunlei and Lyoshin, Vitaly and Maddipatla, Zenith and Maiti, Rama and Mitchell, Joseph and O’Leary, Nuala and Riley, George R and Shi, Wenyao and Zhou, George and Schneider, Valerie and Maglott, Donna and Holmes, J Bradley and Kattman, Brandi L},
title = "{ClinVar: improvements to accessing data}",
journal = {Nucleic Acids Research},
volume = {48},
number = {D1},
pages = {D835-D844},
year = {2019},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkz972},
url = {https://doi.org/10.1093/nar/gkz972},
eprint = {https://academic.oup.com/nar/article-pdf/48/D1/D835/31698033/gkz972.pdf},
}
Git LFS Not Installed: If you get "mkdir: missing operand" errors when trying to use HyenaDNA models, ensure Git LFS is installed:
# Install Git LFS
sudo apt-get install git-lfs # Ubuntu/Debian
brew install git-lfs # macOS
# Initialize in your repository
git lfs installPyTorch UnpicklingError:
If you encounter UnpicklingError related to weights_only when loading HyenaDNA models, this is due to PyTorch 2.6+ security changes. The issue has been fixed in the codebase by setting weights_only=False for trusted model checkpoints.
Model Download Issues:
- Ensure you have sufficient disk space (HyenaDNA models can be >100MB)
- Check your internet connection for large file downloads
- Verify the model path format:
pretrained_models/hyenadna/model-name
Due to tokenization strategies, some models by default return less embedding vectors than their number of input nucleotides. As we still require nucleotide-level input for nucleotide-level prediction tasks, we implement upsampling strategies to match the number of returned embeddings to the number of input nucleotides.
| Model | Upsampling strategy |
|---|---|
| DNABert | The overlapping k-mer tokenization strategy of DNABert causes some "missing embeddings" at the start and the end of the input sequence, as there is no context to build the k-mer tokens from. For k=3, we repeat the first and the last embedding vectors once. For k=4, we repeat the first once and the last twice. For k=5, we repeat the first and the last twice. For k=6, we repeat the first twice and the last three times. |
| Nucleotide Transformer | Due to 6-mer tokenization, each embedding is repeated 6 times. Remainder tokens are single nucleotides and left as-is. |
| GENA-LM, DNABERT-2, GROVER | BPE tokens have variable length. We repeat each embedding vector to the length of the sequence represented by its token. |