This is the internal codebase for the MosaicFM series of single-cell RNA-seq foundation models developed by Vevo Therapeutics. Our repository follows a similar structure to llm-foundry and imports several utility functions from it. Please follow the developer guidelines if you are contributing to this repository. For main results and documentation, please refer to the results section. If you are looking to train or finetune a model on single-cell data, please refer to the training section.
The repository is organized as follows:
mosaicfm/
contains the building blocks for the MosaicFM models.mosaicfm/model/blocks
Building block modules that may be used across modelsmosaicfm/model/model
Full architectures subclassed from ComposerModelmosaicfm/tasks/
Helper functions to use in downstream applications, such as embedding extraction.mosaicfm/tokenizer
Vocabulary building and tokenization functions.mosaicfm/data
Data loaders and collatorsmosaicfm/utils
Miscellaneous utility functions such to dowload files from s3 etc.
scripts/
contains scripts to train/evaluate models and to build datasets.scripts/train.py
Script to train a model. Accepts a yaml file or command line arguments for specifying job parameters.scripts/prepare_for_inference.py
Script to save a model for inference by packaging it with the vocabulary and saving metadata.scripts/depmap
Scripts to run the depmap benchmark.
mcli
yaml files to configure and launch runs on the MosaicML platform.runai
yaml files to configure and launch runs on RunAI.tutorials
Notebooks to demonstrate some applications of the models.
We have tested our code on NVIDIA A100 and H100 GPUs with CUDA 12.1. At the moment, we are also restricted to use a version of llm-foundry no later v0.6.0, since support for the triton implementation of flash-attention was removed in v0.7.0.
We support launching runs on the MosaicML platform as well as on local machines through RunAI. The recommended method for using MosaicFM is to use the pre-built vevotx/ml-scgpt docker image.
Currently, we have the following images available:
Image Name | Base Image | Description |
---|---|---|
vevotx/ml-scgpt:shreshth |
docker.io/mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596 | Image used for MosaicFM-1.3B (July 2024 release) |
git clone https://github.com/vevotx/mosaicfm.git
cd mosaicfm
pip install -e .
git clone https://github.com/vevotx/mosaicfm.git
cd mosaicfm
mamba env create -f envs/mosaicfm_env.yml
mamba activate mosaicfm
pip install -e . --no-deps # Inside the mosaicfm directory
Note
If you are on an H100 GPU you may see 'sm_90' is not a recognized processor for this target (ignoring processor)
. This is expected and safe to ignore.
The following datasets are used for training and evaluation:
Dataset Path | Description |
---|---|
s3://vevo-ml-datasets/vevo-scgpt/datasets/cellxgene_primary_2024-04-29_MDS/ |
MDS dataset comprising ~45M cells from Apr 2024 release by CellxGene and Vevo dataset 35 (resistance-is-futile) |
s3://vevo-ml-datasets/vevo-scgpt/datasets/cellxgene_primary_2023-12-15_MDS_v2/ |
MDS dataset comprising ~34M cells from Dec 2023 release by CellxGene. |
s3://vevo-ml-datasets/umair/scgpt-depmap/ |
Root folder containing Depmap dataset and model predictions |
s3://vevo-drives/drive_3/ANALYSIS/analysis_107/ |
Root folder MSigDB data and model predictions |
Model Name | Run Name | Path to Checkpoints | WandB id |
---|---|---|---|
MosaicFM-1.3B | scgpt-1_3b-2048-prod | s3:/vevo-scgpt/models/scgpt-1_3b-2048-prod/ |
lv6jl8kl |
MosaicFM-70M | scgpt-70m-1024-fix-norm-apr24-data | s3:/vevo-scgpt/models/scgpt-70m-1024-fix-norm-apr24-data/ |
55n5wvdm |
MosaicFM-25M | scgpt-25m-1024-fix-norm-apr24-data | s3:/vevo-scgpt/models/scgpt-25m-1024-fix-norm-apr24-data/ |
bt4a1luo |
MosaicFM-9M | sscgpt-test-9m-full-data | s3:/vevo-scgpt/models/scgpt-test-9m-full-data/ |
di7kyyf1 |
Links to evaluations and benchmarks are provided below:
Please refer to our technical report for detailed results and analysis: Internal Link
We use the black code style and the Ruff linter to mantain consistency across contributions.
Please set-up pre-commit
and run the repository level hooks before committing any changes.
Please do not push to master directly. Create a new branch and open a pull request for review.
To set up pre-commit hooks, run the following command:
pip install pre-commit
pre-commit install
pre-commit run --all-files # Before committing
We also encourage new contributions to use type annotations and docstrings for functions and classes. In the future we
will add pyright
and pydocstyle
checks to the pre-commit hooks. We encourage the use of Google style docstrings.
If you will be launching any training/evaluation runs, please also make sure you have access to s3
, wandb
and mcli
/runai
by reaching out on #infrastructure.
We would like to thank the developers of the following open-source projects: