Skip to content

tahoebio/mosaicfm

Repository files navigation

tahoe-therapeutics

Linter: Ruff License Code style: black


MosaicFM

This is the internal codebase for the MosaicFM series of single-cell RNA-seq foundation models developed by Vevo Therapeutics. Our repository follows a similar structure to llm-foundry and imports several utility functions from it. Please follow the developer guidelines if you are contributing to this repository. For main results and documentation, please refer to the results section. If you are looking to train or finetune a model on single-cell data, please refer to the training section.

The repository is organized as follows:

  • mosaicfm/ contains the building blocks for the MosaicFM models.
    • mosaicfm/model/blocks Building block modules that may be used across models
    • mosaicfm/model/model Full architectures subclassed from ComposerModel
    • mosaicfm/tasks/ Helper functions to use in downstream applications, such as embedding extraction.
    • mosaicfm/tokenizer Vocabulary building and tokenization functions.
    • mosaicfm/data Data loaders and collators
    • mosaicfm/utils Miscellaneous utility functions such to dowload files from s3 etc.
  • scripts/ contains scripts to train/evaluate models and to build datasets.
    • scripts/train.py Script to train a model. Accepts a yaml file or command line arguments for specifying job parameters.
    • scripts/prepare_for_inference.py Script to save a model for inference by packaging it with the vocabulary and saving metadata.
    • scripts/depmap Scripts to run the depmap benchmark.
  • mcli yaml files to configure and launch runs on the MosaicML platform.
  • runai yaml files to configure and launch runs on RunAI.
  • tutorials Notebooks to demonstrate some applications of the models.

Hardware and Software Requirements

We have tested our code on NVIDIA A100 and H100 GPUs with CUDA 12.1. At the moment, we are also restricted to use a version of llm-foundry no later v0.6.0, since support for the triton implementation of flash-attention was removed in v0.7.0.

We support launching runs on the MosaicML platform as well as on local machines through RunAI. The recommended method for using MosaicFM is to use the pre-built vevotx/ml-scgpt docker image.

Currently, we have the following images available:

Image Name Base Image Description
vevotx/ml-scgpt:shreshth docker.io/mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596 Image used for MosaicFM-1.3B (July 2024 release)

Installation

With docker

git clone https://github.com/vevotx/mosaicfm.git
cd mosaicfm
pip install -e .

Without docker

git clone https://github.com/vevotx/mosaicfm.git 
cd mosaicfm
mamba env create -f envs/mosaicfm_env.yml
mamba activate mosaicfm
pip install -e . --no-deps # Inside the mosaicfm directory

Note

If you are on an H100 GPU you may see 'sm_90' is not a recognized processor for this target (ignoring processor). This is expected and safe to ignore.

Datasets

The following datasets are used for training and evaluation:

Dataset Path Description
s3://vevo-ml-datasets/vevo-scgpt/datasets/cellxgene_primary_2024-04-29_MDS/ MDS dataset comprising ~45M cells from Apr 2024 release by CellxGene and Vevo dataset 35 (resistance-is-futile)
s3://vevo-ml-datasets/vevo-scgpt/datasets/cellxgene_primary_2023-12-15_MDS_v2/ MDS dataset comprising ~34M cells from Dec 2023 release by CellxGene.
s3://vevo-ml-datasets/umair/scgpt-depmap/ Root folder containing Depmap dataset and model predictions
s3://vevo-drives/drive_3/ANALYSIS/analysis_107/ Root folder MSigDB data and model predictions

Pre-trained Models

Model Name Run Name Path to Checkpoints WandB id
MosaicFM-1.3B scgpt-1_3b-2048-prod s3:/vevo-scgpt/models/scgpt-1_3b-2048-prod/ lv6jl8kl
MosaicFM-70M scgpt-70m-1024-fix-norm-apr24-data s3:/vevo-scgpt/models/scgpt-70m-1024-fix-norm-apr24-data/ 55n5wvdm
MosaicFM-25M scgpt-25m-1024-fix-norm-apr24-data s3:/vevo-scgpt/models/scgpt-25m-1024-fix-norm-apr24-data/ bt4a1luo
MosaicFM-9M sscgpt-test-9m-full-data s3:/vevo-scgpt/models/scgpt-test-9m-full-data/ di7kyyf1

Results

Links to evaluations and benchmarks are provided below:

Please refer to our technical report for detailed results and analysis: Internal Link

Developer Guidelines

We use the black code style and the Ruff linter to mantain consistency across contributions. Please set-up pre-commit and run the repository level hooks before committing any changes. Please do not push to master directly. Create a new branch and open a pull request for review. To set up pre-commit hooks, run the following command:

pip install pre-commit
pre-commit install
pre-commit run --all-files # Before committing

We also encourage new contributions to use type annotations and docstrings for functions and classes. In the future we will add pyright and pydocstyle checks to the pre-commit hooks. We encourage the use of Google style docstrings.

If you will be launching any training/evaluation runs, please also make sure you have access to s3, wandb and mcli/runai by reaching out on #infrastructure.

Acknowledgements

We would like to thank the developers of the following open-source projects:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8