Skip to content

bmwoolf/kepler

Repository files navigation

Banner

kepler

scRNA-seq cell annotations using VAE architecture

overview

kepler is a minimal, production-ready pipeline for training Variational Autoencoders (VAE) on single-cell RNA-seq data. it learns latent representations of cells and uses them for cell type annotation tasks.

installation

# install dependencies (using uv recommended)
uv sync --extra ml --extra dev

# generate machine config (optional, for optimized defaults)
python config/generate_config.py

quick start

tutorial workflow (steps 1-3): data preparation

prepare data with batch correction and qc filtering:

# run all data preparation steps (load, batch correction, qc filtering)
uv run python -m src.app.cli --config config/default.yaml prepare_data

# or run steps individually:
uv run python -m src.app.cli --config config/default.yaml load_data
uv run python -m src.app.cli --config config/default.yaml batch_correct --output_dir outputs/<run_id>
uv run python -m src.app.cli --config config/default.yaml qc_filter --output_dir outputs/<run_id>

expected outputs:

  • outputs/<run_id>/figures/qc_violins_baseline.png - qc metrics before filtering
  • outputs/<run_id>/figures/umap_comparison.png - before/after batch correction
  • outputs/<run_id>/figures/qc_violins_filtered.png - qc metrics after filtering
  • outputs/<run_id>/data_processed.h5ad - ready for model training

model training (steps 4-5): VAE and transformer

train and compare both models:

# train VAE model (unsupervised representation learning)
uv run python -m src.app.cli --config config/default.yaml train_vae --output_dir outputs/<run_id>

# train transformer model (supervised cell type classification)
uv run python -m src.app.cli --config config/default.yaml train_transformer --output_dir outputs/<run_id>

# compare both models
uv run python -m src.app.cli --config config/default.yaml compare --output_dir outputs/<run_id>

expected outputs:

  • outputs/<run_id>/figures/vae_training_curves.png - VAE loss curves (ELBO, recon, KL)
  • outputs/<run_id>/figures/transformer_training_curves.png - transformer loss and accuracy curves
  • outputs/<run_id>/figures/transformer_confusion_matrix.png - transformer prediction confusion matrix
  • outputs/<run_id>/figures/model_comparison.png - side-by-side performance comparison
  • outputs/<run_id>/model/ - trained VAE model
  • outputs/<run_id>/transformer_model/ - trained transformer model

full pipeline

# run legacy pipeline (train VAE + eval + viz)
uv run python -m src.app.cli --config config/default.yaml pipeline

individual commands

# train only
uv run python -m src.app.cli --config config/default.yaml train

# eval (requires trained model)
uv run python -m src.app.cli --config config/default.yaml --output_dir outputs/<run_id> eval

# visualize (requires trained model)
uv run python -m src.app.cli --config config/default.yaml --output_dir outputs/<run_id> viz

configuration

edit config/default.yaml to customize:

  • model architecture (latent_dim, hidden_dims, dropout)
  • training hyperparameters (batch_size, learning_rate, epochs)
  • dataset preprocessing (n_top_genes, normalization)
  • evaluation settings (knn_k, use_logistic_regression)

machine-specific settings (threads, batch size) auto-merge from config.yml if present.

testing

# run all tests
uv run pytest -q

# run specific test suite
uv run pytest tests/unit/ -v
uv run pytest tests/fuzz/ -v

# with coverage
uv run pytest --cov=src --cov-report=html

outputs

results saved to outputs/<timestamp>/:

  • config.used.yaml: resolved configuration
  • metrics.json: training metrics and loss curves
  • eval_metrics.json: cell annotation performance (accuracy, f1-macro)
  • model/model.pt: trained model checkpoint
  • figures/: UMAP plots, training curves, latent distributions

development

# install pre-commit hooks
uv run pre-commit install

# run linter
uv run ruff check .
uv run ruff format .

# run type checking (if mypy added)
# uv run mypy src/

performance

  • training: ~5-10 min on CPU (PBMC 3k cells, 20 epochs)
  • memory: <4GB RAM
  • tests: <30s

references

  • dataset: PBMC 3k from 10x Genomics (via Scanpy)
  • model: scVI-style VAE (Lopez et al., 2018)
  • preprocessing: standard Scanpy pipeline

license

MIT

About

scRNA-seq cell annotations, VAE architecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published