scRNA-seq cell annotations using VAE architecture
kepler is a minimal, production-ready pipeline for training Variational Autoencoders (VAE) on single-cell RNA-seq data. it learns latent representations of cells and uses them for cell type annotation tasks.
# install dependencies (using uv recommended)
uv sync --extra ml --extra dev
# generate machine config (optional, for optimized defaults)
python config/generate_config.pyprepare data with batch correction and qc filtering:
# run all data preparation steps (load, batch correction, qc filtering)
uv run python -m src.app.cli --config config/default.yaml prepare_data
# or run steps individually:
uv run python -m src.app.cli --config config/default.yaml load_data
uv run python -m src.app.cli --config config/default.yaml batch_correct --output_dir outputs/<run_id>
uv run python -m src.app.cli --config config/default.yaml qc_filter --output_dir outputs/<run_id>expected outputs:
outputs/<run_id>/figures/qc_violins_baseline.png- qc metrics before filteringoutputs/<run_id>/figures/umap_comparison.png- before/after batch correctionoutputs/<run_id>/figures/qc_violins_filtered.png- qc metrics after filteringoutputs/<run_id>/data_processed.h5ad- ready for model training
train and compare both models:
# train VAE model (unsupervised representation learning)
uv run python -m src.app.cli --config config/default.yaml train_vae --output_dir outputs/<run_id>
# train transformer model (supervised cell type classification)
uv run python -m src.app.cli --config config/default.yaml train_transformer --output_dir outputs/<run_id>
# compare both models
uv run python -m src.app.cli --config config/default.yaml compare --output_dir outputs/<run_id>expected outputs:
outputs/<run_id>/figures/vae_training_curves.png- VAE loss curves (ELBO, recon, KL)outputs/<run_id>/figures/transformer_training_curves.png- transformer loss and accuracy curvesoutputs/<run_id>/figures/transformer_confusion_matrix.png- transformer prediction confusion matrixoutputs/<run_id>/figures/model_comparison.png- side-by-side performance comparisonoutputs/<run_id>/model/- trained VAE modeloutputs/<run_id>/transformer_model/- trained transformer model
# run legacy pipeline (train VAE + eval + viz)
uv run python -m src.app.cli --config config/default.yaml pipeline# train only
uv run python -m src.app.cli --config config/default.yaml train
# eval (requires trained model)
uv run python -m src.app.cli --config config/default.yaml --output_dir outputs/<run_id> eval
# visualize (requires trained model)
uv run python -m src.app.cli --config config/default.yaml --output_dir outputs/<run_id> vizedit config/default.yaml to customize:
- model architecture (latent_dim, hidden_dims, dropout)
- training hyperparameters (batch_size, learning_rate, epochs)
- dataset preprocessing (n_top_genes, normalization)
- evaluation settings (knn_k, use_logistic_regression)
machine-specific settings (threads, batch size) auto-merge from config.yml if present.
# run all tests
uv run pytest -q
# run specific test suite
uv run pytest tests/unit/ -v
uv run pytest tests/fuzz/ -v
# with coverage
uv run pytest --cov=src --cov-report=htmlresults saved to outputs/<timestamp>/:
config.used.yaml: resolved configurationmetrics.json: training metrics and loss curveseval_metrics.json: cell annotation performance (accuracy, f1-macro)model/model.pt: trained model checkpointfigures/: UMAP plots, training curves, latent distributions
# install pre-commit hooks
uv run pre-commit install
# run linter
uv run ruff check .
uv run ruff format .
# run type checking (if mypy added)
# uv run mypy src/- training: ~5-10 min on CPU (PBMC 3k cells, 20 epochs)
- memory: <4GB RAM
- tests: <30s
- dataset: PBMC 3k from 10x Genomics (via Scanpy)
- model: scVI-style VAE (Lopez et al., 2018)
- preprocessing: standard Scanpy pipeline
MIT