Bodenmiller Lab template for Python data analysis projects using Jupyter notebooks
To create a new project from this template:
cookiecutter https://github.com/BodenmillerGroup/cookiecutter-jupyter
After project creation, it is recommended to initialize git and add the origin:
cd <package_name>
git init
git remote add origin <origin_url>
The created project contains both a pip-style requirements.txt file and a conda-style environment.yml file:
- The requirements.txt file should contain all Python packages required for executing the code in the project repository. This file allows the user to install the most recent version of all packages and should therefore not be version-pinned, unless specific package versions are required.
- The environment.yml file should contain a conda environment for which the correct execution of the code in the project is guaranteed, including binary packages and the Python runtime. This file allows the user to reproduce the analysis results for which the project was created and should therefore be version-pinned.
Both the requirements.txt file and the environment.yml file are prepopulated with essential dependencies. Initially, the environment.yml file is not version-pinned, to allow for the creation of a fresh conda environment after project initialization. Following conda environment creation, this file should be replaced by an all-pinned environment file as follows:
conda env export > environment.yml
Also, for sharing your analysis environments, consider containerization tools such as Singularity or Docker.
This topic was discussed in more detail in issue #2.
By default, Jupyter notebook files ending with .ipynb are not versioned. This can be changed anytime by removing the corresponding line from the project's generated .gitignore file.
Advantages of versioning Jupyter notebooks:
- Static output of Jupyter notebook files is supported & rendered by GitHub.
- "Lab journal-style" Jupyter notebooks: Not only the code, but also the output embedded in the notebooks is versioned. This allows to track analysis results, as long as they are embedded in the Jupyter notebook.
- Pure code changes can still be tracked on an individual file-level by simultaneously version-controlling the
.pyfiles autogenerated by jupytext (enabled by default, see example.ipynb). - Per-commit code changes can be viewed using third-party tools such as nbdime.
Disadvantages of versioning Jupyter notebooks:
- Processed data should be easily reproducible by simply rerunning the code for the respective revision. Storage of such data is often not required/desirable.
- In principle, large and/or binary data should not be stored in git repositories, but tracked using appropriate data storage systems (file system, dolt, dvc, ...) instead. Storing (large) binary files in git repositories increases the physical disk space and data transfer requirements and makes it harder to understand changes on a per-commit level without third-party tooling (see above).
- Code & version history duplication: both the
.ipynband the autogenerated.pyfiles contain the same code. - Processed data not embedded in Jupyter notebooks has to be tracked separate from the output embedded in Jupyter notebooks. Also, not all data embedded in Jupyter notebooks can be stored (e.g. interactive visualization results).
Whether to version-control Jupyter notebooks or not is a design choice. See issue #3 for details.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Jonas Windhager
- Vito Zanotelli