-
-
Notifications
You must be signed in to change notification settings - Fork 237
Beta support for configurable dependency resolution & Biocontainers. #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is awesome! I will review at the soonest opportunity (likely early next week). "CWL-specific Dockerfiles" I think you mean Docker-specific CWL files? |
|
I definitely want this feature, but I'd like to see some more work before merging. Are you able to continue working on this branch in response to comments? Otherwise can see about working off of this branch. The intended design of SoftwareRequirement is to avoid picking favorites with the packaging system. The Resolution should work something like this:
|
|
Reviewing #212, I'm not sure why a separate PYTHON_RUN_SCRIPT is necessary. Everything that it does can be accomplished generating shell code? |
|
@tetron I don't think the specs should strictly be needed - I can look at seqtk and find it Conda, Brew, and Debian. Same is going to hold for most packages in our experience. Back in #93 I made the same argument passionately - that time about bwa.
That comment was +1'd by @pvanheus who has a lot more experience adapting software inventories these platforms - Galaxy and cwl. The tool author shouldn't need to know every last package manager that this may be deployed against - forcing them to know this a priori is a mistake and is a recipe for non-future facing, brittle tools. If instead you want to use the spec field as an optional hint when the answer isn't the intuitive, natural one, then I think this idea is acceptable and I'm more than happy to add it to the galaxy-lib stuff myself. The resulting improvement would benefit Galaxy at the margins also. But it is the margins - I assure you. If you ignore the spec thing however - |
cfc9da0 to
1625611
Compare
|
@tetron Rebased with noted spelling fix - thanks for the catch! |
Ok, this seems reasonable. In that case, the behavior should be something like, check Let me see if I'm following: the strategy is to generate shell script which does the package installation prior to running the command. This may run on a separate node (or VM or container environment) from the workflow engine. How does this work with multiple package managers or package versions? Does it generate an if-then-else chain? |
|
@tetron So a limitation of the current implementation is that it does the install or generates the shell code to execute runs on the node running cwltool. (This works on fine on clusters that have a certain directory mounted across them and in Galaxy land our remote job execution tool Pulsar knows how to resolve dependencies so for jobs without a shared file system this would happen closer to the job.) I guess cwltool is somewhat explicitly single node so this shouldn't be a problem right? Explicit support for something like Toil would be great - I'd be happy to field PRs to galaxy-lib to enable that or do it myself if I sense enough eagerness for it. If multiple "package managers" are used (the term is a misnomer since something like Galaxy packages or environment modules are not package mangers) - each may resolve packages individually and produce its own string for injecting the environment into the shell. |
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage.
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage.
|
Can one of the admins verify this patch? |
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage. Conflicts: lib/galaxy/tools/deps/__init__.py test/functional/tools/samples_tool_conf.xml
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage. Conflicts: lib/galaxy/tools/deps/__init__.py test/functional/tools/samples_tool_conf.xml
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage. Conflicts: lib/galaxy/tools/deps/__init__.py test/functional/tools/samples_tool_conf.xml
This adapts CWL-style ``SoftwareRequirement`` ``specs`` to solve galaxyproject#1927. Here I'm trying to implement the CWL specification in a way that helps enable the feasibility of Conda packaging in Galaxy. It is a delicate balancing act aimed to upset as many interested parties as I can. To understand the problem - consider the ``blast+`` requirement found in the Galaxy wrappers. It looks something like this: ``` <requirement type="package" version="2.2.31" name="blast+"> ``` Great, that works for Galaxy and Tool Shed packages. It doesn't work for bioconda at all. I think this problem is fairly uncommon - most packages have simple names shared across Debian, Brew, Conda, etc... - but it does happen in some cases that there are inconsistencies. Some people have taken to duplicating the requirement - this is bad and should not be done since they are mutually exclusive and Galaxy will attempt to resolve both. This introduces the following syntax for tools with profile >= 16.10: ``` <requirement type="package" version="2.2.31" name="blast+"> <specification uri="https://anaconda.org/bioconda/blast" /> <specification uri="https://packages.debian.org/sid/ncbi-blast+" version="2.2.31-3" /> </requirement> ``` This allows finer grain resolution of the requirement without sacrificing the abstract name at the top. It allows the name and the version to be adapted by resolvers as needed (hopefully rarely so). This syntax is the future facing one, but obviously this tool would not work on older Galaxy versions. To remedy this - an alternative syntax can be used for tools targetting Galaxy verions pre-16.10: ``` <requirement type="package" version="2.2" specification_uris="https://anaconda.org/bioconda/[email protected],https://packages.debian.org/jessie/[email protected]">blast+</requirement> ``` This syntax sucks - but it does give newer Galaxies the ability to resolve the specifications without breaking the more simple functionality for older Galaxies. For more information on the CWL side of this - checkout the discussion on common-workflow-language/cwltool#214. The CWL specification information is defined at http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage. Conflicts: lib/galaxy/tools/deps/__init__.py test/functional/tools/samples_tool_conf.xml
|
And we are up and running on this branch again at the GCC 2017 Hackathon. Just summarize what I have done and some conversations...
So my TODO list for the rest of the hackathon is:
These last two won't actually happen - but they sound fun. |
|
Tested using |
| log = ... # type: Any | ||
| CONFIG_VAL_NOT_FOUND = ... # type: Any | ||
|
|
||
| def build_dependency_manager(config): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a full type signature. It returns a DependencyManager, but I am unsure of what type config is.
| def __init__(self, default_base_path, conf_file: Optional[Any] = ..., app_config: Any = ...) -> None: ... | ||
| def get_resolver_option(self, resolver, key, explicit_resolver_options: Any = ...): ... | ||
| def get_app_option(self, key, default: Optional[Any] = ...): ... | ||
| def dependency_shell_commands(self, requirements, **kwds): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a full type signature.
| def to_dict(self): ... | ||
| def copy(self): ... | ||
| @staticmethod | ||
| def from_dict(dict): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a full type description
| tool_requirements = ... # type: Any | ||
| def __init__(self, tool_requirements: Optional[Any] = ...) -> None: ... | ||
| @staticmethod | ||
| def from_list(requirements): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a full type description
cwltool/draft2tool.py
Outdated
| def __init__(self, toolpath_object, **kwargs): | ||
| # type: (Dict[Text, Any], **Any) -> None | ||
| super(CommandLineTool, self).__init__(toolpath_object, **kwargs) | ||
| self.find_default_container = kwargs["find_default_container"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing multiple tests to fail.
cwltool/main.py
Outdated
| _logger.addHandler(defaultStreamHandler) | ||
|
|
||
|
|
||
| COMMAND_WITH_DEPENDENCIES_TEMPLATE = string.Template("""#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the rest of this be moved to an external file?
450f646 to
d25285c
Compare
|
Consider the included tool ``seqtk_seq.cwl``. It includes the following SoftwareRequirement hint:
```
hints:
SoftwareRequirement:
packages:
- package: seqtk
version:
- 1.2
```
I'm not happy that ``version`` is a list - but I can live with it for now I guess.
If cwltool is executed with the hidden ``--beta-conda-dependencies`` flag, this requirement will be processed by galaxy-lib, Conda will be installed, and seqtk will be installed, and a Conda environment including seqtk will be setup for the job.
```
virtualenv .venv
. .venv/bin/activate
python setup.py install
pip install galaxy-lib
cwltool --beta-conda-dependencies tests/seqtk_seq.cwl tests/seqtk_seq_job.json
```
Additional flags are available to configure dependency resolution in a more fine grained way - using Conda however has a number of advantages that make it particularily well suited to CWL. Conda packages are distributed as binaries that work across Mac and Linux and work on relatively old version of Linux (great for HPC). Conda also doesn't require root and supports installation of multiple different versions of a package - again these factors make it great for HPC and non-Docker targets.
The Biocontainers project (previously Biodocker) dovetails nicely with this. Every version of every Bioconda package has a corresponding best-practice (very lightweight, very small) Docker container on quay.io (assembled by @bgruening and colleagues). There are over 1800 such containers currently.
Continuing with the example above, the new ``--beta-use-biocontainers`` flag instructs cwltool to fetch the corresponding Biocontainers container from quay.io automatically or build one to use locally (required for instance for tools with multiple software requirements - fat tools).
```
cwltool --beta-use-biocontainers tests/seqtk_seq.cwl tests/seqtk_seq_job.json
```
These containers contain the same binaries that the package would use locally (outside of Docker). Therefore this technique allows cross platform reproducibility/remixability across CWL, Galaxy, and CLI - both inside and outside of Docker.
My sincerest hope is that we move away from CWL-specific Dockerfiles. For less effort, a community bioconda package can be made and the result can be used in many more contexts. The Docker image will then be maintained by the community Biocontainer project.
Rebased with correct spelling of DependenciesConfiguration thanks to @tetron.
|
@mr-c This branch's README now contains lots of documentation about SoftwareRequirements including how to resolve them with environment modules, custom package scripts, and Conda. It covers mapping to local infrastructure (for admins) as well mapping to multiple package managers from the tools themselves (for tool developers). It has many example calls that should work without external dependencies covering most of the test data I added and that can be run to try these things out today. It also contains some external links and information about installing the optional dependencies required to run these. Here are the docs - https://github.com/jmchilton/cwltool/tree/galaxy_deps#leveraging-softwarerequirements-beta. |
|
@mr-c Consider this comment documentation that I agree to allow uploading of galaxy-lib stubs to typeshed. Hope this is good enough? |
|
|
||
| $ pip install 'cwltool[deps]' | ||
|
|
||
| Installing cwltool in this fashion enables several new command line options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. should we hide them when the deps extra isn't modified? What happens if they are called and galaxy-lib is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have clarified this behavior with #459 and improved it a bit.
|
Merging as this is tremendous; we can add additional polish later. Thank you @jmchilton ! |
Consider the included tool
seqtk_seq.cwl. It includes the following SoftwareRequirement hint:I'm not happy that
versionis a list - but I can live with it for now I guess.If cwltool is executed with the hidden
--beta-conda-dependenciesflag, this requirement will be processed by galaxy-lib, Conda will be installed, and seqtk will be installed, and a Conda environment including seqtk will be setup for the job.Additional flags are available to configure dependency resolution in a more fine grained way - using Conda however has a number of advantages that make it particularily well suited to CWL. Conda packages are distributed as binaries that work across Mac and Linux and work on relatively old version of Linux (great for HPC). Conda also doesn't require root and supports installation of multiple different versions of a package - again these factors make it great for HPC and non-Docker targets.
The Biocontainers project (previously Biodocker) dovetails nicely with this. Every version of every Bioconda package has a corresponding best-practice (very lightweight, very small) Docker container on quay.io (assembled by @bgruening and colleagues). There are over 1800 such containers currently.
Continuing with the example above, the new
--beta-use-biocontainersflag instructs cwltool to fetch the corresponding Biocontainers container from quay.io automatically or build one to use locally (required for instance for tools with multiple software requirements - fat tools).These containers contain the same binaries that the package would use locally (outside of Docker). Therefore this technique allows cross platform reproducibility/remixability across CWL, Galaxy, and CLI - both inside and outside of Docker.
My sincerest hope is that we move away from CWL-specific Dockerfiles. For less effort, a community bioconda package can be made and the result can be used in many more contexts. The Docker image will then be maintained by the community Biocontainer project.