User Guide¶
Recall for a moment the vague instructions your PhD advisor hastily scribbled onto the chalkboard about how to do a calculation. Now imagine that those scribbles were actually executable! That’s the goal! The goal is to allow high-level, domain-specific concepts to be directly specified in a user-friendly YAML format. Then the abstract scientific protocol is automatically translated into specific concrete steps, executed on a remote job cluster, and automated analysis is performed.
Main Features / Design Overview¶
See overview
auto-discovery¶
Many software packages have a way of automatically discovering files which they can use. (examples: pytest pylint)
By default, wic will recursively search for tools / workflows within the directories (and subdirectories) listed in the config file’s json tags search_paths_cwl
and search_paths_yml
. The paths listed can be absolute or relative. The default config.json
is shown.
We strongly recommend placing all repositories of tools / workflows in the same parent directory.
(All your repos should be side-by-side in sibling directories, as shown.)
......
"search_paths_cwl": {
"global": [
"../workflow-inference-compiler/cwl_adapters",
"../image-workflows/cwl_adapters",
"../biobb_adapters/biobb_adapters",
"../mm-workflows/cwl_adapters"
],
"gpu": [
"../mm-workflows/gpu"
]
},
"search_paths_yml": {
"global": [
"./workflow-inference-compiler/docs/tutorials",
"../image-workflows/workflows",
"../mm-workflows/examples"
]
}
.....
If you do not specify config file using the command line argument --config
, it will be automatically created for you the first time you run wic in ~/wic/global_config.json
. (Because of this, the first time you run wic you should be in the root directory of any one of your repos.) Then you can manually edit this file with additional sources of tools / workflows.
To avoid dealing with relative file paths in YAML files, by default
all tools / workflow names are required to be unique!
See namespaces for details.
Edges¶
What do the edges in a workflow represent? In many workflow languages (e.g. Argo), the edges represent dependencies between entire steps. Note that there could be multiple
files or directories implicitly passed between two steps, but these workflow languages only model that as a single edge.
CWL models edges differently. In CWL, edges represent dependencies between individual
explicit inputs and outputs. This fine-grained approach has several benefits, first and foremost increased parallelism. CWL also allows individual inputs and outputs to be tagged with metadata such as type:
and format:
tags. This additional information is what makes edge inference possible!
Edge Inference Algorithm¶
First of all, a reminder that we can only connect an input in the current step to an output that already exists from some previous step.
The edge inference algorithm is actually rather simple: For each input with a given type and format, it checks for outputs that have the same type and format from one of the previous steps. Since many workflows are linear-ish pipelines, the steps are checked in reverse order (and the outputs of each step are also checked in reverse order). If there is a unique match, then great! If there are multiple matches, it arbitrarily chooses the first (i.e. most recent) match. For technical reasons edge inference is far from unique, so users should always check that edge inference actually produces the intended DAG
.
Explicit Edges¶
If for some reason edge inference fails, you can always explicitly specify the edges using '&var'
and '*var'
notation. Simply use '&var'
to create a reference to an output filename and then, in an input in any later step, use '*var'
to dereference the filename and create an explicit edge between the output and the input. See examples/gromacs
in mm-workflows
repository for a concrete example. Due to yaml’s anchors and aliases notation (which you can still use!), these variables will need to be in quotes. (The notation is intended to be nearly identical, but instead of using '*var'
to refer to the contents of '&var'
it refers to the path to '&var'
.)
Inline Inputs¶
Of course, if you supply an input value directly, then the algorithm doesn’t need to do either inference or explicit edges. The message
input is a great example of this.
steps:
- echo:
in:
message: Hello World
Note that this is one key different between WIC and CWL. In CWL, all inputs must be given in a separate file. In WIC, inputs can be given inline and after compilation they will be automatically extracted into the separate file.