Algorithms¶
Namespacing¶
Before outlining the main compilation algorithm, we discuss names and namespacing. Sophios uses namespaces to keep autogenerated CWL input and output variables unique across nested workflows and generated graphs.
For now, each individual Namespace is a string with the following encoding:
namespace_str = f'{plugin_name}__step__{i + 1}__{step_key}' i.e. namespace_str = utils.step_name_str(plugin_name, i+1, step_key)
plugin_name can be either a CWL filename stem or a .wic filename.
In the autogenerated CWL files, multiple Namespaces are encoded as a string separated by triple underscores (i.e. namespaces_str = '___'.join(namespace_strs))
so that we can later split the string back into its original Namespace strings (i.e. namespace_strs = namespaces_str.split('___'))
In subworkflows, we also discard the leading namespaces from the parent workflows (i.e. we use relative namespacing).
This keeps the CWL files for subworkflows independent of the parent workflow
that embeds them. The regression test for this behavior is
test_cwl_embedding_independence.
This information is stored in specially encoded strings because the CWL schema does not allow arbitrary extra tags in the places Sophios needs them. The Galaxy workflow platform uses a similar encoded-string scheme, although its indexing starts at zero.
On the other hand, GraphViz and NetworkX require all names to be globally unique, so by default we do not truncate the namespaces when constructing graphs. However, –graph_inline_depth allows users to hide irrelevant details by collapsing all subgraphs below the given depth to a single node. This is implemented by simply truncating the trailing namespaces. Thus, keeping track of namespaces allows us to trivially implement two key features.
Compilation Algorithm¶
One of the main design criteria is that users should be able to recursively
combine workflow steps into reusable subworkflows. Compilation should be
independent of how the root workflow is partitioned into subworkflows. The
regression test for this behavior is test_inline_subworkflows.
First let’s consider the base case, i.e. the case that all of the steps in the workflow are already CWL CommandLineTools and thus there is no recursion. In other words, we have a list of steps with inputs that need to be connected to previous outputs, either explicitly or using inference, which will be discussed below.
Now for the recursive case: If we are in the process of compiling the steps of a workflow and we encounter a subworkflow, we simply compile the subworkflow’s wic file contents to CWL, replace the wic file contents in-memory with the compiled CWL, and continue the compilation of the parent workflow as if the subworkflow was already a CWL CommandLineTool.
However, there are two major additional points: After compilation, a namespace is prepended to the input and output variables of a subworkflow to guarantee uniqueness. More importantly, from within the subworkflow, it may not be possible to completely determine all inputs concretely; satisfaction of some inputs may need to be deferred.
Deferred Satisfaction¶
When compiling a subworkflow, inputs which originate in a parent workflow (and are thus external to the subworkflow) are not yet in scope. Thus, we cannot yet make an edge (either explicit or inferred) and the inputs cannot yet be concretely satisfied. So we simply create an intermediate input variable in the intermediate subworkflow(s), and as the recursion unwinds there will eventually be a concrete input in some parent workflow. It is very important to note that deferring inputs does NOT affect the DAGs of any of the parent workflows! (Again, see test_inline_subworkflows()) Also note that deferred intermediate inputs will be namespaced accordingly.
Explicit Edges¶
Explicit edges are handled first, to prevent edge inference from being applied. Whereas the edge inference algorithm operates ‘locally’, at one level of recursion at a time, the explicit edge algorithm inherently operates ‘globally’, requiring deferred information to be passed around through the various levels of recursion. (As such, it was actually much more difficult to implement correctly.)
First, when an edge definition site (!&) is encountered, its namespaces are
stored in explicit_edge_defs. Then, when an edge call site (!*) is
encountered, Sophios compares the namespaces of the definition and call sites
using the lowest common ancestor
algorithm to see whether they have leading namespaces in common. If compilation
is already in the common namespace, the definition information can be applied
immediately. Otherwise, the information is stored in explicit_edge_calls and
deferred until recursion reaches the common namespace.
Edge Inference¶
Since users may eventually need to know how edge inference works, the edge inference algorithm is described in Advanced YAML and Operations.
Again note that if we are in a subworkflow, edge inference may temporarily fail for some inputs and we may need to defer to a parent workflow.
Mathematical Aside¶
Every DAG has a topological ordering. Since CWL workflows are DAGs, there must be an associated topological ordering. However, a .wic workflow can start as a linear sequence of steps without explicit edge information, so the compiler begins with a linear ordering. The challenge is to promote the linear ordering to a topological ordering by inferring the missing edges. Since edge inference can have more than one plausible answer, users should inspect important generated DAGs before relying on them.
Speculative Compilation¶
Speculative compilation is a constrained version of the Automated Pipeline Explorer idea. The problem with APE, and SAT-style search in general, is that the number of solutions can increase exponentially with the number of variables. Sophios aims to find unique insertions when possible, so automatic insertion is limited to one inserted step.
The algorithm is actually rather simple: first we attempt to perform edge inference. If it fails, that means there are no outputs that directly match the given input. So what happens if we insert an intermediate step? Specifically, the compiler attempts to transitively match the input of the current step with the outputs of the intermediate step, and the inputs of the intermediate step with the outputs (plural) that failed to directly match.
Due to implementation details, the compiler temporarily attempts to match a single output with the intermediate inputs. At that point the insertion is still tentative, so Sophios stops compiling the current subworkflow, inserts the candidate step, and speculatively recompiles the subworkflow from scratch. If the insertion is valid, inference should then match all required outputs.
Known Issues¶
Speculative compilation can repeatedly fail while repeatedly attempting to
insert an additional step. To guarantee termination, compiler.compile_workflow
currently stops after a fixed maximum number of iterations and reports an error.
Future work could detect this condition earlier and report a more specific
diagnostic.
There is a pathological case where speculative recompilation has time
complexity O(2^n). Insertions can happen at any level of recursion, so Sophios
may recompile deeply nested subworkflows unnecessarily. This should be
amenable to dynamic programming
or subworkflow caching if it becomes a practical bottleneck.