Algorithms

Namespacing

Before outlining the main compilation algorithm, we discuss names and namespacing. To guarantee that input & output variables have unique names when we autogenerate CWL files and graphs, we use namespacing.

For now, each individual Namespace is a string with the following encoding:

namespace_str = f'{plugin_name}__step__{i + 1}__{step_key}' i.e. namespace_str = utils.step_name_str(plugin_name, i+1, step_key)

plugin_name can be either a CWL filename (excluding .cwl), or a wic filename (including .wic)

In the autogenerated CWL files, multiple Namespaces are encoded as a string separated by triple underscores (i.e. namespaces_str = '___'.join(namespace_strs))

so that we can later split the string back into its original Namespace strings (i.e. namespace_strs = namespaces_str.split('___'))

In subworkflows, we also discard the leading namespaces from the parent workflows (i.e. we use relative namespacing).

This is absolutely critical to ensuring that the CWL files corresponding to Subworkflows are completely independent of their embedding into a parent Workflow. This is one of the main design criteria, so do not mess this up!!! In fact, you can’t mess this up because of test_cwl_embedding_independence()

(FYI, we are forced to store this info in specially encoded strings because the CWL schema does not allow us to add extra tags. Specially encoded strings are almost never the right answer, but for now I don’t see any other way. Also note that the Galaxy workflow platform uses a nearly identical specially encoded string scheme, although their indexing starts at zero.)

On the other hand, GraphViz and NetworkX require all names to be globally unique, so by default we do not truncate the namespaces when constructing graphs. However, –graph_inline_depth allows users to hide irrelevant details by collapsing all subgraphs below the given depth to a single node. This is implemented by simply truncating the trailing namespaces. Thus, keeping track of namespaces allows us to trivially implement two key features.

Compilation Algorithm

One of the main design criteria is that users should be able to (recursively) combine workflow steps into reusable building blocks, i.e. subworkflows. Moreover, everything should be independent of how the user chooses to partition their root workflow into subworkflows. (See test_inline_subworkflows()) To implement this, we use a recursive compilation strategy.

First let’s consider the base case, i.e. the case that all of the steps in the workflow are already CWL CommandLineTools and thus there is no recursion. In other words, we have a list of steps with inputs that need to be connected to previous outputs, either explicitly or using inference, which will be discussed below.

Now for the recursive case: If we are in the process of compiling the steps of a workflow and we encounter a subworkflow, we simply compile the subworkflow’s wic file contents to CWL, replace the wic file contents in-memory with the compiled CWL, and continue the compilation of the parent workflow as if the subworkflow was already a CWL CommandLineTool.

However, there are two major additional points: After compilation, a namespace is prepended to the input and output variables of a subworkflow to guarantee uniqueness. More importantly, from within the subworkflow, it may not be possible to completely determine all inputs concretely; satisfaction of some inputs may need to be deferred.

Deferred Satisfaction

When compiling a subworkflow, inputs which originate in a parent workflow (and are thus external to the subworkflow) are not yet in scope. Thus, we cannot yet make an edge (either explicit or inferred) and the inputs cannot yet be concretely satisfied. So we simply create an intermediate input variable in the intermediate subworkflow(s), and as the recursion unwinds there will eventually be a concrete input in some parent workflow. It is very important to note that deferring inputs does NOT affect the DAGs of any of the parent workflows! (Again, see test_inline_subworkflows()) Also note that deferred intermediate inputs will be namespaced accordingly.

Explicit Edges

Explicit edges are handled first, to prevent edge inference from being applied. Whereas the edge inference algorithm operates ‘locally’, at one level of recursion at a time, the explicit edge algorithm inherently operates ‘globally’, requiring deferred information to be passed around through the various levels of recursion. (As such, it was actually much more difficult to implement correctly.)

First, when an edge definition site (!&) is encountered, its namespaces are stored in explicit_edge_defs. Then, when an edge call site is encountered (!*), we compare the namespaces of the definition and call sites using the lowest common ancestor algorithm to see if they have any leading namespaces in common. If we are already in the common namespace, we can immediately apply the definition information. Otherwise, we need to store the information in explicit_edge_calls to be deferred until the recursion reaches the common namespace. (This is the part that was difficult; please be careful when modifying this code!)

Edge Inference

Since users may eventually need to know how edge inference works, the edge inference algorithm is in the user’s guide.

Again note that if we are in a subworkflow, edge inference may temporarily fail for some inputs and we may need to defer to a parent workflow.

Mathematical Aside

Every DAG has a topological ordering. Since CWL workflows are DAGs, there must be an associated topological ordering. However, since the input wic DSL only contains a linear sequence of steps and does not contain any edge information, we merely have a linear ordering. The challenge is to promote the linear ordering to a topological ordering by inferring all of the edges. Since we are initially missing information this is far from unique, so users should always check that edge inference actually produces the intended DAG.

Speculative Compilation

What we are doing here is basically a simplified version of the Automated Pipeline Exporer The problem with APE (and SAT solvers in general) is that the number of solutions typically increases exponentially with the number of variables. APE forces the user to manually choose which solution is correct. We want to find unique solutions (if possible), so limit to n=1 inserted steps.

The algorithm is actually rather simple: first we attempt to perform edge inference. If it fails, that means there are no outputs that directly match the given input. So what happens if we insert an intermediate step? Specifically, the compiler attempts to transitively match the input of the current step with the outputs of the intermediate step, and the inputs of the intermediate step with the outputs (plural) that failed to directly match.

Actually, due to implementation details, the compiler temporarily attempts to match a single output with the intermediate inputs. At this point we may or may not have a valid insertion, so we stop compiling the current subworkflow, tentatively insert the step, and speculatively re-compile the subworkflow from scratch. If we indeed have a valid insertion, the inference algorithm should now succeed in matching all of the outputs, and we are done! This is a rather roundabout way of doing it, but it turned out to be much easier to implement.

Known Issues

There is a case where speculative compilation can repeatedly fail, but repeatedly attempt to insert an additional step, and thus go into an infinite loop. For now, I have simply hardcoded a maximum number of iterations, after which an error message is displayed. It should be possible to detect and recover from this error, which I believe is caused when a single match is initially found, but then all matches cannot subsequently be found.

There is a pathological case where speculatively re-compiling has time complexity O(2^n)! This is because insertions can happen at any level of recursion, so we may end up re-compiling deeply nested subworkflows unnecessarily. However, this is amenable to dynamic programming techniques, so we should be able to cache the subworkflows and eliminate this problem. Moreover, this pathological case would never occur with any workflows that a human would write.