Validation guide

Formal Schemas

What is validation? Validation is the process of checking whether the contents of a file agree with a formal specification. A popular modern choice for formal specification language is jsonschema, which is what wic uses. (Don’t worry, you do not need to learn jsonschema. See below.)

Different levels of strictness

Notice that I said a formal specification, not the formal specification. That’s because there are many different kinds of schemas, which may be more or less strict.

The trivial schema {}

For example, the trivial schema {} is the least strict; it “validates” anything! (This sounds silly, but can be useful at times.)

The Syntax schemas

Next, there are various basic schemas for simply checking that a yml file is syntactically correct, but otherwise does not place any restrictions on the file contents.

The CWL schema

Since CWL files are just special yml files, there are validators for these special yml files.

Getting better, but wic workflows aren’t just any yml or cwl files, they are even more special. So how can we create a special schema that will only accept wic workflows?

The WIC schema

At the opposite end of the spectrum (from {}), we could create the most strict schema possible. This would be a schema which only accepts inputs from all known tools and subworkflows. In other words, we can use a whitelist instead of a blacklist. We can assume a closed world instead of an open world. Where does this whitelist come from? It comes from the auto-discovery mechanism!

So the command wic --generate_schemas

  • makes a list of every single tool and workflow available

  • generates a separate sub-schema for each tool / workflow individually (this is why the flag says schemas, plural)

  • combines all of the sub-schemas into a single giant disjoint union!

In other words, each step in a WIC workflow had better be chosen from a list of valid steps!

Stale schemas

A direct consequence of using the most strict possible schema is that, as you add and/or modify tools and workflows, you have to keep re-generating the WIC schema so it doesn’t become stale. Otherwise you will get validation errors, because you will be attempting to validate against an old schema which no longer reflects the tool and workflows that are currently available.

The exact same thing happens in Python when you are in the middle of editing a file. Until you push the save button, VSCode will keep attempting to validate your python code against the functions and methods that were available the last time you saved. It isn’t a surprise that you get a red squiggly line while you are still typing. Makes sense, right?

That said, for technical and performance reasons, (for now) we do not automatically generate a new wic schema.

TL;DR

You must periodically run this command!

rm -rf autogenerated/schemas/ && wic --generate_schemas

If you are getting validation errors, try re-running this command!

Property Based Testing

So why are we going through all this trouble to create the most strict possible schema?

The answer is that thanks to the excellent hypothesis-jsonschema library, we can use the WIC schema to perform property-based integration testing

on the entire WIC language, for every single tool and workflow simultaneously!

We can randomly generate entire synthetic workflows!

This has already been implemented, and it has been incredibly useful for finding a few (and thankfully only a few!) bugs. It also revealed a few subtle design issues (which have long ago been fixed).

The Fuzzy Compile CI logs should speak for themselves.

Well what about … !?!

Pydantic

The jsonschema validation for the yml syntax is independent of the Pydantic validation for the python API. You can use both, or neither. They are not mutually exclusive!

Unit Tests

Notice that we get all of this for free! We do not necessarily have to write any additional tests! In particular, we can get away with writing a much smaller number of unit tests, which have a tendency to be strongly coupled with the implementation they are testing and in extreme cases can become a maintenance burden of their own.

Code Coverage

~70% code coverage is plenty good enough.