Comparing Yadage and Parsl Workflow Languages
Yadage and Parsl are two popular workflow languages. As part of project SCAILFIN, we are looking into scripting the REANA reproducible scientific workflow framework from CERN. It currently supports Yadage and Common Worklow Language to specify the workflow. We are keenly interested in a possible role for Parsl in this framework.
Basics
Yadage and Parsl are both workflow languages and both generate directed acyclic graphs (DAGs). Yadage represents workflows using YAML and dependencies are declared in terms of references to other tasks. Parsl uses python to describe the workflow. It takes more of dataflow view on dependencies. Task dependencies are represented as data result futures. A task is started when its incoming data output futures complete.
Tasks
Yadage
The atomic unit of the workflow is a packtivity – a packaged activity. It represents a single parametrized processing step. The parameters are passed as YAML documents and the processing step is executed using one of multiple backends. After processing the packtivity publishes JSON data that includes relevant data for further processing (e.g. referencing files that were created during the processing). 1
Parsl
In Parsl an “app” is a piece of code that can be asynchronously executed on an execution resource. An execution resource in this context is any target system such as a laptop, cluster, cloud, or even supercomputer. Execution on these resources can be performed by a pool of threads, processes, or remote workers.
Parsl apps are defined by annotating Python functions with an app decorator.
Currently two types of apps can be defined: Python, with the corresponding
@python_app
decorator, and Bash, with the corresponding @bash_app
decorator.
Python apps encapsulate pure Python code, while Bash apps wrap calls to external
applications and scripts.2
Feature | Yadage | Parsl |
---|---|---|
Task description | Can be a string interpolated command line which are executed as a single bash command. It allows for string interpolation from a parameter set. Can also be a multi-line script which can be executed by bash, python, or Root C++ interpreter. | Can be a bash script or python function. Python functions are easy; just create a function and decorate @python_app. Bash scripts are also written as decorated python functions that return a string containing the steps to be executed. |
Parameters | A dictionary provided in the parameters property of the task. Each of the keys are available for string interpolation in the task's script or command. | Arguments to the python function can be used directly in python tasks. Can be use in string interpolation for the returned string for the bash steps |
Environment | Encoded as part of the task definition. Mostly used to identify the docker image to run the step in as well as some CERN-specific extras to mount a filesystem and provide authentication. Yadage also supports a local process environment which runs on the host server. | Not part of the task description and must be associated with the task in the main driver script when setting up the workflow. |
Data Publication | Yadage requires a shared filesystem. Data is published by a step using the publish . There is a publisher-type property under this which is not well documented. Valid values seen are frompar-pub which seems to allow for a value to be set as a parameter. fromglob-pub which expands a directory list from wildcard specified in globexpression . constant-pub is another. | Parsl functions can have inputs and outputs parameters. Each of these are lists of futures. It will wait for the input futures to resolve before starting the task. Likewise, data can be published to the output futures which will not resolve til the task is complete. Parsl has a data management layer that allows for data to be consumed and written to various local and remote filesystems. |
Workflows
Yadage
Instead of describing a specific graph of tasks, a yadage workflow definition consists of a collection of stages that describe how an existing graph should be extended with additional nodes and edges. Starting from an empty graph (0 nodes, 0 edges), it is built up sequentially through application of these stages. This allows yadage to process workflows, whose graph structure is not known at definition time (such as workflow producing a variable number of data fragments).
A stage consists of two pieces
A stage body (i.e. its scheduler): This section describes the logic how to define new nodes (i.e. packtivities with a specific parameter input) and new edges to attach them to the existing graph. Currently yadage supports two stages, one defining a single node and defining multiple nodes, both of which add edges according to the the data accessed from upstream nodes.
A predicate (i.e. its dependencies): The predicate (also referred to as the stage’s dependencies) is a description of when the stage body is ready to be applied. Currently yadage supports a single predicate that takes a number of JSON Path expressions. Each expression selects a number of stages. The dependency is considered satisfied when all packtivities associated to that stage (i.e. nodes) have a published result1
Parsl
Workflows in Parsl are created implicitly based on the passing of control or data between apps. The flexibility of this model allows for the implementation of a wide range of workflow patterns from sequential through to complex nested, parallel workflows.
Parsl is also designed to address broad execution requirements from workflows that run a large number of very small tasks to those that run few long running tasks. In each case, Parsl can be configured to optimize deployment towards performance or fault tolerance.2
Workflow Patterns
Pattern | Description | Yadage | Parsl |
---|---|---|---|
Procedural workflows | Simple sequential or procedural workflows | Yadage manages the transitions based on outputs from a task completing | Parsl can track app futures which link end of one task to start of another |
Parallel workflows | Parallel execution, respecting dependencies among app executions. | Automatically builds parallel DAGs from spec | Automatically generated from App dependencies |
Parallel workflows with loops | Can be implemented with tasks that produce a variable number of data fragments | Just a simple python loop that appends calls to the Parsl task | |
Parallel dataflows | Parallel workflows driven by data results, and not task completion | Yadage won't start a new task until all of the outputs from a previous step are complete | Parsl tracks dependencies either by task, or by data future. Tasks can be triggered by one or more of these futures completing |
Execution Backends
A workflow description language is only as useful as the execution backends. Both packages offer extensive options for this.
Yadage
Yadage comes with a command line tool for running workflows. It accepts a workflow definition and optional parameters. By default it will just run the workflow as a multiprocessing pool on the local machine.
Yadage can run on various backends such as multiprocessing pools, ipython clusters, or celery clusters. If human intervention is needed for certain steps, it can also be run interactively. Significantly, Yadage is supported by CERN’s reproducible science Framework, Reana and it is used to execute yadage workflows inside Kubernetes in a standardized environment.
Parsl
Parsl scripts can be executed on different execution providers (e.g., PCs, clusters, supercomputers) and using different execution models (e.g., threads, pilot jobs, etc.). Parsl separates the code from the configuration that specifies which execution provider(s) and executor(s) to use. Parsl provides a high level abstraction, called a block, for providing a uniform description of a resource configuration irrespective of the specific execution provider.
The Parsl ecosystem includes specific interfaces for running on many of the popular HTC environments such as Jetstream, Condor, and Slurm. It also has interfaces for commercial cloud providers.
The HighThroughputExecutor
implements hierarchical scheduling and batching and
consistently delivers high throughput task execution on the order of 1000 Nodes
Key to this performance is the coupling between the Parsl driver program and the
executors. It depends on the IPython library in the executor image. The
workflow steps are pickled using CloudPickle
and transmitted.
Conclusions
Both frameworks offer flexible and easy to read workflow definitions which are automatically translated into parallel processing graphs. Both support a number of execution backends to enable execution of the workflows on a variety of hosts and clusters available to our community.
The obvious difference between the two is the choice of language used to represent the workflow. Yadage starts with the popular YAML file format. This follows a traditional approach that represents workflow using a configuration file. Parsl’s unique approach is to use python as the specification language.
Parsl’s unfamiliar “workflow as code” model takes a little getting used to for new developers, however once the basic concepts are grasped it’s very easy to work with and can express very complicated parallel workflows with ease. Yadage’s YAML files are easy to get started with, but it seems that it would be difficult to specify anything more complicated than a few parallel threads.
The Parsl HighThroughputExecutor has received notoriety as a high performance workflow execution back-end and can scale to thousands of nodes. Delivering this performance requires that Parsl takes control over the execution of the code and depends on the presence of the Parsl library, iPython and Python3 inside any execution environments where it will be run. The versions of these dependencies must match the version of the driver program.
Yadage execution back-ends take a less invasive approach. They will launch docker containers and set the command to run as well as any environment variables. This makes it ideal of reproducible workflows in Reana since there are very few constraints on what can go into the docker image.
Given the expressiveness of Parsl for complicated parallel workflows and the highly performant set of executors, Parsl should be considered for demanding production workflows. However the need to introduce dependencies into the built images would seem to make it unsuitable for preservation. Yadage is a better fit for those applications.