Version 6 - History - Distributed workflows - Arvados

Distributed workflows » History » Version 6

Tom Clegg, 03/08/2018 07:10 PM

-Peter Amstutz
+h1. Distributed workflows
 h2. Problem description
-Peter Amstutz
+A user wants to run a meta-analysis on data located on several different clusters.  For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location.  The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention (data transfer, etc) to complete the workflow.
 Peter Amstutz
 h2. Simplifying assumptions
 User explicitly indicates in the workflow which cluster a certain computation (data+code) happens.
 Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters.
 h2. Proposed solution
-Peter Amstutz
+h3. Run subworkflow on cluster
-Peter Amstutz
+A workflow step can be given a CWL hint "RunOnCluster".  This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on.  The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results.
-Peter Amstutz
+h3. Data transfer
-Peter Amstutz
+In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc).  These are several options:
 Peter Amstutz
 # Don't do any data transfer of dependencies.  Workflows will fail if dependencies are not available.  User must manually transfer collections using arv-copy.
-Peter Amstutz
+** pros: least work
-Peter Amstutz
+** cons: terrible user experience.  workflow patterns that transfer data out to remote clusters don't work.
-Peter Amstutz
+# Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it).
-Peter Amstutz
+** pros: less burden on user compared to option (1)
 ** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation.  workflow patterns that transfer data out to remote clusters don't work.
-Peter Amstutz
+# Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow.
 ** pros: no user intervention required, only copy data to clusters that we think will need it
 ** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
-Peter Amstutz
+# Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand.
-Peter Amstutz
+** pros: no user intervention required, only copy data we actually need
 ** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
-Peter Amstutz
+# Federated access to collections, fetch data blocks on demand from another cluster
 ** pros: only fetch data blocks that are actually needed, no collection record copy in remote database
 ** cons: requires SDK improvements to handle multiple clusters, requires caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection).
-Peter Amstutz
+# Hybrid federation, copy a collection to remote cluster but retain UUID/permission from source
-Peter Amstutz
+** pros: no user intervention, only fetch blocks we need, fetch data blocks from local keep if available, remote keep if necessary
-Peter Amstutz
+** cons: semantics/permission model for "cached" collection records are not yet defined.
 Peter Amstutz
-Peter Amstutz
+h4. Notes
 Peter Amstutz
-Peter Amstutz
+Options 1 and 2 cannot support workflows that involve some local computation, and then passing intermediate results to a remote cluster for computation.
 Peter Amstutz
-Peter Amstutz
+Options 2, 3 and 4 involve a similar level of effort, mainly involving arvados-cwl-runner.  Of these, option 4 seems to cover the most use cases.  A general "transfer required collections" method will cover data transfer for dependencies, intermediate collections, and outputs.
 Peter Amstutz
-Tom Clegg
+Option 5 involves adding federation-awareness to the Python/Go SDKs, arv-mount, crunch-run, and API server. In order to work efficiently when clusters are distant and large remote collections are accessed more than once per workflow, it will need an infrastructure-level cache solution.
 Peter Amstutz
 Option 6 level of effort is probably somewhere between options 4 and 5.
 Peter Amstutz
 h3. Outputs
 Finally, after a subworkflow runs on a remote cluster, the primary cluster needs to access the output and possibly run additional steps.  This requires accessing a single output collection, either by pulling it to the primary cluster (using the same features supporting option 4), or by federation (options 5, 6).

Project

General

Profile

Arvados

Distributed workflows » History » Version 6