Distributed workflows » History » Version 6
Tom Clegg, 03/08/2018 07:10 PM
| 1 | 1 | Peter Amstutz | h1. Distributed workflows |
|---|---|---|---|
| 2 | |||
| 3 | h2. Problem description |
||
| 4 | |||
| 5 | 5 | Peter Amstutz | A user wants to run a meta-analysis on data located on several different clusters. For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location. The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention (data transfer, etc) to complete the workflow. |
| 6 | 1 | Peter Amstutz | |
| 7 | h2. Simplifying assumptions |
||
| 8 | |||
| 9 | User explicitly indicates in the workflow which cluster a certain computation (data+code) happens. |
||
| 10 | |||
| 11 | Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters. |
||
| 12 | |||
| 13 | h2. Proposed solution |
||
| 14 | |||
| 15 | 3 | Peter Amstutz | h3. Run subworkflow on cluster |
| 16 | |||
| 17 | 1 | Peter Amstutz | A workflow step can be given a CWL hint "RunOnCluster". This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on. The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results. |
| 18 | |||
| 19 | 3 | Peter Amstutz | h3. Data transfer |
| 20 | |||
| 21 | 1 | Peter Amstutz | In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc). These are several options: |
| 22 | 2 | Peter Amstutz | |
| 23 | # Don't do any data transfer of dependencies. Workflows will fail if dependencies are not available. User must manually transfer collections using arv-copy. |
||
| 24 | 1 | Peter Amstutz | ** pros: least work |
| 25 | 3 | Peter Amstutz | ** cons: terrible user experience. workflow patterns that transfer data out to remote clusters don't work. |
| 26 | 2 | Peter Amstutz | # Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it). |
| 27 | 3 | Peter Amstutz | ** pros: less burden on user compared to option (1) |
| 28 | ** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation. workflow patterns that transfer data out to remote clusters don't work. |
||
| 29 | 1 | Peter Amstutz | # Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow. |
| 30 | ** pros: no user intervention required, only copy data to clusters that we think will need it |
||
| 31 | ** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
| 32 | 3 | Peter Amstutz | # Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand. |
| 33 | 1 | Peter Amstutz | ** pros: no user intervention required, only copy data we actually need |
| 34 | ** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
| 35 | 3 | Peter Amstutz | # Federated access to collections, fetch data blocks on demand from another cluster |
| 36 | ** pros: only fetch data blocks that are actually needed, no collection record copy in remote database |
||
| 37 | ** cons: requires SDK improvements to handle multiple clusters, requires caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection). |
||
| 38 | 4 | Peter Amstutz | # Hybrid federation, copy a collection to remote cluster but retain UUID/permission from source |
| 39 | 3 | Peter Amstutz | ** pros: no user intervention, only fetch blocks we need, fetch data blocks from local keep if available, remote keep if necessary |
| 40 | 4 | Peter Amstutz | ** cons: semantics/permission model for "cached" collection records are not yet defined. |
| 41 | 1 | Peter Amstutz | |
| 42 | 4 | Peter Amstutz | h4. Notes |
| 43 | 1 | Peter Amstutz | |
| 44 | 4 | Peter Amstutz | Options 1 and 2 cannot support workflows that involve some local computation, and then passing intermediate results to a remote cluster for computation. |
| 45 | 3 | Peter Amstutz | |
| 46 | 4 | Peter Amstutz | Options 2, 3 and 4 involve a similar level of effort, mainly involving arvados-cwl-runner. Of these, option 4 seems to cover the most use cases. A general "transfer required collections" method will cover data transfer for dependencies, intermediate collections, and outputs. |
| 47 | 3 | Peter Amstutz | |
| 48 | 6 | Tom Clegg | Option 5 involves adding federation-awareness to the Python/Go SDKs, arv-mount, crunch-run, and API server. In order to work efficiently when clusters are distant and large remote collections are accessed more than once per workflow, it will need an infrastructure-level cache solution. |
| 49 | 4 | Peter Amstutz | |
| 50 | Option 6 level of effort is probably somewhere between options 4 and 5. |
||
| 51 | 3 | Peter Amstutz | |
| 52 | h3. Outputs |
||
| 53 | |||
| 54 | Finally, after a subworkflow runs on a remote cluster, the primary cluster needs to access the output and possibly run additional steps. This requires accessing a single output collection, either by pulling it to the primary cluster (using the same features supporting option 4), or by federation (options 5, 6). |