Story #8645

closed

Store pipeline resources in new group type

Added by Peter Amstutz about 9 years ago. Updated over 5 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Story points:

Description

"A pipeline is a group"¶

Proposal: In crunch v2, users will treat a invocation of a pipeline and its related resources (or a subset of related resources) as a "bundle" that can be shared, copied, moved, downloaded, etc. as a unit. The bundle can include:

Container requests
Copies of input collections
Copies of docker images
(Perhaps incomplete) clones of git repositories
Copies of container logs
Copies of container output collections

Implementation overview¶

When running a pipeline, rather than create a "pipeline instance" object as in crunch1, Arvados creates a new group with group_class="pipeline". Inputs are copied into the pipeline when (or even before?) the pipeline starts, and container outputs are copied into the pipeline group as container requests are completed.

Pipelines get special treatment in Workbench. ("special" tbd?)

Benefits¶

Workbench can show (and control) "what you will share when you press Share".

It is easy to distinguish objects that are "included" in the bundle -- and therefore will be shared when the bundle is shared -- from objects that are referenced by the pipeline (and might be readable by the current user) but aren't in the bundle.
If you don't want to share some bits (e.g., non-free code, private data), simply delete them from the bundle. Optionally, make a full copy for yourself first.

By default -- if you don't delete any inputs from your bundle -- you protect yourself from accidentally deleting or modifying one of your pipeline dependencies and making your pipeline impossible to reproduce. Examples:

Even after deleting a commit from your git repo with a non-FF push, you should still be able to view that version of the source code if you used it in a pipeline. (But you should also have the option of deleting/unsharing code and data without deleting the metadata about the pipeline, if that's really what you want.)
The user doesn't have the burden of remembering which input collections should be "frozen" in order to make pipelines reproducible. Currently, it's too easy to modify a dataset (e.g., rename a file) and then much later realize that you can no longer run a pipeline that used the old version as an input. With the proposed approach, the version of each input needed to re-run the pipeline is preserved until the user deletes it from that pipeline.

Limited sharing easy to reason about: copy a pipeline and then delete the parts you don't want to share.

A pipeline can include information about failed containers / container requests that were later re-attempted.

Other side effects¶

More groups in the system. This is one incentive (of many) to improve the permission system implementation to use a Postgres join instead of keeping a cache of all group UUIDs readable/writable by each user.
More identical copies of collections. Search results will be more noisy, unless we de-dup/filter/sort results effectively.

Implementation details¶

git¶

A pipeline bundle should include a snapshot of the parts of the original git repository that were used to run the job. The snapshot should be made efficiently -- for example, using "git clone /path/to/original/repo" to make hardlinks rather than copying the git data objects.

Related issues 1 (1 open — 0 closed)