Project

General

Profile

Build docker images as part of a workflow » History » Version 2

Tom Clegg, 12/05/2022 10:08 PM

1 1 Tom Clegg
h1. Build docker images as part of a workflow
2
3
(draft)
4
5 2 Tom Clegg
h2. Background
6 1 Tom Clegg
7 2 Tom Clegg
Container images provide a well-defined execution environment for doing reproducible work. As long as the image is runnable by a container engine, a job can be repeated. However, the point of reproducibility isn't just to allow repetition of the same computation -- it's to make it possible to use prior work as the starting point for future work. Much of this opportunity is lost if the provenance trail ends at a binary image.
8 1 Tom Clegg
9 2 Tom Clegg
Ideally, when a bug is discovered in an analysis tool or library, it should be easy to identify which existing results are affected, and re-run those analyses with the updated software.
10 1 Tom Clegg
11 2 Tom Clegg
Users should have the option of building container images
12 1 Tom Clegg
* ...as part of a CWL workflow (so they can update the image-building instructions and hit one "re-run" button to see the result)
13
* ...in Arvados containers (so the build environment is controlled, build logs are saved, etc.)
14
* ...without having docker on the client side (so build-and-run workflows can be initiated from browsers, non-Linux workstations, and shared VM environments)
15 2 Tom Clegg
16
However, Arvados currently (2022) relies on workstations and shell nodes to build docker images (or download them from external sources) and upload them to Keep before starting a containerized workflow.
17
18
h2. Implementation
19
20
1. Migrate docker links to collection properties
21
* arv-keepdocker should set collection properties["docker-image-repo-tag"] when adding (already done in #16046, #17508)
22
* arv-keepdocker should set collection properties["docker-image-hash"]
23
* arv-keepdocker should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links
24
* arvados-cwl-runner should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links
25
* RailsAPI "resolve docker image spec to container" code should search collection properties for given repo:tag or hash, instead of searching links
26
* RailsAPI data migration should copy any pre-existing "docker-image-repo+tag" and "-hash" properties from links into collection properties
27
28
2. Support "pull image" container request
29
* Accept as a special case docker_image="none" (or empty collection PDH) to mean "builtin command"
30
* Builtin command @["docker", "pull", "repo:tag"]@ causes crunch-run to run @docker pull@ and save the resulting image @sha256:*.tar@ as the output collection instead of running a container
31
* @mounts@ hash is expected/required to be empty
32
* @runtime_constraints.API@ is expected/required to be true
33
* @output_path@ is expected/required to be "/"
34
* crunch-run sets output_properties @{"docker-image-hash":"...", "docker-image-repo-tag":"repo:tag"}@
35
36
3. arvados-cwl-runner submits a "pull image" container request when needed
37
* i.e., if the requested image is not already available in Keep, and docker is not installed/usable directly (e.g., running in an arvados container)
38
39
4. Support "build image" container request
40
* Another builtin command: @["docker", "build"]@
41
* url uses docker syntax to indicate a collection or remote git repo containing Dockerfile
42
* @environment@ can be used to pass build args
43
* @mounts@ establishes build context (e.g., mount a collection or git tree at "/")
44
* If Dockerfile is not at the root of build context, use @["docker", "build", "/path/to/Dockerfile"]@
45
* @output_path@ is expected/required to be "/"
46
47
h2. TBD
48
49
How do we avoid the situation of copying & modifying an image collection, and unwittingly leaving the properties in place, causing the modified collection to be used unintentionally?
50
51
For a @docker pull@ request, should @runtime_constraints@ be automatic (site configurable), or should the client specify? (Consider the case of pulling a 2 GiB image from dockerhub.)