Build docker images as part of a workflow » History » Version 6
Tom Clegg, 12/14/2022 05:43 PM
1 | 1 | Tom Clegg | h1. Build docker images as part of a workflow |
2 | |||
3 | (draft) |
4 | |||
5 | 2 | Tom Clegg | h2. Background |
6 | 1 | Tom Clegg | |
7 | 2 | Tom Clegg | Container images provide a well-defined execution environment for doing reproducible work. As long as the image is runnable by a container engine, a job can be repeated. However, the point of reproducibility isn't just to allow repetition of the same computation -- it's to make it possible to use prior work as the starting point for future work. Much of this opportunity is lost if the provenance trail ends at a binary image. |
8 | 1 | Tom Clegg | |
9 | 2 | Tom Clegg | Ideally, when a bug is discovered in an analysis tool or library, it should be easy to identify which existing results are affected, and re-run those analyses with the updated software. |
10 | 1 | Tom Clegg | |
11 | 2 | Tom Clegg | Users should have the option of building container images |
12 | 1 | Tom Clegg | * part of a CWL workflow (so they can update the image-building instructions and hit one "re-run" button to see the result) |
13 | * Arvados containers (so the build environment is controlled, build logs are saved, etc.) |
14 | * ...without having docker on the client side (so build-and-run workflows can be initiated from browsers, non-Linux workstations, and shared VM environments) |
15 | 2 | Tom Clegg | |
16 | However, Arvados currently (2022) relies on workstations and shell nodes to build docker images (or download them from external sources) and upload them to Keep before starting a containerized workflow. |
17 | |||
18 | h2. Implementation |
19 | |||
20 | 1. Migrate docker links to collection properties |
21 | * arv-keepdocker should set collection properties["docker-image-repo-tag"] when adding (already done in #16046, #17508) |
22 | 5 | Tom Clegg | * arv-keepdocker should set collection properties["docker-image-hash"] and properties["docker-image-timestamp"] |
23 | * arv-keepdocker should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"] |
24 | * arvados-cwl-runner should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"] |
25 | * RailsAPI "resolve docker image spec to container" code should search collection properties for given repo:tag or hash, instead of searching links, and sort by properties["docker-image-timestamp"] |
26 | * RailsAPI data migration should copy any pre-existing "docker-image-repo+tag" and "-hash" and "-timestamp" values from links into collection properties |
27 | 2 | Tom Clegg | |
28 | 2. Support "pull image" container request |
29 | 3 | Tom Clegg | * Accept as a special case docker_image="arvados/none" (or empty collection PDH) to mean "builtin command" |
30 | 4 | Brett Smith | ** or maybe the special value is "arvados/builtin" or "arvados/none" |
31 | 2 | Tom Clegg | * Builtin command @["docker", "pull", "repo:tag"]@ causes crunch-run to run @docker pull@ and save the resulting image @sha256:*.tar@ as the output collection instead of running a container |
32 | * @mounts@ hash is expected/required to be empty |
33 | * @runtime_constraints.API@ is expected/required to be true |
34 | * @output_path@ is expected/required to be "/" |
35 | * crunch-run sets output_properties @{"docker-image-hash":"...", "docker-image-repo-tag":"repo:tag"}@ |
36 | |||
37 | 3. arvados-cwl-runner submits a "pull image" container request when needed |
38 | * i.e., if the requested image is not already available in Keep, and docker is not installed/usable directly (e.g., running in an arvados container) |
39 | |||
40 | 4. Support "build image" container request |
41 | * Another builtin command: @["docker", "build"]@ |
42 | * url uses docker syntax to indicate a collection or remote git repo containing Dockerfile |
43 | * @environment@ can be used to pass build args |
44 | * @mounts@ establishes build context (e.g., mount a collection or git tree at "/") |
45 | * If Dockerfile is not at the root of build context, use @["docker", "build", "/path/to/Dockerfile"]@ |
46 | * @output_path@ is expected/required to be "/" |
47 | |||
48 | h2. TBD |
49 | |||
50 | How do we avoid the situation of copying & modifying an image collection, and unwittingly leaving the properties in place, causing the modified collection to be used unintentionally? |
51 | |||
52 | For a @docker pull@ request, should @runtime_constraints@ be automatic (site configurable), or should the client specify? (Consider the case of pulling a 2 GiB image from dockerhub.) |
53 | 6 | Tom Clegg | |
54 | In a @docker build@ request, if @Dockerfile@ says @FROM foo/bar@ and there is already an image in Arvados tagged @foo/bar@, should that image be used as the build base, or should docker pull @foo/bar@ from dockerhub and use that as the base? |