Project

General

Profile

Build docker images as part of a workflow » History » Revision 5

Revision 4 (Brett Smith, 12/14/2022 04:52 PM) → Revision 5/6 (Tom Clegg, 12/14/2022 05:41 PM)

h1. Build docker images as part of a workflow 

 (draft) 

 h2. Background 

 Container images provide a well-defined execution environment for doing reproducible work. As long as the image is runnable by a container engine, a job can be repeated. However, the point of reproducibility isn't just to allow repetition of the same computation -- it's to make it possible to use prior work as the starting point for future work. Much of this opportunity is lost if the provenance trail ends at a binary image. 

 Ideally, when a bug is discovered in an analysis tool or library, it should be easy to identify which existing results are affected, and re-run those analyses with the updated software. 

 Users should have the option of building container images 
 * ...as part of a CWL workflow (so they can update the image-building instructions and hit one "re-run" button to see the result) 
 * ...in Arvados containers (so the build environment is controlled, build logs are saved, etc.) 
 * ...without having docker on the client side (so build-and-run workflows can be initiated from browsers, non-Linux workstations, and shared VM environments) 

 However, Arvados currently (2022) relies on workstations and shell nodes to build docker images (or download them from external sources) and upload them to Keep before starting a containerized workflow. 

 

 h2. Implementation 

 1. Migrate docker links to collection properties 
 * arv-keepdocker should set collection properties["docker-image-repo-tag"] when adding (already done in #16046, #17508) 
 * arv-keepdocker should set collection properties["docker-image-hash"] and properties["docker-image-timestamp"] 
 * arv-keepdocker should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"] links 
 * arvados-cwl-runner should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"] links 
 * RailsAPI "resolve docker image spec to container" code should search collection properties for given repo:tag or hash, instead of searching links, and sort by properties["docker-image-timestamp"] links 
 * RailsAPI data migration should copy any pre-existing "docker-image-repo+tag" and "-hash" and "-timestamp" values properties from links into collection properties 

 2. Support "pull image" container request 
 * Accept as a special case docker_image="arvados/none" (or empty collection PDH) to mean "builtin command" 
 ** or maybe the special value is "arvados/builtin" or "arvados/none" 
 * Builtin command @["docker", "pull", "repo:tag"]@ causes crunch-run to run @docker pull@ and save the resulting image @sha256:*.tar@ as the output collection instead of running a container 
 * @mounts@ hash is expected/required to be empty 
 * @runtime_constraints.API@ is expected/required to be true 
 * @output_path@ is expected/required to be "/" 
 * crunch-run sets output_properties @{"docker-image-hash":"...", "docker-image-repo-tag":"repo:tag"}@ 

 3. arvados-cwl-runner submits a "pull image" container request when needed 
 * i.e., if the requested image is not already available in Keep, and docker is not installed/usable directly (e.g., running in an arvados container) 

 4. Support "build image" container request 
 * Another builtin command: @["docker", "build"]@ 
 * url uses docker syntax to indicate a collection or remote git repo containing Dockerfile 
 * @environment@ can be used to pass build args 
 * @mounts@ establishes build context (e.g., mount a collection or git tree at "/") 
 * If Dockerfile is not at the root of build context, use @["docker", "build", "/path/to/Dockerfile"]@ 
 * @output_path@ is expected/required to be "/" 

 

 h2. TBD 

 How do we avoid the situation of copying & modifying an image collection, and unwittingly leaving the properties in place, causing the modified collection to be used unintentionally? 

 For a @docker pull@ request, should @runtime_constraints@ be automatic (site configurable), or should the client specify? (Consider the case of pulling a 2 GiB image from dockerhub.)