Bug #11138
closedProtection against running new style docker 1.10+ image on old docker host
100%
Description
I upgraded docker to 1.13.1 over the weekend (or so I thought). Since then, I've been struggling to get the system working reliably.
It turns out the reason is that the docker upgrade did not take on 3 of the 31 nodes, such that those 3 nodes are still running 1.9.1.
There does not appear to be any check in crunch to ensure that the image being loaded from keep is able to be run in the docker version. I would suggest implementing a `docker version` check in the pre-flight sanity checks that happen just prior to job execution.
The underlying problem seems to be with docker - the 1.9.1 version of docker does not return a failure exit status when attempting to load a 1.10+ image, but the load does fail:
root@humgen-05-04:/tmp# arv-get e829be6274c110c4c16bf6381efae022+594/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac.tar | /usr/bin/docker load 753 MiB / 753 MiB 100.0% root@humgen-05-04:/tmp# echo $? 0 root@humgen-05-04:/tmp# /usr/bin/docker images -q --no-trunc --all |grep a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac root@humgen-05-04:/tmp#
The logic in crunch-job does not seem to handle this situation well - it assumes (reasonably, but incorrectly) that if `docker load` returns 0 then it is safe to try to run the container. Docker ends up attempting to pull the image from the `library/sha256` repo on docker hub, which obviously fails.
Full logs from a failing job:
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001 2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check slurm allocation 2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 node humgen-05-04 - 10 slots 2017-02-20_11:53:06 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 start 2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 clean work dirs: start 2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodelist=humgen-05-04','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 clean work dirs: exit 0 2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Install docker image e829be6274c110c4c16bf6381efae022+594 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 docker image hash is sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 load docker image: start 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodelist=humgen-05-04','/bin/bash','-o','pipefail','-ec',' if /usr/bin/docker images -q --no-trunc --all | grep -xF sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac >/dev/null; then exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then exit "${exit_codes[0]}" # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then exit "${exit_codes[1]}" # `grep` encountered an error else # Everything worked fine, but grep didn\'t find the image on this host. arv-get e829be6274c110c4c16bf6381efae022\\+594\\/sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac\\.tar | /usr/bin/docker load fi '] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 load docker image: exit 0 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check --memory-swap feature: start 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/usr/bin/docker','run','--help'] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check --memory-swap feature: exit 0 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether default user is UID 0: start 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether default user is UID 0: exit 1 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'crunch' is UID 0: start 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=crunch sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:20 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'crunch' is UID 0: exit 1 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'nobody' is UID 0: start 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=nobody sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'nobody' is UID 0: exit 1 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job line 484 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 release job allocation 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Freeze not implemented 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 collate 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 collated output manifest text to send to API server is 0 bytes with access tokens