Project

General

Profile

Actions

Bug #11138

closed

Protection against running new style docker 1.10+ image on old docker host

Added by Joshua Randall almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
02/20/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

I upgraded docker to 1.13.1 over the weekend (or so I thought). Since then, I've been struggling to get the system working reliably.

It turns out the reason is that the docker upgrade did not take on 3 of the 31 nodes, such that those 3 nodes are still running 1.9.1.

There does not appear to be any check in crunch to ensure that the image being loaded from keep is able to be run in the docker version. I would suggest implementing a `docker version` check in the pre-flight sanity checks that happen just prior to job execution.

The underlying problem seems to be with docker - the 1.9.1 version of docker does not return a failure exit status when attempting to load a 1.10+ image, but the load does fail:

root@humgen-05-04:/tmp# arv-get e829be6274c110c4c16bf6381efae022+594/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac.tar | /usr/bin/docker load
753 MiB / 753 MiB 100.0%
root@humgen-05-04:/tmp# echo $?
0
root@humgen-05-04:/tmp# /usr/bin/docker images -q --no-trunc --all |grep a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac
root@humgen-05-04:/tmp#

The logic in crunch-job does not seem to handle this situation well - it assumes (reasonably, but incorrectly) that if `docker load` returns 0 then it is safe to try to run the container. Docker ends up attempting to pull the image from the `library/sha256` repo on docker hub, which obviously fails.

Full logs from a failing job:

2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check slurm allocation
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  node humgen-05-04 - 10 slots
2017-02-20_11:53:06 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  start
2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  clean work dirs: start
2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodelist=humgen-05-04','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  clean work dirs: exit 0
2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Install docker image e829be6274c110c4c16bf6381efae022+594
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  docker image hash is sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  load docker image: start
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodelist=humgen-05-04','/bin/bash','-o','pipefail','-ec',' if /usr/bin/docker images -q --no-trunc --all | grep -xF sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get e829be6274c110c4c16bf6381efae022\\+594\\/sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac\\.tar | /usr/bin/docker load fi ']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  load docker image: exit 0
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check --memory-swap feature: start
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/usr/bin/docker','run','--help']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check --memory-swap feature: exit 0
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether default user is UID 0: start
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21  sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether default user is UID 0: exit 1
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'crunch' is UID 0: start
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=crunch sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:20 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'crunch' is UID 0: exit 1
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'nobody' is UID 0: start
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=nobody sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'nobody' is UID 0: exit 1
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job line 484
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  release job allocation
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Freeze not implemented
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  collate
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  collated output manifest text to send to API server is 0 bytes with access tokens


Subtasks 2 (0 open2 closed)

Task #11207: Make docker_install_script more robustResolvedTom Clegg02/20/2017

Actions
Task #11197: Review 11138-docker-load-failResolvedLucas Di Pentima02/20/2017

Actions
Actions

Also available in: Atom PDF