Actions
Bug #8810
closed[Crunch] `docker load` fails to connect to endpoint; srun exits 0
Start date:
04/05/2016
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Story points:
-
Description
2016-03-22_16:33:38 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodelist=compute11','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi '] 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF. 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Are you trying to connect to a TLS-enabled daemon without TLS? 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Is your docker daemon up and running? 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: Node failure on compute11 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 load docker image: exit 0
From here the job continued running and generating errors until the UID 0 check failed. Instead crunch-job should detect this error and exit such that crunch-dispatch retries the job.
Actions