Bug #8203
closed[Crunch] Mystery problem running container / Could not find a user in container that is not UID 0
Description
GATK Queue parent job: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-idahcsb4sdi9ko3#Log
GATK Queue child job: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-pxxnn8d4ldc7tjz#Log
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151023190001
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 check slurm allocation
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 node compute19 - 1 slots
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 start
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Clean work dirs
2016-01-13_16:12:03 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','docker.io run --help | grep -qe --memory-swap=']
2016-01-13_16:12:05 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if container default user is not UID 0
2016-01-13_16:12:05 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:07 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if user 'crunch' is not UID 0
2016-01-13_16:12:07 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:08 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if user 'nobody' is not UID 0
2016-01-13_16:12:09 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:10 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 486
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Freeze not implemented
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 collate
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 collated output manifest text to send to API server is 0 bytes with access tokens
Updated by Brett Smith about 10 years ago
Testing by hand indicates the test should've passed for user=crunch, so yeah, something's up. Might've been a problem running docker on the compute node at the time?
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] Could not find a user in container that is not UID 0 to [Crunch] Mystery problem running container / Could not find a user in container that is not UID 0
- Category set to Crunch
- Run
docker ps -qto make sure we can run docker - Run
docker images | grepand possiblyarv-get | docker loadif the image isn't installed yet - Run
docker run --helpto find out whether we can use--memory-swap
But this is the first point where we actually try to start a container, and we don't capture docker's stderr in the job log. Even the "or there was a problem running 'id' in the container" part of the error message might be misleading, since "a problem running 'id'" can include "a problem starting the container" (see #7592#note-19).
I notice arv-get | docker load will succeed if docker load exits 0, even if arv-get exits non-zero. I wonder whether there are cases where docker load exits 0 even though the image received on stdin is truncated?
Of course, even if the image was loaded correctly, container startup could have failed for various reasons.
I think the next sensible step here is to capture stdout/stderr from this docker run invocation, and display stderr (and error out) if stdout is anything but a UID number.
#8099 should give us access to stdout/stderr from every use of srun(), or at least move us in that direction.
Updated by Brett Smith about 10 years ago
My commit on this did not totally prevent the problem. See wx7k5-8i9sb-vh8f43korgr9en6, which failed the exact same way and ran with the fix installed.
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
I notice
arv-get | docker loadwill succeed ifdocker loadexits 0, even ifarv-getexits non-zero. I wonder whether there are cases wheredocker loadexits 0 even though the image received on stdin is truncated?
My commit tackled this problem by running the install script with bash -o pipefail.
I think the next sensible step here is to capture stdout/stderr from this
docker runinvocation, and display stderr (and error out) if stdout is anything but a UID number.#8099 should give us access to stdout/stderr from every use of
srun(), or at least move us in that direction.
Agreed this is the best next step available to us.
Updated by Bryan Cosca about 10 years ago
This happened again: https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-1moswbbv1re4b4u#Log
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150930141818, 0.1.20150205181653 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check slurm allocation 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 node compute4 - 1 slots 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 node compute5 - 1 slots 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 node compute6 - 1 slots 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 node compute7 - 1 slots 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 start 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 clean work dirs: start 2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4,compute5,compute6,compute7','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964 clean work dirs: exit 0 2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964 Install docker image be73c9ab5395123d42badd280d692486+966 2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964 docker image hash is 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964 load docker image: start 2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4,compute5,compute6,compute7','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9; then arv-get be73c9ab5395123d42badd280d692486\\+966\\/0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9\\.tar | docker.io load fi '] 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 load docker image: exit 0 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check --memory-swap feature: start 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4','docker.io','run','--help'] 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check --memory-swap feature: exit 0 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether default user is UID 0: start 2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether default user is UID 0: exit 0 2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether user 'crunch' is UID 0: start 2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run --user=crunch 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether user 'crunch' is UID 0: exit 0 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether user 'nobody' is UID 0: start 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run --user=nobody 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 check whether user 'nobody' is UID 0: exit 0 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 474 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 Freeze not implemented 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 collate 2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964 collated output manifest text to send to API server is 0 bytes with access tokens
Updated by Brett Smith about 10 years ago
Note that we saw the bug hit 6/6 jobs in the span of an hour. This is a marked change: previously we only saw the bug very occasionally.
Of course, right after I added a debugging log to try to track down the problem, my own test job succeeded…
Updated by Brett Smith about 10 years ago
I ran 12 jobs successfully. I noticed that Bryan's jobs all require multiple nodes. So I tried rerunning one of his with an extra debugging statement. https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-v5vcl6tol7lc4vp#Log is very illustrative of the problem.
Updated by Brett Smith about 10 years ago
Brett Smith wrote:
I ran 12 jobs successfully. I noticed that Bryan's jobs all require multiple nodes. So I tried rerunning one of his with an extra debugging statement. https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-v5vcl6tol7lc4vp#Log is very illustrative of the problem.
For posterity: I added this statement to the end of the @tryusers loop:
Log(undef, "'$try_user' UID was '$stdout'");
Here's the log excerpt for a job with four nodes:
2016-03-15_00:27:19 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether default user is UID 0: start 2016-03-15_00:27:19 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether default user is UID 0: exit 0 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 '' UID was '0 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 0 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 0 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 0' 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether user 'crunch' is UID 0: start 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run --user=crunch 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether user 'crunch' is UID 0: exit 0 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 'crunch' UID was '1000 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 1000 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 1000 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 1000' 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether user 'nobody' is UID 0: start 2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run --user=nobody 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user'] 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 check whether user 'nobody' is UID 0: exit 0 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 'nobody' UID was '65534 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 65534 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 65534 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 65534' 2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Updated by Bryan Cosca about 10 years ago
Not sure if this is related because it looks like the real error was node failure? But putting it in here just in case. comes from a gatk queue child job.
parent job: wx7k5-8i9sb-2dhzflda4gsymw8
child job: wx7k5-8i9sb-ose8gk9vuxqe9gd
2016-03-22_16:33:38 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodelist=compute11','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi '] 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF. 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Are you trying to connect to a TLS-enabled daemon without TLS? 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Is your docker daemon up and running? 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: Node failure on compute11 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 load docker image: exit 0 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check --memory-swap feature: start 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: SLURM job 1206 has expired. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check --memory-swap feature: exit 1 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether default user is UID 0: start 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 sending 2x signal 2 to pid 57736 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: SLURM job 1206 has expired. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether default user is UID 0: exit 1 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 '' UID was '' 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether user 'crunch' is UID 0: start 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 sending 2x signal 2 to pid 57737 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: SLURM job 1206 has expired. 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether user 'crunch' is UID 0: exit 1 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 'crunch' UID was '' 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether user 'nobody' is UID 0: start 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 sending 2x signal 2 to pid 57746 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: SLURM job 1206 has expired. 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 check whether user 'nobody' is UID 0: exit 1 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 'nobody' UID was '' 2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Updated by Bryan Cosca almost 10 years ago
Still running into this.
https://workbench.wx7k5.arvadosapi.com/collections/613ef5b90f65f39fe16bc58ff757c8dc+83/wx7k5-8i9sb-bfelvg19a2j4qxr.log.txt
https://workbench.wx7k5.arvadosapi.com/collections/08a6d26909afda58833a81191323bf0b+83/wx7k5-8i9sb-1wvsbbyc8vsteeb.log.txt
17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi '] 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 load docker image: exit 0 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check --memory-swap feature: start 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr srun: error: Unable to create job step: Required node not available (down or drained) 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check --memory-swap feature: exit 1 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether default user is UID 0: start 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr srun: error: Unable to create job step: Required node not available (down or drained) 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether default user is UID 0: exit 1 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 '' UID was '' 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether user 'crunch' is UID 0: start 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr srun: error: Unable to create job step: Required node not available (down or drained) 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether user 'crunch' is UID 0: exit 1 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 'crunch' UID was '' 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether user 'nobody' is UID 0: start 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 stderr srun: error: Unable to create job step: Required node not available (down or drained) 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 check whether user 'nobody' is UID 0: exit 1 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 'nobody' UID was '' 2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Updated by Bryan Cosca almost 10 years ago
Ran into this again: I noticed the difference between this and the other is that squeue exits 256 and the other the node was not available
2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check --memory-swap feature: start 2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 warning: squeue exit status 256 () 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr srun: error: Unable to confirm allocation for job 100: Slurm backup controller in standby mode 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check --memory-swap feature: exit 1 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether default user is UID 0: start 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 2 to pid 50593 2016-03-26_11:09:15 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 15 to pid 50593 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 9 to pid 50593 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether default user is UID 0: exit 0 with signal 9 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 '' UID was '' 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether user 'crunch' is UID 0: start 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 2 to pid 50602 2016-03-26_11:09:25 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 15 to pid 50602 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 9 to pid 50602 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether user 'crunch' is UID 0: exit 0 with signal 9 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 'crunch' UID was '' 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether user 'nobody' is UID 0: start 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 2 to pid 50641 2016-03-26_11:09:35 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 15 to pid 50641 2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 9 to pid 50641 2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether user 'nobody' is UID 0: exit 0 with signal 9 2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 'nobody' UID was '' 2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Updated by Bryan Cosca almost 10 years ago
Another one?
2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check --memory-swap feature: start 2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check --memory-swap feature: exit 0 2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check whether default user is UID 0: start 2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check whether default user is UID 0: exit 0 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 '' UID was '0' 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check whether user 'crunch' is UID 0: start 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 check whether user 'crunch' is UID 0: exit 0 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 Container will run with --user=crunch 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 Packing Arvados SDK version 23cf2d8b1245c4594bf13b5904e8b72600645d0f for installation 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 Looking for version e8b7fbbd711836c59824327364e15ef2253a5848 from repository arvados 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 Using local repository '/var/lib/arvados/internal.git' 2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330 Version e8b7fbbd711836c59824327364e15ef2253a5848 is commit e8b7fbbd711836c59824327364e15ef2253a5848 2016-03-26_11:01:09 wx7k5-8i9sb-lrignmqsulpjnzz 48330 run install script on all workers: start 2016-03-26_11:01:09 wx7k5-8i9sb-lrignmqsulpjnzz 48330 stderr starting: ['srun','--nodelist=compute9','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -'] 2016-03-26_11:01:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 stderr Extracting archive: 1fd7ced65979fbf9a4255a9ba3ef02ca 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 run install script on all workers: exit 0 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 script run-command 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 script_version e8b7fbbd711836c59824327364e15ef2253a5848 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 script_parameters {"command":["/bin/sh","-c"," 'java' '-Xmx60416m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=$(task.tmpdir)' '-cp' '/keep/2b4815867fe6b4b640037e374429b4cc+65/Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/keep/170898c3e447cfe8ec210b04022554ac+54094/24385-200_AH5G7WCCXX.realigned.bam.realigned.bam' '-L' 'scatter.intervals' '-R' '/keep/3514b8e5da0e8d109946bc809b20a78a+5698/human_g1k_v37.fasta' '-nct' '$(node.cores)' '-variant_index_type' 'LINEAR' '-variant_index_parameter' '128000' '-o' '24385-200_AH5G7WCCXX.realigned.bam.realigned.g.vcf.gz' '-ERC' 'GVCF' '-stand_call_conf' '30.0' '-stand_emit_conf' '15.0' '-ploidy' '2' "],"task.vwd":"d51c66a0eec2ce81611932fcaac3d536+95"} 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 runtime_constraints {"docker_image":"bcosc/arv-base-java","max_tasks_per_node":1,"min_cores_per_node":16,"min_ram_mb_per_node":59000,"arvados_sdk_version":"23cf2d8b1245c4594bf13b5904e8b72600645d0f"} 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 start level 0 with 1 slots 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 status: 0 done, 0 running, 1 todo 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 job_task wx7k5-ot0gb-tl4rdb77w4pk8iv 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 started on compute9.1 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr starting: ['srun','--nodelist=compute9','-n1','-c1','-N1','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz.0.49162','bash','-c','if [ -e \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 ]; then rm -rf \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1; fi; mkdir -p \\/tmp\\/crunch\\-job \\/tmp\\/crunch\\-job\\/work \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep && cd \\/tmp\\/crunch\\-job && MEM=$(awk \'($1 == "MemTotal:"){print $2}\' </proc/meminfo) && SWAP=$(awk \'($1 == "SwapTotal:"){print $2}\' </proc/meminfo) && MEMLIMIT=$(( ($MEM * 95) / (1 * 100) )) && let SWAPLIMIT=$MEMLIMIT+$SWAP && exec arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep --exec crunchstat -cgroup-root=\\/sys\\/fs\\/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid -poll=10000 docker.io run --name=wx7k5-ot0gb-tl4rdb77w4pk8iv-0 --attach=stdout --attach=stderr --attach=stdin -i \\-\\-user\\=crunch --cidfile=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid --sig-proxy --memory=${MEMLIMIT}k --memory-swap=${SWAPLIMIT}k --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/by_pdh\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/tmp\\:\\/keep_tmp --volume=/tmp --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x31e5318\\) --env=TASK_UUID\\=wx7k5\\-ot0gb\\-tl4rdb77w4pk8iv --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_SEQUENCE\\=0 --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=CRUNCH_SRC_COMMIT\\=e8b7fbbd711836c59824327364e15ef2253a5848 --env=CRUNCH_JOB_BIN\\=\\/usr\\/local\\/arvados\\/src\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_GIT_ARCHIVE_HASH\\=1fd7ced65979fbf9a4255a9ba3ef02ca --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_NODE_SLOTS\\=1 --env=JOB_PARAMETER_TASK\\.VWD\\=d51c66a0eec2ce81611932fcaac3d536\\+95 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=TASK_SLOT_NUMBER\\=1 --env=TASK_KEEPMOUNT_TMP\\=\\/keep_tmp --env=JOB_SCRIPT\\=run\\-command --env=ARVADOS_API_HOST\\=wx7k5\\.arvadosapi\\.com --env=CRUNCH_JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=ARVADOS_API_TOKEN\\=[...] --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=TASK_SLOT_NODE\\=compute9 --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 stdbuf --output=0 --error=0 perl - \\/tmp\\/crunch\\-job\\/src\\/crunch_scripts\\/run\\-command'] 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 status: 0 done, 1 running, 0 todo 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 warning: squeue exit status 256 () 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: error: Unable to confirm allocation for job 99: Slurm backup controller in standby mode 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 on compute9.1 exit 1 success= 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
Updated by Brett Smith almost 10 years ago
There's good news and bad news here.
The bad news is, as you've noticed, a lot of these failures are happening for different reasons, and we're probably going to have to address each problem separately. I have filed bugs about each case that looks different to me: #8810, #8811, #8812.
The good news is, our logging has gotten good enough that we actually have some new information to work with to tackle these. The reason so many of them seem to trigger the "Could not find a user in container that is not UID 0" message is because that's the first piece of code in crunch-job that requires SLURM and Docker to affirmatively work. If we can't run commands in Docker and report the result all the way back to crunch-job, the check fails. So it's acting as a gate to all kinds of other SLURM and Docker failures. If we didn't have this check, the jobs would still be failing, just for different reasons.
Updated by Joshua Randall almost 10 years ago
N.B. this was also the error I got when docker didn't have enough disk space to create a container.