Project

General

Profile

Actions

Bug #8203

closed

[Crunch] Mystery problem running container / Could not find a user in container that is not UID 0

Added by Bryan Cosca about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

GATK Queue parent job: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-idahcsb4sdi9ko3#Log
GATK Queue child job: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-pxxnn8d4ldc7tjz#Log

2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151023190001
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 check slurm allocation
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 node compute19 - 1 slots
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 start
2016-01-13_16:11:32 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Clean work dirs
2016-01-13_16:12:03 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','docker.io run --help | grep -qe --memory-swap=']
2016-01-13_16:12:05 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if container default user is not UID 0
2016-01-13_16:12:05 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:07 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if user 'crunch' is not UID 0
2016-01-13_16:12:07 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:08 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Checking if user 'nobody' is not UID 0
2016-01-13_16:12:09 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 starting: ['srun','--nodelist=compute19','/bin/sh','-ec','a=`docker.io run --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user` && test $a -ne 0']
2016-01-13_16:12:10 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 486
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 Freeze not implemented
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 collate
2016-01-13_16:12:11 qr2hi-8i9sb-pxxnn8d4ldc7tjz 12557 collated output manifest text to send to API server is 0 bytes with access tokens

Actions #1

Updated by Bryan Cosca about 10 years ago

ps, this is currently blocking #8029

Actions #2

Updated by Bryan Cosca about 10 years ago

  • Description updated (diff)
Actions #3

Updated by Brett Smith about 10 years ago

Testing by hand indicates the test should've passed for user=crunch, so yeah, something's up. Might've been a problem running docker on the compute node at the time?

Actions #4

Updated by Tom Clegg about 10 years ago

  • Subject changed from [Crunch] Could not find a user in container that is not UID 0 to [Crunch] Mystery problem running container / Could not find a user in container that is not UID 0
  • Category set to Crunch
Before we do the UID check, we
  • Run docker ps -q to make sure we can run docker
  • Run docker images | grep and possibly arv-get | docker load if the image isn't installed yet
  • Run docker run --help to find out whether we can use --memory-swap

But this is the first point where we actually try to start a container, and we don't capture docker's stderr in the job log. Even the "or there was a problem running 'id' in the container" part of the error message might be misleading, since "a problem running 'id'" can include "a problem starting the container" (see #7592#note-19).

I notice arv-get | docker load will succeed if docker load exits 0, even if arv-get exits non-zero. I wonder whether there are cases where docker load exits 0 even though the image received on stdin is truncated?

Of course, even if the image was loaded correctly, container startup could have failed for various reasons.

I think the next sensible step here is to capture stdout/stderr from this docker run invocation, and display stderr (and error out) if stdout is anything but a UID number.

#8099 should give us access to stdout/stderr from every use of srun(), or at least move us in that direction.

Actions #5

Updated by Brett Smith about 10 years ago

My commit on this did not totally prevent the problem. See wx7k5-8i9sb-vh8f43korgr9en6, which failed the exact same way and ran with the fix installed.

Actions #6

Updated by Brett Smith about 10 years ago

Tom Clegg wrote:

I notice arv-get | docker load will succeed if docker load exits 0, even if arv-get exits non-zero. I wonder whether there are cases where docker load exits 0 even though the image received on stdin is truncated?

My commit tackled this problem by running the install script with bash -o pipefail.

I think the next sensible step here is to capture stdout/stderr from this docker run invocation, and display stderr (and error out) if stdout is anything but a UID number.

#8099 should give us access to stdout/stderr from every use of srun(), or at least move us in that direction.

Agreed this is the best next step available to us.

Actions #7

Updated by Bryan Cosca about 10 years ago

This happened again: https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-1moswbbv1re4b4u#Log

2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150930141818, 0.1.20150205181653
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check slurm allocation
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  node compute4 - 1 slots
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  node compute5 - 1 slots
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  node compute6 - 1 slots
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  node compute7 - 1 slots
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  start
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  clean work dirs: start
2016-03-14_20:39:15 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4,compute5,compute6,compute7','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964  clean work dirs: exit 0
2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964  Install docker image be73c9ab5395123d42badd280d692486+966
2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964  docker image hash is 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9
2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964  load docker image: start
2016-03-14_20:39:16 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4,compute5,compute6,compute7','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9; then     arv-get be73c9ab5395123d42badd280d692486\\+966\\/0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9\\.tar | docker.io load fi ']
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  load docker image: exit 0
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check --memory-swap feature: start
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4','docker.io','run','--help']
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check --memory-swap feature: exit 0
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether default user is UID 0: start
2016-03-14_20:40:09 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run   0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether default user is UID 0: exit 0
2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether user 'crunch' is UID 0: start
2016-03-14_20:40:10 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run  --user=crunch 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether user 'crunch' is UID 0: exit 0
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether user 'nobody' is UID 0: start
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  stderr starting: ['srun','--nodelist=compute4','/bin/sh','-ec','docker.io run  --user=nobody 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  check whether user 'nobody' is UID 0: exit 0
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 474
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  Freeze not implemented
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  collate
2016-03-14_20:40:11 wx7k5-8i9sb-1moswbbv1re4b4u 1964  collated output manifest text to send to API server is 0 bytes with access tokens
Actions #8

Updated by Brett Smith about 10 years ago

Note that we saw the bug hit 6/6 jobs in the span of an hour. This is a marked change: previously we only saw the bug very occasionally.

Of course, right after I added a debugging log to try to track down the problem, my own test job succeeded…

Actions #9

Updated by Brett Smith about 10 years ago

I ran 12 jobs successfully. I noticed that Bryan's jobs all require multiple nodes. So I tried rerunning one of his with an extra debugging statement. https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-v5vcl6tol7lc4vp#Log is very illustrative of the problem.

Actions #10

Updated by Brett Smith about 10 years ago

Brett Smith wrote:

I ran 12 jobs successfully. I noticed that Bryan's jobs all require multiple nodes. So I tried rerunning one of his with an extra debugging statement. https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-v5vcl6tol7lc4vp#Log is very illustrative of the problem.

For posterity: I added this statement to the end of the @tryusers loop:

Log(undef, "'$try_user' UID was '$stdout'");

Here's the log excerpt for a job with four nodes:

2016-03-15_00:27:19 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether default user is UID 0: start
2016-03-15_00:27:19 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run   0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether default user is UID 0: exit 0
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  '' UID was '0
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  0
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  0
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  0'
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether user 'crunch' is UID 0: start
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run  --user=crunch 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether user 'crunch' is UID 0: exit 0
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  'crunch' UID was '1000
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  1000
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  1000
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  1000'
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether user 'nobody' is UID 0: start
2016-03-15_00:27:20 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  stderr starting: ['srun','--nodelist=compute6','/bin/sh','-ec','docker.io run  --user=nobody 0c1c44a72aa91ab8f566e4762d1ce20b957d45e3a671567218815806cfa5c4c9 id --user']
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  check whether user 'nobody' is UID 0: exit 0
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  'nobody' UID was '65534
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  65534
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  65534
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  65534'
2016-03-15_00:27:21 wx7k5-8i9sb-v5vcl6tol7lc4vp 56779  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Actions #11

Updated by Bryan Cosca about 10 years ago

Not sure if this is related because it looks like the real error was node failure? But putting it in here just in case. comes from a gatk queue child job.

parent job: wx7k5-8i9sb-2dhzflda4gsymw8
child job: wx7k5-8i9sb-ose8gk9vuxqe9gd

https://workbench.wx7k5.arvadosapi.com/collections/0e9d5fda03623c3fdfd368740fef178c+83/wx7k5-8i9sb-ose8gk9vuxqe9gd.log.txt

2016-03-22_16:33:38 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodelist=compute11','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then     arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi ']
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF.
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr * Are you trying to connect to a TLS-enabled daemon without TLS?
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr * Is your docker daemon up and running?
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: Node failure on compute11
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  load docker image: exit 0
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check --memory-swap feature: start
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: SLURM job 1206 has expired.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check --memory-swap feature: exit 1
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether default user is UID 0: start
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run   d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  sending 2x signal 2 to pid 57736
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: SLURM job 1206 has expired.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether default user is UID 0: exit 1
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  '' UID was ''
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether user 'crunch' is UID 0: start
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  sending 2x signal 2 to pid 57737
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: SLURM job 1206 has expired.
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether user 'crunch' is UID 0: exit 1
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  'crunch' UID was ''
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether user 'nobody' is UID 0: start
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  sending 2x signal 2 to pid 57746
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: SLURM job 1206 has expired.
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  check whether user 'nobody' is UID 0: exit 1
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  'nobody' UID was ''
2016-03-22_16:41:15 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Actions #12

Updated by Bryan Cosca almost 10 years ago

Still running into this.

https://workbench.wx7k5.arvadosapi.com/collections/613ef5b90f65f39fe16bc58ff757c8dc+83/wx7k5-8i9sb-bfelvg19a2j4qxr.log.txt
https://workbench.wx7k5.arvadosapi.com/collections/08a6d26909afda58833a81191323bf0b+83/wx7k5-8i9sb-1wvsbbyc8vsteeb.log.txt

17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi ']
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  load docker image: exit 0
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check --memory-swap feature: start
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr srun: error: Unable to create job step: Required node not available (down or drained)
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check --memory-swap feature: exit 1
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether default user is UID 0: start
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run   d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr srun: error: Unable to create job step: Required node not available (down or drained)
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether default user is UID 0: exit 1
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  '' UID was ''
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether user 'crunch' is UID 0: start
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr srun: error: Unable to create job step: Required node not available (down or drained)
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether user 'crunch' is UID 0: exit 1
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  'crunch' UID was ''
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether user 'nobody' is UID 0: start
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  stderr srun: error: Unable to create job step: Required node not available (down or drained)
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  check whether user 'nobody' is UID 0: exit 1
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  'nobody' UID was ''
2016-03-26_03:29:46 wx7k5-8i9sb-bfelvg19a2j4qxr 24584  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Actions #13

Updated by Bryan Cosca almost 10 years ago

Ran into this again: I noticed the difference between this and the other is that squeue exits 256 and the other the node was not available

https://workbench.wx7k5.arvadosapi.com/collections/6ca346fbd8724d3e79ce17100fd9b7d1+83/wx7k5-8i9sb-dwycsrmvrnh5ec1.log.txt

2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check --memory-swap feature: start
2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  warning: squeue exit status 256 ()
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr srun: error: Unable to confirm allocation for job 100: Slurm backup controller in standby mode
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check --memory-swap feature: exit 1
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether default user is UID 0: start
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run   d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 2 to pid 50593
2016-03-26_11:09:15 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 15 to pid 50593
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 9 to pid 50593
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether default user is UID 0: exit 0 with signal 9
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  '' UID was ''
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether user 'crunch' is UID 0: start
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 2 to pid 50602
2016-03-26_11:09:25 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 15 to pid 50602
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 9 to pid 50602
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether user 'crunch' is UID 0: exit 0 with signal 9
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  'crunch' UID was ''
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether user 'nobody' is UID 0: start
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=nobody d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:09:30 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 2 to pid 50641
2016-03-26_11:09:35 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 15 to pid 50641
2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 9 to pid 50641
2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether user 'nobody' is UID 0: exit 0 with signal 9
2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  'nobody' UID was ''
2016-03-26_11:09:40 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 475
Actions #14

Updated by Bryan Cosca almost 10 years ago

Another one?

https://workbench.wx7k5.arvadosapi.com/collections/5d9bbfa9a4521929de6cb093557f061a+85/wx7k5-8i9sb-lrignmqsulpjnzz.log.txt

2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check --memory-swap feature: start
2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check --memory-swap feature: exit 0
2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check whether default user is UID 0: start
2016-03-26_11:01:07 wx7k5-8i9sb-lrignmqsulpjnzz 48330  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run   d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check whether default user is UID 0: exit 0
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  '' UID was '0'
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check whether user 'crunch' is UID 0: start
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run  --user=crunch d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  check whether user 'crunch' is UID 0: exit 0
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  Container will run with --user=crunch
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  Packing Arvados SDK version 23cf2d8b1245c4594bf13b5904e8b72600645d0f for installation
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  Looking for version e8b7fbbd711836c59824327364e15ef2253a5848 from repository arvados
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  Using local repository '/var/lib/arvados/internal.git'
2016-03-26_11:01:08 wx7k5-8i9sb-lrignmqsulpjnzz 48330  Version e8b7fbbd711836c59824327364e15ef2253a5848 is commit e8b7fbbd711836c59824327364e15ef2253a5848
2016-03-26_11:01:09 wx7k5-8i9sb-lrignmqsulpjnzz 48330  run install script on all workers: start
2016-03-26_11:01:09 wx7k5-8i9sb-lrignmqsulpjnzz 48330  stderr starting: ['srun','--nodelist=compute9','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -']
2016-03-26_11:01:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330  stderr Extracting archive: 1fd7ced65979fbf9a4255a9ba3ef02ca
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  run install script on all workers: exit 0
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  script run-command
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  script_version e8b7fbbd711836c59824327364e15ef2253a5848
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  script_parameters {"command":["/bin/sh","-c"," 'java'  '-Xmx60416m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=$(task.tmpdir)'  '-cp' '/keep/2b4815867fe6b4b640037e374429b4cc+65/Queue.jar'  'org.broadinstitute.gatk.engine.CommandLineGATK'  '-T' 'HaplotypeCaller'  '-I' '/keep/170898c3e447cfe8ec210b04022554ac+54094/24385-200_AH5G7WCCXX.realigned.bam.realigned.bam'  '-L' 'scatter.intervals'  '-R' '/keep/3514b8e5da0e8d109946bc809b20a78a+5698/human_g1k_v37.fasta'  '-nct' '$(node.cores)'  '-variant_index_type' 'LINEAR'  '-variant_index_parameter' '128000'  '-o' '24385-200_AH5G7WCCXX.realigned.bam.realigned.g.vcf.gz'  '-ERC' 'GVCF'  '-stand_call_conf' '30.0'  '-stand_emit_conf' '15.0'  '-ploidy' '2' "],"task.vwd":"d51c66a0eec2ce81611932fcaac3d536+95"}
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  runtime_constraints {"docker_image":"bcosc/arv-base-java","max_tasks_per_node":1,"min_cores_per_node":16,"min_ram_mb_per_node":59000,"arvados_sdk_version":"23cf2d8b1245c4594bf13b5904e8b72600645d0f"}
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  start level 0 with 1 slots
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  status: 0 done, 0 running, 1 todo
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 job_task wx7k5-ot0gb-tl4rdb77w4pk8iv
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 started on compute9.1
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr starting: ['srun','--nodelist=compute9','-n1','-c1','-N1','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz.0.49162','bash','-c','if [ -e \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 ]; then rm -rf \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1; fi; mkdir -p \\/tmp\\/crunch\\-job \\/tmp\\/crunch\\-job\\/work \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep && cd \\/tmp\\/crunch\\-job && MEM=$(awk \'($1 == "MemTotal:"){print $2}\' </proc/meminfo) && SWAP=$(awk \'($1 == "SwapTotal:"){print $2}\' </proc/meminfo) && MEMLIMIT=$(( ($MEM * 95) / (1 * 100) )) && let SWAPLIMIT=$MEMLIMIT+$SWAP && exec arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other  \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep --exec crunchstat -cgroup-root=\\/sys\\/fs\\/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid -poll=10000 docker.io run  --name=wx7k5-ot0gb-tl4rdb77w4pk8iv-0 --attach=stdout --attach=stderr --attach=stdin -i \\-\\-user\\=crunch --cidfile=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid --sig-proxy --memory=${MEMLIMIT}k --memory-swap=${SWAPLIMIT}k --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/by_pdh\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/tmp\\:\\/keep_tmp --volume=/tmp --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x31e5318\\) --env=TASK_UUID\\=wx7k5\\-ot0gb\\-tl4rdb77w4pk8iv --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_SEQUENCE\\=0 --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=CRUNCH_SRC_COMMIT\\=e8b7fbbd711836c59824327364e15ef2253a5848 --env=CRUNCH_JOB_BIN\\=\\/usr\\/local\\/arvados\\/src\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_GIT_ARCHIVE_HASH\\=1fd7ced65979fbf9a4255a9ba3ef02ca --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_NODE_SLOTS\\=1 --env=JOB_PARAMETER_TASK\\.VWD\\=d51c66a0eec2ce81611932fcaac3d536\\+95 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=TASK_SLOT_NUMBER\\=1 --env=TASK_KEEPMOUNT_TMP\\=\\/keep_tmp --env=JOB_SCRIPT\\=run\\-command --env=ARVADOS_API_HOST\\=wx7k5\\.arvadosapi\\.com --env=CRUNCH_JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=ARVADOS_API_TOKEN\\=[...] --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=TASK_SLOT_NODE\\=compute9 --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83  stdbuf --output=0 --error=0 perl - \\/tmp\\/crunch\\-job\\/src\\/crunch_scripts\\/run\\-command']
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  status: 0 done, 1 running, 0 todo
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330  warning: squeue exit status 256 ()
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: error: Unable to confirm allocation for job 99: Slurm backup controller in standby mode
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 on compute9.1 exit 1 success=
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
Actions #15

Updated by Brett Smith almost 10 years ago

There's good news and bad news here.

The bad news is, as you've noticed, a lot of these failures are happening for different reasons, and we're probably going to have to address each problem separately. I have filed bugs about each case that looks different to me: #8810, #8811, #8812.

The good news is, our logging has gotten good enough that we actually have some new information to work with to tackle these. The reason so many of them seem to trigger the "Could not find a user in container that is not UID 0" message is because that's the first piece of code in crunch-job that requires SLURM and Docker to affirmatively work. If we can't run commands in Docker and report the result all the way back to crunch-job, the check fails. So it's acting as a gate to all kinds of other SLURM and Docker failures. If we didn't have this check, the jobs would still be failing, just for different reasons.

Actions #16

Updated by Joshua Randall almost 10 years ago

N.B. this was also the error I got when docker didn't have enough disk space to create a container.

Actions #17

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF