Bug #8812
closed[Crunch] squeue checks apparently fail when the backup controller kicks in
Description
This job apparently succesfully loaded its Docker image:
2016-03-26_11:00:41 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 load docker image: start 2016-03-26_11:00:41 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodelist=compute10','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi '] 2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 load docker image: exit 0
Then it looks like the backup SLURM controller kicked in while we were doing Docker capability checks:
2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check --memory-swap feature: start 2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 warning: squeue exit status 256 () 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr srun: error: Unable to confirm allocation for job 100: Slurm backup controller in standby mode 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check --memory-swap feature: exit 1
This caused later squeue checks to go awry:
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether default user is UID 0: start 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user'] 2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 2 to pid 50593 2016-03-26_11:09:15 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 15 to pid 50593 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 sending 2x signal 9 to pid 50593 2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606 check whether default user is UID 0: exit 0 with signal 9
There are a lot of possible problems here. Maybe our backup controller setup has a bug, or isn't fully baked? This needs more investigation.
Updated by Brett Smith almost 10 years ago
Possibly related: we saw a similar error when a job tried to run its first task, and that caused an immediate abort.
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 job_task wx7k5-ot0gb-tl4rdb77w4pk8iv 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 started on compute9.1 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr starting: ['srun','--nodelist=compute9','-n1','-c1','-N1','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz.0.49162','bash','-c','if [ -e \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 ]; then rm -rf \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1; fi; mkdir -p \\/tmp\\/crunch\\-job \\/tmp\\/crunch\\-job\\/work \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep && cd \\/tmp\\/crunch\\-job && MEM=$(awk \'($1 == "MemTotal:"){print $2}\' </proc/meminfo) && SWAP=$(awk \'($1 == "SwapTotal:"){print $2}\' </proc/meminfo) && MEMLIMIT=$(( ($MEM * 95) / (1 * 100) )) && let SWAPLIMIT=$MEMLIMIT+$SWAP && exec arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep --exec crunchstat -cgroup-root=\\/sys\\/fs\\/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid -poll=10000 docker.io run --name=wx7k5-ot0gb-tl4rdb77w4pk8iv-0 --attach=stdout --attach=stderr --attach=stdin -i \\-\\-user\\=crunch --cidfile=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid --sig-proxy --memory=${MEMLIMIT}k --memory-swap=${SWAPLIMIT}k --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/by_pdh\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/tmp\\:\\/keep_tmp --volume=/tmp --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x31e5318\\) --env=TASK_UUID\\=wx7k5\\-ot0gb\\-tl4rdb77w4pk8iv --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_SEQUENCE\\=0 --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=CRUNCH_SRC_COMMIT\\=e8b7fbbd711836c59824327364e15ef2253a5848 --env=CRUNCH_JOB_BIN\\=\\/usr\\/local\\/arvados\\/src\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_GIT_ARCHIVE_HASH\\=1fd7ced65979fbf9a4255a9ba3ef02ca --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_NODE_SLOTS\\=1 --env=JOB_PARAMETER_TASK\\.VWD\\=d51c66a0eec2ce81611932fcaac3d536\\+95 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=TASK_SLOT_NUMBER\\=1 --env=TASK_KEEPMOUNT_TMP\\=\\/keep_tmp --env=JOB_SCRIPT\\=run\\-command --env=ARVADOS_API_HOST\\=wx7k5\\.arvadosapi\\.com --env=CRUNCH_JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=ARVADOS_API_TOKEN\\=[...] --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=TASK_SLOT_NODE\\=compute9 --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 stdbuf --output=0 --error=0 perl - \\/tmp\\/crunch\\-job\\/src\\/crunch_scripts\\/run\\-command'] 2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 status: 0 done, 1 running, 0 todo 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 warning: squeue exit status 256 () 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: error: Unable to confirm allocation for job 99: Slurm backup controller in standby mode 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 on compute9.1 exit 1 success= 2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
It's possible these jobs would've failed differently without a backup controller.
Updated by Brett Smith almost 10 years ago
I'm having trouble finding solid documentation about the failover model for SLURM's backup controller, but right now my theory is that clients contact the backup controller whenever an attempt to contact the primary times out. Contacting the backup controller will not get the desired results if the primary is actually still alive (e.g., the client's connection failed for unrelated network reasons); or if the backup controller hasn't noticed the primary's failure yet. I noticed the SlurmctldTimeout setting in slurm.conf:
The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533.
It seems like you could wait at least that long for the backup controller to take control.
These problems would be less noticeable in a situation where you're submitting jobs the SLURM's own queue, but since we treat SLURM more like an API server that gives us immediate results, we have to be able to handle situations where no controller is available for a brief period of time.