Project

General

Profile

Actions

Bug #8812

closed

[Crunch] squeue checks apparently fail when the backup controller kicks in

Added by Brett Smith almost 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

This job apparently succesfully loaded its Docker image:

2016-03-26_11:00:41 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  load docker image: start
2016-03-26_11:00:41 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodelist=compute10','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then     arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi ']
2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  load docker image: exit 0

Then it looks like the backup SLURM controller kicked in while we were doing Docker capability checks:

2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check --memory-swap feature: start
2016-03-26_11:01:23 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  warning: squeue exit status 256 ()
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr srun: error: Unable to confirm allocation for job 100: Slurm backup controller in standby mode
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check --memory-swap feature: exit 1

This caused later squeue checks to go awry:

2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether default user is UID 0: start
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','docker.io run   d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83 id --user']
2016-03-26_11:09:10 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 2 to pid 50593
2016-03-26_11:09:15 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 15 to pid 50593
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  sending 2x signal 9 to pid 50593
2016-03-26_11:09:20 wx7k5-8i9sb-dwycsrmvrnh5ec1 48606  check whether default user is UID 0: exit 0 with signal 9

There are a lot of possible problems here. Maybe our backup controller setup has a bug, or isn't fully baked? This needs more investigation.

Actions #1

Updated by Brett Smith almost 10 years ago

Possibly related: we saw a similar error when a job tried to run its first task, and that caused an immediate abort.

2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 job_task wx7k5-ot0gb-tl4rdb77w4pk8iv
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 started on compute9.1
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr starting: ['srun','--nodelist=compute9','-n1','-c1','-N1','-D','/tmp','--job-name=wx7k5-8i9sb-lrignmqsulpjnzz.0.49162','bash','-c','if [ -e \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 ]; then rm -rf \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1; fi; mkdir -p \\/tmp\\/crunch\\-job \\/tmp\\/crunch\\-job\\/work \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1 \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep && cd \\/tmp\\/crunch\\-job && MEM=$(awk \'($1 == "MemTotal:"){print $2}\' </proc/meminfo) && SWAP=$(awk \'($1 == "SwapTotal:"){print $2}\' </proc/meminfo) && MEMLIMIT=$(( ($MEM * 95) / (1 * 100) )) && let SWAPLIMIT=$MEMLIMIT+$SWAP && exec arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other  \\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep --exec crunchstat -cgroup-root=\\/sys\\/fs\\/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid -poll=10000 docker.io run  --name=wx7k5-ot0gb-tl4rdb77w4pk8iv-0 --attach=stdout --attach=stderr --attach=stdin -i \\-\\-user\\=crunch --cidfile=/tmp/crunch-job/wx7k5-ot0gb-tl4rdb77w4pk8iv-0.cid --sig-proxy --memory=${MEMLIMIT}k --memory-swap=${SWAPLIMIT}k --volume=\\/tmp\\/crunch\\-job\\/src\\:\\/tmp\\/crunch\\-job\\/src\\:ro --volume=\\/tmp\\/crunch\\-job\\/opt\\:\\/tmp\\/crunch\\-job\\/opt\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/by_pdh\\:\\/keep\\:ro --volume=\\/tmp\\/crunch\\-job\\/task\\/compute9\\.1\\.keep\\/tmp\\:\\/keep_tmp --volume=/tmp --env=JOB_PARAMETER_COMMAND\\=ARRAY\\(0x31e5318\\) --env=TASK_UUID\\=wx7k5\\-ot0gb\\-tl4rdb77w4pk8iv --env=TASK_WORK\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_TMPDIR\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 --env=TASK_SEQUENCE\\=0 --env=CRUNCH_TMP\\=\\/tmp\\/crunch\\-job --env=CRUNCH_SRC_COMMIT\\=e8b7fbbd711836c59824327364e15ef2253a5848 --env=CRUNCH_JOB_BIN\\=\\/usr\\/local\\/arvados\\/src\\/services\\/crunch\\/crunch\\-job --env=CRUNCH_GIT_ARCHIVE_HASH\\=1fd7ced65979fbf9a4255a9ba3ef02ca --env=TASK_QSEQUENCE\\=0 --env=CRUNCH_NODE_SLOTS\\=1 --env=JOB_PARAMETER_TASK\\.VWD\\=d51c66a0eec2ce81611932fcaac3d536\\+95 --env=CRUNCH_SRC\\=\\/tmp\\/crunch\\-job\\/src --env=JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=TASK_SLOT_NUMBER\\=1 --env=TASK_KEEPMOUNT_TMP\\=\\/keep_tmp --env=JOB_SCRIPT\\=run\\-command --env=ARVADOS_API_HOST\\=wx7k5\\.arvadosapi\\.com --env=CRUNCH_JOB_UUID\\=wx7k5\\-8i9sb\\-lrignmqsulpjnzz --env=ARVADOS_API_TOKEN\\=[...] --env=TASK_KEEPMOUNT\\=\\/keep --env=CRUNCH_SRC_URL\\=\\/var\\/lib\\/arvados\\/internal\\.git --env=CRUNCH_REFRESH_TRIGGER\\=\\/tmp\\/crunch_refresh_trigger --env=TASK_SLOT_NODE\\=compute9 --env=JOB_WORK\\=\\/tmp\\/crunch\\-job\\-work --env=CRUNCH_INSTALL\\=\\/tmp\\/crunch\\-job\\/opt --env=CRUNCH_WORK\\=\\/tmp\\/crunch\\-job\\/work --env=HOME\\=\\/tmp\\/crunch\\-job\\-task\\-work\\/compute9\\.1 d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83  stdbuf --output=0 --error=0 perl - \\/tmp\\/crunch\\-job\\/src\\/crunch_scripts\\/run\\-command']
2016-03-26_11:01:12 wx7k5-8i9sb-lrignmqsulpjnzz 48330  status: 0 done, 1 running, 0 todo
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330  warning: squeue exit status 256 ()
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: error: Unable to confirm allocation for job 99: Slurm backup controller in standby mode
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 child 49162 on compute9.1 exit 1 success=
2016-03-26_11:09:10 wx7k5-8i9sb-lrignmqsulpjnzz 48330 0 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.

It's possible these jobs would've failed differently without a backup controller.

Actions #2

Updated by Brett Smith almost 10 years ago

I'm having trouble finding solid documentation about the failover model for SLURM's backup controller, but right now my theory is that clients contact the backup controller whenever an attempt to contact the primary times out. Contacting the backup controller will not get the desired results if the primary is actually still alive (e.g., the client's connection failed for unrelated network reasons); or if the backup controller hasn't noticed the primary's failure yet. I noticed the SlurmctldTimeout setting in slurm.conf:

The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533.

It seems like you could wait at least that long for the backup controller to take control.

These problems would be less noticeable in a situation where you're submitting jobs the SLURM's own queue, but since we treat SLURM more like an API server that gives us immediate results, we have to be able to handle situations where no controller is available for a brief period of time.

Actions #3

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF