Bug #8219
closed[Crunch] SLURM doesn't run anything, complaining "Job credential replayed"
Description
I've noticed this often happens on a job that has multiple nodes associated with it. An example: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-vkm8liup3uhp55k
Error message from logs:
2016-01-16_01:21:43 salloc: Granted job allocation 1068 2016-01-16_01:21:45 5068 Sanity check is `docker.io ps -q` 2016-01-16_01:21:45 5068 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 2016-01-16_01:21:45 srun: error: Task launch for 1068.0 failed on node compute79: Job credential replayed 2016-01-16_01:21:45 srun: error: Task launch for 1068.0 failed on node compute23: Job credential replayed 2016-01-16_01:21:45 srun: error: Application launch failed: Job credential replayed 2016-01-16_01:21:45 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-01-16_01:21:45 srun: error: compute28: task 2: Killed 2016-01-16_01:21:45 slurmd[compute28]: error: *** STEP 1068.0 KILLED AT 2016-01-16T01:21:46 WITH SIGNAL 9 *** 2016-01-16_01:21:47 srun: error: Timed out waiting for job step to complete 2016-01-16_01:21:47 5068 Sanity check failed: 1 2016-01-16_01:21:47 salloc: Relinquishing job allocation 1068 2016-01-16_01:24:31 salloc: Granted job allocation 1069 2016-01-16_01:24:31 6768 Sanity check is `docker.io ps -q` 2016-01-16_01:24:31 6768 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute79: Job credential replayed 2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute23: Job credential replayed 2016-01-16_01:24:31 srun: error: Application launch failed: Job credential replayed 2016-01-16_01:24:31 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-01-16_01:24:31 slurmd[compute28]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:31 slurmd[compute0]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:31 srun: error: compute28: task 2: Killed 2016-01-16_01:24:31 slurmd[compute36]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:30 WITH SIGNAL 9 *** 2016-01-16_01:24:31 slurmd[compute79]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:33 srun: error: Timed out waiting for job step to complete 2016-01-16_01:24:33 6768 Sanity check OK 2016-01-16_01:24:35 qr2hi-8i9sb-vkm8liup3uhp55k 6768 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151023190001 2016-01-16_01:24:35 qr2hi-8i9sb-vkm8liup3uhp55k 6768 check slurm allocation 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute0 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute23 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute28 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute36 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute38 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute79 - 1 slots 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 start 2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 Clean work dirs 2016-01-16_01:24:36 starting: ['srun','--nodelist=compute0,compute23,compute28,compute36,compute38,compute79','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2016-01-16_01:24:36 srun: error: Task launch for 1069.1 failed on node compute38: Job credential replayed 2016-01-16_01:24:36 srun: error: Task launch for 1069.1 failed on node compute23: Job credential replayed 2016-01-16_01:24:36 srun: error: Application launch failed: Job credential replayed 2016-01-16_01:24:36 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-01-16_01:24:36 srun: error: compute28: task 2: Killed 2016-01-16_01:24:36 srun: error: compute36: task 3: Killed 2016-01-16_01:24:36 srun: error: compute0: task 0: Killed 2016-01-16_01:24:36 srun: error: compute79: task 5: Killed 2016-01-16_01:24:36 slurmd[compute79]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 2016-01-16_01:24:36 slurmd[compute28]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:36 WITH SIGNAL 9 *** 2016-01-16_01:24:36 slurmd[compute0]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 2016-01-16_01:24:36 slurmd[compute36]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 2016-01-16_01:24:36 slurmd[compute79]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 2016-01-16_01:24:37 srun: error: Timed out waiting for job step to complete 2016-01-16_01:24:37 qr2hi-8i9sb-vkm8liup3uhp55k 6768 Clean work dirs: exit 1 2016-01-16_01:24:37 salloc: Relinquishing job allocation 1069 2016-01-16_01:24:37 close failed in file object destructor: 2016-01-16_01:24:37 sys.excepthook is missing 2016-01-16_01:24:37 lost sys.stderr 2016-01-16_01:28:36 salloc: Granted job allocation 1071 2016-01-16_01:28:36 9503 Sanity check is `docker.io ps -q` 2016-01-16_01:28:36 9503 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 2016-01-16_01:28:36 srun: error: Task launch for 1071.0 failed on node compute25: Job credential replayed 2016-01-16_01:28:36 srun: error: Task launch for 1071.0 failed on node compute0: Job credential replayed 2016-01-16_01:28:36 srun: error: Application launch failed: Job credential replayed 2016-01-16_01:28:36 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-01-16_01:28:36 srun: error: compute36: task 4: Killed 2016-01-16_01:28:36 slurmd[compute36]: error: *** STEP 1071.0 KILLED AT 2016-01-16T01:28:36 WITH SIGNAL 9 *** 2016-01-16_01:28:38 srun: error: Timed out waiting for job step to complete 2016-01-16_01:28:38 9503 Sanity check failed: 1 2016-01-16_01:28:38 salloc: Relinquishing job allocation 1071
Updated by Brett Smith about 10 years ago
- Subject changed from Job hangs in pending state forever to [Crunch] SLURM doesn't run anything, complaining "Job credential replayed"
Updated by Brett Smith about 10 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Clegg about 10 years ago
https://computing.llnl.gov/linux/slurm/faq.html#cred_replay
"This solution to this problem is to cold-start all slurmd daemons whenever the slurmctld daemon is cold-started."
Is this plausible?
Updated by Tom Clegg about 10 years ago
"If the slurmctld daemon is cold-started (with the "-c" option or "/etc/init.d/slurm startclean"), it starts job ID values over based upon FirstJobId"
Maybe we have some script that does a cold-start when it should be doing a warm-start?
Updated by Tom Clegg about 10 years ago
Looks like the sanity check should have failed here:
2016-01-16_01:24:31 6768 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute79: Job credential replayed 2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute23: Job credential replayed 2016-01-16_01:24:31 srun: error: Application launch failed: Job credential replayed 2016-01-16_01:24:31 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-01-16_01:24:31 slurmd[compute28]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:31 slurmd[compute0]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:31 srun: error: compute28: task 2: Killed 2016-01-16_01:24:31 slurmd[compute36]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:30 WITH SIGNAL 9 *** 2016-01-16_01:24:31 slurmd[compute79]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 2016-01-16_01:24:33 srun: error: Timed out waiting for job step to complete 2016-01-16_01:24:33 6768 Sanity check OK
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
Looks like the sanity check should have failed here:
That sounds like a nice improvement, but I'm not sure it fixes the issue as originally reported, where the job stays in the pending state forever? I note that a sanity check failed both before and after the one you pasted. Apparently neither was sufficient to get the job closer to an end state.
Updated by Brett Smith about 10 years ago
Looking at our ops stuff, I don't see anything that ever does a clean start of the SLURM controller. There's a lot of indirection here, of course, so I might be missing something, but everywhere I know to look looks good.
We've previously had memory contention on our manage nodes because of Node Manager. We're improving that on the Node Manager end. It seems like this could've happened if slurmctld was stopped without a chance to checkpoint its state; e.g., because of a power failure event, or because it got SIGKILL from OOM.
I'm inclined to treat this as on ops ticket that we're incrementally making progress on through other stories.
Updated by Brett Smith about 10 years ago
- Target version deleted (
Arvados Future Sprints)