Project

General

Profile

Actions

Bug #8219

closed

[Crunch] SLURM doesn't run anything, complaining "Job credential replayed"

Added by Sarah Guthrie about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

I've noticed this often happens on a job that has multiple nodes associated with it. An example: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-vkm8liup3uhp55k

Error message from logs:

2016-01-16_01:21:43 salloc: Granted job allocation 1068 
2016-01-16_01:21:45 5068 Sanity check is `docker.io ps -q` 
2016-01-16_01:21:45 5068 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 
2016-01-16_01:21:45 srun: error: Task launch for 1068.0 failed on node compute79: Job credential replayed 2016-01-16_01:21:45 srun: error: Task launch for 1068.0 failed on node compute23: Job credential replayed 2016-01-16_01:21:45 srun: error: Application launch failed: Job credential replayed 
2016-01-16_01:21:45 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
2016-01-16_01:21:45 srun: error: compute28: task 2: Killed 
2016-01-16_01:21:45 slurmd[compute28]: error: *** STEP 1068.0 KILLED AT 2016-01-16T01:21:46 WITH SIGNAL 9 *** 
2016-01-16_01:21:47 srun: error: Timed out waiting for job step to complete 
2016-01-16_01:21:47 5068 Sanity check failed: 1 
2016-01-16_01:21:47 salloc: Relinquishing job allocation 1068 
2016-01-16_01:24:31 salloc: Granted job allocation 1069 
2016-01-16_01:24:31 6768 Sanity check is `docker.io ps -q` 
2016-01-16_01:24:31 6768 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 
2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute79: Job credential replayed 
2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute23: Job credential replayed 
2016-01-16_01:24:31 srun: error: Application launch failed: Job credential replayed 
2016-01-16_01:24:31 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
2016-01-16_01:24:31 slurmd[compute28]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 slurmd[compute0]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 srun: error: compute28: task 2: Killed 
2016-01-16_01:24:31 slurmd[compute36]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:30 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 slurmd[compute79]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:33 srun: error: Timed out waiting for job step to complete 
2016-01-16_01:24:33 6768 Sanity check OK 
2016-01-16_01:24:35 qr2hi-8i9sb-vkm8liup3uhp55k 6768 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151023190001 
2016-01-16_01:24:35 qr2hi-8i9sb-vkm8liup3uhp55k 6768 check slurm allocation 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute0 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute23 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute28 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute36 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute38 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 node compute79 - 1 slots 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 start 
2016-01-16_01:24:36 qr2hi-8i9sb-vkm8liup3uhp55k 6768 Clean work dirs 
2016-01-16_01:24:36 starting: ['srun','--nodelist=compute0,compute23,compute28,compute36,compute38,compute79','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 
2016-01-16_01:24:36 srun: error: Task launch for 1069.1 failed on node compute38: Job credential replayed 
2016-01-16_01:24:36 srun: error: Task launch for 1069.1 failed on node compute23: Job credential replayed 
2016-01-16_01:24:36 srun: error: Application launch failed: Job credential replayed 
2016-01-16_01:24:36 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
2016-01-16_01:24:36 srun: error: compute28: task 2: Killed 
2016-01-16_01:24:36 srun: error: compute36: task 3: Killed 
2016-01-16_01:24:36 srun: error: compute0: task 0: Killed 
2016-01-16_01:24:36 srun: error: compute79: task 5: Killed 
2016-01-16_01:24:36 slurmd[compute79]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 
2016-01-16_01:24:36 slurmd[compute28]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:36 WITH SIGNAL 9 *** 
2016-01-16_01:24:36 slurmd[compute0]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 
2016-01-16_01:24:36 slurmd[compute36]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 
2016-01-16_01:24:36 slurmd[compute79]: error: *** STEP 1069.1 KILLED AT 2016-01-16T01:24:35 WITH SIGNAL 9 *** 
2016-01-16_01:24:37 srun: error: Timed out waiting for job step to complete 
2016-01-16_01:24:37 qr2hi-8i9sb-vkm8liup3uhp55k 6768 Clean work dirs: exit 1 
2016-01-16_01:24:37 salloc: Relinquishing job allocation 1069 
2016-01-16_01:24:37 close failed in file object destructor: 
2016-01-16_01:24:37 sys.excepthook is missing 
2016-01-16_01:24:37 lost sys.stderr 
2016-01-16_01:28:36 salloc: Granted job allocation 1071 
2016-01-16_01:28:36 9503 Sanity check is `docker.io ps -q` 
2016-01-16_01:28:36 9503 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 
2016-01-16_01:28:36 srun: error: Task launch for 1071.0 failed on node compute25: Job credential replayed 
2016-01-16_01:28:36 srun: error: Task launch for 1071.0 failed on node compute0: Job credential replayed 
2016-01-16_01:28:36 srun: error: Application launch failed: Job credential replayed 
2016-01-16_01:28:36 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
2016-01-16_01:28:36 srun: error: compute36: task 4: Killed 
2016-01-16_01:28:36 slurmd[compute36]: error: *** STEP 1071.0 KILLED AT 2016-01-16T01:28:36 WITH SIGNAL 9 *** 
2016-01-16_01:28:38 srun: error: Timed out waiting for job step to complete 
2016-01-16_01:28:38 9503 Sanity check failed: 1 
2016-01-16_01:28:38 salloc: Relinquishing job allocation 1071

Actions #1

Updated by Brett Smith about 10 years ago

  • Subject changed from Job hangs in pending state forever to [Crunch] SLURM doesn't run anything, complaining "Job credential replayed"
Actions #2

Updated by Brett Smith about 10 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Tom Clegg about 10 years ago

https://computing.llnl.gov/linux/slurm/faq.html#cred_replay

"This solution to this problem is to cold-start all slurmd daemons whenever the slurmctld daemon is cold-started."

Is this plausible?

Actions #4

Updated by Tom Clegg about 10 years ago

"If the slurmctld daemon is cold-started (with the "-c" option or "/etc/init.d/slurm startclean"), it starts job ID values over based upon FirstJobId"

Maybe we have some script that does a cold-start when it should be doing a warm-start?

Actions #5

Updated by Tom Clegg about 10 years ago

Looks like the sanity check should have failed here:

2016-01-16_01:24:31 6768 starting: ['srun','--nodes=6','--ntasks-per-node=1','docker.io','ps','-q'] 
2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute79: Job credential replayed 
2016-01-16_01:24:31 srun: error: Task launch for 1069.0 failed on node compute23: Job credential replayed 
2016-01-16_01:24:31 srun: error: Application launch failed: Job credential replayed 
2016-01-16_01:24:31 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
2016-01-16_01:24:31 slurmd[compute28]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 slurmd[compute0]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 srun: error: compute28: task 2: Killed 
2016-01-16_01:24:31 slurmd[compute36]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:30 WITH SIGNAL 9 *** 
2016-01-16_01:24:31 slurmd[compute79]: error: *** STEP 1069.0 KILLED AT 2016-01-16T01:24:31 WITH SIGNAL 9 *** 
2016-01-16_01:24:33 srun: error: Timed out waiting for job step to complete 
2016-01-16_01:24:33 6768 Sanity check OK 
Actions #6

Updated by Brett Smith about 10 years ago

Tom Clegg wrote:

Looks like the sanity check should have failed here:

That sounds like a nice improvement, but I'm not sure it fixes the issue as originally reported, where the job stays in the pending state forever? I note that a sanity check failed both before and after the one you pasted. Apparently neither was sufficient to get the job closer to an end state.

Actions #7

Updated by Brett Smith about 10 years ago

Looking at our ops stuff, I don't see anything that ever does a clean start of the SLURM controller. There's a lot of indirection here, of course, so I might be missing something, but everywhere I know to look looks good.

We've previously had memory contention on our manage nodes because of Node Manager. We're improving that on the Node Manager end. It seems like this could've happened if slurmctld was stopped without a chance to checkpoint its state; e.g., because of a power failure event, or because it got SIGKILL from OOM.

I'm inclined to treat this as on ops ticket that we're incrementally making progress on through other stories.

Actions #8

Updated by Brett Smith about 10 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #9

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF