Project

General

Profile

Actions

Bug #8869

open

[Crunch] Job was repeatedly retried on same bad compute node until abandoned

Added by Brett Smith about 8 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
03/31/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

gatk queue parent job: https://workbench.wx7k5.arvadosapi.com/collections/c224325251c4194e854235c7877ce6f5+89/wx7k5-8i9sb-w0sevdd7ysszqjn.log.txt
child job: wx7k5-8i9sb-f0ygdqygwonamfr

This is the last log, from the logs table:

2016-03-26_20:51:23 salloc: Granted job allocation 228
2016-03-26_20:51:23 13514  Sanity check is `docker.io ps -q`
2016-03-26_20:51:23 13514  sanity check: start
2016-03-26_20:51:23 13514  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-03-26_20:51:23 13514  stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: error: Application launch failed: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-26_20:51:23 13514  stderr srun: error: Timed out waiting for job step to complete
2016-03-26_20:51:23 13514  sanity check: exit 2
2016-03-26_20:51:23 13514  Sanity check failed: 2
2016-03-26_20:51:23 salloc: Relinquishing job allocation 228

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended?


Related issues

Related to Arvados - Bug #8810: [Crunch] `docker load` fails to connect to endpoint; srun exits 0ResolvedBrett Smith04/05/2016

Actions
Copied from Arvados - Bug #8807: [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAILClosedBrett Smith03/31/2016

Actions
Actions #1

Updated by Brett Smith about 8 years ago

  • Assigned To deleted (Brett Smith)
Actions #2

Updated by Brett Smith about 8 years ago

To see all the logs for this job:

arv log list -f '[["event_type", "=", "stderr"], ["object_uuid", "=", "wx7k5-8i9sb-f0ygdqygwonamfr"]]' --order 'created_at asc' | jq -r '.items | map(.properties.text) | join("")'
Actions #3

Updated by Brett Smith about 8 years ago

The system behaved as designed, but I wonder if we need a design improvement.

The original error was basically the same as #8810, except it failed early enough that arv-get also reported an error and exited 1:

2016-03-26_20:50:32 wx7k5-8i9sb-f0ygdqygwonamfr 12962  load docker image: start
2016-03-26_20:50:32 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr starting: ['srun','--nodelist=compute15','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then     arv-get [redacted hash]\\.tar | docker.io load fi ']
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF.
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr * Are you trying to connect to a TLS-enabled daemon without TLS?
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr * Is your docker daemon up and running?
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr Traceback (most recent call last):
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr   File "/usr/local/bin/arv-get", line 209, in <module>
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr     outfile.write(data)
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr IOError: [Errno 32] Broken pipe
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  stderr srun: error: compute15: task 0: Exited with exit code 1
2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962  load docker image: exit 1
2016-03-26_20:50:53 salloc: Relinquishing job allocation 223

Here crunch-job exited EX_RETRY_UNLOCKED. Fair enough.

crunch-dispatch tried to run the job three more times. Each time, it was allocated to compute15, and failed in the sanity check as in the description. So crunch-dispatch gave up:

2016-03-26_20:51:23.35132 dispatch: job wx7k5-8i9sb-f0ygdqygwonamfr exceeded node failure retry limit -- giving up
Actions #4

Updated by Brett Smith about 8 years ago

  • Subject changed from [Crunch] crunch-job exited TEMPFAIL, but job was not retried to [Crunch] Job was repeatedly retried on same bad compute node until abandoned
Actions #5

Updated by Brett Smith about 8 years ago

Idea: Give crunch-job a new exit code to specifically mark the sanity check failed. crunch-dispatch should recognize this exit code and blacklist the allocated compute node(s) for a little while (a configurable amount of time?).

Right now this doesn't seem super high priority, since it's only happened literally once. We're going to wait a little bit and watch out for recurrences.

Actions #6

Updated by Ward Vandewege almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF