Project

General

Profile

Actions

Bug #8807

closed

[Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL

Added by Bryan Cosca almost 9 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
-
Target version:
-
Start date:
03/31/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-


Subtasks 1 (0 open1 closed)

Task #8863: Review 8807-crunch-job-tempfail-logs-wipClosedBrett Smith03/31/2016

Actions

Related issues 2 (1 open1 closed)

Related to Arvados - Feature #5694: [Workbench] Log tab of running job should include logs from before you opened the tabResolvedRadhika Chippada04/10/2015

Actions
Copied to Arvados - Bug #8869: [Crunch] Job was repeatedly retried on same bad compute node until abandonedNew03/31/2016

Actions
Actions #1

Updated by Brett Smith almost 9 years ago

  • Subject changed from Cannot find logs to [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL

This is the last log, from the logs table:

2016-03-26_20:51:23 salloc: Granted job allocation 228
2016-03-26_20:51:23 13514  Sanity check is `docker.io ps -q`
2016-03-26_20:51:23 13514  sanity check: start
2016-03-26_20:51:23 13514  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-03-26_20:51:23 13514  stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: error: Application launch failed: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-26_20:51:23 13514  stderr srun: error: Timed out waiting for job step to complete
2016-03-26_20:51:23 13514  sanity check: exit 2
2016-03-26_20:51:23 13514  Sanity check failed: 2
2016-03-26_20:51:23 salloc: Relinquishing job allocation 228

The bug is that crunch-job doesn't save logs when it exits TEMPFAIL. It can safely do that now; the code knows how to check the job record for an existing log collection, and append to it.

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.

Actions #2

Updated by Brett Smith almost 9 years ago

  • Project changed from 35 to Arvados
Actions #3

Updated by Brett Smith almost 9 years ago

  • Target version set to Arvados Future Sprints
Actions #4

Updated by Brett Smith almost 9 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Actions #5

Updated by Brett Smith almost 9 years ago

Brett Smith wrote:

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.

Filed separately as #8869.

Actions #6

Updated by Peter Amstutz almost 9 years ago

What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.

Actions #7

Updated by Brett Smith almost 9 years ago

  • Status changed from In Progress to Closed
  • Target version deleted (2016-04-13 sprint)

Peter Amstutz wrote:

What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.

You're right, that's no good. And in general, this is a bad way to try to solve the problem: this early in crunch-job, it's hard to be sure that it's safe to update the job's log field in any case.

On reflection, I think I'm going to declare this bug closed by #5694. I tested that current master Workbench renders logs for this job, and it does, which would address the complaint in the original report. For cases where the job fails this early, rendering logs from the logs table seems like a more robust strategy than trying to save them in the job record.

Actions

Also available in: Atom PDF