Bug #8807: [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Bug #8807

closed

[Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL

Added by Bryan Cosca almost 9 years ago. Updated almost 9 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Brett Smith

Category:

Target version:

Start date:

03/31/2016

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

Description

gatk queue parent job: https://workbench.wx7k5.arvadosapi.com/collections/c224325251c4194e854235c7877ce6f5+89/wx7k5-8i9sb-w0sevdd7ysszqjn.log.txt
child job: wx7k5-8i9sb-f0ygdqygwonamfr

log tab is blank

Subtasks 1 (0 open — 1 closed)

Task #8863: Review 8807-crunch-job-tempfail-logs-wip

Closed

Brett Smith

03/31/2016

Actions

Related issues 2 (1 open — 1 closed)

Related to Arvados - Feature #5694: [Workbench] Log tab of running job should include logs from before you opened the tab

Resolved

Radhika Chippada

04/10/2015

Actions

Copied to Arvados - Bug #8869: [Crunch] Job was repeatedly retried on same bad compute node until abandoned

New

03/31/2016

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Subject changed from Cannot find logs to [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL

This is the last log, from the logs table:

2016-03-26_20:51:23 salloc: Granted job allocation 228
2016-03-26_20:51:23 13514  Sanity check is `docker.io ps -q`
2016-03-26_20:51:23 13514  sanity check: start
2016-03-26_20:51:23 13514  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-03-26_20:51:23 13514  stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: error: Application launch failed: No such file or directory
2016-03-26_20:51:23 13514  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-26_20:51:23 13514  stderr srun: error: Timed out waiting for job step to complete
2016-03-26_20:51:23 13514  sanity check: exit 2
2016-03-26_20:51:23 13514  Sanity check failed: 2
2016-03-26_20:51:23 salloc: Relinquishing job allocation 228

The bug is that crunch-job doesn't save logs when it exits TEMPFAIL. It can safely do that now; the code knows how to check the job record for an existing log collection, and append to it.

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Project changed from 35 to Arvados

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Target version set to Arvados Future Sprints

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Status changed from New to In Progress
Assigned To set to Brett Smith
Target version changed from Arvados Future Sprints to 2016-04-13 sprint

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Brett Smith wrote:

The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.

Filed separately as #8869.

Actions

Copy link

Updated by Peter Amstutz almost 9 years ago

What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Status changed from In Progress to Closed
Target version deleted (~~2016-04-13 sprint~~)

Peter Amstutz wrote:

What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.

You're right, that's no good. And in general, this is a bad way to try to solve the problem: this early in crunch-job, it's hard to be sure that it's safe to update the job's log field in any case.

On reflection, I think I'm going to declare this bug closed by #5694. I tested that current master Workbench renders logs for this job, and it does, which would address the complaint in the original report. For cases where the job fails this early, rendering logs from the logs table seems like a more robust strategy than trying to save them in the job record.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #8807

[Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL

Updated by Brett Smith almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Peter Amstutz almost 9 years ago

Updated by Brett Smith almost 9 years ago