Bug #8807
closed
[Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL
Added by Bryan Cosca almost 9 years ago.
Updated almost 9 years ago.
Estimated time:
(Total: 0.00 h)
- Subject changed from Cannot find logs to [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL
This is the last log, from the logs table:
2016-03-26_20:51:23 salloc: Granted job allocation 228
2016-03-26_20:51:23 13514 Sanity check is `docker.io ps -q`
2016-03-26_20:51:23 13514 sanity check: start
2016-03-26_20:51:23 13514 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-03-26_20:51:23 13514 stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory
2016-03-26_20:51:23 13514 stderr srun: error: Application launch failed: No such file or directory
2016-03-26_20:51:23 13514 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-26_20:51:23 13514 stderr srun: error: Timed out waiting for job step to complete
2016-03-26_20:51:23 13514 sanity check: exit 2
2016-03-26_20:51:23 13514 Sanity check failed: 2
2016-03-26_20:51:23 salloc: Relinquishing job allocation 228
The bug is that crunch-job doesn't save logs when it exits TEMPFAIL. It can safely do that now; the code knows how to check the job record for an existing log collection, and append to it.
The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.
- Project changed from 35 to Arvados
- Target version set to Arvados Future Sprints
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Brett Smith wrote:
The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.
Filed separately as #8869.
What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.
- Status changed from In Progress to Closed
- Target version deleted (
2016-04-13 sprint)
Peter Amstutz wrote:
What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.
You're right, that's no good. And in general, this is a bad way to try to solve the problem: this early in crunch-job, it's hard to be sure that it's safe to update the job's log field in any case.
On reflection, I think I'm going to declare this bug closed by #5694. I tested that current master Workbench renders logs for this job, and it does, which would address the complaint in the original report. For cases where the job fails this early, rendering logs from the logs table seems like a more robust strategy than trying to save them in the job record.
Also available in: Atom
PDF