Bug #8807
closed[Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL
100%
Description
gatk queue parent job: https://workbench.wx7k5.arvadosapi.com/collections/c224325251c4194e854235c7877ce6f5+89/wx7k5-8i9sb-w0sevdd7ysszqjn.log.txt
child job: wx7k5-8i9sb-f0ygdqygwonamfr
log tab is blank
Updated by Brett Smith almost 9 years ago
- Subject changed from Cannot find logs to [Crunch] crunch-job doesn't save logs when exiting EX_TEMPFAIL
This is the last log, from the logs table:
2016-03-26_20:51:23 salloc: Granted job allocation 228 2016-03-26_20:51:23 13514 Sanity check is `docker.io ps -q` 2016-03-26_20:51:23 13514 sanity check: start 2016-03-26_20:51:23 13514 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q'] 2016-03-26_20:51:23 13514 stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory 2016-03-26_20:51:23 13514 stderr srun: error: Application launch failed: No such file or directory 2016-03-26_20:51:23 13514 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-03-26_20:51:23 13514 stderr srun: error: Timed out waiting for job step to complete 2016-03-26_20:51:23 13514 sanity check: exit 2 2016-03-26_20:51:23 13514 Sanity check failed: 2 2016-03-26_20:51:23 salloc: Relinquishing job allocation 228
The bug is that crunch-job doesn't save logs when it exits TEMPFAIL. It can safely do that now; the code knows how to check the job record for an existing log collection, and append to it.
The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Updated by Brett Smith almost 9 years ago
Brett Smith wrote:
The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended? There might be a second bug here.
Filed separately as #8869.
Updated by Peter Amstutz almost 9 years ago
What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.
Updated by Brett Smith almost 9 years ago
- Status changed from In Progress to Closed
- Target version deleted (
2016-04-13 sprint)
Peter Amstutz wrote:
What are the implications of calling save_meta() when the job record is locked by another crunch-job? I believe a failed api_call() results in die() (which is why the locking code catches it in an eval{}), which would prevent crunch-job from exiting EX_TEMPFAIL as intended.
You're right, that's no good. And in general, this is a bad way to try to solve the problem: this early in crunch-job, it's hard to be sure that it's safe to update the job's log field in any case.
On reflection, I think I'm going to declare this bug closed by #5694. I tested that current master Workbench renders logs for this job, and it does, which would address the complaint in the original report. For cases where the job fails this early, rendering logs from the logs table seems like a more robust strategy than trying to save them in the job record.