Bug #14574
closedWorkflow deadlocked
100%
Description
When executing ExpressionTool, it doesn't take the workflow execution lock when it calls the output callback. This is a problem when multiple ExpressionTool jobs are executing in threads.
Updated by Peter Amstutz about 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 6 years ago
Updated by Peter Amstutz about 6 years ago
The quick fix is to change the default, but best is to fix the underlying problem.
Need to make a note on 1.3.0 release notes about the bug and its workaround, and try to fix it for 1.3.1.
Updated by Peter Amstutz about 6 years ago
14574-expression-fix @ 4ad50255921b571d8e7748b4c6c098b53d803183
https://ci.curoverse.com/view/Developer/job/developer-run-tests/997/
Override ExpressionTool with ArvadosExpressionTool and ensure that the output callback is wrapped to take the workflow lock.
Updated by Peter Amstutz about 6 years ago
While I'm pretty sure failure to lock the callback from ExpressionTool is a bug, and it could plausibly cause the behavior being reported, I haven't actually been able to reproduce the reported deadlock, so I can't say definitively that this fixes it.
Updated by Lucas Di Pentima about 6 years ago
The locking LGTM. How can we test this? Maybe with the original workflow?
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
The locking LGTM. How can we test this? Maybe with the original workflow?
Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.
Updated by Peter Amstutz about 6 years ago
Peter Amstutz wrote:
Lucas Di Pentima wrote:
The locking LGTM. How can we test this? Maybe with the original workflow?
Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.
I re-ran the job (e51c5-xvhdp-g1kjpf3j7zo6ou1) from the original failure report (e51c5-xvhdp-tlnzytroy9m380j). It finished successfully in 2 minutes (all containers reused.)
Running with job reuse isn't exactly the same as running a normal job, so the only other thing I can think of would be to re-run without job reuse, but that's expensive.
Updated by Peter Amstutz about 6 years ago
- Status changed from In Progress to Resolved