Bug #14574: Workflow deadlocked - Arvados

Actions

Copy link

Bug #14574

closed

Workflow deadlocked

Added by Peter Amstutz over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Peter Amstutz

Category:

Target version:

2018-12-12 Sprint

Start date:

12/06/2018

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

Release:

Arvados v1.4 - Q1/Q2 2019

Release relationship:

Auto

Description

When executing ExpressionTool, it doesn't take the workflow execution lock when it calls the output callback. This is a problem when multiple ExpressionTool jobs are executing in threads.

Subtasks 2 (0 open — 2 closed)

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

14574-thread-count-1 @ 2bb34027cb209d11557dbbf9f8140f9549bfc88f

https://ci.curoverse.com/view/Developer/job/developer-run-tests/996/

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

The quick fix is to change the default, but best is to fix the underlying problem.

Need to make a note on 1.3.0 release notes about the bug and its workaround, and try to fix it for 1.3.1.

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

14574-expression-fix @ 4ad50255921b571d8e7748b4c6c098b53d803183

https://ci.curoverse.com/view/Developer/job/developer-run-tests/997/

Override ExpressionTool with ArvadosExpressionTool and ensure that the output callback is wrapped to take the workflow lock.

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

While I'm pretty sure failure to lock the callback from ExpressionTool is a bug, and it could plausibly cause the behavior being reported, I haven't actually been able to reproduce the reported deadlock, so I can't say definitively that this fixes it.

Actions

Copy link

Updated by Lucas Di Pentima over 6 years ago

The locking LGTM. How can we test this? Maybe with the original workflow?

Actions

Copy link

#10

Updated by Peter Amstutz over 6 years ago

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

Actions

Copy link

#11

Updated by Peter Amstutz over 6 years ago

Peter Amstutz wrote:

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

I re-ran the job (e51c5-xvhdp-g1kjpf3j7zo6ou1) from the original failure report (e51c5-xvhdp-tlnzytroy9m380j). It finished successfully in 2 minutes (all containers reused.)

Running with job reuse isn't exactly the same as running a normal job, so the only other thing I can think of would be to re-run without job reuse, but that's expensive.

Actions

Copy link

#12

Updated by Peter Amstutz over 6 years ago

Status changed from In Progress to Resolved

Actions

Copy link

#13

Updated by Tom Morris about 6 years ago

Release set to 15

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Watchers (1)

Bug #14574

Workflow deadlocked

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Lucas Di Pentima over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Tom Morris about 6 years ago