Feature #8018: [Crunch2] Identify container failure and retry - Arvados

Actions

Copy link

Feature #8018

closed

[Crunch2] Identify container failure and retry

Added by Peter Amstutz over 9 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Peter Amstutz

Category:

Target version:

2016-10-12 sprint

Start date:

09/23/2016

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Release:

Crunch v2

Release relationship:

Auto

Description

When a container goes into cancelled state, create a new duplicate container when these conditions are met by at least one container request:

request is in "committed" state (not uncommitted or finalized)
request has priority > 0
request container_count < container_count_max

Add a new column, container_count. Each time a new container is assigned to a committed container request, increment container_count.

Subtasks 3 (0 open — 3 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg almost 9 years ago

So far, aside from the hint "retry" in the subject line, this reads like a solution in search of a problem. I'm assuming the relevant features we're hoping to support here are:

Automatically retry failed containers when appropriate (so users don't have to push "try again" every time there's a transient infrastructure problem)
Re-use existing containers when appropriate

(In both cases, the definition of "appropriate" is one thing we need to figure out.)

Among others, we should consider the use case where

Container exits zero, state=Complete
User inspects the output and concludes the container failed
User wants to mark the container as "failed" in some way, so it doesn't get reused in future workflows

One way the user can address this is to fix the code that misreported "success": this would result in new docker images or git trees, and the workflow could be updated to require the new versions. This approach has some drawbacks, though:

Until the code and workflow are fixed (which can be non-trivial), there is no notation anywhere that the output is bogus
Updating the workflow to require a new version invalidates all previous containers/outputs, not just the known-bad one. This can cause an unacceptably huge amount of unnecessary computation.

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Target version set to 2016-09-28 sprint

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Story points set to 1.0

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 8 years ago

Suggest adding a "number of attempts already made" attribute to a container request, incrementing it atomically when retrying, and comparing it to count_max. (Using "remaining tries" is troublesome because it's hard for clients to update that atomically.)

We could expose this in API responses but prohibit clients from updating it.

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Suggest adding a "number of attempts already made" attribute to a container request, incrementing it atomically when retrying, and comparing it to count_max. (Using "remaining tries" is troublesome because it's hard for clients to update that atomically.)

We could expose this in API responses but prohibit clients from updating it.

This sounds good, will update description.

Actions

Copy link

Updated by Peter Amstutz over 8 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by Peter Amstutz over 8 years ago

Description updated (diff)

Actions

Copy link

#11

Updated by Peter Amstutz over 8 years ago

Status changed from New to In Progress

Actions

Copy link

#12

Updated by Radhika Chippada over 8 years ago

Target version changed from 2016-09-28 sprint to 2016-10-12 sprint

Actions

Copy link

#13

Updated by Radhika Chippada over 8 years ago

Did you forget to commit the db migration script to add the container_count?

When retryable_requests exist in container.handle_completed, should we be using find_reusable and then create if none found instead?

In the container_request_test, I think you can assert that the cr.container_uuid is the same as that after the first cancel related update.

Of course, I would need to re-review after you push the migration script

Actions

Copy link

#14

Updated by Peter Amstutz over 8 years ago

Radhika Chippada wrote:

Did you forget to commit the db migration script to add the container_count?

Oops. Committed.

When retryable_requests exist in container.handle_completed, should we be using find_reusable and then create if none found instead?

We could do that, although because of job reuse, I don't know how likely it is that there will be another active container with exactly the same attributes, because a container request for that container would have been associated with the container that was just cancelled (which is being retried). If/when we adjust the reuse behavior (for impure containers, or just an explicit "don't reuse" flag on container request) it might not play well with the idea of reusing containers on retry.

So this is what I coded this up first:

          reusable = Container.find_reusable c_attrs
          c = if not reusable.nil?
                reusable
              else
                Container.create! c_attrs
              end

But I think I just talked myself out of it, so I'm going to put it back to always creating a new container.

          c = Container.create! c_attrs

In the container_request_test, I think you can assert that the cr.container_uuid is the same as that after the first cancel related update.

I am a little confused by what you are suggesting, but I added some additional assertions checking if cr.container_uuid changed or not.

Of course, I would need to re-review after you push the migration script

Now at 8317b6e

Actions

Copy link

#15

Updated by Radhika Chippada over 8 years ago

LGTM

Actions

Copy link

#16

Updated by Peter Amstutz over 8 years ago

Status changed from In Progress to Resolved
% Done changed from 67 to 100

Applied in changeset arvados|commit:47e42f1129363c2565e69c36ff26ce9c42731fb8.

Actions

Copy link

#18

Updated by Tom Clegg about 6 years ago

Related to Feature #14706: [Crunch2] Retain references + permissions to earlier containers when retrying a container request added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #8018

[Crunch2] Identify container failure and retry

Updated by Peter Amstutz about 9 years ago

Updated by Tom Clegg almost 9 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Tom Clegg over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Radhika Chippada over 8 years ago

Updated by Radhika Chippada over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Radhika Chippada over 8 years ago

Updated by Peter Amstutz over 8 years ago

Updated by Tom Clegg about 6 years ago