Feature #8018
closed[Crunch2] Identify container failure and retry
100%
Description
When a container goes into cancelled state, create a new duplicate container when these conditions are met by at least one container request:
- request is in "committed" state (not uncommitted or finalized)
- request has priority > 0
- request container_count < container_count_max
Add a new column, container_count. Each time a new container is assigned to a committed container request, increment container_count.
Updated by Tom Clegg over 8 years ago
- Automatically retry failed containers when appropriate (so users don't have to push "try again" every time there's a transient infrastructure problem)
- Re-use existing containers when appropriate
(In both cases, the definition of "appropriate" is one thing we need to figure out.)
Among others, we should consider the use case where- Container exits zero, state=Complete
- User inspects the output and concludes the container failed
- User wants to mark the container as "failed" in some way, so it doesn't get reused in future workflows
- Until the code and workflow are fixed (which can be non-trivial), there is no notation anywhere that the output is bogus
- Updating the workflow to require a new version invalidates all previous containers/outputs, not just the known-bad one. This can cause an unacceptably huge amount of unnecessary computation.
Updated by Peter Amstutz over 8 years ago
- Target version set to 2016-09-28 sprint
Updated by Tom Clegg over 8 years ago
Suggest adding a "number of attempts already made" attribute to a container request, incrementing it atomically when retrying, and comparing it to count_max. (Using "remaining tries" is troublesome because it's hard for clients to update that atomically.)
We could expose this in API responses but prohibit clients from updating it.
Updated by Peter Amstutz over 8 years ago
Tom Clegg wrote:
Suggest adding a "number of attempts already made" attribute to a container request, incrementing it atomically when retrying, and comparing it to count_max. (Using "remaining tries" is troublesome because it's hard for clients to update that atomically.)
We could expose this in API responses but prohibit clients from updating it.
This sounds good, will update description.
Updated by Peter Amstutz about 8 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada about 8 years ago
- Target version changed from 2016-09-28 sprint to 2016-10-12 sprint
Updated by Radhika Chippada about 8 years ago
- Did you forget to commit the db migration script to add the container_count?
- When retryable_requests exist in container.handle_completed, should we be using find_reusable and then create if none found instead?
- In the container_request_test, I think you can assert that the cr.container_uuid is the same as that after the first cancel related update.
- Of course, I would need to re-review after you push the migration script
Updated by Peter Amstutz about 8 years ago
Radhika Chippada wrote:
- Did you forget to commit the db migration script to add the container_count?
Oops. Committed.
- When retryable_requests exist in container.handle_completed, should we be using find_reusable and then create if none found instead?
We could do that, although because of job reuse, I don't know how likely it is that there will be another active container with exactly the same attributes, because a container request for that container would have been associated with the container that was just cancelled (which is being retried). If/when we adjust the reuse behavior (for impure containers, or just an explicit "don't reuse" flag on container request) it might not play well with the idea of reusing containers on retry.
So this is what I coded this up first:
reusable = Container.find_reusable c_attrs c = if not reusable.nil? reusable else Container.create! c_attrs end
But I think I just talked myself out of it, so I'm going to put it back to always creating a new container.
c = Container.create! c_attrs
- In the container_request_test, I think you can assert that the cr.container_uuid is the same as that after the first cancel related update.
I am a little confused by what you are suggesting, but I added some additional assertions checking if cr.container_uuid changed or not.
- Of course, I would need to re-review after you push the migration script
Now at 8317b6e
Updated by Peter Amstutz about 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:47e42f1129363c2565e69c36ff26ce9c42731fb8.
Updated by Tom Clegg almost 6 years ago
- Related to Feature #14706: [Crunch2] Retain references + permissions to earlier containers when retrying a container request added