Idea #22765: a-d-c should not run every container dispatch attempt on the same cloud node - Arvados

Actions

Copy link

Idea #22765

open

a-d-c should not run every container dispatch attempt on the same cloud node

Added by Brett Smith 12 months ago. Updated 10 months ago.

Status:

New

Priority:

Normal

Assigned To:

Category:

Crunch

Target version:

Future

Start date:

Due date:

Story points:

Description

The current implementation of MaxDispatchAttempts can make every attempt on the same cloud node. We would like to make this more sophisticated:

It should be willing to try to dispatch to the same node a "few" times, to try to recover from slow/misordered boots.
- It would be slick to incorporate a little backoff into this retry logic.
- See below for discussion of the retry count.
If dispatch makes no progress after those "few" attempts, assume the node is bad and shut it down. The container goes back to the previous state of waiting for a cloud node. We remember dispatch attempts across nodes and keep trying until we hit the configured total maximum.

We need to discuss how to relate this new implementation to the existing MaxDispatchAttempts configuration. Should we let administrators configure both "attempts per node" and "total attempts?" Or do we want to hardcode the first value as some constant or percentage and let MaxDispatchAttempts remain as total attempts? Bear in mind any new configuration will only affect a-d-c, not other dispatchers.

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version set to Development 2025-06-25

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-06-25 to Future
Tracker changed from Bug to Idea

We would certainly benefit from a more sophisticated policy.

The problem is that it is hard to know which of several situations we are in:

Sometimes the instance is bad, and we need a new instance.

Sometimes the container record is bad, and the container won't ever start. In this case, waiting for a new instance is a waste of time.

Sometimes the there was some other network hiccup and retrying anywhere (same instance or a different one) is likely to work.

So what we probably want is some sequence of: fast retry, backoff, and retrying on a different instance.

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Future to Development 2025-07-09

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-07-09 to Development 2025-07-23

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-07-23 to Development 2025-08-06

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-08-06 to Future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #22765

a-d-c should not run every container dispatch attempt on the same cloud node

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago