Project

General

Profile

Actions

Idea #22765

open

a-d-c should not run every container dispatch attempt on the same cloud node

Added by Brett Smith 12 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
Story points:
-

Description

The current implementation of MaxDispatchAttempts can make every attempt on the same cloud node. We would like to make this more sophisticated:

  • It should be willing to try to dispatch to the same node a "few" times, to try to recover from slow/misordered boots.
    • It would be slick to incorporate a little backoff into this retry logic.
    • See below for discussion of the retry count.
  • If dispatch makes no progress after those "few" attempts, assume the node is bad and shut it down. The container goes back to the previous state of waiting for a cloud node. We remember dispatch attempts across nodes and keep trying until we hit the configured total maximum.

We need to discuss how to relate this new implementation to the existing MaxDispatchAttempts configuration. Should we let administrators configure both "attempts per node" and "total attempts?" Or do we want to hardcode the first value as some constant or percentage and let MaxDispatchAttempts remain as total attempts? Bear in mind any new configuration will only affect a-d-c, not other dispatchers.

Actions #1

Updated by Peter Amstutz 10 months ago

  • Target version set to Development 2025-06-25
Actions #2

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-06-25 to Future
  • Tracker changed from Bug to Idea

We would certainly benefit from a more sophisticated policy.

The problem is that it is hard to know which of several situations we are in:

Sometimes the instance is bad, and we need a new instance.

Sometimes the container record is bad, and the container won't ever start. In this case, waiting for a new instance is a waste of time.

Sometimes the there was some other network hiccup and retrying anywhere (same instance or a different one) is likely to work.

So what we probably want is some sequence of: fast retry, backoff, and retrying on a different instance.

Actions #3

Updated by Peter Amstutz 10 months ago

  • Target version changed from Future to Development 2025-07-09
Actions #4

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-07-09 to Development 2025-07-23
Actions #5

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-07-23 to Development 2025-08-06
Actions #6

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-08-06 to Future
Actions

Also available in: Atom PDF