Story #4293
closed[Node Manager] Write off cloud nodes that spend too long in booted state
Added by Brett Smith about 10 years ago. Updated about 10 years ago.
100%
Description
If the cloud has an internal error starting a node, Node Manager won't shut it down until the normal shutdown window opens. There should be a separate timer for this case: if a cloud node doesn't appear in the node listing within N minutes (configurable), assume it failed to start, and shut it down.
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Brett Smith about 10 years ago
- Story points changed from 1.0 to 1.5
Fixing this effectively requires fixing #4322, which was originally filed separately. Adjusting this story to account for this.
Updated by Brett Smith about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Brett Smith about 10 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 10 years ago
A few comments:
- It's a little confusing to have "cloud_nodes", "booting", "booted", and "shutdowns", where the state of the node depends on which collection it is in. (From the comments I see that the different dicts don't quite hold the same thing, so maybe it's justified, but it seems more complex than representing a node as a single record whose state changes over time.) In particular, this is confusing, as it is not obvious why the "cloud_nodes" and "booted" sets should be disjoint:
for record_dict in [self.cloud_nodes, self.booted]:
- Arvados nodes get paired with cloud nodes based on IP address. Is it possible that a (reused) Arvados node record could have a stale IP address and get a bogus pairing because the compute node ip address gets reused?
- Is there a race condition if the node starts talking to Arvados after the "node is taking too long" shutdown is initiated?
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
A few comments:
- It's a little confusing to have "cloud_nodes", "booting", "booted", and "shutdowns", where the state of the node depends on which collection it is in. (From the comments I see that the different dicts don't quite hold the same thing, so maybe it's justified, but it seems more complex than representing a node as a single record whose state changes over time.) In particular, this is confusing, as it is not obvious why the "cloud_nodes" and "booted" sets should be disjoint:
[...]
"booted" nodes are ones that have finished going through the setup process, but haven't appeared in the listing of cloud nodes yet (i.e., we're waiting for eventual consistency). "cloud_nodes" have appeared in a listing. Since this method fires on a timer, we don't know how far in the boot process it's gotten, so we need to look for it in both places.
I agree that the daemon has gotten hairier than I'd really like, and I'd like to have an excuse to clean it up. But self.booted was added in an earlier branch, and there's no reason to deal with it in this one.
- Arvados nodes get paired with cloud nodes based on IP address. Is it possible that a (reused) Arvados node record could have a stale IP address and get a bogus pairing because the compute node ip address gets reused?
No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.
- Is there a race condition if the node starts talking to Arvados after the "node is taking too long" shutdown is initiated?
Within Node Manager itself, no. It will stop the corresponding ComputeNodeMonitorActor when the node disappears from the cloud listing, regardless of the node's state, and not before. So if the node pairs with an Arvados node in a later update, the ComputeNodeMonitorActor will successfully receive that update, then be shutdown later when its shutdown registers in the cloud node listing.
In the larger Arvados context, it's possible that Arvados (e.g., Crunch) will try to start working with the node in between the time it pairs and the time it's shut down, but I think Crunch has to be responsible for dealing with that kind of failure. Node Manager can't tell Arvados anything about the shutdown, because the what creates this situation is that there's no meaningful record of the node in Arvados to talk about.
Updated by Peter Amstutz about 10 years ago
Brett Smith wrote:
No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.
I was actually thinking of stale records in the nodes table, but on further thought presumably the arvados_nodes list in NodeManager only includes the records where the last ping time is up to date.
The rest of it looks good to me.
Updated by Brett Smith about 10 years ago
Peter Amstutz wrote:
Brett Smith wrote:
No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.
I was actually thinking of stale records in the nodes table, but on further thought presumably the arvados_nodes list in NodeManager only includes the records where the last ping time is up to date.
It doesn't. You're right, there is a bug here. But it predates this branch and can happen even when nodes come up from sources outside of Node Manager's control. Created #4751 to track this.
Updated by Brett Smith about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:8141501a6ef0a3cf4f40da14671c31c0257472e4.