Actions
Bug #7286
closed[Node Manager] Should recognize and shut down broken nodes
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
09/10/2015
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Story points:
1.0
Description
"Broken" means the cloud provider asserts that it is broken, and one of the following is true:
- The cloud node is unpaired, and at least boot_fail_after seconds old.
- The cloud node is paired, and the associated Arvados record has status "missing".
Steps:
- Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise.
- ComputeNodeMonitorActor suggests its node for shutdown if this new method returns True, and one of the conditions above is true.
- Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor.
- Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed.
Actions