Bug #7286: [Node Manager] Should recognize and shut down broken nodes - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Bug #7286

closed

[Node Manager] Should recognize and shut down broken nodes

Added by Brett Smith over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Peter Amstutz

Category:

Node Manager

Target version:

2015-10-14 sprint

Start date:

09/10/2015

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Description

"Broken" means the cloud provider asserts that it is broken, and one of the following is true:

The cloud node is unpaired, and at least boot_fail_after seconds old.
The cloud node is paired, and the associated Arvados record has status "missing".

Steps:

Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise.
ComputeNodeMonitorActor suggests its node for shutdown if this new method returns True, and one of the conditions above is true.
Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor.
Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed.

Subtasks 2 (0 open — 2 closed)

Task #7409: Review 7286-nodeman-destroy-broken-nodes

Resolved

Tom Clegg

09/10/2015

Actions

Task #7460: Deploy arvados-node-manager-0.1.20151006131743 and python-apache-libcloud-0.18.1dev4 on c97qk

Resolved

Nico César

09/10/2015

Actions

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Peter Amstutz over 9 years ago

We should take into account the node state reported by the cloud provider, specifically the Node.state field which contains the NodeState enum. A node in the "ERROR" state should be destroyed.

Actions

Copy link

Updated by Brett Smith over 9 years ago

Target version set to Arvados Future Sprints

Actions

Copy link

Updated by Brett Smith over 9 years ago

Subject changed from [Node Manager] Should shut down a node when it appears broken: to [Node Manager] Should recognize and shut down broken nodes
Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 9 years ago

Only nodes reported as PENDING/RUNNING/REBOOTING state should be considered usable, nodes in UNKNOWN/STOPPED/SUSPENDED/PAUSED are unlikely to do useful work for us, only usable nodes should be considered when deciding whether the node request is satisfied, however all nodes regardless of state are considered in the calculation for maximum number of nodes.

Actions

Copy link

Updated by Peter Amstutz over 9 years ago

Status changed from New to In Progress
Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Peter Amstutz over 9 years ago

Is arvados_node["status"] == "missing" a sufficient signal for determining whether the node should be counted as "up" or not? Unlike the decision to shut down the node, this seems fine, because if a node is out of contact but is doing useful work, it won't be affected; if a node is out of contact but idle, it may still be the right thing to do to bring up a new node, since the missing node node can't do work for us.
If a node is otherwise running but has "crunch_worker_state" of "down" should it eligible for shutdown?

Actions

Copy link

Updated by Brett Smith over 9 years ago

Target version changed from Arvados Future Sprints to 2015-10-14 sprint

Actions

Copy link

Updated by Brett Smith over 9 years ago

Story points set to 1.0

Actions

Copy link

Updated by Tom Clegg over 9 years ago

Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:

Return True instead of None from arvados_node_missing if the node has never pinged, or
change self.arvados_node is None → self.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None in the "Node is unpaired" condition

It seems counterintuitive that the Azure driver understands cloud_types.NodeState.UNKNOWN to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?

The rest LGTM.

Actions

Copy link

#10

Updated by Peter Amstutz over 9 years ago

Tom Clegg wrote:

Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:

Return True instead of None from arvados_node_missing if the node has never pinged, or

change self.arvados_node is None → self.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None in the "Node is unpaired" condition

Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association. Here is the logic that Brett and I agreed on for this ticket:

If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)
If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone
If a cloud node is "missing" and cloud does say it is "broken", destroy the node

If we want to further tweak this logic, we should take it up in another ticket.

It seems counterintuitive that the Azure driver understands cloud_types.NodeState.UNKNOWN to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?

libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.

I have updated the branch with comments relating to the above code sections.

Actions

Copy link

#11

Updated by Tom Clegg over 9 years ago

Peter Amstutz wrote:

Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association.

OK. I asked because of these three lines, which seem to address specifically this possibility:

+    if arvados_node["last_ping_at"] is None:
+        return None
+    else:

This looks like "node was associated, but has never pinged" so I was surprised that we end up following the "node has pinged recently" path in such a case.

Here is the logic that Brett and I agreed on for this ticket:

If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)

If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone

If a cloud node is "missing" and cloud does say it is "broken", destroy the node

If we want to further tweak this logic, we should take it up in another ticket.

That all sounds fine to me.

Currently the "return None" path (which "shouldn't happen anyway" because currently we never associate a node without setting last_ping_at) exempts a node from all of these checks, right?

Fine if it's deliberate, I only asked because it looked like an oversight.

libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.

Indeed, I looked for hints about what states are supposed to mean, and what I found wasn't very illuminating. A quick look at other drivers suggests UNKNOWN is used for states like "crashed" and "shutting down", which seems compatible with the "not worth keeping alive" interpretation.

I have updated the branch with comments relating to the above code sections.

Thanks. The "impossible condition is handled as OK" question isn't a blocker, so this looks good to merge.

Actions

Copy link

#12

Updated by Peter Amstutz over 9 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:05b52b297b30d075ef2409a123f7d096c1156cf8.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #7286

[Node Manager] Should recognize and shut down broken nodes

Updated by Peter Amstutz over 9 years ago

Updated by Brett Smith over 9 years ago

Updated by Brett Smith over 9 years ago

Updated by Peter Amstutz over 9 years ago

Updated by Peter Amstutz over 9 years ago

Updated by Peter Amstutz over 9 years ago

Updated by Brett Smith over 9 years ago

Updated by Brett Smith over 9 years ago

Updated by Tom Clegg over 9 years ago

Updated by Peter Amstutz over 9 years ago

Updated by Tom Clegg over 9 years ago

Updated by Peter Amstutz over 9 years ago