Bug #12073
closed[Node manager] Clean up stale arvados node records
100%
Description
If a node goes away without node manager doing clean_arvados_node(), it will remain in the nodes table tying up a slot.
Node manager should detect when there is a node in the nodes table that doesn't correspond to a real cloud node and clean up the stale node record.
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-09-13 Sprint to 2017-08-30 Sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Updated by Lucas Di Pentima over 7 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 7 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 7 years ago
Following a related ticket note: https://dev.arvados.org/issues/12064#note-2
Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:
- Ask the api client for all node records with a
slot_number
assigned - For every node record, check using its status with
sinfo
- If the status is
drain*
, reset the node record
This would be implemented on a on_start()
method on the ArvadosNodeListMonitorActor
class.
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
Following a related ticket note: https://dev.arvados.org/issues/12064#note-2
Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:
- Ask the api client for all node records with a
slot_number
assigned- For every node record, check using its status with
sinfo
- If the status is
drain*
, reset the node recordThis would be implemented on a
on_start()
method on theArvadosNodeListMonitorActor
class.
This looks about right. One additional consideration is you should make sure the record is "stale" (the node record in the arvados table hasn't been touched for at least node_stale_after
seconds, by default this is two hours.)
Updated by Lucas Di Pentima over 7 years ago
Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.
So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?
Updated by Peter Amstutz over 7 years ago
Peter Amstutz wrote:
Lucas Di Pentima wrote:
Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.
So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?
(04:09:07 PM) tetron: lucas: actually, I think the right place might be in daemon.py in update_arvados_nodes
(04:09:32 PM) tetron: that's where it actually has access to the list of both cloud nodes and arvados nodes
(04:09:57 PM) tetron: and can determine if there's an arvados node which is tying up a slot, but doesn't have a corresponding cloud node, and is stale
(04:15:04 PM) lucas: tetron: Ok, I think that method is the one that it's executed when the ArvadosNodeListMonitorActor poller returns the nodelist. I should check those with a crunch_worker_state == 'down' in that case because the 'drain*' state isn't taken as a special case on _send_request().
(04:16:05 PM) tetron: yes
(04:18:39 PM) tetron: yes, if a node record is down, and stale, but has a slot assigned, it needs to be cleaned
(04:20:07 PM) tetron: lucas: you'll want to use something like ComputeNodeUpdateActor to do the actual Arvados request to avoid blocking the daemon thread
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-09-13 Sprint to 2017-09-27 Sprint
Updated by Lucas Di Pentima over 7 years ago
Updates at 0f1bac0cc - branch 12073-nodemanager-stale-nodes-recs
Test run: https://ci.curoverse.com/job/developer-run-tests/463/
- Added a new actor that takes care of sending the arvados records update requests to the API Server when the
NodeManagerDaemonActor
decides that some stale node record should be cleaned up. - Added related tests.
Updated by Lucas Di Pentima over 7 years ago
Re-ran tests to see if they still failed, but they finished ok: https://ci.curoverse.com/job/developer-run-tests/464/
Updated by Peter Amstutz over 7 years ago
if node['crunch_worker_state'] == 'down' and \ node['slot_number'] and \ arvados_node_missing(node, self.node_stale_after):
Should be node['slot_number'] is None
, because bool(node['slot_number'])
when node['slot_number'] == 0
evaluates to False
.
In doc/*.cfg
there is a comment about the value of "Node stale time":
# "Node stale time" affects two related behaviors. # 1. If a compute node has been running for at least this long, but it # isn't paired with an Arvados node, do not shut it down, but leave it alone. # This prevents the node manager from shutting down a node that might # actually be doing work, but is having temporary trouble contacting the # API server. # 2. When the Node Manager starts a new compute node, it will try to reuse # an Arvados node that hasn't been updated for this long.
Part (1.) is wrong (it may have done that at some point, but I'm pretty sure doesn't do that any more.) Then it needs to mention the new behavior of clearing stale node records.
There's a copy of these sample configuration files in arvados/doc/install/install-nodemanager.html.textile.liquid
which unfortunately makes six places the comment needs to be updated :-/ Maybe the arvados/doc/_includes
could have a relative symlink to services/nodemanager/docs/*.cfg
and then use {% include 'azure.example.cfg' %}
.
It feels like there is a potential race between _arvados_node_cleaner.clean_node()
and ComputeNodeSetupActor
. Example: a call to clean_node()
is delayed, meanwhile the record is assigned to ComputeNodeSetupActor
and paired, then clean_node()
comes in later and wipes the node record. The find_stale_node()
method should check slot_numer is None
to ensure that it doesn't assign a record that might be cleaned later.
On further consideration, the clean_node()
process should probably be the main process responsible for cleaning stale node record, find_stale_node()
should only pick up cleaned nodes, and ComputeNodeSetupActor
shouldn't need to call _clean_arvados_node()
at all.
Similarly, the daemon callback method node_finished_shutdown
could call clean_node()
on a successful shutdown instead of having it be part of the ComputeNodeShutdownActor
.
Updated by Peter Amstutz over 7 years ago
So to clarify:
- A node record should get cleaned by
NodeManagerDaemonActor.update_arvados_nodes
whenmodified_at
is older thannode_stale_time
,crunch_worker_state in ('down', None)
and any ofslot_number
,hostname
,ip_address
,first_ping_at
,last_ping_at
are notNone
. - The node record also gets cleaned immediately by
NodeManagerDaemonActor.node_finished_shutdown
on successful shutdown. - A stale record is candidate for reuse when
modified_at
is older thannode_stale_time
,crunch_worker_state in ('down', None) and all of @slot_number
,hostname
,ip_address
,first_ping_at
,last_ping_at
areNone
. - ComputeNodeSetupActor and ComputeNodeShutdownActor should not longer clean node records. (ComputeNodeSetupActor may still create a new one if no record is provided to it).
Updated by Tom Morris over 7 years ago
I've also seen this table with hundreds of records with a state of 'startup-fail'. Are these candidates for cleanup too?
Updated by Peter Amstutz over 7 years ago
On further thought, I realize there's a reason the SetupActor performs the record cleaning, which we need to maintain.
Node records effectively have the following states:
- Unassigned: node record is available for use
- Assigned: node record is reserved to a node that is being created, but hasn't pinged yet
- In use: node is assigned a slot and pinging the API server every minute
- Not in use: recently shut down, slot is null
- Stale: node has not pinged in some time, but is taking up a slot
Updated by Lucas Di Pentima over 7 years ago
Updates at 6ec328c00
Test run: https://ci.curoverse.com/job/developer-run-tests/468/
- Updated documentation to include nodemanager cloud config files examples from
services/nodemanager/doc/
directory. - Updated criteria by which an arvados node record is considered stale.
- Added node record cleanup call on successful cloud node shutdown.
- Added related test.
Updated by Peter Amstutz over 7 years ago
Sorry, I had to cut the previous thought short.
Currently the "clean" behavior of SetupActor has the effect of touching the Node record (updating modified_at). Because the node record reuse logic picks a record where both modified_at is at least 2 hours old and slot is null, this protects us from assigning the record to another node even if node manager is restarted.
If node creation fails, the node record won't be available to reassign for 2 hours (if a lot of node creation attempts fail, it can result in creating 1000s of extra node records.)
The slot is assigned when the nodes successfully pings.
The "clean" behavior of ShutdownActor is currently responsible for releasing the slot. If node manager is restarted between sending the request to destroy the the node and cleaning the record, then the slot remains non-null and the record can't be reused. Also, because modified_at is updated, even cleaned node records for recently shut down nodes won't be reused for two hours.
Effectively the current node manager treats nodes as having a two-hour lock, and avoids modifying nodes which have been modified more recently than that, with the single exception of nodes that have been explicitly shut down.
So desired node record states are:
- Cleaned: "clean" state is indicated, safe to assign
- Assigned: "assigned" state is indicated, recently modified, slot is null
- In use: "assigned" state is indicated, recently modified, slot is set
- Stale: "assigned" state is indicated, not recently modified, should be cleaned
The SetupActor should put the node record into the "Assigned" state, the ShutdownActor should put the record back into the "Cleaned" state. We don't have explicit a state field on the Node record, but could define it through some combination of fields.
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-09-27 Sprint to 2017-10-11 Sprint
Updated by Lucas Di Pentima over 7 years ago
Updates at 96da34b18 - new branch: 12073-nodemanager-stale-nodes-recs-bis
Test run: https://ci.curoverse.com/job/developer-run-tests/473/
After some thought, I came up with a fairly simple fix, that I believe avoid race conditions and don't try to solve the bigger problem that causes this issue.
I think the main cause of this is that nodemanager only creates or updates node records (never deletes), and because of the 2 hour window that any record has before being considered as stale, there are times when problems make the nodemanager create new records because it doesn't have any stale ones (all modified_at recently, for example) so the node record count grows over time.
To this previous issue we have to add the fact that sometimes a node record isn't cleaned after a node shutdown, and it is kept on the database as stale and with a slot number assigned, mixed with lots of other stale and slot_number-less.
So, what I believe is happening is that find_stale_node()
is randomically returning the first (of a lot) stale node, and over time the ones with a slot number are more of a minority against the other kind, until there are no more slots and node creation start to fail.
Quick Solution: I've updated find_stale_node()
to sort the node record list before checking for the first stale one, prioritizing those with a slot number assigned.
We'll have to think how to safely remove those stale node records from the api server, but I think this can help avoid the issue reported here.
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
Updates at 96da34b18 - new branch:
12073-nodemanager-stale-nodes-recs-bis
Test run: https://ci.curoverse.com/job/developer-run-tests/473/After some thought, I came up with a fairly simple fix, that I believe avoid race conditions and don't try to solve the bigger problem that causes this issue.
I think the main cause of this is that nodemanager only creates or updates node records (never deletes), and because of the 2 hour window that any record has before being considered as stale, there are times when problems make the nodemanager create new records because it doesn't have any stale ones (all modified_at recently, for example) so the node record count grows over time.
To this previous issue we have to add the fact that sometimes a node record isn't cleaned after a node shutdown, and it is kept on the database as stale and with a slot number assigned, mixed with lots of other stale and slot_number-less.So, what I believe is happening is that
find_stale_node()
is randomically returning the first (of a lot) stale node, and over time the ones with a slot number are more of a minority against the other kind, until there are no more slots and node creation start to fail.Quick Solution: I've updated
find_stale_node()
to sort the node record list before checking for the first stale one, prioritizing those with a slot number assigned.We'll have to think how to safely remove those stale node records from the api server, but I think this can help avoid the issue reported here.
While this doesn't totally avoid the situation of booting nodes that can't be assigned slots (since there is a 2 hour window where a slot might remain tied up), it should mitigate the problem a bit. LGTM @ 96da34b1888638648c291d989d990b4e5738db88
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:1339298d5812df668e08d9e77d595012cffd3171.