Story #8000
closed[Node Manager] Shut down nodes in SLURM 'down' state
0%
Description
Apparently node manager only shuts down nodes that are "idle" in slurm, if they are "down" then they don't get shut down?
2015-12-11_20:41:05.08909 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: CloudNodeListMonitorActor (at 140548410010704) got response with 1 items 2015-12-11_20:41:05.09007 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk 2015-12-11_20:41:05.09273 2015-12-11 20:41:05 pykka[11545] DEBUG: Registered ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c) 2015-12-11_20:41:05.09280 2015-12-11 20:41:05 pykka[11545] DEBUG: Starting ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c) 2015-12-11_20:41:05.09391 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk suggesting shutdown. 2015-12-11_20:41:05.09584 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f81b0850> subscribed to events for '/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk' 2015-12-11_20:41:05.09804 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk has associated with Arvados node c97qk-7ekkf-tj4hwdsw3yjiyjt 2015-12-11_20:41:05.09921 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk shutdown window open but node busy. 2015-12-11_20:41:05.10064 2015-12-11 20:41:05 arvnodeman.arvados_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f8e11250> subscribed to events for 'c97qk-7ekkf-tj4hwdsw3yjiyjt'
$ arv node get -u c97qk-7ekkf-tj4hwdsw3yjiyjt { "href":"/nodes/c97qk-7ekkf-tj4hwdsw3yjiyjt", "kind":"arvados#node", "etag":"984qlz3msed6utdnndclhuz0o", "uuid":"c97qk-7ekkf-tj4hwdsw3yjiyjt", "owner_uuid":"c97qk-tpzed-000000000000000", "created_at":"2015-09-09T14:26:19.832861000Z", "modified_by_client_uuid":null, "modified_by_user_uuid":"c97qk-tpzed-000000000000000", "modified_at":"2015-12-11T20:58:01.734010000Z", "hostname":"compute0", "domain":"c97qk.arvadosapi.com", "ip_address":"10.25.64.10", "last_ping_at":"2015-12-11T20:58:01.734010000Z", "slot_number":0, "status":"running", "job_uuid":null, "crunch_worker_state":"down", "properties":{ "cloud_node":{ "price":0, "size":"Standard_D1" }, "total_cpu_cores":1, "total_ram_mb":3442, "total_scratch_mb":51172 }, "first_ping_at":"2015-12-08T02:17:01.949316000Z", "info":{ "ec2_instance_id":"/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk", "last_action":"Prepared by Node Manager", "ping_secret":"35vaizroj3kkoqzm2vad92t6fewg7hbdix8jgj0wpklh3rdo4v", "slurm_state":"down" }, "nameservers":[ "10.25.0.6" ] }
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 2 drain* compute[2-3] compute* up infinite 252 down* compute[1,4-14,16-255] compute* up infinite 1 idle compute15 compute* up infinite 1 down compute0
Updated by Brett Smith about 9 years ago
- Subject changed from [NodeManager] shuts down 'idle' nodes but not 'down' nodes to [Node Manager] Does not shut down nodes in SLURM 'down' state
- Category set to Node Manager
This was discussed and desired behavior at the time the code was written. The thinking then was that a node being down in SLURM may just mean there's a network issue, and plenty of jobs can do their compute without network access just fine, so it's better to leave the node up and try to let the work finish than shut it down. An admin will intervene if necessary.
Since then:
- Now that we have Node Manager, admins want to intervene less.
- Nobody's said it in as many words, but I think we've shifted our philosophy about how to handle weird cases from "avoid doing anything that might interrupt to compute work" to "get the cluster into a known-good state ASAP."
- Given what I know about SLURM now, it's not clear to me that compute work can continue successfully even against transient network failures. It seems more likely that, in that case, SLURM will note the node failure and cancel the job allocation.
If all of that makes sense to everyone else, I agree we should change the behavior in this case.
Updated by Tom Clegg about 9 years ago
I'd say "slurm says node is down but everything will be fine if we're lucky" was somewhat true before we figured out that we needed to flatten the slurm node-communication tree.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Tracker changed from Bug to Story
- Subject changed from [Node Manager] Does not shut down nodes in SLURM 'down' state to [Node Manager] Shut down nodes in SLURM 'down' state
Updated by Peter Amstutz almost 8 years ago
- Status changed from New to Resolved
This was fixed #8953 with the addition of an explicit state transition table.
Updated by Tom Morris over 6 years ago
- Target version deleted (
Arvados Future Sprints)