Actions
Story #8000
closed[Node Manager] Shut down nodes in SLURM 'down' state
Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Start date:
Due date:
% Done:
0%
Estimated time:
Story points:
-
Description
Apparently node manager only shuts down nodes that are "idle" in slurm, if they are "down" then they don't get shut down?
2015-12-11_20:41:05.08909 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: CloudNodeListMonitorActor (at 140548410010704) got response with 1 items 2015-12-11_20:41:05.09007 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk 2015-12-11_20:41:05.09273 2015-12-11 20:41:05 pykka[11545] DEBUG: Registered ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c) 2015-12-11_20:41:05.09280 2015-12-11 20:41:05 pykka[11545] DEBUG: Starting ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c) 2015-12-11_20:41:05.09391 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk suggesting shutdown. 2015-12-11_20:41:05.09584 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f81b0850> subscribed to events for '/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk' 2015-12-11_20:41:05.09804 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk has associated with Arvados node c97qk-7ekkf-tj4hwdsw3yjiyjt 2015-12-11_20:41:05.09921 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk shutdown window open but node busy. 2015-12-11_20:41:05.10064 2015-12-11 20:41:05 arvnodeman.arvados_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f8e11250> subscribed to events for 'c97qk-7ekkf-tj4hwdsw3yjiyjt'
$ arv node get -u c97qk-7ekkf-tj4hwdsw3yjiyjt { "href":"/nodes/c97qk-7ekkf-tj4hwdsw3yjiyjt", "kind":"arvados#node", "etag":"984qlz3msed6utdnndclhuz0o", "uuid":"c97qk-7ekkf-tj4hwdsw3yjiyjt", "owner_uuid":"c97qk-tpzed-000000000000000", "created_at":"2015-09-09T14:26:19.832861000Z", "modified_by_client_uuid":null, "modified_by_user_uuid":"c97qk-tpzed-000000000000000", "modified_at":"2015-12-11T20:58:01.734010000Z", "hostname":"compute0", "domain":"c97qk.arvadosapi.com", "ip_address":"10.25.64.10", "last_ping_at":"2015-12-11T20:58:01.734010000Z", "slot_number":0, "status":"running", "job_uuid":null, "crunch_worker_state":"down", "properties":{ "cloud_node":{ "price":0, "size":"Standard_D1" }, "total_cpu_cores":1, "total_ram_mb":3442, "total_scratch_mb":51172 }, "first_ping_at":"2015-12-08T02:17:01.949316000Z", "info":{ "ec2_instance_id":"/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk", "last_action":"Prepared by Node Manager", "ping_secret":"35vaizroj3kkoqzm2vad92t6fewg7hbdix8jgj0wpklh3rdo4v", "slurm_state":"down" }, "nameservers":[ "10.25.0.6" ] }
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 2 drain* compute[2-3] compute* up infinite 252 down* compute[1,4-14,16-255] compute* up infinite 1 idle compute15 compute* up infinite 1 down compute0
Actions