Project

General

Profile

Actions

Bug #8149

closed

node failure compute3 and compute2

Added by Bryan Cosca about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-t3sv3hy0buij2ku#Log

2016-01-07_14:55:34 srun: error: Node failure on compute3
2016-01-07_14:55:34 salloc: error: Node failure on compute3
2016-01-07_14:55:35 salloc: Job allocation 4503 has been revoked.
2016-01-07_14:55:35 salloc: error: Node failure on compute2
2016-01-07_14:55:35 srun: error: Node failure on compute2
2016-01-07_14:55:35 srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-01-07_14:55:35 srun: error: compute3: task 1: Terminated
2016-01-07_14:55:35 srun: error: compute2: task 0: Killed
2016-01-07_14:55:35 wx7k5-8i9sb-d0lg6x2zc447hj4 4590 Installing Docker image from cdaec3f190665b1cb03b9f52437f013c+798 exited 15 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 444
2016-01-07_14:55:35 wx7k5-8i9sb-d0lg6x2zc447hj4 4590 Freeze not implemented
2016-01-07_14:55:35 wx7k5-8i9sb-d0lg6x2zc447hj4 4590 collate
2016-01-07_14:55:35 wx7k5-8i9sb-d0lg6x2zc447hj4 4590 collated output manifest text to send to API server is 0 bytes with access tokens
2016-01-07_14:55:39 wx7k5-8i9sb-d0lg6x2zc447hj4 4590 log collection is a97926e9cbd862803a24bec7f8358a05+83
2016-01-07_14:55:39 Died at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 1786, <DATA> line 1.

Actions #1

Updated by Nico César about 10 years ago

this is from nodemanager log. why did it shut it down?


2016-01-07_15:02:43.60416 2016-01-07 15:02:43 arvnodeman.jobqueue[46420] DEBUG: JobQueueMonitorActor (at 140202901046032) got response with 4 items
2016-01-07_15:02:43.60597 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D14_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60788 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D5_v2: idle nodes 4, wishlist size 4
2016-01-07_15:02:43.60807 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D13_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60825 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D4_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60850 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D12_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60870 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D3_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60888 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D11_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.60905 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D2_v2: idle nodes 0, wishlist size 0
2016-01-07_15:02:43.61068 2016-01-07 15:02:43 arvnodeman.daemon[46420] DEBUG: Standard_D1_v2: idle nodes 2, wishlist size 0
2016-01-07_15:02:44.00138 2016-01-07 15:02:44 arvnodeman.arvados_nodes[46420] DEBUG: ArvadosNodeListMonitorActor (at 140202901693200) got response with 187 items
2016-01-07_15:02:46.44856 2016-01-07 15:02:46 arvnodeman.cloud_nodes[46420] DEBUG: CloudNodeListMonitorActor (at 140202922197200) sending poll
2016-01-07_15:02:46.49473 2016-01-07 15:02:46 pykka[46420] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:c9992a0d-03f6-4cb1-a20f-2968246c48de)
2016-01-07_15:02:46.49487 2016-01-07 15:02:46 pykka[46420] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:c9992a0d-03f6-4cb1-a20f-2968246c48de)
2016-01-07_15:02:46.49593 2016-01-07 15:02:46 arvnodeman.nodedown[46420] INFO: Draining SLURM node compute2
2016-01-07_15:02:46.51610 2016-01-07 15:02:46 arvnodeman.nodedown[46420] INFO: Waiting for SLURM node compute2 to drain
2016-01-07_15:02:52.74014 2016-01-07 15:02:52 pykka[46420] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:1db691e3-cd57-4d09-93cc-fd45e26c975f)
2016-01-07_15:02:52.74029 2016-01-07 15:02:52 pykka[46420] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:1db691e3-cd57-4d09-93cc-fd45e26c975f)
2016-01-07_15:02:52.74136 2016-01-07 15:02:52 arvnodeman.nodedown[46420] INFO: Draining SLURM node compute3
2016-01-07_15:02:52.76170 2016-01-07 15:02:52 arvnodeman.nodedown[46420] INFO: Waiting for SLURM node compute3 to drain
Actions #2

Updated by Nico César about 10 years ago

just FYI this is the moment where the 2 nodes get out of the wishlist

2016-01-07_14:52:23.56980 2016-01-07 14:52:23 arvnodeman.jobqueue[46420] DEBUG: JobQueueMonitorActor (at 140202901046032) sending poll
2016-01-07_14:52:23.57685 2016-01-07 14:52:23 arvnodeman.arvados_nodes[46420] DEBUG: ArvadosNodeListMonitorActor (at 140202901693200) sending poll
2016-01-07_14:52:23.62317 2016-01-07 14:52:23 arvnodeman.jobqueue[46420] DEBUG: Sending server wishlist: Standard_D5_v2, Standard_D5_v2, Standard_D5_v2, Standard_D5_v2, Standard_D1_v2, Standard_D1_v2
2016-01-07_14:52:23.62331 2016-01-07 14:52:23 arvnodeman.jobqueue[46420] DEBUG: JobQueueMonitorActor (at 140202901046032) got response with 6 items
2016-01-07_14:52:23.62477 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D14_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62827 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D5_v2: idle nodes 4, wishlist size 4
2016-01-07_14:52:23.62846 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D13_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62864 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D4_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62884 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D12_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62903 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D3_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62920 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D11_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.62937 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D2_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:23.63088 2016-01-07 14:52:23 arvnodeman.daemon[46420] DEBUG: Standard_D1_v2: idle nodes 2, wishlist size 2
2016-01-07_14:52:24.06102 2016-01-07 14:52:24 arvnodeman.arvados_nodes[46420] DEBUG: ArvadosNodeListMonitorActor (at 140202901693200) got response with 187 items
2016-01-07_14:52:33.57009 2016-01-07 14:52:33 arvnodeman.jobqueue[46420] DEBUG: JobQueueMonitorActor (at 140202901046032) sending poll
2016-01-07_14:52:33.57603 2016-01-07 14:52:33 arvnodeman.arvados_nodes[46420] DEBUG: ArvadosNodeListMonitorActor (at 140202901693200) sending poll
2016-01-07_14:52:33.60586 2016-01-07 14:52:33 arvnodeman.jobqueue[46420] DEBUG: Sending server wishlist: Standard_D5_v2, Standard_D5_v2, Standard_D5_v2, Standard_D5_v2
2016-01-07_14:52:33.60602 2016-01-07 14:52:33 arvnodeman.jobqueue[46420] DEBUG: JobQueueMonitorActor (at 140202901046032) got response with 4 items
2016-01-07_14:52:33.60752 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D14_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61107 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D5_v2: idle nodes 4, wishlist size 4
2016-01-07_14:52:33.61126 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D13_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61144 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D4_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61161 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D12_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61178 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D3_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61196 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D11_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61213 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D2_v2: idle nodes 0, wishlist size 0
2016-01-07_14:52:33.61359 2016-01-07 14:52:33 arvnodeman.daemon[46420] DEBUG: Standard_D1_v2: idle nodes 2, wishlist size 0
2016-01-07_14:52:34.01272 2016-01-07 14:52:34 arvnodeman.arvados_nodes[46420] DEBUG: ArvadosNodeListMonitorActor (at 140202901693200) got response with 187 items
2016-01-07_14:52:41.95592 2016-01-07 14:52:41 arvnodeman.cloud_nodes[46420] DEBUG: CloudNodeListMonitorActor (at 140202922197200) got response with 8 items
Actions #3

Updated by Brett Smith about 10 years ago

Nico Cesar wrote:

this is from nodemanager log. why did it shut it down?

Note the timestamps—this is happening several minutes after the node failure noted by SLURM in the logs. Node Manager shut them down because at this point, they were made idle by whatever caused the original node error.

Actions #4

Updated by Brett Smith about 10 years ago

Nico Cesar wrote:

just FYI this is the moment where the 2 nodes get out of the wishlist

This is because crunch-dispatch started running the job, which moves it from the Queued state to the Running state. Then it's no longer in the queue, and Node Manager doesn't factor it into the wishlist.

Why the nodes are still considered idle at this point is a little harder to say. It's possible it's just an "eventual consistency" problem: crunch-dispatch has started running the job, but hasn't yet updated the node records to reflect their current SLURM state. It only does that periodically, so that's definitely a possibility.

Actions #5

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF