Project

General

Profile

Actions

Bug #4380

closed

[Node Manager] Should drain nodes via SLURM before terminating them

Added by Brett Smith about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
11/11/2014
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
1.0

Subtasks 2 (0 open2 closed)

Task #4496: Review 4380-node-manager-computenode-reorg-wipResolvedTim Pierce11/11/2014

Actions
Task #4519: Review 4380-node-manager-slurm-drain-wipResolvedBrett Smith11/13/2014

Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #4368: [Crunch] Improve node failure detection and job retry logicClosed10/31/2014

Actions
Actions #1

Updated by Brett Smith about 10 years ago

To start draining a node:

scontrol update NodeName=computeNN State=DRAIN

To check current node state:

sinfo --noheader -o %t -n computeNN

One idea: ComputeNodeShutdownActor starts with a start_shutdown() message, that class uses it to send the current message. SlurmComputeNodeShutdownActor overrides it to initiate shutdown, then check for the result. Testers can override it for better isolation.

Config variable to set the local dispatch method. If none specified, use the actors from base computenode module.

Does ComputeNodeShutdownActor need to be passed the ComputeNodeMonitorActor, so that SlurmComputeNodeShutdownActor can re-check shutdown eligibility after draining is done? This would be a pretty significant overhaul…

Actions #2

Updated by Brett Smith about 10 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Brett Smith about 10 years ago

Ward says that if the node's shutdown window closes while the node is draining, Node Manager should cancel the shutdown and undrain the node.

Actions #4

Updated by Brett Smith about 10 years ago

4380-node-manager-computenode-reorg-wip is up for review. See the commit message in 0d49d9d for rationale. I'm pushing this for review separately to try to avoid too much mutual blocking between this and #4138.

Actions #5

Updated by Tim Pierce about 10 years ago

Reviewing 4380-node-manager-computenode-reorg-wip at 0d49d9d0a

Looks good. Only minor comment: launcher.py still imports ComputeNodeSetupActor, ComputeNodeShutdownActor, ComputeNodeUpdateActor and ShutdownTimer, but of these, it only appears to use ComputeNodeUpdateActor. Are all of these imports necessary for reasons I can't obviously see?

Other than that LGTM. Thanks.

Actions #6

Updated by Brett Smith about 10 years ago

Tim Pierce wrote:

Looks good. Only minor comment: launcher.py still imports ComputeNodeSetupActor, ComputeNodeShutdownActor, ComputeNodeUpdateActor and ShutdownTimer, but of these, it only appears to use ComputeNodeUpdateActor. Are all of these imports necessary for reasons I can't obviously see?

Nope. That bug even predates this branch. I cleaned these up, along with ShutdownTimer too, and merged. Thanks.

Actions #7

Updated by Brett Smith about 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 25 to 100

Applied in changeset arvados|commit:6c68141eb50255128cf38b5717b15b16f2a8cdff.

Actions

Also available in: Atom PDF