Bug #4380
closed[Node Manager] Should drain nodes via SLURM before terminating them
100%
Updated by Brett Smith about 10 years ago
To start draining a node:
scontrol update NodeName=computeNN State=DRAIN
To check current node state:
sinfo --noheader -o %t -n computeNN
One idea: ComputeNodeShutdownActor starts with a start_shutdown() message, that class uses it to send the current message. SlurmComputeNodeShutdownActor overrides it to initiate shutdown, then check for the result. Testers can override it for better isolation.
Config variable to set the local dispatch method. If none specified, use the actors from base computenode module.
Does ComputeNodeShutdownActor need to be passed the ComputeNodeMonitorActor, so that SlurmComputeNodeShutdownActor can re-check shutdown eligibility after draining is done? This would be a pretty significant overhaul…
Updated by Brett Smith about 10 years ago
- Status changed from New to In Progress
Updated by Brett Smith about 10 years ago
Ward says that if the node's shutdown window closes while the node is draining, Node Manager should cancel the shutdown and undrain the node.
Updated by Brett Smith about 10 years ago
Updated by Tim Pierce about 10 years ago
Reviewing 4380-node-manager-computenode-reorg-wip at 0d49d9d0a
Looks good. Only minor comment: launcher.py still imports ComputeNodeSetupActor, ComputeNodeShutdownActor, ComputeNodeUpdateActor and ShutdownTimer, but of these, it only appears to use ComputeNodeUpdateActor. Are all of these imports necessary for reasons I can't obviously see?
Other than that LGTM. Thanks.
Updated by Brett Smith about 10 years ago
Tim Pierce wrote:
Looks good. Only minor comment: launcher.py still imports ComputeNodeSetupActor, ComputeNodeShutdownActor, ComputeNodeUpdateActor and ShutdownTimer, but of these, it only appears to use ComputeNodeUpdateActor. Are all of these imports necessary for reasons I can't obviously see?
Nope. That bug even predates this branch. I cleaned these up, along with ShutdownTimer too, and merged. Thanks.
Updated by Brett Smith about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 25 to 100
Applied in changeset arvados|commit:6c68141eb50255128cf38b5717b15b16f2a8cdff.