Bug #6142
closed
[Node Manager] When canceling a SLURM shutdown, check state before resuming the node
Added by Brett Smith over 9 years ago.
Updated over 9 years ago.
Estimated time:
(Total: 0.00 h)
Description
Node Manager's SLURM dispatcher always tries to resume the node in SLURM when a shutdown is canceled. However, this request is only valid if the node is drained or failed. In other cases—for example, if the node is idle or alloc because it was never drained to begin with—issuing this request is invalid, and scontrol exits 1. This causes ComputeNodeShutdownActor to enter an infinite loop, trying repeatedly to resume a node that will never resume.
Check the node's current state (you can refactor code from await_slurm_drain), and only issue the resume request if that state is drain
or drng
.
- Target version changed from Arvados Future Sprints to 2015-09-02 sprint
- Target version changed from 2015-09-02 sprint to Arvados Future Sprints
- Target version changed from Arvados Future Sprints to 2015-10-14 sprint
- Assigned To set to Peter Amstutz
This state occurs if the node is allocated before the drain requests goes through. This can happen if Node Manager simply loses the race with crunch-dispatch, or if something interferes with the drain request like #6321.
Once the node is allocated, it will no longer be eligible for shutdown, and Node Manager will try to cancel the pending node shutdown. The first step of that is resuming the node in SLURM—but that can't succeed if the node isn't already drained. So that request fails, and then Node Manager's state is bad.
review @ 2a94b125b93a3aba204f55c37ecdc2876d81d642
I looked at the code and I ran the tests (all passed). This is new code for me, It's not evident for my current knowledge to have a major problem here.
LGTM
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:7c97dd88e541a0245272b8e93a33e4d2fe4e32cd.
Also available in: Atom
PDF