Feature #14325
closed[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager
100%
Description
This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.
Background -- already done in #14360:- Bring up nodes and run containers on them
- Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
- HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
- Shutdown idle nodes automatically
- Handle cloud API quota errors
- Package-building changes are in place, but commented out
- Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
- Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
- Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
- Configurable port number for connecting to VM SSH servers
- Pass API host and dispatcher's token to crunch-run command via
ARVADOS_API_*
environment variables - Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
- Test container.Queue using real railsAPI/controller
- Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
- Cancel container after some number of start/requeue cycles (i.e.,
crunch-run --detach
succeeded, but child exited without moving container past Locked state) - Cancel container with no suitable instance type
- Enable package build
- Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
- Update management API response format (lowercase keys)
- Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Updated by Tom Clegg about 6 years ago
- Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added
Updated by Tom Clegg about 6 years ago
- Related to Bug #13964: crunch-dispatch-cloud spike added
Updated by Tom Clegg about 6 years ago
- Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Updated by Tom Clegg about 6 years ago
- Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added
Updated by Tom Clegg about 6 years ago
- Description updated (diff)
- Target version deleted (
To Be Groomed)
Updated by Tom Clegg about 6 years ago
- Target version set to Arvados Future Sprints
- Story points set to 4.0
Updated by Peter Amstutz about 6 years ago
Management APIs should return {"items": [...]} not {"Items": [...]} for consistency with the Arvados API.
Updated by Tom Clegg about 6 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
Updated by Tom Morris almost 6 years ago
- Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Updated by Tom Clegg almost 6 years ago
- Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Updated by Tom Clegg almost 6 years ago
- Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
Added "hold" and "drain". (Wiki also mentions a "kill" API -- not included here.)
- Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
Added total hourly price. The others were already in place.
- Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
Added what the wiki says, which is a little different:
"...instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down."
This is a bit more consistent since it's more consistent with the "inherited node is not running a container and fails boot probe" case: we allow the boot timeout to run out before killing it, rather than expecting its boot probe to succeed before the existing container finishes.
- Configurable port number for connecting to VM SSH servers
CloudVMs→SSHPort can be given as a name ("ssh") or number ("22").
- Pass API host and dispatcher's token to crunch-run command via
ARVADOS_API_*
environment variables
Added.
- Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
Added.
- Test container.Queue using real railsAPI/controller
Added. Revealed & fixed SDK bug, see f696f142e.
- Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
Added restart/resume test to confirm "hold" and instance-type labels are saved/loaded effectively.
Added a slew of worker tests to confirm proper state changes in probeAndUpdate.
- Cancel container after some number of start/requeue cycles (i.e.,
crunch-run --detach
succeeded, but child exited without moving container past Locked state)
Didn't do this. (We've already implemented it on the API side.)
- Cancel container with no suitable instance type
Added.
- Enable package build
Uncommented.
- Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
Added.
- Update management API response format (lowercase keys)
Updated.
- Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Confirmed.
14325-dispatch-cloud @ b105602902e38f18a48505e2091ffea77b2c7c89 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1040/
Updated by Tom Clegg almost 6 years ago
Now at a27b2bf3e with some test cleanup (move LameInstanceSet's one remaining useful feature to StubDriver and retire LameInstanceSet).
Updated by Tom Clegg almost 6 years ago
- Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint
Updated by Tom Clegg almost 6 years ago
- Precedes Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added
Updated by Peter Amstutz almost 6 years ago
worker.shutdownIfIdle():
if !(wkr.state == StateIdle || (wkr.state == StateBooting && wkr.idleBehavior == IdleBehaviorDrain)) { return false }
The double-negative logic (do nothing if these things are NOT true...) makes this expression hard to read. Please add comments clarifying the intention that we want to shut down when certain things are true.
if wkr.idleBehavior != IdleBehaviorDrain && age < wkr.wp.timeoutIdle { return false }
Same comment about confusing expression. If I'm understanding the intended behavior, it would be clearer to write wkr.idleBehavior == IdleBehaviorRun && age < wkr.wp.timeoutIdle
because the IdleBehaviorHold case has already been eliminated, and IdleBehaviorDrain ignores the timeout (but having IdleBehaviorDrain and timeoutIdle appear on the same line implies they are related).
Queue.Update():
if _, keep := cq.dontupdate[uuid]; keep { continue } ... if _, keep := cq.dontupdate[uuid]; keep { continue } else if _, keep = next[uuid]; keep { continue } else { delete(cq.current, uuid) }
Comment from last time that "keeplocal" was confusing and was renamed to "dontupdate" but there's still a few local variables called "keep" and I don't know how to read it. Should those also be called "dontupdate"? Maybe add some comments?
In Queue.addEnt()
there's an embedded assumption that if the current dispatcher can't find a instance type for a container, nobody can, so it should always cancel the container (even if it has to lock it first). I think that's fine (heterogeneous dispatchers has complexity we don't want to get into yet, if ever) but should probably be mentioned in a comment.
worker.probeAndUpdate():
for _, uuid := range ctrUUIDs { running[uuid] = struct{}{} if _, ok := wkr.running[uuid]; !ok { changed = true } }
Another place that would benefit from some more comments expressing the intent / context of the code. I think what this is doing is determining if there a container UUID was found on the node which isn't present in wkr.running
. The next block looks like it checks the opposite case where a container is known to wkr.running
but not present on the instance.
if wkr.state == StateUnknown || wkr.state == StateBooting { wkr.state = StateIdle }
It is implied by getting to this point in the code that probeBooted() and probeRunning() both passed successfully, could use a comment making that assumption explicit.
... to be continued ....
Updated by Peter Amstutz almost 6 years ago
Pool.Unallocated():
if !(wkr.state == StateIdle || wkr.state == StateBooting || wkr.state == StateUnknown) || wkr.idleBehavior != IdleBehaviorRun || len(wkr.running) > 0 { continue }
This line is way too long.
Similar comment to earlier about hard-to-follow double-negative logic. Here !(wkr.state StateIdle || wkr.state StateBooting || wkr.state == StateUnknown)
is much clearer written as (wkr.state != StateIdle && wkr.state != StateBooting && wkr.state != StateUnknown)
Updated by Peter Amstutz almost 6 years ago
Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
Didn't do this. (We've already implemented it on the API side.)
We've agreed to do so, but haven't actually done it yet (#11561)
# git.curoverse.com/arvados.git/lib/dispatchcloud/container ./queue_test.go:38:17: undefined: test ./queue_test.go:95:17: undefined: test FAIL git.curoverse.com/arvados.git/lib/dispatchcloud/container [build failed]
import ( "github.com/julienschmidt/httprouter" )
What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.
# Layouter fails if we add these
Maybe use graphviz instead? (Requires slightly different notation).
Updated by Tom Clegg almost 6 years ago
Indeed, those are some confusing conditional expressions, thanks. Clarified and added comments.
Peter Amstutz wrote:
Didn't do this. (We've already implemented it on the API side.)
We've agreed to do so, but haven't actually done it yet (#11561)
Ah, indeed.
What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.
Cheap, easy to use, does what we need (filter on methods + extract path params), didn't think of a reason not to use it. (It happens to be much more efficient with time and memory than gorilla, not that that's a big concern here.)
Maybe use graphviz instead? (Requires slightly different notation).
Maybe. I didn't find this ascii art exercise particularly rewarding. If I were to sink more time into different ways of doing this, I'd probably just give up and make a drawing in Google Drive.
lib/dispatchcloud/container tests are fixed.
14325-dispatch-cloud @ 71fd4da18b22100682ae7e2079aadfd66360d310 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1051/
Updated by Tom Clegg almost 6 years ago
- Precedes Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added
Updated by Tom Clegg almost 6 years ago
Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.
14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/
Updated by Peter Amstutz almost 6 years ago
Tom Clegg wrote:
Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.
14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/
This LGTM, thanks.
Updated by Tom Clegg almost 6 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|800139c8dee7d9a563a8a2dca9e45e283c55c22c.