Project

General

Profile

Actions

Feature #14325

closed

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/28/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.

Background -- already done in #14360:
  • Bring up nodes and run containers on them
  • Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
  • HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
  • Shutdown idle nodes automatically
  • Handle cloud API quota errors
  • Package-building changes are in place, but commented out
Requirements covered here:
  • Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
  • Configurable port number for connecting to VM SSH servers
  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables
  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
  • Test container.Queue using real railsAPI/controller
  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
  • Cancel container with no suitable instance type
  • Enable package build
  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
  • Update management API response format (lowercase keys)
  • Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Requirements covered elsewhere:
  • One cloud vendor driver (Azure = #14324)
  • Production-readiness (#14807)
Refs

Subtasks 1 (0 open1 closed)

Task #14664: Review 14325-dispatch-cloudResolvedPeter Amstutz01/28/2019

Actions

Related issues 6 (0 open6 closed)

Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolvedPeter Amstutz01/09/2019

Actions
Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolvedTom Clegg

Actions
Related to Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolved

Actions
Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolvedTom Clegg10/26/2018

Actions
Precedes Arvados - Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node managerResolvedTom Clegg01/29/2019

Actions
Precedes Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolvedTom Clegg01/29/2019

Actions
Actions #1

Updated by Tom Clegg about 6 years ago

  • Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added
Actions #2

Updated by Tom Clegg about 6 years ago

  • Related to Bug #13964: crunch-dispatch-cloud spike added
Actions #3

Updated by Tom Clegg about 6 years ago

  • Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #4

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #8

Updated by Tom Clegg about 6 years ago

  • Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added
Actions #9

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #10

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #11

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #12

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #13

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #14

Updated by Tom Morris about 6 years ago

  • Target version set to To Be Groomed
Actions #15

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #16

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #17

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
  • Target version deleted (To Be Groomed)
Actions #18

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #19

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #20

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #21

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #22

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #23

Updated by Tom Clegg about 6 years ago

  • Target version set to Arvados Future Sprints
  • Story points set to 4.0
Actions #24

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #25

Updated by Peter Amstutz about 6 years ago

Management APIs should return {"items": [...]} not {"Items": [...]} for consistency with the Arvados API.

Actions #26

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #27

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #28

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #29

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #30

Updated by Tom Clegg about 6 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
Actions #31

Updated by Tom Morris almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Actions #32

Updated by Tom Clegg almost 6 years ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Actions #33

Updated by Tom Clegg almost 6 years ago

Added "hold" and "drain". (Wiki also mentions a "kill" API -- not included here.)

  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"

Added total hourly price. The others were already in place.

  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"

Added what the wiki says, which is a little different:

"...instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down."

This is a bit more consistent since it's more consistent with the "inherited node is not running a container and fails boot probe" case: we allow the boot timeout to run out before killing it, rather than expecting its boot probe to succeed before the existing container finishes.

  • Configurable port number for connecting to VM SSH servers

CloudVMs→SSHPort can be given as a name ("ssh") or number ("22").

  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables

Added.

  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)

Added.

  • Test container.Queue using real railsAPI/controller

Added. Revealed & fixed SDK bug, see f696f142e.

  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)

Added restart/resume test to confirm "hold" and instance-type labels are saved/loaded effectively.

Added a slew of worker tests to confirm proper state changes in probeAndUpdate.

  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

  • Cancel container with no suitable instance type

Added.

  • Enable package build

Uncommented.

  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)

Added.

  • Update management API response format (lowercase keys)

Updated.

Confirmed.

14325-dispatch-cloud @ b105602902e38f18a48505e2091ffea77b2c7c89 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1040/

Actions #34

Updated by Tom Clegg almost 6 years ago

Now at a27b2bf3e with some test cleanup (move LameInstanceSet's one remaining useful feature to StubDriver and retire LameInstanceSet).

Actions #35

Updated by Tom Clegg almost 6 years ago

  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint
Actions #36

Updated by Tom Clegg almost 6 years ago

  • Story points changed from 4.0 to 1.0
Actions #37

Updated by Tom Clegg almost 6 years ago

  • Precedes Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added
Actions #38

Updated by Peter Amstutz almost 6 years ago

worker.shutdownIfIdle():

    if !(wkr.state == StateIdle || (wkr.state == StateBooting && wkr.idleBehavior == IdleBehaviorDrain)) {
        return false
    }

The double-negative logic (do nothing if these things are NOT true...) makes this expression hard to read. Please add comments clarifying the intention that we want to shut down when certain things are true.

    if wkr.idleBehavior != IdleBehaviorDrain && age < wkr.wp.timeoutIdle {
        return false
    }

Same comment about confusing expression. If I'm understanding the intended behavior, it would be clearer to write wkr.idleBehavior == IdleBehaviorRun && age < wkr.wp.timeoutIdle because the IdleBehaviorHold case has already been eliminated, and IdleBehaviorDrain ignores the timeout (but having IdleBehaviorDrain and timeoutIdle appear on the same line implies they are related).

Queue.Update():

        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        }
...
        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        } else if _, keep = next[uuid]; keep {
            continue
        } else {
            delete(cq.current, uuid)
        }

Comment from last time that "keeplocal" was confusing and was renamed to "dontupdate" but there's still a few local variables called "keep" and I don't know how to read it. Should those also be called "dontupdate"? Maybe add some comments?

In Queue.addEnt() there's an embedded assumption that if the current dispatcher can't find a instance type for a container, nobody can, so it should always cancel the container (even if it has to lock it first). I think that's fine (heterogeneous dispatchers has complexity we don't want to get into yet, if ever) but should probably be mentioned in a comment.

worker.probeAndUpdate():

    for _, uuid := range ctrUUIDs {
        running[uuid] = struct{}{}
        if _, ok := wkr.running[uuid]; !ok {
            changed = true
        }
    }

Another place that would benefit from some more comments expressing the intent / context of the code. I think what this is doing is determining if there a container UUID was found on the node which isn't present in wkr.running. The next block looks like it checks the opposite case where a container is known to wkr.running but not present on the instance.

    if wkr.state == StateUnknown || wkr.state == StateBooting {
        wkr.state = StateIdle
        }

It is implied by getting to this point in the code that probeBooted() and probeRunning() both passed successfully, could use a comment making that assumption explicit.

... to be continued ....

Actions #39

Updated by Peter Amstutz almost 6 years ago

Pool.Unallocated():

        if !(wkr.state == StateIdle || wkr.state == StateBooting || wkr.state == StateUnknown) || wkr.idleBehavior != IdleBehaviorRun || len(wkr.running) > 0 {
            continue
        }

This line is way too long.

Similar comment to earlier about hard-to-follow double-negative logic. Here !(wkr.state StateIdle || wkr.state StateBooting || wkr.state == StateUnknown) is much clearer written as (wkr.state != StateIdle && wkr.state != StateBooting && wkr.state != StateUnknown)

Actions #40

Updated by Peter Amstutz almost 6 years ago

Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

# git.curoverse.com/arvados.git/lib/dispatchcloud/container
./queue_test.go:38:17: undefined: test
./queue_test.go:95:17: undefined: test
FAIL    git.curoverse.com/arvados.git/lib/dispatchcloud/container [build failed]
import (
    "github.com/julienschmidt/httprouter" 
)

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

# Layouter fails if we add these

Maybe use graphviz instead? (Requires slightly different notation).

Actions #41

Updated by Tom Clegg almost 6 years ago

Indeed, those are some confusing conditional expressions, thanks. Clarified and added comments.

Peter Amstutz wrote:

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

Ah, indeed.

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

Cheap, easy to use, does what we need (filter on methods + extract path params), didn't think of a reason not to use it. (It happens to be much more efficient with time and memory than gorilla, not that that's a big concern here.)

Maybe use graphviz instead? (Requires slightly different notation).

Maybe. I didn't find this ascii art exercise particularly rewarding. If I were to sink more time into different ways of doing this, I'd probably just give up and make a drawing in Google Drive.

lib/dispatchcloud/container tests are fixed.

14325-dispatch-cloud @ 71fd4da18b22100682ae7e2079aadfd66360d310 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1051/

Actions #42

Updated by Tom Clegg almost 6 years ago

  • Precedes Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added
Actions #43

Updated by Tom Clegg almost 6 years ago

  • Description updated (diff)
Actions #44

Updated by Tom Clegg almost 6 years ago

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

Actions #45

Updated by Peter Amstutz almost 6 years ago

Tom Clegg wrote:

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

This LGTM, thanks.

Actions #46

Updated by Tom Clegg almost 6 years ago

  • Status changed from In Progress to Resolved
Actions #47

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions

Also available in: Atom PDF