Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager - Arvados

Actions

Copy link

Feature #14325

closed

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Added by Tom Clegg over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Target version:

2019-02-13 Sprint

Start date:

01/28/2019

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Release:

Arvados v1.4 - Q1/Q2 2019

Release relationship:

Auto

Description

This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.

Background -- already done in #14360:

Bring up nodes and run containers on them
Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
Shutdown idle nodes automatically
Handle cloud API quota errors
Package-building changes are in place, but commented out

Requirements covered here:

Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
Configurable port number for connecting to VM SSH servers
Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables
Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
Test container.Queue using real railsAPI/controller
Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
Cancel container with no suitable instance type
Enable package build
Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
Update management API response format (lowercase keys)
Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)

Requirements covered elsewhere:

One cloud vendor driver (Azure = #14324)
Production-readiness (#14807)

Refs

Dispatching containers to cloud VMs
#13964 spike

Subtasks 1 (0 open — 1 closed)

Related issues 6 (0 open — 6 closed)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Related to Feature #14324: [crunch-dispatch-cloud] Azure driver added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Related to Bug #13964: crunch-dispatch-cloud spike added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Related to Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Related to Story #14360: [crunch-dispatch-cloud] Merge incomplete implementation added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#11

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#12

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#13

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#14

Updated by Tom Morris over 6 years ago

Target version set to To Be Groomed

Actions

Copy link

#15

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#16

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#17

Updated by Tom Clegg over 6 years ago

Description updated (diff)
Target version deleted (~~To Be Groomed~~)

Actions

Copy link

#18

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#19

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#20

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#21

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#22

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#23

Updated by Tom Clegg over 6 years ago

Target version set to Arvados Future Sprints
Story points set to 4.0

Actions

Copy link

#24

Updated by Peter Amstutz over 6 years ago

Description updated (diff)

Actions

Copy link

#25

Updated by Peter Amstutz over 6 years ago

Management APIs should return {"items": [...]} not {"Items": [...]} for consistency with the Arvados API.

Actions

Copy link

#26

Updated by Peter Amstutz over 6 years ago

Description updated (diff)

Actions

Copy link

#27

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#28

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#29

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#30

Updated by Tom Clegg over 6 years ago

Status changed from New to In Progress
Assigned To set to Tom Clegg

Actions

Copy link

#31

Updated by Tom Morris over 6 years ago

Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

Actions

Copy link

#32

Updated by Tom Clegg over 6 years ago

Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

Actions

Copy link

#33

Updated by Tom Clegg about 6 years ago

Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"

Added "hold" and "drain". (Wiki also mentions a "kill" API -- not included here.)

Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"

Added total hourly price. The others were already in place.

Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"

Added what the wiki says, which is a little different:

"...instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down."

This is a bit more consistent since it's more consistent with the "inherited node is not running a container and fails boot probe" case: we allow the boot timeout to run out before killing it, rather than expecting its boot probe to succeed before the existing container finishes.

Configurable port number for connecting to VM SSH servers

CloudVMs→SSHPort can be given as a name ("ssh") or number ("22").

Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables

Added.

Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)

Added.

Test container.Queue using real railsAPI/controller

Added. Revealed & fixed SDK bug, see f696f142e.

Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)

Added restart/resume test to confirm "hold" and instance-type labels are saved/loaded effectively.

Added a slew of worker tests to confirm proper state changes in probeAndUpdate.

Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

Cancel container with no suitable instance type

Added.

Enable package build

Uncommented.

Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)

Added.

Update management API response format (lowercase keys)

Updated.

Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)

Confirmed.

14325-dispatch-cloud @ b105602902e38f18a48505e2091ffea77b2c7c89 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1040/

Actions

Copy link

#34

Updated by Tom Clegg about 6 years ago

Now at a27b2bf3e with some test cleanup (move LameInstanceSet's one remaining useful feature to StubDriver and retire LameInstanceSet).

Actions

Copy link

#35

Updated by Tom Clegg about 6 years ago

Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint

Actions

Copy link

#36

Updated by Tom Clegg about 6 years ago

Story points changed from 4.0 to 1.0

Actions

Copy link

#37

Updated by Tom Clegg about 6 years ago

Precedes Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node manager added

Actions

Copy link

#38

Updated by Peter Amstutz about 6 years ago

worker.shutdownIfIdle():

    if !(wkr.state == StateIdle || (wkr.state == StateBooting && wkr.idleBehavior == IdleBehaviorDrain)) {
        return false
    }

The double-negative logic (do nothing if these things are NOT true...) makes this expression hard to read. Please add comments clarifying the intention that we want to shut down when certain things are true.

    if wkr.idleBehavior != IdleBehaviorDrain && age < wkr.wp.timeoutIdle {
        return false
    }

Same comment about confusing expression. If I'm understanding the intended behavior, it would be clearer to write wkr.idleBehavior == IdleBehaviorRun && age < wkr.wp.timeoutIdle because the IdleBehaviorHold case has already been eliminated, and IdleBehaviorDrain ignores the timeout (but having IdleBehaviorDrain and timeoutIdle appear on the same line implies they are related).

Queue.Update():

        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        }
...
        if _, keep := cq.dontupdate[uuid]; keep {
            continue
        } else if _, keep = next[uuid]; keep {
            continue
        } else {
            delete(cq.current, uuid)
        }

Comment from last time that "keeplocal" was confusing and was renamed to "dontupdate" but there's still a few local variables called "keep" and I don't know how to read it. Should those also be called "dontupdate"? Maybe add some comments?

In Queue.addEnt() there's an embedded assumption that if the current dispatcher can't find a instance type for a container, nobody can, so it should always cancel the container (even if it has to lock it first). I think that's fine (heterogeneous dispatchers has complexity we don't want to get into yet, if ever) but should probably be mentioned in a comment.

worker.probeAndUpdate():

    for _, uuid := range ctrUUIDs {
        running[uuid] = struct{}{}
        if _, ok := wkr.running[uuid]; !ok {
            changed = true
        }
    }

Another place that would benefit from some more comments expressing the intent / context of the code. I think what this is doing is determining if there a container UUID was found on the node which isn't present in wkr.running. The next block looks like it checks the opposite case where a container is known to wkr.running but not present on the instance.

    if wkr.state == StateUnknown || wkr.state == StateBooting {
        wkr.state = StateIdle
        }

It is implied by getting to this point in the code that probeBooted() and probeRunning() both passed successfully, could use a comment making that assumption explicit.

... to be continued ....

Actions

Copy link

#39

Updated by Peter Amstutz about 6 years ago

Pool.Unallocated():

        if !(wkr.state == StateIdle || wkr.state == StateBooting || wkr.state == StateUnknown) || wkr.idleBehavior != IdleBehaviorRun || len(wkr.running) > 0 {
            continue
        }

This line is way too long.

Similar comment to earlier about hard-to-follow double-negative logic. Here !(wkr.state StateIdle || wkr.state StateBooting || wkr.state == StateUnknown) is much clearer written as (wkr.state != StateIdle && wkr.state != StateBooting && wkr.state != StateUnknown)

Actions

Copy link

#40

Updated by Peter Amstutz about 6 years ago

Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

# git.curoverse.com/arvados.git/lib/dispatchcloud/container
./queue_test.go:38:17: undefined: test
./queue_test.go:95:17: undefined: test
FAIL    git.curoverse.com/arvados.git/lib/dispatchcloud/container [build failed]

import (
    "github.com/julienschmidt/httprouter" 
)

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

# Layouter fails if we add these

Maybe use graphviz instead? (Requires slightly different notation).

Actions

Copy link

#41

Updated by Tom Clegg about 6 years ago

Indeed, those are some confusing conditional expressions, thanks. Clarified and added comments.

Peter Amstutz wrote:

Didn't do this. (We've already implemented it on the API side.)

We've agreed to do so, but haven't actually done it yet (#11561)

Ah, indeed.

What's the goal of introducing yet another routing framework here? We already use both http.ServeMux and gorilla/mux.

Cheap, easy to use, does what we need (filter on methods + extract path params), didn't think of a reason not to use it. (It happens to be much more efficient with time and memory than gorilla, not that that's a big concern here.)

Maybe use graphviz instead? (Requires slightly different notation).

Maybe. I didn't find this ascii art exercise particularly rewarding. If I were to sink more time into different ways of doing this, I'd probably just give up and make a drawing in Google Drive.

lib/dispatchcloud/container tests are fixed.

14325-dispatch-cloud @ 71fd4da18b22100682ae7e2079aadfd66360d310 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1051/

Actions

Copy link

#42

Updated by Tom Clegg about 6 years ago

Precedes Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added

Actions

Copy link

#43

Updated by Tom Clegg about 6 years ago

Description updated (diff)

Actions

Copy link

#44

Updated by Tom Clegg about 6 years ago

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

Actions

Copy link

#45

Updated by Peter Amstutz about 6 years ago

Tom Clegg wrote:

Added one more fix that wasn't mentioned here: Log stderr from last boot-probe when giving up on boot.

14325-dispatch-cloud @ ee53a267ded17bc50eaf4dfebba5ff4a3273753c https://ci.curoverse.com/view/Developer/job/developer-run-tests/1053/

This LGTM, thanks.

Actions

Copy link

#46

Updated by Tom Clegg about 6 years ago

Status changed from In Progress to Resolved

Applied in changeset arvados|800139c8dee7d9a563a8a2dca9e45e283c55c22c.

Actions

Copy link

#47

Updated by Tom Morris about 6 years ago

Release set to 15

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #14325

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Peter Amstutz about 6 years ago

Updated by Peter Amstutz about 6 years ago

Updated by Peter Amstutz about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Peter Amstutz about 6 years ago

Updated by Tom Clegg about 6 years ago

Updated by Tom Morris about 6 years ago