Dispatching containers to cloud VMs » History » Version 3
Tom Clegg, 08/04/2018 04:00 AM
1 | 1 | Tom Clegg | h1. Dispatching containers to cloud VMs |
---|---|---|---|
2 | |||
3 | (Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and find/make a kubernetes auto-scaler, instead.) |
||
4 | |||
5 | h2. Background |
||
6 | |||
7 | This is about dispatching to on-demand cloud nodes like Amazon EC2 instances. |
||
8 | |||
9 | Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine. |
||
10 | |||
11 | In crunch1, and the early days of crunch2, we made something work with arvados-nodemanager and SLURM. |
||
12 | |||
13 | One of the goals of crunch2 is eliminating all uses of SLURM with the exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that already exists for non-Arvados tasks. |
||
14 | |||
15 | This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the end state: how dispatch will work when all implementation tasks and migrations are complete. |
||
16 | |||
17 | h2. Relevant components |
||
18 | |||
19 | API server (backed by PostgreSQL) is the source of truth about which containers the system should be trying to execute (or cancel) at any given time. |
||
20 | |||
21 | Arvados configuration (currently via file in /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc. |
||
22 | |||
23 | crunch-dispatch-cloud-node (a new component) arranges for queued containers to run on worker nodes, brings up new worker nodes in order to run the queue faster, and shuts down idle worker nodes. |
||
24 | |||
25 | h2. Overview of crunch-dispatch-cloud-node operation |
||
26 | |||
27 | When first starting up, inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restore internal state accordingly |
||
28 | |||
29 | When API server puts a container in Queued state, lock it, select or create a cloud node to run it on, and start a crunch-run process there to run it |
||
30 | |||
31 | When API server says a container (locked or dispatched by this dispatcher) should be cancelled, ensure the actual container and its crunch-run supervisor get shut down and the relevant node becomes idle |
||
32 | |||
33 | When a crunch-run invocation (dispatched by this dispatcher) exits without updating the container record on the API server -- or can’t run at all -- clean up accordingly |
||
34 | |||
35 | Invariant: every dispatcher-tagged cloud node is either needed by this dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags). |
||
36 | |||
37 | 3 | Tom Clegg | h2. Mechanisms |
38 | 1 | Tom Clegg | |
39 | 3 | Tom Clegg | h3. Interface between dispatcher and operator |
40 | 1 | Tom Clegg | |
41 | 3 | Tom Clegg | Management status endpoint provides a list of its cloud VMs, each with cloud instance ID, UUID of the current / most recent container it attempted (if known), hourly price, and idle time. This snapshot info is not saved to PostgreSQL. |
42 | 1 | Tom Clegg | |
43 | 3 | Tom Clegg | h3. Interface between dispatcher and cloud provider |
44 | 1 | Tom Clegg | |
45 | 3 | Tom Clegg | "VMProvider" interface has the few cloud instance operations we need (list instances+tags, create instance with tags, update instance tags, destroy instance). |
46 | 1 | Tom Clegg | |
47 | 3 | Tom Clegg | h3. Interface between dispatcher and worker node |
48 | 1 | Tom Clegg | |
49 | 3 | Tom Clegg | Each worker node has a public key in /root/.ssh/authorized_keys. Dispatcher has the corresponding private key. |
50 | 2 | Peter Amstutz | |
51 | 3 | Tom Clegg | Dispatcher uses the Go SSH client library to connect to worker nodes. |
52 | 2 | Peter Amstutz | |
53 | 3 | Tom Clegg | h3. Probe operation |
54 | 2 | Peter Amstutz | |
55 | 3 | Tom Clegg | Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself. |
56 | 2 | Peter Amstutz | |
57 | 3 | Tom Clegg | Probe: |
58 | * Check whether the SSH connection is alive; reopen if needed. |
||
59 | * Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting. |
||
60 | * Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID) |
||
61 | 2 | Peter Amstutz | |
62 | 3 | Tom Clegg | Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.) |
63 | 2 | Peter Amstutz | |
64 | 3 | Tom Clegg | h3. Dead/lame nodes |
65 | 2 | Peter Amstutz | |
66 | 3 | Tom Clegg | If a node has been up for N seconds without a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during a time when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.) |
67 | 2 | Peter Amstutz | |
68 | 3 | Tom Clegg | h3. Dead dispatchers |
69 | 2 | Peter Amstutz | |
70 | 3 | Tom Clegg | Every cluster should run multiple dispatchers. If one dies and stays down, the other must notice and take over (or shut down) its nodes -- otherwise those worker node will run forever. |
71 | 2 | Peter Amstutz | |
72 | 3 | Tom Clegg | Each dispatcher, when inspecting the list of cloud nodes, should try to claim (or, failing that, destroy) any node that belongs to a different dispatcher and hasn't completed a probe for N seconds. (This timeout might be longer than the "lame node" timeout.) |