Dispatching containers to cloud VMs » History » Version 2
Peter Amstutz, 08/03/2018 02:31 PM
1 | 1 | Tom Clegg | h1. Dispatching containers to cloud VMs |
---|---|---|---|
2 | |||
3 | (Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and find/make a kubernetes auto-scaler, instead.) |
||
4 | |||
5 | h2. Background |
||
6 | |||
7 | This is about dispatching to on-demand cloud nodes like Amazon EC2 instances. |
||
8 | |||
9 | Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine. |
||
10 | |||
11 | In crunch1, and the early days of crunch2, we made something work with arvados-nodemanager and SLURM. |
||
12 | |||
13 | One of the goals of crunch2 is eliminating all uses of SLURM with the exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that already exists for non-Arvados tasks. |
||
14 | |||
15 | This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the end state: how dispatch will work when all implementation tasks and migrations are complete. |
||
16 | |||
17 | h2. Relevant components |
||
18 | |||
19 | API server (backed by PostgreSQL) is the source of truth about which containers the system should be trying to execute (or cancel) at any given time. |
||
20 | |||
21 | Arvados configuration (currently via file in /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc. |
||
22 | |||
23 | crunch-dispatch-cloud-node (a new component) arranges for queued containers to run on worker nodes, brings up new worker nodes in order to run the queue faster, and shuts down idle worker nodes. |
||
24 | |||
25 | h2. Overview of crunch-dispatch-cloud-node operation |
||
26 | |||
27 | When first starting up, inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restore internal state accordingly |
||
28 | |||
29 | When API server puts a container in Queued state, lock it, select or create a cloud node to run it on, and start a crunch-run process there to run it |
||
30 | |||
31 | When API server says a container (locked or dispatched by this dispatcher) should be cancelled, ensure the actual container and its crunch-run supervisor get shut down and the relevant node becomes idle |
||
32 | |||
33 | When a crunch-run invocation (dispatched by this dispatcher) exits without updating the container record on the API server -- or can’t run at all -- clean up accordingly |
||
34 | |||
35 | Invariant: every dispatcher-tagged cloud node is either needed by this dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags). |
||
36 | |||
37 | h2. TBD |
||
38 | |||
39 | Mechanism for running commands on worker nodes: SSH? |
||
40 | 2 | Peter Amstutz | |
41 | |||
42 | h1. "crunch-dispatch-cloud" (PA) |
||
43 | |||
44 | Node manager generates wishlist based on container queue. Compute nodes run crunch-dispatch-local or similar service, which asks the API server for work and then runs it. |
||
45 | |||
46 | Advantages: |
||
47 | |||
48 | * Complete control over scheduling decisions / priority |
||
49 | |||
50 | Disadvantages: |
||
51 | |||
52 | * Additional load on API server (but probably not that much) |
||
53 | * Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle. Node manager has to be able to put nodes in equivalent of "draining" state to ensure they don't get shut down while doing work. (We can use the "nodes" table for this). |
||
54 | * Need to be able to detect node failure. |
||
55 | |||
56 | h3. Starting up |
||
57 | |||
58 | # Node looks at pending containers to get a "wishlist" |
||
59 | # Nodes spin up the way they do now. However, instead of registering with slurm, they start crunch-dispatch-local. |
||
60 | # Node ping token should have corresponding API token to be used by dispatcher to talk to API server |
||
61 | # C-d-l pings the API server to ask for work, the ping operation puts the node in either "busy" (if work is returned) or "idle" |
||
62 | |||
63 | h3. Running containers |
||
64 | |||
65 | Assumption: Nodes only run one container at once. |
||
66 | |||
67 | # Add "I am idle, give me work" API which locks and returns the next container that is appropriate for the node, or marks the node as "idle" if no work is available |
||
68 | # Node record records which container it is supposed to be running (can be part of the "Lock" call based on the per-node API token) |
||
69 | # C-d-l makes API call to nodes table to say it is "busy" |
||
70 | # C-d-l calls crunch-run to run the container |
||
71 | # C-d-l must continue to ping that it is "busy" every X seconds |
||
72 | # When container finishes, c-d-l pings that it is "idle" |
||
73 | |||
74 | h3. Shutting down |
||
75 | |||
76 | # When node manager decides a node is ready for shutdown, it makes an API call on the node record to indicate "draining". |
||
77 | # C-d-l pings "I am idle" on a "draining" record. This puts the state in "drained" and c-d-l does not get any new work. |
||
78 | # Node manager sees the node is "drained" and can proceed with destroying the cloud node. |
||
79 | |||
80 | h3. Handling failure |
||
81 | |||
82 | # If a node enters a failure state and there is a container associated with it, the container should either be unlocked (if container is in locked state) or cancelled (if in running state). |
||
83 | # API server should have a background process which looks for nodes that haven't pinged recently puts them into failed state. |
||
84 | # Node can also put itself into failed state with an API call. |