Project

General

Profile

Dispatching containers to cloud VMs » History » Version 6

Tom Clegg, 09/05/2018 09:08 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3 6 Tom Clegg
(Draft)
4 1 Tom Clegg
5 6 Tom Clegg
h2. Component name
6 1 Tom Clegg
7 6 Tom Clegg
(TBD) crunch-dispatch-cloud, or arvados-dispatch-cloud, or arvados-dispatch (c-d-slurm and c-d-local could become arvados-dispatch modules, selected at runtime via config, rather than shipping as separate packages/programs).
8 1 Tom Clegg
9 6 Tom Clegg
h2. Overview
10 1 Tom Clegg
11 6 Tom Clegg
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider refuses to create new VMs.
12 1 Tom Clegg
13 6 Tom Clegg
h2. Interaction with other components
14 1 Tom Clegg
15 6 Tom Clegg
API server (backed by PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
16 1 Tom Clegg
17 6 Tom Clegg
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
18 1 Tom Clegg
19 6 Tom Clegg
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root.
20 1 Tom Clegg
21 6 Tom Clegg
h2. Configuration
22 1 Tom Clegg
23 6 Tom Clegg
Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
24 1 Tom Clegg
25 6 Tom Clegg
<pre><code class="yaml">
26
    CloudVMs:
27
      BootTimeout: 20m
28
      Driver: Amazon
29
      DriverParameters:
30
        Region: us-east-1
31
        APITimeout: 20s
32
        EC2Key: abcdef
33
        EC2Secret: abcdefghijklmnopqrstuvwxyz
34
        StorageKey: abcdef
35
        StorageSecret: abcdefghijklmnopqrstuvwxyz
36
        ImageID: ami-0123456789abcdef0
37
        SubnetID: subnet-01234567
38
        SecurityGroups: sg-01234567
39
</code></pre>
40 1 Tom Clegg
41 6 Tom Clegg
h2. Scheduling policy
42 1 Tom Clegg
43 6 Tom Clegg
The container priority field determines the order in which resources are allocated.
44
* If container C1 has priority P1,
45
* ...and C2 has higher priority P2,
46
* ...and there is no pending/booting/idle VM suitable for running C2,
47
* ...then C1 will not be started.
48 1 Tom Clegg
49 6 Tom Clegg
However, containers that run on different VM types don't necessarily start in priority order.
50
* If container C1 has priority P1,
51
* ...and C2 has higher priority P2,
52
* ...and there is no idle VM suitable for running C2,
53
* ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
54
* ...and there is an idle VM suitable for running C1,
55
* ...then C1 will start before C2.
56 1 Tom Clegg
57 6 Tom Clegg
h2. Synchronizing state
58 1 Tom Clegg
59 6 Tom Clegg
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
60 1 Tom Clegg
61 6 Tom Clegg
Often, at startup there will be some containers with state=Locked. To avoid breaking priority order, the dispatcher won't schedule any new containers until all such locked containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
62 1 Tom Clegg
63 6 Tom Clegg
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
64 1 Tom Clegg
65 6 Tom Clegg
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
66 1 Tom Clegg
67 6 Tom Clegg
h2. Operator view
68 1 Tom Clegg
69 6 Tom Clegg
Management status endpoint provides:
70
* list of cloud VMs, each with
71
** provider's instance ID
72
** hourly price (from configuration file)
73
** instance type (from configuration file)
74
** instance type (from provider's menu)
75
** UUID of the current / most recent container attempted (if known)
76
** time last container finished (or boot time, if nothing run yet)
77
* list of queued/running containers, each with
78
** UUID
79
** state (queued/locked/running/complete/cancelled)
80
** desired instance type
81
** time appeared in queue
82
** time started (if started)
83 5 Peter Amstutz
84 6 Tom Clegg
Metrics endpoint tracks:
85
* (each VM) time elapsed between VM creation and first successful SSH connection
86
* (each VM) time elapsed between first successful SSH connection and ready to run a container
87
* total hourly price of all existing VMs
88
* total VCPUs and memory allocated to containers
89
* number of containers running
90
* number of containers allocated to VMs but not started yet (because VMs are pending/booting)
91
* number of containers not allocated to VMs (because provider quota is reached)
92 4 Peter Amstutz
93 6 Tom Clegg
h2. SSH keys
94
95 5 Peter Amstutz
Each worker node has a public key in /root/.ssh/authorized_keys. Dispatcher has the corresponding private key.
96
97 6 Tom Clegg
(Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features.
98 5 Peter Amstutz
99 6 Tom Clegg
h3. Probes
100 4 Peter Amstutz
101
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
102 5 Peter Amstutz
103
Probe:
104
* Check whether the SSH connection is alive; reopen if needed.
105
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
106
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
107
108
Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.)
109
110 6 Tom Clegg
h3. Detecting dead/lame nodes
111 5 Peter Amstutz
112
If a node has been up for N seconds without a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during a time when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.)
113
114 6 Tom Clegg
h3. Multiple dispatchers
115 5 Peter Amstutz
116 6 Tom Clegg
Not supported in initial version.