Dispatching containers to cloud VMs » History » Version 14
Tom Clegg, 10/16/2018 03:22 PM
1 | 1 | Tom Clegg | h1. Dispatching containers to cloud VMs |
---|---|---|---|
2 | |||
3 | 6 | Tom Clegg | (Draft) |
4 | 1 | Tom Clegg | |
5 | 11 | Tom Clegg | {{>toc}} |
6 | |||
7 | 7 | Tom Clegg | h2. Component name / purpose |
8 | 1 | Tom Clegg | |
9 | 7 | Tom Clegg | crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them. |
10 | 1 | Tom Clegg | |
11 | 12 | Tom Clegg | h2. Deployment |
12 | |||
13 | The crunch-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one crunch-dispatch-cloud process (future versions will support multiple dispatchers). |
||
14 | |||
15 | 9 | Tom Clegg | h2. Overview of operation |
16 | 1 | Tom Clegg | |
17 | 9 | Tom Clegg | The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs. |
18 | 1 | Tom Clegg | |
19 | 6 | Tom Clegg | h2. Interaction with other components |
20 | 1 | Tom Clegg | |
21 | 9 | Tom Clegg | Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time. |
22 | 1 | Tom Clegg | |
23 | 6 | Tom Clegg | The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys. |
24 | 1 | Tom Clegg | |
25 | 6 | Tom Clegg | The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root. |
26 | 1 | Tom Clegg | |
27 | 6 | Tom Clegg | h2. Configuration |
28 | 1 | Tom Clegg | |
29 | 6 | Tom Clegg | Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc. |
30 | 1 | Tom Clegg | |
31 | 6 | Tom Clegg | <pre><code class="yaml"> |
32 | CloudVMs: |
||
33 | 8 | Tom Clegg | BootProbeCommand: "docker ps -q" |
34 | SyncInterval: 1m # get list of |
||
35 | TimeoutIdle: 1m # shutdown if idle longer than this |
||
36 | TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully |
||
37 | TimeoutProbe: 2m # shutdown if (after booting) communication fails longer than this, even if ctrs are running |
||
38 | TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown |
||
39 | 6 | Tom Clegg | Driver: Amazon |
40 | 8 | Tom Clegg | DriverParameters: # following configs are driver dependent |
41 | 6 | Tom Clegg | Region: us-east-1 |
42 | APITimeout: 20s |
||
43 | EC2Key: abcdef |
||
44 | EC2Secret: abcdefghijklmnopqrstuvwxyz |
||
45 | StorageKey: abcdef |
||
46 | StorageSecret: abcdefghijklmnopqrstuvwxyz |
||
47 | ImageID: ami-0123456789abcdef0 |
||
48 | 1 | Tom Clegg | SubnetID: subnet-01234567 |
49 | SecurityGroups: sg-01234567 |
||
50 | 8 | Tom Clegg | Dispatch: |
51 | StaleLockTimeout: 1m # after restart, time to wait for workers to come up before abandoning locks from previous run |
||
52 | PollInterval: 1m # how often to get latest queue from arvados controller |
||
53 | ProbeInterval: 10s # how often to probe each instance for current status/vital signs |
||
54 | MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances) |
||
55 | PrivateKey: | # SSH key able to log in as root@ worker VMs |
||
56 | -----BEGIN RSA PRIVATE KEY----- |
||
57 | MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ |
||
58 | 0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E |
||
59 | GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV |
||
60 | mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 |
||
61 | LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD |
||
62 | gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav |
||
63 | Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6 |
||
64 | exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7 |
||
65 | p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK |
||
66 | dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w |
||
67 | 70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj |
||
68 | iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v |
||
69 | yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV |
||
70 | X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE |
||
71 | ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f |
||
72 | QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP |
||
73 | exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+ |
||
74 | jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY |
||
75 | lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5 |
||
76 | 7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn |
||
77 | G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O |
||
78 | JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN |
||
79 | ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI |
||
80 | 1 | Tom Clegg | pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 |
81 | 1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ |
||
82 | -----END RSA PRIVATE KEY----- |
||
83 | 9 | Tom Clegg | InstanceTypes: |
84 | - Name: m4.large |
||
85 | VCPUs: 2 |
||
86 | RAM: 7782000000 |
||
87 | Scratch: 32000000000 |
||
88 | Price: 0.1 |
||
89 | - Name: m4.large.spot |
||
90 | Preemptible: true |
||
91 | VCPUs: 2 |
||
92 | RAM: 7782000000 |
||
93 | Scratch: 32000000000 |
||
94 | Price: 0.1 |
||
95 | - Name: m4.xlarge |
||
96 | VCPUs: 4 |
||
97 | RAM: 15564000000 |
||
98 | Scratch: 80000000000 |
||
99 | Price: 0.2 |
||
100 | - Name: m4.xlarge.spot |
||
101 | Preemptible: true |
||
102 | VCPUs: 4 |
||
103 | RAM: 15564000000 |
||
104 | Scratch: 80000000000 |
||
105 | Price: 0.2 |
||
106 | - Name: m4.2xlarge |
||
107 | VCPUs: 8 |
||
108 | RAM: 31129000000 |
||
109 | Scratch: 160000000000 |
||
110 | Price: 0.4 |
||
111 | - Name: m4.2xlarge.spot |
||
112 | Preemptible: true |
||
113 | VCPUs: 8 |
||
114 | RAM: 31129000000 |
||
115 | Scratch: 160000000000 |
||
116 | Price: 0.4 |
||
117 | 6 | Tom Clegg | </code></pre> |
118 | 1 | Tom Clegg | |
119 | 10 | Tom Clegg | h2. Management API |
120 | 1 | Tom Clegg | |
121 | 10 | Tom Clegg | APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}". |
122 | |||
123 | Responses are JSON-encoded and resemble other Arvados APIs: |
||
124 | <pre><code class="json"> |
||
125 | { |
||
126 | "Items": [ |
||
127 | { |
||
128 | "Name": "...", |
||
129 | ... |
||
130 | }, |
||
131 | ... |
||
132 | ] |
||
133 | } |
||
134 | </code></pre> |
||
135 | |||
136 | @GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes: |
||
137 | * provider's instance ID |
||
138 | * hourly price (from configuration file) |
||
139 | * instance type (from configuration file) |
||
140 | * instance type (from provider's menu) |
||
141 | * UUID of the current / most recent container attempted (if known) |
||
142 | * time last container finished (or boot time, if nothing run yet) |
||
143 | |||
144 | @GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes: |
||
145 | * container UUID |
||
146 | * container state (Queued/Locked/Running/Complete/Cancelled) |
||
147 | * desired instance type |
||
148 | * time appeared in queue |
||
149 | * time started (if started) |
||
150 | |||
151 | @POST /arvados/v1/dispatch/instances/:instance_id/drain@ puts an instance in "drain" state. |
||
152 | * if the instance is currently running a container, it is allowed to continue |
||
153 | * no further containers will be scheduled on the instance |
||
154 | * (TBD) the instance will not be shut down automatically |
||
155 | |||
156 | @POST /arvados/v1/dispatch/instances/:instance_id/shutdown@ puts an instance in "shutdown" state. |
||
157 | * if the instance is currently running a container, the instance is shut down when the container finishes |
||
158 | * otherwise, the instance is shut down immediately |
||
159 | |||
160 | h2. Metrics |
||
161 | |||
162 | 13 | Tom Clegg | Metrics are available via HTTP on a configurable address/port (conventionally :9005). Request headers must include "Authorization: Bearer {management token}". |
163 | 10 | Tom Clegg | |
164 | Metrics include: |
||
165 | 13 | Tom Clegg | * [future] (summary) time elapsed between VM creation and first successful SSH connection to that VM |
166 | * [future] (summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM |
||
167 | 10 | Tom Clegg | * (gauge) total hourly price of all existing VMs |
168 | * (gauge) total VCPUs and memory allocated to containers |
||
169 | 1 | Tom Clegg | * (gauge) number of containers running |
170 | 10 | Tom Clegg | * (gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting) |
171 | * (gauge) number of containers not allocated to VMs (because provider quota is reached) |
||
172 | 13 | Tom Clegg | |
173 | 14 | Tom Clegg | h2. Logs |
174 | |||
175 | For purposes of troubleshooting, a log message is printed on stderr when: |
||
176 | * a new instance is created/ordered |
||
177 | * an instance appears on the provider's list of instances |
||
178 | * an instance's boot probe succeeds |
||
179 | * an instance shutdown is requested |
||
180 | * an instance disappears from the provider's list of instances |
||
181 | * a cloud provider API error occurs |
||
182 | * a new container appears in the Arvados queue |
||
183 | * a container is locked by the dispatcher |
||
184 | * a crunch-run process is started on an instance |
||
185 | * a crunch-run process ends |
||
186 | * an active container's state changes to Complete or Cancelled |
||
187 | * an active container is requeued after being locked |
||
188 | * an Arvados API error occurs |
||
189 | |||
190 | (Example log entries should be shown here) |
||
191 | |||
192 | If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug. |
||
193 | 10 | Tom Clegg | |
194 | h2. Internal details |
||
195 | |||
196 | h3. Scheduling policy |
||
197 | |||
198 | 6 | Tom Clegg | The container priority field determines the order in which resources are allocated. |
199 | * If container C1 has priority P1, |
||
200 | * ...and C2 has higher priority P2, |
||
201 | * ...and there is no pending/booting/idle VM suitable for running C2, |
||
202 | * ...then C1 will not be started. |
||
203 | |||
204 | However, containers that run on different VM types don't necessarily start in priority order. |
||
205 | 1 | Tom Clegg | * If container C1 has priority P1, |
206 | * ...and C2 has higher priority P2, |
||
207 | 5 | Peter Amstutz | * ...and there is no idle VM suitable for running C2, |
208 | 6 | Tom Clegg | * ...and there is a pending/booting VM that will be suitable for running C2 when it comes up, |
209 | * ...and there is an idle VM suitable for running C1, |
||
210 | 1 | Tom Clegg | * ...then C1 will start before C2. |
211 | 6 | Tom Clegg | |
212 | 10 | Tom Clegg | h3. Special cases / synchronizing state |
213 | 1 | Tom Clegg | |
214 | 6 | Tom Clegg | When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly. |
215 | |||
216 | 10 | Tom Clegg | Some containers might have state=Locked at startup. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled). |
217 | 6 | Tom Clegg | |
218 | 1 | Tom Clegg | When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container). |
219 | 4 | Peter Amstutz | |
220 | 6 | Tom Clegg | When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled. |
221 | |||
222 | 10 | Tom Clegg | h3. SSH keys |
223 | 5 | Peter Amstutz | |
224 | 10 | Tom Clegg | The operator must install a public key in /root/.ssh/authorized_keys on each worker node. Dispatcher has the corresponding private key. |
225 | 5 | Peter Amstutz | |
226 | 6 | Tom Clegg | (Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features. |
227 | 4 | Peter Amstutz | |
228 | h3. Probes |
||
229 | 5 | Peter Amstutz | |
230 | Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself. |
||
231 | |||
232 | Probe: |
||
233 | * Check whether the SSH connection is alive; reopen if needed. |
||
234 | * Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting. |
||
235 | * Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID) |
||
236 | |||
237 | 6 | Tom Clegg | h3. Detecting dead/lame nodes |
238 | 5 | Peter Amstutz | |
239 | 10 | Tom Clegg | If a node has been up for N seconds without a successful probe, despite at least M attempts, it is shut down, even if it was running a container last time it was contacted successfully. |
240 | 5 | Peter Amstutz | |
241 | 6 | Tom Clegg | h3. Multiple dispatchers |
242 | 5 | Peter Amstutz | |
243 | 6 | Tom Clegg | Not supported in initial version. |