Dispatching containers to cloud VMs » History » Version 72
Tom Clegg, 06/26/2019 01:31 PM
1 | 1 | Tom Clegg | h1. Dispatching containers to cloud VMs |
---|---|---|---|
2 | |||
3 | 6 | Tom Clegg | (Draft) |
4 | 1 | Tom Clegg | |
5 | 11 | Tom Clegg | {{>toc}} |
6 | |||
7 | 71 | Tom Clegg | See also: |
8 | * [[cloudtest utility]] |
||
9 | |||
10 | 7 | Tom Clegg | h2. Component name / purpose |
11 | 1 | Tom Clegg | |
12 | 53 | Tom Clegg | arvados-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them. |
13 | 1 | Tom Clegg | |
14 | 49 | Tom Clegg | h2. Overview of operation |
15 | |||
16 | The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs. |
||
17 | |||
18 | h2. Interaction with other components |
||
19 | |||
20 | Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time. |
||
21 | |||
22 | The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys. |
||
23 | |||
24 | 57 | Tom Clegg | The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root (either directly or via sudo). |
25 | 49 | Tom Clegg | |
26 | h2. Instance tags |
||
27 | |||
28 | The dispatcher relies on the cloud provider's tagging feature to persist state across server restarts. |
||
29 | * {"InstanceType": "foo"} indicates that the instance was created with the specs from the instance type named "foo" in the cluster configuration file. |
||
30 | 50 | Tom Clegg | * {"IdleBehavior": "hold"} indicates that the management API has been used to put the instance in "hold" state. |
31 | 58 | Tom Clegg | * {"InstanceSecret": "ad23b6a8912f2b75d8a5e6887fbcb82f8024daea"} is a random string used to verify the instance's SSH host key. |
32 | 49 | Tom Clegg | |
33 | Provider-specific drivers (Amazon, Google, Azure) determine exactly how these tags are encoded in the cloud API, and can use tags to persist their own internal state as well. For example, a driver might save tags named "Arvados-DispatchCloud-InstanceType" rather than just "InstanceType". |
||
34 | |||
35 | 23 | Tom Clegg | h2. Deployment |
36 | 24 | Tom Clegg | |
37 | 53 | Tom Clegg | *Where to install:* The arvados-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one arvados-dispatch-cloud process. |
38 | 9 | Tom Clegg | * Future versions will support multiple dispatchers. |
39 | 1 | Tom Clegg | |
40 | 9 | Tom Clegg | *Dispatcher's SSH key:* The operator must generate an SSH key pair for the dispatcher to use when connecting to cloud VMs. The private key is stored (without a passphrase) in the cluster configuration file. It does not need to be saved in @~/.ssh/@. |
41 | 1 | Tom Clegg | |
42 | 56 | Tom Clegg | *Cloud VM image:* The operator must provide a VM image with an SSH server on a port reachable by the dispatcher (default 22, configurable per cluster). The dispatcher's SSH public key must be listed in @/root/.ssh/authorized_keys@. The image should also include systemd-cat (part of systemd) and suitable versions of docker and crunch-run. The @/var/lock@ directory must be available for lockfiles with names matching "@crunch-run-*.*@". |
43 | 1 | Tom Clegg | * It is possible to install docker and crunch-run using a custom boot probe command, but pre-installing is more efficient. |
44 | * Future versions will automatically sync the crunch-run binary from the dispatcher host to each worker node. |
||
45 | 56 | Tom Clegg | * The Azure driver creates a new admin user account and installs the SSH public key by itself so @/root/.ssh/authorized_keys@ is not needed. The VM image must include @sudo@. |
46 | 1 | Tom Clegg | |
47 | 6 | Tom Clegg | *Cloud provider account:* The dispatcher uses cloud provider credentials to create and delete VMs and other cloud resources. An Arvados user can create an arbitrary number of long-running containers, and the dispatcher will try to run all of them. Currently the dispatcher does not enforce any resource limits of its own, so the operator must ensure the cloud provider itself is enforcing a suitable quota. |
48 | 53 | Tom Clegg | |
49 | 52 | Tom Clegg | *Migrating from nodemanager/SLURM:* When VM images, SSH keys, and configuration files are ready, disable nodemanager and crunch-dispatch-slurm. Install arvados-dispatch-cloud deb/rpm package. Confirm success with @systemctl status arvados-dispatch-cloud@ and @journalctl -fu arvados-dispatch-cloud@. See [[Migrating from arvados-node-manager to arvados-dispatch-cloud]]. |
50 | 1 | Tom Clegg | |
51 | 6 | Tom Clegg | h2. Configuration |
52 | 1 | Tom Clegg | |
53 | 42 | Tom Clegg | Arvados [[Cluster configuration]] (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc. |
54 | 1 | Tom Clegg | |
55 | 6 | Tom Clegg | <pre><code class="yaml"> |
56 | CloudVMs: |
||
57 | 1 | Tom Clegg | BootProbeCommand: "docker ps -q" |
58 | 42 | Tom Clegg | SSHPort: 22 |
59 | 27 | Tom Clegg | SyncInterval: 1m # how often to get list of active instances from cloud provider |
60 | 8 | Tom Clegg | TimeoutIdle: 1m # shutdown if idle longer than this |
61 | TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully |
||
62 | TimeoutProbe: 2m # shutdown if (after booting) communication fails longer than this, even if ctrs are running |
||
63 | TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown |
||
64 | 6 | Tom Clegg | Driver: Amazon |
65 | 8 | Tom Clegg | DriverParameters: # following configs are driver dependent |
66 | 6 | Tom Clegg | Region: us-east-1 |
67 | 59 | Tom Clegg | AccessKeyID: abcdef |
68 | SecretAccessKey: abcdefghijklmnopqrstuvwxyz |
||
69 | 39 | Tom Clegg | SubnetID: subnet-01234567 |
70 | 59 | Tom Clegg | SecurityGroupIDs: sg-01234567 |
71 | AdminUsername: ubuntu |
||
72 | EBSVolumeType: gp2 |
||
73 | 8 | Tom Clegg | Dispatch: |
74 | StaleLockTimeout: 1m # after restart, time to wait for workers to come up before abandoning locks from previous run |
||
75 | PollInterval: 1m # how often to get latest queue from arvados controller |
||
76 | ProbeInterval: 10s # how often to probe each instance for current status/vital signs |
||
77 | MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances) |
||
78 | PrivateKey: | # SSH key able to log in as root@ worker VMs |
||
79 | -----BEGIN RSA PRIVATE KEY----- |
||
80 | MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ |
||
81 | 0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E |
||
82 | GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV |
||
83 | mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 |
||
84 | LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD |
||
85 | gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav |
||
86 | Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6 |
||
87 | exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7 |
||
88 | p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK |
||
89 | dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w |
||
90 | 70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj |
||
91 | iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v |
||
92 | yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV |
||
93 | X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE |
||
94 | ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f |
||
95 | QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP |
||
96 | exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+ |
||
97 | jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY |
||
98 | lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5 |
||
99 | 7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn |
||
100 | G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O |
||
101 | JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN |
||
102 | ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI |
||
103 | 1 | Tom Clegg | pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 |
104 | 1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ |
||
105 | -----END RSA PRIVATE KEY----- |
||
106 | 9 | Tom Clegg | InstanceTypes: |
107 | - Name: m4.large |
||
108 | VCPUs: 2 |
||
109 | RAM: 7782000000 |
||
110 | 66 | Tom Clegg | Scratch: 32000000000 |
111 | 59 | Tom Clegg | IncludedScratch: 32000000000 |
112 | 9 | Tom Clegg | Price: 0.1 |
113 | - Name: m4.large.spot |
||
114 | Preemptible: true |
||
115 | VCPUs: 2 |
||
116 | RAM: 7782000000 |
||
117 | 66 | Tom Clegg | Scratch: 32000000000 |
118 | 59 | Tom Clegg | IncludedScratch: 32000000000 |
119 | 9 | Tom Clegg | Price: 0.1 |
120 | - Name: m4.xlarge |
||
121 | VCPUs: 4 |
||
122 | RAM: 15564000000 |
||
123 | 66 | Tom Clegg | Scratch: 80000000000 |
124 | 59 | Tom Clegg | IncludedScratch: 80000000000 |
125 | 9 | Tom Clegg | Price: 0.2 |
126 | - Name: m4.xlarge.spot |
||
127 | Preemptible: true |
||
128 | VCPUs: 4 |
||
129 | RAM: 15564000000 |
||
130 | 66 | Tom Clegg | Scratch: 80000000000 |
131 | 59 | Tom Clegg | IncludedScratch: 80000000000 |
132 | 9 | Tom Clegg | Price: 0.2 |
133 | - Name: m4.2xlarge |
||
134 | VCPUs: 8 |
||
135 | RAM: 31129000000 |
||
136 | 66 | Tom Clegg | Scratch: 160000000000 |
137 | 59 | Tom Clegg | IncludedScratch: 160000000000 |
138 | 9 | Tom Clegg | Price: 0.4 |
139 | - Name: m4.2xlarge.spot |
||
140 | Preemptible: true |
||
141 | VCPUs: 8 |
||
142 | RAM: 31129000000 |
||
143 | 66 | Tom Clegg | Scratch: 160000000000 |
144 | 59 | Tom Clegg | IncludedScratch: 160000000000 |
145 | 9 | Tom Clegg | Price: 0.4 |
146 | 6 | Tom Clegg | </code></pre> |
147 | 1 | Tom Clegg | |
148 | 10 | Tom Clegg | h2. Management API |
149 | 1 | Tom Clegg | |
150 | 10 | Tom Clegg | APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}". |
151 | |||
152 | Responses are JSON-encoded and resemble other Arvados APIs: |
||
153 | <pre><code class="json"> |
||
154 | { |
||
155 | 43 | Tom Clegg | "items": [ |
156 | 10 | Tom Clegg | { |
157 | 43 | Tom Clegg | "name": "...", |
158 | 10 | Tom Clegg | ... |
159 | }, |
||
160 | ... |
||
161 | ] |
||
162 | } |
||
163 | </code></pre> |
||
164 | |||
165 | @GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes: |
||
166 | * container UUID |
||
167 | * container state (Queued/Locked/Running/Complete/Cancelled) |
||
168 | * desired instance type |
||
169 | 1 | Tom Clegg | * time appeared in queue |
170 | * time started (if started) |
||
171 | * if you're switching from slurm, this is roughly *equivalent to squeue* |
||
172 | |||
173 | 72 | Tom Clegg | @POST /arvados/v1/dispatch/containers/kill?container_uuid=X@ terminates a container immediately. |
174 | 69 | Tom Clegg | * a single attempt is made to send SIGTERM to the container's supervisor (crunch-run) process |
175 | 1 | Tom Clegg | * container state/priority fields are not affected |
176 | * assuming SIGTERM works, the container record will end up with state "Cancelled" |
||
177 | 10 | Tom Clegg | * if you're switching from slurm, this is roughly *equivalent to scancel* |
178 | 68 | Tom Clegg | |
179 | @GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes: |
||
180 | * provider's instance ID |
||
181 | * hourly price (from configuration file) |
||
182 | * instance type (from configuration file) |
||
183 | * instance type (from provider's menu) |
||
184 | * UUID of the current / most recent container attempted (if known) |
||
185 | * time last container finished (or boot time, if nothing run yet) |
||
186 | * if you're switching from slurm, this is roughly *equivalent to sinfo* |
||
187 | 10 | Tom Clegg | |
188 | 54 | Tom Clegg | @POST /arvados/v1/dispatch/instances/hold?instance_id=X@ puts an instance in "hold" state. |
189 | 34 | Tom Clegg | * if the instance is currently running a container, it is allowed to continue |
190 | * no further containers will be scheduled on the instance |
||
191 | * the instance will not be shut down automatically |
||
192 | |||
193 | 54 | Tom Clegg | @POST /arvados/v1/dispatch/instances/drain?instance_id=X@ puts an instance in "drain" state. |
194 | 1 | Tom Clegg | * if the instance is currently running a container, it is allowed to continue |
195 | * no further containers will be scheduled on the instance |
||
196 | 34 | Tom Clegg | * the instance will be shut down automatically when all containers finish |
197 | 1 | Tom Clegg | |
198 | 55 | Tom Clegg | @POST /arvados/v1/dispatch/instances/run?instance_id=X@ puts an instance in the default "run" state. |
199 | * if the instance is currently running a container, it is allowed to continue |
||
200 | * more containers will be scheduled on the instance when it becomes available |
||
201 | * the instance will be shut down automatically when it exceeds the configured idle timeout |
||
202 | |||
203 | 61 | Tom Clegg | @POST /arvados/v1/dispatch/instances/kill?instance_id=X@ shuts down an instance immediately. |
204 | 1 | Tom Clegg | * the instance is terminated immediately via cloud API |
205 | 34 | Tom Clegg | * SIGTERM is sent to the container if one is running, but no effort is made to give it time to end gracefully before terminating the instance |
206 | 54 | Tom Clegg | |
207 | †@POST /arvados/v1/dispatch/loglevel?level=debug@ sets the logging threshold to "debug" or "info". |
||
208 | * @.../loglevel?level=debug@ enables debug logs |
||
209 | * @.../loglevel?level=info@ disables debug logs |
||
210 | 47 | Tom Clegg | |
211 | 10 | Tom Clegg | h2. Metrics |
212 | 13 | Tom Clegg | |
213 | 63 | Tom Clegg | Metrics are available via HTTP on a configurable address/port (conventionally :9006). Request headers must include "Authorization: Bearer {management token}". |
214 | 10 | Tom Clegg | |
215 | 13 | Tom Clegg | Metrics include: |
216 | 1 | Tom Clegg | * (gauge) number of existing VMs |
217 | 35 | Tom Clegg | * (gauge) total hourly price of all existing VMs |
218 | 46 | Tom Clegg | * (gauge) total VCPUs and memory in all existing VMs |
219 | 1 | Tom Clegg | * (gauge) total VCPUs and memory allocated to containers |
220 | * (gauge) number of containers running |
||
221 | 46 | Tom Clegg | * †(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting) |
222 | * †(gauge) number of containers not allocated to VMs (because provider quota is reached) |
||
223 | 62 | Tom Clegg | * (gauge) total hourly price of VMs, partitioned by allocation state (booting, running, idle, shutdown) |
224 | 46 | Tom Clegg | * †(summary) time elapsed between VM creation and first successful SSH connection to that VM |
225 | * †(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM |
||
226 | * †(summary) time elapsed between first shutdown attempt on a VM and its disappearance from the provider listing |
||
227 | 60 | Tom Clegg | * †(summary) wait times (between seeing a container in the queue or requeueing, and starting its crunch-run process on a worker) across previous starts |
228 | * †(gauge) longest wait time of any unstarted container |
||
229 | 46 | Tom Clegg | |
230 | † not yet implemented |
||
231 | |||
232 | 14 | Tom Clegg | h2. Logs |
233 | 20 | Tom Clegg | |
234 | 16 | Tom Clegg | For purposes of troubleshooting, a JSON-formatted log entry is printed on stderr when... |
235 | 20 | Tom Clegg | |
236 | 44 | Tom Clegg | | |... if loglevel ≥ ...|...including timestamp and...| |
237 | |a new instance is created/ordered |info |instance type name| |
||
238 | |an instance appears on the provider's list of instances |info |instance ID| |
||
239 | |an instance's boot probe succeeds |info |instance ID| |
||
240 | |an instance is shut down after boot timeout |warn |instance ID, †stdout/stderr/error from last boot probe attempt| |
||
241 | |an instance shutdown is requested |info |instance ID| |
||
242 | |an instance disappears from the provider's list of instances |info |instance ID and previous state (booting/idle/shutdown)| |
||
243 | |a cloud provider API or driver error occurs |error |provider/driver's error message| |
||
244 | 64 | Tom Clegg | |a new container appears in the Arvados queue |info |container UUID, desired instance type name| |
245 | 45 | Tom Clegg | |a container is locked by the dispatcher |debug |container UUID| |
246 | 44 | Tom Clegg | |a crunch-run process is started on an instance |info |container UUID, instance ID, crunch-run PID| |
247 | |a crunch-run process fails to start on an instance |info |container UUID, instance ID, stdout/stderr/exitcode| |
||
248 | |a crunch-run process ends |info |container UUID, instance ID| |
||
249 | |an active container's state changes to Complete or Cancelled |info |container UUID, new state| |
||
250 | |an active container is requeued after being locked |info |container UUID| |
||
251 | |an Arvados API error occurs |warn |error message| |
||
252 | 16 | Tom Clegg | |
253 | 44 | Tom Clegg | † not yet implemented |
254 | 1 | Tom Clegg | |
255 | 51 | Tom Clegg | Example log entries from test suite (note test suite uses text formatting, production logging uses JSON formatting): |
256 | <pre> |
||
257 | INFO[0000] creating new instance ContainerUUID=zzzzz-dz642-000000000000160 InstanceType=type8 |
||
258 | INFO[0000] instance appeared in cloud IdleBehavior=run Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 State=booting |
||
259 | INFO[0000] boot probe succeeded Command=true Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 stderr= stdout= |
||
260 | INFO[0000] instance booted; will try probeRunning Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 ProbeStart="2019-02-05 15:49:49.183431341 -0500 EST m=+0.126074285" |
||
261 | INFO[0000] probes succeeded, instance is in service Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 ProbeStart="2019-02-05 15:49:49.183431341 -0500 EST m=+0.126074285" RunningContainers=0 State=idle |
||
262 | INFO[0000] crunch-run process started ContainerUUID=zzzzz-dz642-000000000000160 Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 Priority=20 |
||
263 | INFO[0000] container finished ContainerUUID=zzzzz-dz642-000000000000160 State=Complete |
||
264 | ... |
||
265 | INFO[0002] shutdown idle worker Age=151.615512ms IdleBehavior=run Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 State=idle |
||
266 | INFO[0002] instance disappeared in cloud Instance=stub-providertype8-6ec34c367674cb74 WorkerState=shutdown |
||
267 | </pre> |
||
268 | 14 | Tom Clegg | |
269 | 10 | Tom Clegg | If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug. |
270 | |||
271 | h2. Internal details |
||
272 | |||
273 | 38 | Tom Clegg | h3. Worker lifecycle |
274 | |||
275 | <pre> |
||
276 | 41 | Tom Clegg | |
277 | ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ |
||
278 | │ │ |
||
279 | │ create() returns ID │ want=drain |
||
280 | │ ┌───────────────────────────────────────────────────────────────────────────┐ ┌────────────────────────────┼─────────────────────────────────────────┐ |
||
281 | │ │ ∨ │ │ ∨ |
||
282 | │ ┌─────────────┐ appears in cloud list ┌─────────┐ create() returns ID ┌─────────┐ boot+run probes succeed ┌──────┐ container starts ┌─────────┐ container ends, want=drain ┌──────────┐ instance disappears from cloud ┌──────┐ |
||
283 | │ │ Nonexistent │ ───────────────────────> │ Unknown │ ─────────────────────> │ Booting │ ─────────────────────────> │ │ ──────────────────> │ Running │ ────────────────────────────> │ │ ────────────────────────────────> │ Gone │ |
||
284 | │ └─────────────┘ └─────────┘ └─────────┘ │ │ └─────────┘ │ │ └──────┘ |
||
285 | │ │ │ │ idle timeout │ │ |
||
286 | │ │ │ Idle │ ────────────────────────────────────────────────────────────> │ Shutdown │ |
||
287 | │ │ │ │ │ │ |
||
288 | │ │ │ │ probe timeout │ │ |
||
289 | │ │ │ │ ────────────────────────────────────────────────────────────> │ │ |
||
290 | │ │ └──────┘ └──────────┘ |
||
291 | │ │ ∧ boot timeout ∧ |
||
292 | │ └─────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┘ |
||
293 | │ │ |
||
294 | │ container ends │ |
||
295 | └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘</pre> |
||
296 | 38 | Tom Clegg | |
297 | 10 | Tom Clegg | h3. Scheduling policy |
298 | 6 | Tom Clegg | |
299 | The container priority field determines the order in which resources are allocated. |
||
300 | * If container C1 has priority P1, |
||
301 | * ...and C2 has higher priority P2, |
||
302 | * ...and there is no pending/booting/idle VM suitable for running C2, |
||
303 | * ...then C1 will not be started. |
||
304 | |||
305 | 1 | Tom Clegg | However, containers that run on different VM types don't necessarily start in priority order. |
306 | * If container C1 has priority P1, |
||
307 | 5 | Peter Amstutz | * ...and C2 has higher priority P2, |
308 | 6 | Tom Clegg | * ...and there is no idle VM suitable for running C2, |
309 | * ...and there is a pending/booting VM that will be suitable for running C2 when it comes up, |
||
310 | 1 | Tom Clegg | * ...and there is an idle VM suitable for running C1, |
311 | 6 | Tom Clegg | * ...then C1 will start before C2. |
312 | 10 | Tom Clegg | |
313 | 1 | Tom Clegg | h3. Special cases / synchronizing state |
314 | 6 | Tom Clegg | |
315 | When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly. |
||
316 | 10 | Tom Clegg | |
317 | 36 | Tom Clegg | At startup, some containers might have state=Locked. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled). |
318 | |||
319 | 37 | Tom Clegg | At startup, some instances might still be running containers that were started by a prior invocation, even though the (new) boot probe command fails. Such instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down. This allows the operator to configure a new image along with a new boot probe command that only works on the new image, without disrupting users' work. |
320 | 1 | Tom Clegg | |
321 | 4 | Peter Amstutz | When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container). |
322 | 6 | Tom Clegg | |
323 | When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled. |
||
324 | 4 | Peter Amstutz | |
325 | 5 | Peter Amstutz | h3. Probes |
326 | |||
327 | Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself. |
||
328 | |||
329 | Probe: |
||
330 | * Check whether the SSH connection is alive; reopen if needed. |
||
331 | * Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting. |
||
332 | * Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID) |
||
333 | 6 | Tom Clegg | |
334 | 5 | Peter Amstutz | h3. Detecting dead/lame nodes |
335 | 10 | Tom Clegg | |
336 | 65 | Tom Clegg | If a node has been up for N seconds without a successful probe, it is shut down, even if it was running a container last time it was contacted successfully. |
337 | 28 | Tom Clegg | |
338 | h1. Future plans / features |
||
339 | |||
340 | Per-instance-type VM images: It can be useful to run differently configured/tuned kernels/systems on different instance types, use different ops/monitoring systems on preemptible instances, etc. In addition to a system-wide default, each instance type could optionally specify an image. |
||
341 | |||
342 | 1 | Tom Clegg | Selectable VM images: When upgrading a production system, it can be useful to run a few trial containers on a new VM image before making it the default. |