Project

General

Profile

Dispatching containers to cloud VMs » History » Version 16

Tom Clegg, 10/16/2018 06:31 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3 6 Tom Clegg
(Draft)
4 1 Tom Clegg
5 11 Tom Clegg
{{>toc}}
6
7 7 Tom Clegg
h2. Component name / purpose
8 1 Tom Clegg
9 7 Tom Clegg
crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them.
10 1 Tom Clegg
11 12 Tom Clegg
h2. Deployment
12
13
The crunch-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one crunch-dispatch-cloud process (future versions will support multiple dispatchers).
14
15 9 Tom Clegg
h2. Overview of operation
16 1 Tom Clegg
17 9 Tom Clegg
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs.
18 1 Tom Clegg
19 6 Tom Clegg
h2. Interaction with other components
20 1 Tom Clegg
21 9 Tom Clegg
Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
22 1 Tom Clegg
23 6 Tom Clegg
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
24 1 Tom Clegg
25 6 Tom Clegg
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root.
26 1 Tom Clegg
27 6 Tom Clegg
h2. Configuration
28 1 Tom Clegg
29 6 Tom Clegg
Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
30 1 Tom Clegg
31 6 Tom Clegg
<pre><code class="yaml">
32
    CloudVMs:
33 8 Tom Clegg
      BootProbeCommand: "docker ps -q"
34
      SyncInterval: 1m    # get list of 
35
      TimeoutIdle: 1m     # shutdown if idle longer than this
36
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
37
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
38
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
39 6 Tom Clegg
      Driver: Amazon
40 8 Tom Clegg
      DriverParameters:   # following configs are driver dependent
41 6 Tom Clegg
        Region: us-east-1
42
        APITimeout: 20s
43
        EC2Key: abcdef
44
        EC2Secret: abcdefghijklmnopqrstuvwxyz
45
        StorageKey: abcdef
46
        StorageSecret: abcdefghijklmnopqrstuvwxyz
47
        ImageID: ami-0123456789abcdef0
48 1 Tom Clegg
        SubnetID: subnet-01234567
49
        SecurityGroups: sg-01234567
50 8 Tom Clegg
    Dispatch:
51
      StaleLockTimeout: 1m     # after restart, time to wait for workers to come up before abandoning locks from previous run
52
      PollInterval: 1m         # how often to get latest queue from arvados controller
53
      ProbeInterval: 10s       # how often to probe each instance for current status/vital signs
54
      MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances)
55
      PrivateKey: |            # SSH key able to log in as root@ worker VMs
56
        -----BEGIN RSA PRIVATE KEY-----
57
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
58
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
59
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
60
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
61
        LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD
62
        gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav
63
        Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6
64
        exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7
65
        p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK
66
        dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w
67
        70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj
68
        iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v
69
        yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV
70
        X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE
71
        ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f
72
        QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP
73
        exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+
74
        jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY
75
        lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5
76
        7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn
77
        G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O
78
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
79
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
80 1 Tom Clegg
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
81
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
82
        -----END RSA PRIVATE KEY-----
83 9 Tom Clegg
    InstanceTypes:
84
    - Name: m4.large
85
      VCPUs: 2
86
      RAM: 7782000000
87
      Scratch: 32000000000
88
      Price: 0.1
89
    - Name: m4.large.spot
90
      Preemptible: true
91
      VCPUs: 2
92
      RAM: 7782000000
93
      Scratch: 32000000000
94
      Price: 0.1
95
    - Name: m4.xlarge
96
      VCPUs: 4
97
      RAM: 15564000000
98
      Scratch: 80000000000
99
      Price: 0.2
100
    - Name: m4.xlarge.spot
101
      Preemptible: true
102
      VCPUs: 4
103
      RAM: 15564000000
104
      Scratch: 80000000000
105
      Price: 0.2
106
    - Name: m4.2xlarge
107
      VCPUs: 8
108
      RAM: 31129000000
109
      Scratch: 160000000000
110
      Price: 0.4
111
    - Name: m4.2xlarge.spot
112
      Preemptible: true
113
      VCPUs: 8
114
      RAM: 31129000000
115
      Scratch: 160000000000
116
      Price: 0.4
117 6 Tom Clegg
</code></pre>
118 1 Tom Clegg
119 10 Tom Clegg
h2. Management API
120 1 Tom Clegg
121 10 Tom Clegg
APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}".
122
123
Responses are JSON-encoded and resemble other Arvados APIs:
124
<pre><code class="json">
125
{
126
  "Items": [
127
    {
128
      "Name": "...",
129
      ...
130
    },
131
    ...
132
  ]
133
}
134
</code></pre>
135
136
@GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes:
137
* provider's instance ID
138
* hourly price (from configuration file)
139
* instance type (from configuration file)
140
* instance type (from provider's menu)
141
* UUID of the current / most recent container attempted (if known)
142
* time last container finished (or boot time, if nothing run yet)
143
144
@GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes:
145
* container UUID
146
* container state (Queued/Locked/Running/Complete/Cancelled)
147
* desired instance type
148
* time appeared in queue
149
* time started (if started)
150
151
@POST /arvados/v1/dispatch/instances/:instance_id/drain@ puts an instance in "drain" state.
152
* if the instance is currently running a container, it is allowed to continue
153
* no further containers will be scheduled on the instance
154
* (TBD) the instance will not be shut down automatically
155
156
@POST /arvados/v1/dispatch/instances/:instance_id/shutdown@ puts an instance in "shutdown" state.
157
* if the instance is currently running a container, the instance is shut down when the container finishes
158
* otherwise, the instance is shut down immediately
159
160
h2. Metrics
161
162 13 Tom Clegg
Metrics are available via HTTP on a configurable address/port (conventionally :9005). Request headers must include "Authorization: Bearer {management token}".
163 10 Tom Clegg
164
Metrics include:
165 13 Tom Clegg
* [future] (summary) time elapsed between VM creation and first successful SSH connection to that VM
166
* [future] (summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM
167 10 Tom Clegg
* (gauge) total hourly price of all existing VMs
168
* (gauge) total VCPUs and memory allocated to containers
169 1 Tom Clegg
* (gauge) number of containers running
170 10 Tom Clegg
* (gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
171
* (gauge) number of containers not allocated to VMs (because provider quota is reached)
172 13 Tom Clegg
173 14 Tom Clegg
h2. Logs
174
175 16 Tom Clegg
For purposes of troubleshooting, a log message is printed on stderr when...
176
177
|                                                              |...including...|
178
|a new instance is created/ordered                             |instance type name|
179
|an instance appears on the provider's list of instances       |instance ID|
180
|an instance's boot probe succeeds                             |instance ID|
181
|an instance is shut down after boot timeout                   |instance ID, stdout/stderr/error from last boot probe attempt|
182
|an instance shutdown is requested                             |instance ID|
183
|an instance disappears from the provider's list of instances  |instance ID and previous state (booting/idle/shutdown)|
184
|a cloud provider API or driver error occurs                   |provider/driver's error message|
185
|a new container appears in the Arvados queue                  |container UUID, desired instance type name|
186
|a container is locked by the dispatcher                       |container UUID|
187
|a crunch-run process is started on an instance                |container UUID, instance ID, crunch-run PID|
188
|a crunch-run process fails to start on an instance            |container UUID, instance ID, stdout/stderr/exitcode|
189
|a crunch-run process ends                                     |container UUID, instance ID|
190
|an active container's state changes to Complete or Cancelled  |container UUID, new state|
191
|an active container is requeued after being locked            |container UUID|  
192
|an Arvados API error occurs                                   |error message|
193
194 14 Tom Clegg
195
(Example log entries should be shown here)
196
197
If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug.
198 10 Tom Clegg
199
h2. Internal details
200
201
h3. Scheduling policy
202
203 6 Tom Clegg
The container priority field determines the order in which resources are allocated.
204
* If container C1 has priority P1,
205
* ...and C2 has higher priority P2,
206
* ...and there is no pending/booting/idle VM suitable for running C2,
207
* ...then C1 will not be started.
208
209
However, containers that run on different VM types don't necessarily start in priority order.
210 1 Tom Clegg
* If container C1 has priority P1,
211
* ...and C2 has higher priority P2,
212 5 Peter Amstutz
* ...and there is no idle VM suitable for running C2,
213 6 Tom Clegg
* ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
214
* ...and there is an idle VM suitable for running C1,
215 1 Tom Clegg
* ...then C1 will start before C2.
216 6 Tom Clegg
217 10 Tom Clegg
h3. Special cases / synchronizing state
218 1 Tom Clegg
219 6 Tom Clegg
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
220
221 10 Tom Clegg
Some containers might have state=Locked at startup. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
222 6 Tom Clegg
223 1 Tom Clegg
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
224 4 Peter Amstutz
225 6 Tom Clegg
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
226
227 10 Tom Clegg
h3. SSH keys
228 5 Peter Amstutz
229 10 Tom Clegg
The operator must install a public key in /root/.ssh/authorized_keys on each worker node. Dispatcher has the corresponding private key.
230 5 Peter Amstutz
231 6 Tom Clegg
(Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features.
232 4 Peter Amstutz
233
h3. Probes
234 5 Peter Amstutz
235
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
236
237
Probe:
238
* Check whether the SSH connection is alive; reopen if needed.
239
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
240
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
241
242 6 Tom Clegg
h3. Detecting dead/lame nodes
243 5 Peter Amstutz
244 10 Tom Clegg
If a node has been up for N seconds without a successful probe, despite at least M attempts, it is shut down, even if it was running a container last time it was contacted successfully.
245 5 Peter Amstutz
246 6 Tom Clegg
h3. Multiple dispatchers
247 5 Peter Amstutz
248 6 Tom Clegg
Not supported in initial version.