Feature #14325
Updated by Tom Clegg about 6 years ago
This issue covers the smallest version that can be deployed on a dev cluster. Background -- already done in #14360: * Bring up nodes and run containers on them * Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle * HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view" * Shutdown idle nodes automatically * Handle cloud API quota/ratelimit errors * Package-building changes are in place, but commented out Requirements: * One cloud vendor driver (Azure = #14324) * Ops mechanism for draining a node (e.g., curl command using a management token) * Resource consumption metrics (instances running/allocated, hourly cost) * Go from unknown/booting to drain state automatically if boot probe fails + containers are running * Configurable port number for connecting to VM SSH servers * Pass API host and token to crunch-run command * Test SSH host key verification * Test container.Queue using real railsAPI/controller * Test resuming state after restart (some instances are booting, some idle, some running containers, some on admin-hold) * Cancel containers that can't be scheduled * Cancel container after some number of start/requeue cycles * Cancel container with no suitable instance type * Enable package build Undecided: (might not be blockers for first dev deploy) * Update runtime_status field when cancelling containers * Ops mechanism for hold/release (add tags so hold state survives dispatcher restart) * Test activity/resource usage metrics * "Broken node" hook * crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs * crunch-run --detach: cleanup old stdout/stderr * Handle cloud API ratelimit errors * Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver, move fakeVM to test pkg test.StubDriver * Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM * Shutdown node if container process still running after several SIGKILL attempts Non-requirements: * Multiple cloud drivers * Test suite that uses a real cloud provider * Prometheus metrics (containers in queue, time container queued before starting, workers in each state, etc) * Periodic status reports in logs * Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run) * Configurable spending limits * Generic driver test suite Refs * [[Dispatching containers to cloud VMs]] * #13964 spike