Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager - Arvados

Feature #14325

This issue covers the smallest version that can be deployed on a dev cluster. 

 Background -- already done in #14360: 
 * Bring up nodes and run containers on them 
 * Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle 
 * HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view" 
 * Shutdown idle nodes automatically 
 * Handle cloud API quota/ratelimit errors 
 * Package-building changes are in place, but commented out 

 Requirements: 
 * One cloud vendor driver (Azure = #14324) 
 * Ops mechanism for draining a node (e.g., curl command using a management token) 
 * Resource consumption metrics (instances running/allocated, hourly cost) 
 * Go from unknown/booting to drain state automatically if boot probe fails + containers are running 
 * Configurable port number for connecting to VM SSH servers 
 * Pass API host and token to crunch-run command 
 * Test SSH host key verification 
 * Test container.Queue using real railsAPI/controller 
 * Test resuming state after restart (some instances are booting, some idle, some running containers, some on admin-hold) 
 * Cancel containers that can't be scheduled 
 * Cancel container after some number of start/requeue cycles 
 * Cancel container with no suitable instance type 
 * Enable package build 

 Undecided: (might not be blockers for first dev deploy) 
 * Update runtime_status field when cancelling containers 
 * Ops mechanism for hold/release (add tags so hold state survives dispatcher restart) 
 * Test activity/resource usage metrics 
 * "Broken node" hook 
 * crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs 
 * crunch-run --detach: cleanup old stdout/stderr 
 * Handle cloud API ratelimit errors 
 * Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver, move fakeVM to test pkg test.StubDriver 
 * Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM 
 * Shutdown node if container process still running after several SIGKILL attempts 

 Non-requirements: 
 * Multiple cloud drivers 
 * Test suite that uses a real cloud provider 
 * Prometheus metrics (containers in queue, time container queued before starting, workers in each state, etc) 
 * Periodic status reports in logs 
 * Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run) 
 * Configurable spending limits 
 * Generic driver test suite 

 Refs 
 * [[Dispatching containers to cloud VMs]] 
 * #13964 spike

Back

Project

General

Profile

Arvados

Feature #14325