Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy - Arvados

Story #14807

Issues encountered & fixed/worked around during dev deploy: 
 * -Include instance address (host or IP) in logs and management API responses- 
 * -Ensure @crunch-run --list@ works even if /var/lock is a symlink- 
 * -Log full instance ID, not (Instance)String(), which might be an abbreviated name- 
 * -Fix management API endpoints to allow specifying instance IDs that have slashes- 
 * -Pass SSH public key to Azure so it doesn't crash (Azure refuses to create a node without adding an admin account)- 
 * -Fix host part of SSH target address being dropped- 
 * -Allow driver to specify a login username- 
 * -Send ARVADOS_API_* values on stdin instead of environment vars (typical SSH server is configured to refuse these env vars)- 
 * -If ProviderType is not given in an instance type in the cluster config, default to the type name (not the empty string)- 
 * -Pass a random string to Azure driver as "node-token" (or fix Azure driver so it doesn't expect that)- 

 Further improvements necessary to run in production: 
 * -Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed- 
 * -Metrics: total cost of nodes in idle or booting state- 
 * -Metrics: total cost of nodes with admin-hold flag set- 
 * -Log when an instance goes down unexpectedly (i.e., state != Shutdown when deleted from list)- 
 * -Log when a container is added to or dropped from the queue- 
 * -Obey logging format in cluster config file (as of #14325, HTTP request logs were JSON, operational logs were text)- 
 * Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM 
 * Shutdown node if container process still running after several SIGKILL attempts 
 * Provide a "mark node as broken" callback mechanism for crunch-run (drain node, unless it's already marked "hold" -- see #14807#note-20) 
 * Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36) 
 * Metrics: number of containers, split by state and instance type 
 * Load API host & token from cluster config file instead of env vars 
 * Ensure crunch-run exits instead of hanging if ARVADOS_API_HOST/TOKEN is empty or broken 
 * Kill containers (or at least log a warning) if a worker is kept busy by a container whose UUID does not exist according to the API server's queue (e.g., container deleted from database) 
 * "Kill instance now" management API 
 * (Azure) error out if AddedScratch>0 because that isn't implemented yet 

 Improvements that are desired, but not necessary to run in production (noted here for clarity until they move to their own tickets): 
 * -crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs- (logs go to journal instead) 
 * -crunch-run --detach: cleanup old stdout/stderr- (logs go to journal instead) 
 * -Move "cat .../node-token" host key verification mechanism out of Azure driver (instead, have the dispatcher do this itself if the driver returns cloud.ErrNotImplemented)- 
 * Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance) 
 * Test suite that uses a real cloud provider 
 * Test activity/resource usage metrics 
 * Multiple cloud drivers 
 * Generic driver test suite 
 * Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker) 
 * Configurable spending limits 
 * Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case) 
 * If present, use VM image ID given in runtime_constraints instead of image ID from cluster config file 
 * (API) Allow admin users to specify image ID in runtime_constraints 
 * Metrics: count unexpected shutdowns, split by instance type 
 * Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe 
 * Run crunch-run as a non-root user 

 Improvements that might never be implemented at all (noted here for clarity): 
 * Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint. 
 * Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed. 

 [[Dispatching containers to cloud VMs]]

Back

Project

General

Profile

Arvados

Story #14807