Project

General

Profile

Actions

Feature #14325

closed

[crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/28/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

This issue covers the smallest version of Dispatching containers to cloud VMs that can be deployed on a dev cluster.

Background -- already done in #14360:
  • Bring up nodes and run containers on them
  • Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
  • HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see Dispatching containers to cloud VMs "Operator view"
  • Shutdown idle nodes automatically
  • Handle cloud API quota errors
  • Package-building changes are in place, but commented out
Requirements covered here:
  • Ops mechanism for draining a node (e.g., curl command using a management token) -- see Dispatching containers to cloud VMs "Management API"
  • Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see Dispatching containers to cloud VMs "Metrics"
  • Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see Dispatching containers to cloud VMs "Special cases / synchronizing state"
  • Configurable port number for connecting to VM SSH servers
  • Pass API host and dispatcher's token to crunch-run command via ARVADOS_API_* environment variables
  • Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
  • Test container.Queue using real railsAPI/controller
  • Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
  • Cancel container after some number of start/requeue cycles (i.e., crunch-run --detach succeeded, but child exited without moving container past Locked state)
  • Cancel container with no suitable instance type
  • Enable package build
  • Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
  • Update management API response format (lowercase keys)
  • Confirm all probe failures are logged once instance is booted (see #14360#note-38, fixed in 7a047d8b6)
Requirements covered elsewhere:
  • One cloud vendor driver (Azure = #14324)
  • Production-readiness (#14807)
Refs

Subtasks 1 (0 open1 closed)

Task #14664: Review 14325-dispatch-cloudResolvedPeter Amstutz01/28/2019

Actions

Related issues 6 (0 open6 closed)

Related to Arvados - Feature #14324: [crunch-dispatch-cloud] Azure driverResolvedPeter Amstutz01/09/2019

Actions
Related to Arvados - Bug #13964: crunch-dispatch-cloud spikeResolvedTom Clegg

Actions
Related to Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolved

Actions
Related to Arvados - Story #14360: [crunch-dispatch-cloud] Merge incomplete implementationResolvedTom Clegg10/26/2018

Actions
Precedes Arvados - Story #14796: [crunch-dispatch-cloud] Document installation / migration from c-d-slurm + node managerResolvedTom Clegg01/29/2019

Actions
Precedes Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolvedTom Clegg01/29/2019

Actions
Actions

Also available in: Atom PDF