Project

General

Profile

Actions

Story #15026

closed

[arvados-dispatch-cloud] Cloud driver/config testing tool

Added by Tom Clegg over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/21/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5
Release relationship:
Auto

Description

Provide an arvados-server "cloudtest" subcommand (lib/cloud/test) that uses the configured credentials (from cluster config file) to verify that
  • the selected driver implements the cloud.Driver interface properly (empty and non-empty instance tag sets; no implicit filtering of instances list; Instances() includes the new instance if called immediately after Create() returns success; Destroy() works)
  • the cloud provider accepts the configured credentials
  • resulting VMs accept the configured SSH private key and run commands as root
This has three main uses:
  1. Dev tests when creating/modifying a driver
  2. CI tests
  3. Verify/debug config while creating/updating a real cluster
Specs:
  • By default, use InstanceSetID "cloudtest-$(whoami)@$(hostname)" so a series of aborted/broken runs will recognize any abandoned instances. Accept a command line argument -instance-set-id=string to override.
  • Use the selected driver directly: don't use a worker.Pool, rateLimitingInstanceSet, etc.
  • Start by listing all instances and checking whether any are tagged with the selected InstanceSetID.
    • If so, and a -clear command line flag was given: destroy them, get an updated list, and repeat until they're all gone.
    • If so, and a -clear command line flag was not given: log a message mentioning the "-clear" option, and error out.
  • Create an instance, using a {"CloudTestPID":"$PID","InstanceSetID":"$InstanceSetID"} tag plus any ResourceTags in the cluster config. If an error is returned, log it (and exit non-zero later), but keep going in case an instance was created.
  • Verify that the Tags() on the returned instance match the ones passed to Create().
  • List all instances. If an error is returned, log it but keep going so the test instance (if any) can be destroyed.
  • Verify that the instance list has an instance with the same ID as the one returned from Create(). If not, keep going, but log an error: the instance is supposed to appear in the very next Instances() call after Create() returns.
  • Verify that the instance returned in the list has the same tags.
  • If a new instance was created (either Create() succeeded or List() returned an instance with our InstanceSetID):
    • Poll Instances() until the instance has a non-empty Address() or TimeoutBooting expires.
    • Use ssh_executor to run BootProbeCommand on the instance (or "docker ps -q" if that's empty). Retry until it succeeds or TimeoutBooting expires.
    • If a -command value is given, execute it as a shell command on the instance, with stdin/stdout/stderr connected to cloudtest's own stdin/stdout/stderr.
    • If the -pause-before-destroy flag is given, show a sample SSH command line for connecting to the instance, and wait for the user to press Enter before proceeding.
    • Destroy the instance.
    • Poll Instances() until the instance disappears.
  • Exit 0 if everything succeeded, otherwise 1.

If the -quiet flag isn't given, log progress to stdout.

$ arvados-server cloudtest -exec 'echo $(hostname) $(date)' -pause-before-destroy
getting instance list
got instance list (N=13)
no instances are tagged with our InstanceSetID (7 instances are not tagged with any InstanceSetID at all)
creating instance with tags map[CloudTestPID:1234, InstanceSetID:cloudtest-ops@4xphq]
created instance with id i-12345abcde
all requested tags are present
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
all requested tags are present
instance has no address
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
instance has no address
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
instance i-12345abcde has addr 10.2.3.4
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command failed (attempt 1): connection refused, output "" 
waiting probeInterval 10s
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command failed (attempt 2): connection refused, output "" 
waiting probeInterval 10s
executing command "docker ps -q" on i-12345abcde addr 10.2.3.4 port 2222
executing command succeeded (attempt 3), output "" 
executing command "echo $(hostname) $(date)" on i-12345abcde addr 10.2.3.4 port 2222
executing command succeeded (attempt 1), output "i-12345abcde.cloud.example Tue Jun 11 11:28:23 EDT 2019\n" 
instance is booted
... you can connect with "ssh -p2222 debian@10.2.3.4" 
... hit Enter when you are finished, and ready to destroy the instance: {pause until user hits Enter}
destroying instance i-12345abcde
destroyed instance i-12345abcde
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
waiting probeInterval 10s
getting instance list
got instance list (N=14)
found our instance i-12345abcde in returned list
waiting probeInterval 10s
getting instance list
got instance list (N=14)
instance i-12345abcde not found in returned list
done

Files

15026-dispatcher.png (64.7 KB) 15026-dispatcher.png Tom Clegg, 06/27/2019 02:55 PM
15026-cloudtest.png (193 KB) 15026-cloudtest.png Tom Clegg, 06/27/2019 02:55 PM

Subtasks 1 (0 open1 closed)

Task #15389: Review 15026-cloudtestResolvedPeter Amstutz06/21/2019

Actions

Related issues 1 (0 open1 closed)

Blocks Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolved

Actions
Actions

Also available in: Atom PDF