Bug #15695: [a-d-c] Long delay before cloud dispatcher starts jobs on playground - Arvados

Actions

Copy link

Bug #15695

closed

[a-d-c] Long delay before cloud dispatcher starts jobs on playground

Added by Tom Morris over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Target version:

2020-01-02 Sprint

Story points:

Description

This workflow wants 100 parallel jobs running the same code over different date.

There are two separate runs shown in the Prometheus graph below:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A7200%2C%22endTime%22%3A1570223940%7D

The timeline is (all times UTC):
19:24 First run submitted with requirements for 100 x 4 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-ctw1t6m8z718emc
19:41 17 nodes with 4 cores each started
19:57 Workflow canceled
19:57 75 nodes idle
19:58 Second run submitted with edited run time requirements for 100 x 2 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-2ry6g3l031wlygu
20:00 71 nodes idle from 1st cancelled workflow
20:03 1 node busy, 0 nodes idle
20:26 1st node child container started
20:31 2nd node for child container start
20:36 39 nodes booting for child containers
20:40 another 22 nodes start booting
20:46 final 7 nodes start booting
20:55 All 100 containers finally running

Files

Download all files

15695-queue.png (57.7 KB) 15695-queue.png		Tom Clegg, 10/07/2019 06:08 PM
15695-nodes.png (22.5 KB) 15695-nodes.png		Tom Clegg, 10/07/2019 06:11 PM

Actions

Copy link

Updated by Tom Clegg over 6 years ago

File 15695-queue.png 15695-queue.png added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

File 15695-nodes.png added

Actions

Copy link

Updated by Tom Clegg over 6 years ago

File deleted (~~15695-nodes.png~~)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

File 15695-nodes.png 15695-nodes.png added

Actions

Copy link

Updated by Tom Morris over 6 years ago

The 26 queued containers are some secondary artifact left over from the first cancelled run.

Not sure it adds any additional info, but a run submitted over the weekend showed a normal startup profile with 120 nodes up and running within 15 minutes:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A21600%2C%22endTime%22%3A1570445340%7D
https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-o40xgjibykcsqmk (it failed later for an unrelated reason)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

During the dead zone (20:00 to 20:36) containers are locked but no nodes are booting. It seems like arvados-dispatch-cloud isn't able to start new nodes. Errors would have been logged, but logs from that time are already deleted.

Config has MaxCloudOpsPerSecond=1. That seems low, but not low enough to cause starvation.

Actions

Copy link

Updated by Ward Vandewege over 6 years ago

Subject changed from Long delay before cloud dispatcher starts jobs on playground to [a-d-c] Long delay before cloud dispatcher starts jobs on playground

Actions

Copy link

Updated by Tom Morris over 6 years ago

Assigned To set to Tom Morris
Target version set to 2019-10-23 Sprint

I'll attempt to reproduce this next week when folks are available to look at the logs before they get deleted.

Actions

Copy link

Updated by Tom Morris over 6 years ago

Target version changed from 2019-10-23 Sprint to 2019-11-06 Sprint

Actions

Copy link

#10

Updated by Tom Morris over 6 years ago

Target version changed from 2019-11-06 Sprint to 2019-11-20 Sprint

Actions

Copy link

#11

Updated by Tom Morris over 6 years ago

Target version changed from 2019-11-20 Sprint to 2019-12-04 Sprint

Actions

Copy link

#12

Updated by Tom Morris over 6 years ago

Target version changed from 2019-12-04 Sprint to 2020-01-02 Sprint

Actions

Copy link

#13

Updated by Tom Clegg over 6 years ago

Assigned To deleted (~~Tom Morris~~)
Status changed from New to Closed

Problem still unknown, but hasn't happened (or hasn't been noticed) since first reported.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #15695

[a-d-c] Long delay before cloud dispatcher starts jobs on playground

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Ward Vandewege over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Tom Clegg over 6 years ago