Bug #15695
closed[a-d-c] Long delay before cloud dispatcher starts jobs on playground
0%
Description
This workflow wants 100 parallel jobs running the same code over different date.
There are two separate runs shown in the Prometheus graph below:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A7200%2C%22endTime%22%3A1570223940%7D
The timeline is (all times UTC):
19:24 First run submitted with requirements for 100 x 4 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-ctw1t6m8z718emc
19:41 17 nodes with 4 cores each started
19:57 Workflow canceled
19:57 75 nodes idle
19:58 Second run submitted with edited run time requirements for 100 x 2 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-2ry6g3l031wlygu
20:00 71 nodes idle from 1st cancelled workflow
20:03 1 node busy, 0 nodes idle
20:26 1st node child container started
20:31 2nd node for child container start
20:36 39 nodes booting for child containers
20:40 another 22 nodes start booting
20:46 final 7 nodes start booting
20:55 All 100 containers finally running
Files