Bug #16132
closedce8i5 test timeouts
0%
Description
The run-cwl-test-ce8i5 job runs the CWL test suite against the test cluster ce8i5.
Tests have been failing with timeouts. There is a 900s (15m) limit on tests, so that means it is taking longer than that to either bring up a node or schedule a container.
It is unclear if the issue is Azure delay, arvados-dispatch-cloud delay, some interaction between the two (waiting for something it shouldn't), stalled ssh connections, or what.
One way to begin diagnosing this would be to start collecting prometheus metrics, and see if we can capture the delay on a graph.
Updated by Peter Amstutz almost 5 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz almost 5 years ago
- Status changed from In Progress to New
Updated by Peter Amstutz almost 5 years ago
ce8i5 has a really tiny quota - 10 cores. The containers that are timing out seem to be hitting the quota, that prevents starting a new compute node to run the container.
We need a way to communicate back to the user when the cluster limited by quota.