Bug #19437
closed[crunch-run] Require >1 watchdog errors before giving up and killing docker container
100%
Description
Observed on customer cluster, this seems to have failed multiple times but eventually succeeded (it seems to have run to completion and was only canceled at the very end).
2022-08-31T00:00:01.820945772Z Creating Docker container 2022-08-31T00:00:09.932234553Z Starting container 2022-08-31T00:00:10.896745626Z Waiting for container to finish 2022-08-31T02:25:10.898243240Z Error inspecting container: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded 2022-08-31T02:25:10.898483541Z error in Run: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded 2022-08-31T02:38:12.612609772Z copying "/temp.txt" (0 bytes) 2022-08-31T02:38:13.468649279Z Cancelled
Updated by Peter Amstutz over 2 years ago
- Subject changed from Error inspecting container to Error inspecting container: context deadline exceeded
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Updated by Tom Clegg over 2 years ago
This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.
Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.
Updated by Peter Amstutz over 2 years ago
Tom Clegg wrote in #note-5:
This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.
Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.
It seems likely that the Docker daemon is in the throes of tearing down the container and either there's an edge case it can fall into where the Inspect request gets dropped, or it really just takes 1+ minute to shut down some containers.
I think it would be a good idea to count 2 or 3 consecutive failures before giving up.
Updated by Tom Clegg over 2 years ago
- Subject changed from Error inspecting container: context deadline exceeded to [crunch-run] Require >1 watchdog errors before giving up and killing docker container
- Assigned To set to Tom Clegg
Updated by Tom Clegg over 2 years ago
- Status changed from New to In Progress
19437-docker-watchdog @ 9c254acbd78ed50e1e9fec508fb9ec4164867dda -- developer-run-tests: #3282
Updated by Tom Clegg over 2 years ago
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Updated by Tom Clegg over 2 years ago
cherry-picked to 2.4-staging as 8ad66154df528ad2020e80bc255896537f1c712a
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|ef25b3d6bef2288c1aaf99f6bff68b0d9d05ef89.