Bug #19437
closed
[crunch-run] Require >1 watchdog errors before giving up and killing docker container
Added by Peter Amstutz over 2 years ago.
Updated over 2 years ago.
Estimated time:
(Total: 0.00 h)
Release relationship:
Auto
Description
Observed on customer cluster, this seems to have failed multiple times but eventually succeeded (it seems to have run to completion and was only canceled at the very end).
2022-08-31T00:00:01.820945772Z Creating Docker container
2022-08-31T00:00:09.932234553Z Starting container
2022-08-31T00:00:10.896745626Z Waiting for container to finish
2022-08-31T02:25:10.898243240Z Error inspecting container: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:25:10.898483541Z error in Run: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:38:12.612609772Z copying "/temp.txt" (0 bytes)
2022-08-31T02:38:13.468649279Z Cancelled
- Description updated (diff)
- Subject changed from Error inspecting container to Error inspecting container: context deadline exceeded
- Description updated (diff)
- Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.
Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.
Tom Clegg wrote in #note-5:
This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.
Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.
It seems likely that the Docker daemon is in the throes of tearing down the container and either there's an edge case it can fall into where the Inspect request gets dropped, or it really just takes 1+ minute to shut down some containers.
I think it would be a good idea to count 2 or 3 consecutive failures before giving up.
- Subject changed from Error inspecting container: context deadline exceeded to [crunch-run] Require >1 watchdog errors before giving up and killing docker container
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
- Status changed from In Progress to Resolved
Also available in: Atom
PDF