Bug #13049
closedsome containers fail While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF
Description
Most containers are working, but sometimes I get one that ends up cancelled, with logs like this:
2018-02-07T19:53:39.491333390Z crunch-run 0.1.20180111190404.453f922 started 2018-02-07T19:53:39.491354829Z Executing container 'ncucu-dz642-bj25f37seau10y4' 2018-02-07T19:53:39.491369694Z Executing on host 'arvados-compute-node-ncucu-031' 2018-02-07T19:53:39.506507181Z Fetching Docker image from collection 'b45ca55d372ee1d42591e5bacb15d75b+342' 2018-02-07T19:53:39.525928919Z Using Docker image id 'sha256:ccb085c0eee522d2ab0a2bd0069345c477419d8a467ac0b132c7b2f1a6b6e370' 2018-02-07T19:53:39.526527753Z Loading Docker image from keep 2018-02-07T19:54:39.089300367Z While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 2018-02-07T19:54:39.089329926Z Cancelled 2018-02-07T19:54:39.617211875Z crunch-run finished
Updated by Joshua Randall about 8 years ago
This is repeatedly happening, with different containers on different compute nodes (although when it happens, it seems to happen to a bunch of containers attempting to run on the same compute node). Later on, the same compute node starts working fine. Is it possible there is some kind of race condition when attempting to load the docker image from keep multiple times at the same time on a given host?
For example, another one:
2018-02-08T13:59:22.206971855Z arvados.cwl-runner INFO: [container verify_bam_id_601] ncucu-xvhdp-p09pfnk8uryjqth is Final 2018-02-08T13:59:22.286580590Z arvados.cwl-runner INFO: [container verify_bam_id_601] error log: 2018-02-08T13:59:22.286580590Z arvados.cwl-runner INFO: 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.657227305Z crunch-run crunch-run 0.1.20180111190404.453f922 started 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.657257186Z crunch-run Executing container 'ncucu-dz642-rubsej34h38adxe' 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.657286715Z crunch-run Executing on host 'arvados-compute-node-ncucu-026' 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.704063858Z crunch-run Fetching Docker image from collection 'b3e6ee000c7c211025b7257bc77dd26a+383' 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.805692310Z crunch-run Using Docker image id 'sha256:1b43a6bc26fee99abea3032ed7da80847a121845f4fc67fd3e42730e4c7536f7' 2018-02-08T13:59:22.286580590Z 2018-02-08T13:55:19.806424119Z crunch-run Loading Docker image from keep 2018-02-08T13:59:22.286580590Z 2018-02-08T13:59:18.792669583Z crunch-run While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 2018-02-08T13:59:22.286580590Z 2018-02-08T13:59:18.792713907Z crunch-run Cancelled
In this example there was a ~4 minute delay between: "2018-02-08T13:55:19.806424119Z crunch-run Loading Docker image from keep " and "2018-02-08T13:59:18.792669583Z crunch-run While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF "
Updated by Joshua Randall about 8 years ago
Looks like the code path that cancels the container is being triggered here: https://github.com/curoverse/arvados/blob/5f6738f134e1abb359fc8e8c61751263db88d4b6/services/crunch-run/crunchrun.go#L1625
When in fact this issue is in this case transient.
Updated by Tom Clegg about 8 years ago
The dockerclient library annotates the http stdlib's error with "error during connect" because it doesn't match any of the network error types it knows about. In this case I think it's guessing wrong.
The "unexpected EOF" error probably comes from (*body)readLocked() here: https://golang.org/src/net/http/transfer.go
The (unexpected-by-http-client) EOF is probably coming from ManifestFileReader when it fails to read a Keep block.
Some things to investigate further:- (on the cluster) Keepstores might be unhealthy
- ManifestFileReader might deserve better retry (ideally by adding retry to collectionfs from #12483, and using that)
- crunch-run should not cancel the container if image loading fails due to a transient Keep error (even if it's not a "broken node" sign)
Updated by Joshua Randall about 8 years ago
We did have an issue with a keep_service entry that didn't really exist, but that has now been fixed and the issue is still occurring (although possibly less often).
It does seem possible the issue could be related to keep reads taking a very long time, as after the most recent issue I tried downloading the same docker image to a shell node and it took over 4m to download the <700MB, and judging by the progress meter, it took over a minute just to download the first block.
2018-02-08T17:15:29.708484953Z crunch-run 0.1.20180111190404.453f922 started 2018-02-08T17:15:29.708515081Z Executing container 'ncucu-dz642-laypo47jc5utyp6' 2018-02-08T17:15:29.708545306Z Executing on host 'arvados-compute-node-ncucu-006' 2018-02-08T17:15:29.725152412Z Fetching Docker image from collection 'd66406ae1d6b0551690f6616b4171cd7+552' 2018-02-08T17:15:29.735717535Z Using Docker image id 'sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5' 2018-02-08T17:15:29.736826369Z Loading Docker image from keep 2018-02-08T17:16:16.879325054Z While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 2018-02-08T17:16:16.879361346Z Cancelled 2018-02-08T17:16:18.501443268Z crunch-run finished
mercury@arvados-shell-ncucu:~$ arv keep ls d66406ae1d6b0551690f6616b4171cd7+552 ./sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar mercury@arvados-shell-ncucu:~$ time arv keep get -v d66406ae1d6b0551690f6616b4171cd7+552/sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar 2018-02-08 17:39:16 arvados.arv-get[13205] INFO: X-Request-Id: req-49rg54awvyfuhjl1piye 679 MiB / 679 MiB 100.0% real 4m20.723s user 0m2.476s sys 0m1.516s