Project

General

Profile

Actions

Bug #13049

closed

some containers fail While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF

Added by Joshua Randall about 8 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

Most containers are working, but sometimes I get one that ends up cancelled, with logs like this:

2018-02-07T19:53:39.491333390Z crunch-run 0.1.20180111190404.453f922 started 
2018-02-07T19:53:39.491354829Z Executing container 'ncucu-dz642-bj25f37seau10y4' 
2018-02-07T19:53:39.491369694Z Executing on host 'arvados-compute-node-ncucu-031' 
2018-02-07T19:53:39.506507181Z Fetching Docker image from collection 'b45ca55d372ee1d42591e5bacb15d75b+342' 
2018-02-07T19:53:39.525928919Z Using Docker image id 'sha256:ccb085c0eee522d2ab0a2bd0069345c477419d8a467ac0b132c7b2f1a6b6e370' 
2018-02-07T19:53:39.526527753Z Loading Docker image from keep 
2018-02-07T19:54:39.089300367Z While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 
2018-02-07T19:54:39.089329926Z Cancelled 
2018-02-07T19:54:39.617211875Z crunch-run finished 

Actions #1

Updated by Joshua Randall about 8 years ago

This is repeatedly happening, with different containers on different compute nodes (although when it happens, it seems to happen to a bunch of containers attempting to run on the same compute node). Later on, the same compute node starts working fine. Is it possible there is some kind of race condition when attempting to load the docker image from keep multiple times at the same time on a given host?

For example, another one:

2018-02-08T13:59:22.206971855Z arvados.cwl-runner INFO: [container verify_bam_id_601] ncucu-xvhdp-p09pfnk8uryjqth is Final 
2018-02-08T13:59:22.286580590Z arvados.cwl-runner INFO: [container verify_bam_id_601] error log: 
2018-02-08T13:59:22.286580590Z arvados.cwl-runner INFO: 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.657227305Z crunch-run crunch-run 0.1.20180111190404.453f922 started 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.657257186Z crunch-run Executing container 'ncucu-dz642-rubsej34h38adxe' 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.657286715Z crunch-run Executing on host 'arvados-compute-node-ncucu-026' 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.704063858Z crunch-run Fetching Docker image from collection 'b3e6ee000c7c211025b7257bc77dd26a+383' 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.805692310Z crunch-run Using Docker image id 'sha256:1b43a6bc26fee99abea3032ed7da80847a121845f4fc67fd3e42730e4c7536f7' 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:55:19.806424119Z crunch-run Loading Docker image from keep 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:59:18.792669583Z crunch-run While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 
2018-02-08T13:59:22.286580590Z 
2018-02-08T13:59:18.792713907Z crunch-run Cancelled

In this example there was a ~4 minute delay between: "2018-02-08T13:55:19.806424119Z crunch-run Loading Docker image from keep " and "2018-02-08T13:59:18.792669583Z crunch-run While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF "

Actions #2

Updated by Joshua Randall about 8 years ago

Looks like the code path that cancels the container is being triggered here: https://github.com/curoverse/arvados/blob/5f6738f134e1abb359fc8e8c61751263db88d4b6/services/crunch-run/crunchrun.go#L1625
When in fact this issue is in this case transient.

Actions #3

Updated by Tom Clegg about 8 years ago

The dockerclient library annotates the http stdlib's error with "error during connect" because it doesn't match any of the network error types it knows about. In this case I think it's guessing wrong.

The "unexpected EOF" error probably comes from (*body)readLocked() here: https://golang.org/src/net/http/transfer.go

The (unexpected-by-http-client) EOF is probably coming from ManifestFileReader when it fails to read a Keep block.

Some things to investigate further:
  • (on the cluster) Keepstores might be unhealthy
  • ManifestFileReader might deserve better retry (ideally by adding retry to collectionfs from #12483, and using that)
  • crunch-run should not cancel the container if image loading fails due to a transient Keep error (even if it's not a "broken node" sign)
Actions #4

Updated by Joshua Randall about 8 years ago

We did have an issue with a keep_service entry that didn't really exist, but that has now been fixed and the issue is still occurring (although possibly less often).

It does seem possible the issue could be related to keep reads taking a very long time, as after the most recent issue I tried downloading the same docker image to a shell node and it took over 4m to download the <700MB, and judging by the progress meter, it took over a minute just to download the first block.

2018-02-08T17:15:29.708484953Z crunch-run 0.1.20180111190404.453f922 started 
2018-02-08T17:15:29.708515081Z Executing container 'ncucu-dz642-laypo47jc5utyp6' 
2018-02-08T17:15:29.708545306Z Executing on host 'arvados-compute-node-ncucu-006' 
2018-02-08T17:15:29.725152412Z Fetching Docker image from collection 'd66406ae1d6b0551690f6616b4171cd7+552' 
2018-02-08T17:15:29.735717535Z Using Docker image id 'sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5' 
2018-02-08T17:15:29.736826369Z Loading Docker image from keep 
2018-02-08T17:16:16.879325054Z While loading container image: While loading container image into Docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.21/images/load?quiet=1: unexpected EOF 
2018-02-08T17:16:16.879361346Z Cancelled 
2018-02-08T17:16:18.501443268Z crunch-run finished 
mercury@arvados-shell-ncucu:~$ arv keep ls d66406ae1d6b0551690f6616b4171cd7+552
./sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar
mercury@arvados-shell-ncucu:~$ time arv keep get -v d66406ae1d6b0551690f6616b4171cd7+552/sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar sha256:7984dda02f275b1144f4ec9dc601ea3ca5bbc174ee7eb50326ca3a808c0f76a5.tar
2018-02-08 17:39:16 arvados.arv-get[13205] INFO: X-Request-Id: req-49rg54awvyfuhjl1piye
679 MiB / 679 MiB 100.0%

real    4m20.723s
user    0m2.476s
sys     0m1.516s
Actions #5

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF