Bug #8786
closed[FUSE] arv-mount "Resolving timed out" failure leads to job failure
Description
arv-mount failure: "Resolving timed out after ... milliseconds"
Over the past few days, jobs are occasionally dying with error messages like the above.
An example of the relevant line from crunch-job log output is:
2016-03-24_03:08:17 z8ta6-8i9sb-ee2iki5xb0d7nxz 49547 289 stderr ERROR: could not read from compressed VCF: failed to read 7de4a43c3be89d867ef4220147dfc171+67108864+A4e9aaafb69a92591a1edd1dc484d71bd4b4ef9e7@5705ceca: service http://humgen-02-02.internal.sanger.ac.uk:25107/ responded with 0 (28, 'Resolving timed out after 2525 milliseconds'); service http://humgen-04-02.internal.sanger.ac.uk:25107/ responded with 0 (28, 'Resolving timed out after 2511 milliseconds'); service http://humgen-04-01.internal.sanger.ac.uk:25107/ responded with 0 (28, 'Resolving timed out after 2512 milliseconds'); service http://humgen-04-03.internal.sanger.ac.uk:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-03-24_03:08:17 z8ta6-8i9sb-ee2iki5xb0d7nxz 49547 289 stderr ; service http://humgen-02-01.internal.sanger.ac.uk:25107/ responded with 0 (28, 'Resolving timed out after 2512
The "Resolving timed out" message seems to originate from libcurl (via pycurl). We aren't aware of any DNS problems but systems is looking into whether there were any network issues during the time periods when these errors occurred.
In any case, it does not seem like ideal behaviour for arv-mount to give up if the name server is temporarily unavailable for whatever reason. I would have expected to catch this situation and retry.
Updated by Joshua Randall almost 10 years ago
- Subject changed from arv-mount "Resolving timed out" failure cause job failure to arv-mount "Resolving timed out" failure leads to job failure
Updated by Joshua Randall almost 10 years ago
This seems to be due to the CURLOPT_CONNECTTIMEOUT_MS which is being set at https://github.com/curoverse/arvados/blob/83172cf795687bd4f618d2c673be8fb30ca840df/sdk/python/arvados/keep.py#L521
Unfortunately, it looks like if you specify a timeout, libcurl does not offer a way for you to tell that the connect problem was due to a name resolution failure (because it actually gave up because of the specified timeout): https://curl.haxx.se/mail/tracker-2012-11/0030.html
Although, actually, if the keep servers had connection timeouts rather than "actual" failures, wouldn't it be prudent to try them again with longer timeouts?
Updated by Joshua Randall almost 10 years ago
It looks like the actual "cause" of this bug is that we have saturated the 20Gbps link between our compute-only nodes and the keep+compute nodes.
We will look into tuning and/or upgrading the network to avoid keep transfers completely saturating the link, but I think it is still a good idea to handle this particular failure mode in arv-mount in the interest of being robust to transient DNS failures.
Updated by Brett Smith almost 10 years ago
- Subject changed from arv-mount "Resolving timed out" failure leads to job failure to [FUSE] arv-mount "Resolving timed out" failure leads to job failure
Josh,
We do retry these failures. The higher-level KeepClient class keeps retrying until a request is successful; all services return a permanent failure (e.g., HTTP 404); or the number of retries has been exhausted. See https://github.com/curoverse/arvados/blob/83172cf795687bd4f618d2c673be8fb30ca840df/sdk/python/arvados/keep.py#L936-L957
However, the default number of retries is three, and there isn't much delay between requests. If your network is exhausted, it's very believable that it could exhaust the default retries.
We recently had some discussion about improving our default retry behavior in a variety of ways. See #8539, #8774, #8148.
In the meantime, if you want a quick improvement, you could pass the --retries=N switch where crunch-job builds the arv-mount command line. Note that this will affect each individual Keep request. The client has exponential backoff between each retry.
Updated by Joshua Randall almost 10 years ago
A closer inspection of our logs shows that when jobs failed due to this error, it was because they failed three times in a row, making the rate of failure lower than it otherwise would have been (other logs had 1-2 failures but ended up successful).
I agree this could be rolled into the general "improve retry behaviour" stories, ideally involving detecting the failure-to-resolve DNS as a specific type of connect failure that is clearly not the job's fault, probably resulting in a different (i.e. much more persistent) retry policy.
I'm not sure where to add the --retries=N, but I'll try to figure it out as soon as we have some downtime - I guess with exponential backoff, something like --retries=10 would probably be reasonable?
Updated by Brett Smith almost 10 years ago
Joshua Randall wrote:
I agree this could be rolled into the general "improve retry behaviour" stories, ideally involving detecting the failure-to-resolve DNS as a specific type of connect failure that is clearly not the job's fault, probably resulting in a different (i.e. much more persistent) retry policy.
One thing we've talked about in #8774 is that the retry policy needs to take into account both interactive use and running jobs. If DNS lookup fails because your DNS is misconfigured, for example, it would be better for an interactive arv-mount to report that and fail immediately, so the user can fix it, rather than retrying for a long time.
Basically, we think we want the number of retries to be roughly commensurate to the amount of effort that's already gone into this work, whether that's a quick CLI call, or a long-running arv-put, or a job that's step 6/6 of a day-long pipeline. If you have more general thoughts, it'd be great to hear them on #8774.
I'm not sure where to add the --retries=N, but I'll try to figure it out as soon as we have some downtime - I guess with exponential backoff, something like --retries=10 would probably be reasonable?
The downside is that it's a source patch to crunch-job, on the line $command .= "&& exec arv-mount --read-write ..."--you can add it there. On the plus side, you should be able to safely make that change while jobs are running. Any jobs that are already running will continue with the copy of crunch-job they have in RAM, while new ones will immediately start using the additional retries. 10 sounds reasonable to me.