Project

General

Profile

Actions

Bug #8868

closed

reliability of crunch jobs with saturated network links

Added by Joshua Randall almost 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Documentation
Target version:
-
Story points:
-

Description

We have struggled for the past few weeks with a very I/O intensive set of crunch jobs causing what basically amounts to a DoS attack on core system services (in that the keep traffic to the jobs was saturating the 20Gbps link between keep nodes and one set of compute nodes). The result was a variety of different forms of job failures on those nodes that are on the other side of the contended link.

I have other issues open for some of the underlying issues that could be addressed directly by Arvados (this basically amounts to better retry behaviour).

However, I think it is also a good idea to document the best practices that would avoid those issues as much as possible. Here I will list the various issues we ran into along with what appears to be successful mitigation strategies we are now employing:

In our systems, the issues I believe were attributed to network link congestion were:

1. arv-mount DNS failure looking up keep servers
Symptom: DNS failure leading to arv-mount failure leading to job failure
Explanation: The 2.5s connect timeout when connecting to each keep server doesn't allow enough time to do a DNS lookup when the network is congested. In addition, TTL on the DNS A records for the compute nodes was set lower than typical job lifetime.
Mitigation: Add all keep servers to /etc/hosts on all compute nodes

2. arv-mount DNS failure looking up API server
Symptom: DNS failure leading to arv-mount failure leading to job failure
Explanation: The 2.5s connect timeout when connecting to the API server to get the list of keep servers doesn't allow enough time to do a DNS lookup when the network is congested. In addition, TTL on our DNS CNAME record for the API server was set lower than typical job lifetime.
Mitigation: Add the API server to /etc/hosts on all compute nodes

3. python crunch script DNS failure looking up API server and/or keep server
Symptom: python crunch scripts that access keep directly (e.g. via `arvados.collection.CollectionReader()`) have the same issue as arv-mount (above in 1/2) but from inside the docker container
Explanation: same as 1/2
Mitigation: add `--add-host=${host}:${ip}` entries for API server and all keep servers to CRUNCH_JOB_DOCKER_RUN_ARGS environment before starting crunch-dispatch.rb --jobs

4. SLURM send/recv failures starting job tasks
Symptom: errors involving "send/recv" failures from SLURM when attempting to start job task steps
Explanation: this appears to not actually be an RPC failure between SLURM compute node and master, but actually a problem with looking up the crunch user entry via LDAP (in our case via SSSD) because the default 5s timeout that SSSD uses when contacting LDAP servers is not long enough when the network is congested
Mitigation: ensure crunch user is in local passwd file and that local nss files are searched before LDAP

Actions #1

Updated by Joshua Randall almost 10 years ago

5. SLURM send/recv RPC failures
Symptom: errors involving "send/recv" RPC failures from SLURM
Explanation: after resolving the LDAP/SSSD issues, we still got occasional send/recv failures from SLURM. We have the default MessageTimeout of 10s in slurm.conf
Mitigation: Raise MessageTimeout to 60s

Actions #2

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF