Project

General

Profile

Actions

Bug #13142

open

crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error

Added by Joshua Randall almost 7 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
04/09/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

For about 3 minutes this morning (between 09:04:40 and 09:07:11), our crunch-dispatch-slurm reported 7565 "too many open files errors" in its log:

#  grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | wc -l
7565
# grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | cut -d" " -f8- | sort | uniq -c | sort -rg | head -n 10
   5059 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")
    472 scancel: /usr/bin/scancel: pipe2: too many open files ("")
     98 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: open /dev/null: too many open files "" 
     21 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: pipe2: too many open files "" 
      6 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: fork/exec /usr/bin/squeue: too many open files "" 
      6 2 times: [ 2018/02/28 09:05:23 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")]
      1 Error updating container ncucu-dz642-zwq2f3i5rj6r5wx to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zwq2f3i5rj6r5wx: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zhsu1zqrsizyaxg to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zhsu1zqrsizyaxg: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zeftzdg5kifnvcp to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zeftzdg5kifnvcp: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-yrv1f4husp85ivo to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-yrv1f4husp85ivo: dial tcp 10.101.0.179:443: socket: too many open files

The process has the default nofile limits, which is 1024 soft limit and 4096 hard limit.

We installed crunch-dispatch-slurm according to the instructions (http://doc.arvados.org/install/crunch2-slurm/install-dispatch.html) and it is being run by systemd using the unit installed by the apt package.
Version:

# dpkg -l |grep crunch
ii  crunch-dispatch-slurm            0.1.20171102145827.cc6f86f-1      amd64        Dispatch Crunch containers to a SLURM cluster

Between 09:07:11 and 17:17:27, crunch-dispatch-slurm continued to run and logged:
- 8594 "Done monitoring container <UUID>" lines.
- 37 "Error getting final container state for <UUID>: arvados API server error: Path not found (404: 404 Not Found) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
- 32 "Error updating container <UUID> to state "Cancelled": arvados API server error: State cannot change from Complete to Cancelled (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.

Between 17:17:27 and 19:27:46, crunch-dispatch-slurm logged nothing (I think there was at that point nothing running).

At 19:27:46, I ran `systemctl restart crunch-dispatch-slurm` and it logged "crunch-dispatch-slurm dev started" and it then proceeded to submit 662 batch jobs.

It appears that the end result of running into the "too many files error" was that crunch-dispatch-slurm was going to continue to run but not process any new data. It would be better if it is in an errored state if it would terminate so that systemd can restart it automatically.

It would also be good if the systemd unit raises the crunch-dispatch-slurm nofile limit to an appropriate level.


Subtasks 1 (0 open1 closed)

Task #13293: Review 13142-dispatch-nfilesResolvedTom Clegg04/09/2018

Actions
Actions #1

Updated by Joshua Randall almost 7 years ago

I have manually raised max open files by creating a systemd drop-in:

# printf 'a\n[Service]\nLimitNOFILE=1000000\n.\nw\nq' | EDITOR=ed systemctl edit crunch-dispatch-slurm
# systemctl restart crunch-dispatch-slurm
# cat /proc/$(ps auxwww |grep crunch-dispatch-slurm | grep -v grep | awk '{print $2}')/limits | grep files
Max open files            1000000              1000000              files

Actions #2

Updated by Tom Morris almost 7 years ago

  • Target version set to 2018-04-11 Sprint
Actions #3

Updated by Tom Clegg almost 7 years ago

We should check for leaking FDs. We might be relying on garbage collection to clean up abandoned exec.Command fds.

Actions #4

Updated by Tom Clegg almost 7 years ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg almost 7 years ago

Tried to find a leak, but it looks like
  • we call Wait (or something that calls Wait) every time we use exec.Command
  • we use one arvados client object for all api calls
  • a dispatcher that's been running jobs all day has 4 FDs open

...and we only run one scancel at a time, so I'm guessing this happens during a busy time (lots of sbatch invocations).

I suppose we could rate-limit sbatch, but either way, 1024 is a silly limit anyway for a program like this, so:

13142-dispatch-nfiles @ 580949047505722b11956f09177a8537a1fef015

Actions #6

Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Lucas Di Pentima over 6 years ago

Not much to review :) this LGTM, thanks.

Actions #8

Updated by Tom Clegg over 6 years ago

  • Subject changed from crunch-dispatch-slurm systemd unit does not set nofile limit, and crunch-dispatch-slurm does not handle "too many open files" errors well to crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error
  • Status changed from In Progress to New
  • Assigned To deleted (Tom Clegg)
  • Target version changed from 2018-04-11 Sprint to Arvados Future Sprints
Actions #9

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #10

Updated by Lucas Di Pentima almost 2 years ago

  • Release set to 60
Actions

Also available in: Atom PDF