Project

General

Profile

Actions

Bug #8225

closed

System was unusable after API overload condition (reported in 8224)

Added by Joshua Randall about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

After raising the number of passenger jobs in my nginx.conf, I tried to re-run the job that had failed, but it failed immediately with an arv-mount error:

2016-01-19_10:48:30 starting: ['srun','--nodelist=humgen-01-02,humgen-01-03,humgen-02-01,humgen-02-02,humgen-02-03,humgen-03-01,humgen-03-02,humgen-03-03,humgen-04-01,humgen-04-02,humgen-04-03,humgen-05-01,humgen-05-02,humgen-05-03,humgen-05-04,humgen-05-05,humgen-05-06,humgen-05-07,humgen-05-08,humgen-05-09,humgen-05-10,humgen-05-11,humgen-05-12,humgen-05-13,humgen-05-14,humgen-05-15,humgen-05-16','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-01-19_10:48:31 srun: error: humgen-05-01: task 11: Exited with exit code 123
2016-01-19_10:48:31 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:32 z8ta6-8i9sb-e37o65uy79f83f6 64871 Clean work dirs: exit 123

Following that error, it relinquised the job allocation and then tried it again with the same error each time. It did that three times.

At the end of the third time, there were additional errors:

2016-01-19_10:48:39 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:41 z8ta6-8i9sb-e37o65uy79f83f6 1156 Clean work dirs: exit 123
2016-01-19_10:48:41 salloc: Relinquishing job allocation 4270
2016-01-19_10:48:41 close failed in file object destructor:
2016-01-19_10:48:41 sys.excepthook is missing
2016-01-19_10:48:41 lost sys.stderr

It then tried one last time before marking the job as failed:

2016-01-19_10:48:44 starting: ['srun','--nodelist=humgen-01-02,humgen-01-03,humgen-02-01,humgen-02-02,humgen-02-03,humgen-03-01,humgen-03-02,humgen-03-03,humgen-04-01,humgen-04-02,humgen-04-03,humgen-05-01,humgen-05-02,humgen-05-03,humgen-05-04,humgen-05-05,humgen-05-06,humgen-05-07,humgen-05-08,humgen-05-09,humgen-05-10,humgen-05-11,humgen-05-12,humgen-05-13,humgen-05-14,humgen-05-15,humgen-05-16','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-01-19_10:48:44 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:44 srun: error: humgen-05-01: task 11: Exited with exit code 123
2016-01-19_10:48:45 z8ta6-8i9sb-e37o65uy79f83f6 1380 Clean work dirs: exit 123
2016-01-19_10:48:46 salloc: Relinquishing job allocation 4271


Files

mount-grep-fuse.png (282 KB) mount-grep-fuse.png Joshua Randall, 01/19/2016 12:44 PM
Actions #1

Updated by Joshua Randall about 10 years ago

`mount` reports that the keep fuse mount is still mounted on three of the hosts (see screenshot), but arv-mount is not running on those machines.

Actions #2

Updated by Joshua Randall about 10 years ago

I couldn't get `fusermount` to unmount them, but a `sudo umount ...` does appear to have worked.

Actions #3

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF