Bug #8225
closedSystem was unusable after API overload condition (reported in 8224)
Description
After raising the number of passenger jobs in my nginx.conf, I tried to re-run the job that had failed, but it failed immediately with an arv-mount error:
2016-01-19_10:48:30 starting: ['srun','--nodelist=humgen-01-02,humgen-01-03,humgen-02-01,humgen-02-02,humgen-02-03,humgen-03-01,humgen-03-02,humgen-03-03,humgen-04-01,humgen-04-02,humgen-04-03,humgen-05-01,humgen-05-02,humgen-05-03,humgen-05-04,humgen-05-05,humgen-05-06,humgen-05-07,humgen-05-08,humgen-05-09,humgen-05-10,humgen-05-11,humgen-05-12,humgen-05-13,humgen-05-14,humgen-05-15,humgen-05-16','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-01-19_10:48:31 srun: error: humgen-05-01: task 11: Exited with exit code 123
2016-01-19_10:48:31 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:32 z8ta6-8i9sb-e37o65uy79f83f6 64871 Clean work dirs: exit 123
Following that error, it relinquised the job allocation and then tried it again with the same error each time. It did that three times.
At the end of the third time, there were additional errors:
2016-01-19_10:48:39 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:41 z8ta6-8i9sb-e37o65uy79f83f6 1156 Clean work dirs: exit 123
2016-01-19_10:48:41 salloc: Relinquishing job allocation 4270
2016-01-19_10:48:41 close failed in file object destructor:
2016-01-19_10:48:41 sys.excepthook is missing
2016-01-19_10:48:41 lost sys.stderr
It then tried one last time before marking the job as failed:
2016-01-19_10:48:44 starting: ['srun','--nodelist=humgen-01-02,humgen-01-03,humgen-02-01,humgen-02-02,humgen-02-03,humgen-03-01,humgen-03-02,humgen-03-03,humgen-04-01,humgen-04-02,humgen-04-03,humgen-05-01,humgen-05-02,humgen-05-03,humgen-05-04,humgen-05-05,humgen-05-06,humgen-05-07,humgen-05-08,humgen-05-09,humgen-05-10,humgen-05-11,humgen-05-12,humgen-05-13,humgen-05-14,humgen-05-15,humgen-05-16','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-01-19_10:48:44 fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-01.17.keep: Invalid argument
2016-01-19_10:48:44 srun: error: humgen-05-01: task 11: Exited with exit code 123
2016-01-19_10:48:45 z8ta6-8i9sb-e37o65uy79f83f6 1380 Clean work dirs: exit 123
2016-01-19_10:48:46 salloc: Relinquishing job allocation 4271
Files
Updated by Joshua Randall about 10 years ago
- File mount-grep-fuse.png mount-grep-fuse.png added
`mount` reports that the keep fuse mount is still mounted on three of the hosts (see screenshot), but arv-mount is not running on those machines.
Updated by Joshua Randall about 10 years ago
I couldn't get `fusermount` to unmount them, but a `sudo umount ...` does appear to have worked.