Actions
Bug #11209
closedstuck keep fuse mounts not cleared by crunch-job
Start date:
03/02/2017
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Story points:
-
Description
crunch-job attempts to unmount any fuse filesystems that are mounted under $CRUNCH_TMP but it attempts to do so only using fusermount. Often on our system, this fails and a "umount -f <mount_point>" is required to make the node work again.
In addition, this often happens on multiple nodes at the same time - and by the time we have three nodes with wedged fuse mounts, they will rapidly fail all pending jobs. There seems to be no mechanism by which crunch dispatch can decide to stop trying to dispatch to a node that is broken.
Here is the log from a job that suffered from this issue.
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-07 z8ta6-7ekkf-sa1q59632vhxov6 {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561} 2017-02-28_17:23:33 salloc: Granted job allocation 17536 2017-02-28_17:23:33 58397 Sanity check is `/usr/bin/docker ps -q` 2017-02-28_17:23:33 58397 sanity check: start 2017-02-28_17:23:33 58397 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q'] 2017-02-28_17:23:33 58397 sanity check: exit 0 2017-02-28_17:23:33 58397 Sanity check OK 2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397 running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001 2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397 check slurm allocation 2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397 node humgen-05-07 - 10 slots 2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397 start 2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397 clean work dirs: start 2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397 stderr starting: ['srun','--nodelist=humgen-05-07','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397 stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-07.10.keep: Invalid argument 2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397 stderr srun: error: humgen-05-07: task 0: Exited with exit code 123 2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397 clean work dirs: exit 123 2017-02-28_17:23:34 salloc: Relinquishing job allocation 17536 dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106} 2017-02-28_17:23:35 salloc: error: Unable to allocate resources: Requested nodes are busy 2017-02-28_17:23:35 salloc: Job allocation 17539 has been revoked. dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-03 z8ta6-7ekkf-1i1v5zotflg26jn {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561} 2017-02-28_17:23:36 salloc: Granted job allocation 17540 2017-02-28_17:23:36 58715 Sanity check is `/usr/bin/docker ps -q` 2017-02-28_17:23:36 58715 sanity check: start 2017-02-28_17:23:36 58715 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q'] 2017-02-28_17:23:36 58715 sanity check: exit 0 2017-02-28_17:23:36 58715 Sanity check OK 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 check slurm allocation 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 node humgen-05-03 - 10 slots 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 start 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 clean work dirs: start 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 stderr starting: ['srun','--nodelist=humgen-05-03','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.4.keep: Invalid argument 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 stderr srun: error: humgen-05-03: task 0: Exited with exit code 123 2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715 clean work dirs: exit 123 2017-02-28_17:23:38 salloc: Relinquishing job allocation 17540 2017-02-28_17:23:38 close failed in file object destructor: 2017-02-28_17:23:38 sys.excepthook is missing 2017-02-28_17:23:38 lost sys.stderr dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106} 2017-02-28_17:23:40 salloc: Granted job allocation 17544 2017-02-28_17:23:40 58985 Sanity check is `/usr/bin/docker ps -q` 2017-02-28_17:23:40 58985 sanity check: start 2017-02-28_17:23:40 58985 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q'] 2017-02-28_17:23:40 58985 sanity check: exit 0 2017-02-28_17:23:40 58985 Sanity check OK 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 check slurm allocation 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 node humgen-04-02 - 10 slots 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 start 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 clean work dirs: start 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 stderr starting: ['srun','--nodelist=humgen-04-02','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-04-02.9.keep: Invalid argument 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 stderr srun: error: humgen-04-02: task 0: Exited with exit code 123 2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985 clean work dirs: exit 123 2017-02-28_17:23:41 salloc: Relinquishing job allocation 17544 2017-02-28_17:23:41 close failed in file object destructor: 2017-02-28_17:23:41 sys.excepthook is missing 2017-02-28_17:23:41 lost sys.stderr
Actions