Bug #10586
closedPython keep client (CollectionWriter) appears to deadlock
100%
Description
Some of the jobs across our cluster (those that are neither stuck due to #10585 nor one of the handful that are still running) appear to be stuck in our python crunch script in one of the calls to arvados.CollectionWriter()
Our crunch script is stuck after printing "...writing output to keep" but before "...validating it", which means it is in one of these three calls: https://github.com/wtsi-hgi/arvados-pipelines/blob/master/crunch_scripts/gatk-haplotypecaller-cram.py#L73-L80
It seems likely that issue 10585 and this one could be due to the same underlying issue, which would be some sort of deadlock in the Python keep client, assuming that arv-mount has some supervisor process that eventually notices things are hung and kills them off (whereas our crunch script doesn't have that).