Bug #8404
closedInterrupted system call / warning: squeue exit status 256 ()
Description
A few jobs failed:
2016-02-09_14:43:28 wx7k5-8i9sb-jiy58uzpgsjmy2v 7805 0 stderr run-command: caught exception 2016-02-09_14:43:28 wx7k5-8i9sb-jiy58uzpgsjmy2v 7805 0 stderr Traceback (most recent call last): 2016-02-09_14:43:28 wx7k5-8i9sb-jiy58uzpgsjmy2v 7805 0 stderr File "/tmp/crunch-job/src/crunch_scripts/run-command", line 393, in <module> 2016-02-09_14:43:28 wx7k5-8i9sb-jiy58uzpgsjmy2v 7805 0 stderr (pid, status) = os.wait() 2016-02-09_14:43:28 wx7k5-8i9sb-jiy58uzpgsjmy2v 7805 0 stderr OSError: [Errno 4] Interrupted system call
One example: wx7k5-8i9sb-bnmltdlp4ucwlbe, a GATK queue parent job failed. One job: wx7k5-8i9sb-qhd7u9uhpugccn3 failed due to it looks like manual cancellation at 2016-02-09 14:23:41 UTC, by me: wx7k5-tpzed-j0r27xny18tkq45
But, I was not in the office by then and still making my commute, so I don't know how this could have happened.
another example: wx7k5-8i9sb-jiy58uzpgsjmy2v failed to it at 2016-02-09_14:43:28 but I can't find any child queue job that failed to it.
Updated by Nico César about 10 years ago
Just a random thought: https://hg.python.org/cpython/rev/c3193b7156bb/ they just capture the EINTR and just continue ... shall we do the same?
Updated by Tom Clegg about 10 years ago
8404-catch-interrupted-syscall
Style nit: Perhaps better to use try: except: else: here, so the only thing in try: is the os.wait() call. That makes it easy to see where we expect the exception to occur.
try:
(pid, status) = os.wait()
except OSError as e:
if e.errno == errno.EINTR:
pass
else:
raise
else:
pids.discard(pid)
if not taskp.get("task.ignore_rcode"):
rcode[pid] = (status >> 8)
else:
rcode[pid] = 0
Updated by Peter Amstutz about 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:1fc6d7713baabfe85b49191e156b6c093d22b69f.