Bug #8217
closedqr2hi random killed job?
Description
parent gatk queue job: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-le78pctmw2ldrih
2016-01-15_18:38:38 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: keepcalls 0 put 92152 get -- interval 10.0000 seconds 0 put 0 get 2016-01-15_18:38:38 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: net:keep0 0 tx 7243644001 rx -- interval 10.0000 seconds 0 tx 0 rx 2016-01-15_18:38:38 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: keepcache 92041 hit 111 miss -- interval 10.0000 seconds 0 hit 0 miss 2016-01-15_18:38:38 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: fuseops 0 write 46617 read -- interval 10.0000 seconds 0 write 0 read 2016-01-15_18:38:38 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: blkio:0:0 0 write 6080790552 read -- interval 10.0000 seconds 0 write 0 read 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr Killed 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: /bin/sh completed with exit code 137 (failed) 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: the following output files will be saved to keep: 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: 13284 ./scatter.intervals 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: 0 ./.scatter.intervals.done 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: 1334274207 ./24385-200_AH5G7WCCXX.realigned.bam.realigned.g.vcf.gz 2016-01-15_18:38:43 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr run-command: start writing output to keep 2016-01-15_18:38:45 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: mem 361123840 cache 0 swap 32 pgmajfault 462368768 rss 2016-01-15_18:38:45 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: cpu 82070.0000 user 447.1700 sys 16 cpus -- interval 9.9999 seconds 17.1700 user 4.4400 sys 2016-01-15_18:38:45 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: net:eth0 403426335 tx 1628930 rx -- interval 9.9998 seconds 403399187 tx 567038 rx 2016-01-15_18:38:48 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: keepcalls 0 put 92152 get -- interval 10.0000 seconds 0 put 0 get 2016-01-15_18:38:48 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: net:keep0 0 tx 7243644001 rx -- interval 10.0000 seconds 0 tx 0 rx 2016-01-15_18:38:48 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: keepcache 92041 hit 111 miss -- interval 10.0000 seconds 0 hit 0 miss 2016-01-15_18:38:48 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: fuseops 0 write 46618 read -- interval 10.0000 seconds 0 write 1 read 2016-01-15_18:38:48 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: blkio:0:0 0 write 6080790845 read -- interval 10.0000 seconds 0 write 293 read 2016-01-15_18:38:55 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: mem 1006260224 cache 0 swap 32 pgmajfault 495964160 rss 2016-01-15_18:38:55 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: cpu 82072.3900 user 449.1700 sys 16 cpus -- interval 10.0001 seconds 2.3900 user 2.0000 sys 2016-01-15_18:38:55 qr2hi-8i9sb-oaw9ccj1ss83djw 7906 0 stderr crunchstat: net:eth0 2275656409 tx 3993886 rx -- interval 10.0003 seconds 1872230074 tx 2364956 rx
Updated by Brett Smith about 10 years ago
- Status changed from New to Resolved
Linux killed the java process because it was using too much RAM.
For future debugging: one good hint in this log is the "exited 137" message. Exit codes over 128 mean the process exited because it got a signal. You can get the signal number by subtracting 128: 137 - 128 == 9. Looking at man 7 signal, we see that signal 9 is SIGKILL, which immediately terminates the program.
Crunch sometimes sends SIGKILL to the job when things go very wrong, but it logs when it does. If a process stops from SIGKILL with no warning, it almost certainly means that Linux sent SIGKILL directly because the system was out of RAM.
Use less RAM or request a bigger node. ;)