Bug #8803: [Crunch] Job interrupted by unknown SIGTERM, not retried - Arvados

Actions

Copy link

Bug #8803

closed

[Crunch] Job interrupted by unknown SIGTERM, not retried

Added by Bryan Cosca almost 10 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Target version:

Story points:

Description

running a gatk queue indel realigner job:

instance: https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-58vf4mow54w8k7h
parent job: wx7k5-8i9sb-ymacd9y4somp5pd
child job: wx7k5-8i9sb-9emd52krip2qqvz

2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr INFO  14:47:47,218 ProgressMeter -     12:92505219   2.7600585E7    16.2 m      35.0 s       37.1%    43.5 m      27.4 m 
2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: keepcalls 0 put 40092 get -- interval 10.0000 seconds 0 put 418 get
2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: net:keep0 0 tx 6821827566 rx -- interval 10.0000 seconds 0 tx 134217728 rx
2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: keepcache 39988 hit 104 miss -- interval 10.0000 seconds 416 hit 2 miss
2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: fuseops 0 write 20590 read -- interval 10.0000 seconds 0 write 209 read
2016-03-25_14:47:47 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: blkio:0:0 0 write 2668644995 read -- interval 10.0000 seconds 0 write 27394048 read
2016-03-25_14:47:56 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: mem 5316657152 cache 1 pgmajfault 6211383296 rss
2016-03-25_14:47:56 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: cpu 944.6800 user 7.8400 sys 4 cpus -- interval 9.9998 seconds 9.9300 user 0.0400 sys
2016-03-25_14:47:56 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: net:eth0 39157 tx 1118893 rx -- interval 9.9997 seconds 0 tx 0 rx
2016-03-25_14:47:57 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: keepcalls 0 put 40518 get -- interval 10.0000 seconds 0 put 426 get
2016-03-25_14:47:57 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: net:keep0 0 tx 6821827566 rx -- interval 10.0000 seconds 0 tx 0 rx
2016-03-25_14:47:57 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: keepcache 40414 hit 104 miss -- interval 10.0000 seconds 426 hit 0 miss
2016-03-25_14:47:57 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: fuseops 0 write 20803 read -- interval 10.0000 seconds 0 write 213 read
2016-03-25_14:47:57 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr crunchstat: blkio:0:0 0 write 2696563331 read -- interval 10.0000 seconds 0 write 27918336 read
2016-03-25_14:47:58 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr run-command: terminating on signal 15
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 stderr srun: error: compute27: task 0: Exited with exit code 2
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 child 50371 on compute27.1 exit 2 success=
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 ERROR: Task process exited 2, but never updated its task record to indicate success and record its output.
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 failure (#1, permanent) after 1014 seconds
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358 0 task output (0 bytes): 
2016-03-25_14:47:59 wx7k5-8i9sb-9emd52krip2qqvz 49358  status: 0 done, 0 running, 1 todo

Actions

Copy link

Updated by Brett Smith almost 10 years ago

Project changed from 35 to Arvados
Subject changed from run-command exits signal 15 to [Crunch] Job interrupted by unknown SIGTERM, not retried

Right now the best explanation I can come up with is that the node was shut down by forces outside our control (e.g., planned maintenance). Every running process gets SIGTERM (signal 15) when the host is shut down, so that would explain why this came out of nowhere.

If I'm right that the node shut down, the job should've been retried. I'm not sure where that handling should go, though. In the Crunch scripts themselves? Somewhere in Crunch?

Actions

Copy link

Updated by Tom Clegg almost 10 years ago

How should we decide whether SIGTERM comes from a system shutdown -- as opposed to a SIGTERM from crunch-job trying to cancel the job, for example, or a program sending itself SIGTERM? (Even detecting SIGTERM in the first place looks slightly tricky with run-command here, since it seems to swallow the "exited on signal" part of exitstatus after logging it.)

Perhaps, rather than put too much stock in exit status, crunch-job should do a sanity check on the worker node after a task fails -- at least in cases where the task failure is about to cause the job to fail. If that sanity check fails, the task failure itself was probably an infrastructure problem, so it would be appropriate to return tempfail to crunch-dispatch.

Actions

Copy link

Updated by Peter Amstutz about 6 years ago

Status changed from New to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #8803

[Crunch] Job interrupted by unknown SIGTERM, not retried

Updated by Brett Smith almost 10 years ago

Updated by Tom Clegg almost 10 years ago

Updated by Peter Amstutz about 6 years ago