Bug #8191: [Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated - Arvados

Actions

Copy link

Bug #8191

closed

[Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated

Added by Bryan Cosca about 9 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Target version:

Start date:

01/11/2016

Due date:

% Done:

Estimated time:

Story points:

Description

It looks like a job prematurely finished and no output was recorded because of this error.

2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr 00:07:30.881\011\011\011X:85419918
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr ERROR: ALT field does not match
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr \011VCF entry dbsnp_138\011X\01185470602\011rs201008478\011TC\011T\011.\011.\011CAF=[0.9553,0.04474];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201008478;RSPOS=85470603;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr srun: error: Abandoning IO 60 secs after job shutdown initiated
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 child 50302 on compute2.1 exit 1 success=
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output.
2016-01-11_22:17:39 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 failure (#1, permanent) after 1009 seconds

https://workbench.wx7k5.arvadosapi.com/collections/d0cd77591c4db547ce1f9daf62341eaf+91/wx7k5-8i9sb-3m1ey3oj3sibv7l.log.txt

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Bryan Cosca about 9 years ago

I reran the job with a sdk version: 722e147756526579ba32a31f967e9e00d47fd3ed

(before I used 92768ce858673678aa7924f83ad41e2a9f8dd678)

https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-ejxs4nxc0dy21xr

and it worked, no longer blocked.

Actions

Copy link

Updated by Brett Smith about 9 years ago

Target version set to Arvados Future Sprints

Actions

Copy link

Updated by Bryan Cosca about 9 years ago

This job used 722e147756526579ba32a31f967e9e00d47fd3ed

https://workbench.wx7k5.arvadosapi.com/collections/e6ee0c7858d44fec27cd245ed98b0117+91/wx7k5-8i9sb-o0dzi5z6vlq204g.log.txt

and failed with the same error.

Actions

Copy link

Updated by Bryan Cosca about 9 years ago

Also I think its currently blocking #7933, its a little hard to tell, but I think this is making the job end prematurely.

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Ideas about situations that might cause this:

It takes >60 seconds to process the buffered stderr after the slurm jobstep exits. The process already exited 1 for some unrelated reason, but we don't see the relevant part of stderr because slurm cut us off before we got to it.
The task process forks/detaches a child process that is still running after the task process exits, and stderr is still coming from that daemon process for >= 60s when slurm decides something is wrong. (But: wouldn't "docker run" shut down the container and kill off any such daemon processs before exiting?)

Actions

Copy link

Updated by Bryan Cosca about 9 years ago

https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status

This job has the stderr print to a file, and the error does not show up.

Actions

Copy link

Updated by Brett Smith almost 9 years ago

Subject changed from srun: error: Abandoning IO 60 secs after job shutdown initiated to [Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated

Bryan Cosca wrote:

https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status

This job has the stderr print to a file, and the error does not show up.

I think I'm convinced this is basically a problem in snpsift, and SLURM is actually doing what we want it to do.

snpsift writes to stderr prolifically. So much so, SLURM can't send it all over the network within 60 seconds of snpsift finishing.

It's possible that these things are related: that snpsift is exiting with an error because it can't write to stderr immediately, because the buffer is full, or something like that. But I don't think we want to turn off SLURM's I/O timeout to fix this: that could threaten the stability of the cluster more generally.

Writing stderr to a file as you've done seems like a decent fix. Is all that stderr actually useful, though? If not, you might consider calling snpsift with switches to turn down the messages, or piping its stderr through grep to filter out some of the less useful messages.

Actions

Copy link