Bug #8191
closed[Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated
0%
Description
It looks like a job prematurely finished and no output was recorded because of this error.
2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr 00:07:30.881\011\011\011X:85419918 2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr ERROR: ALT field does not match 2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr \011VCF entry dbsnp_138\011X\01185470602\011rs201008478\011TC\011T\011.\011.\011CAF=[0.9553,0.04474];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201008478;RSPOS=85470603;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137 2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 stderr srun: error: Abandoning IO 60 secs after job shutdown initiated 2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 child 50302 on compute2.1 exit 1 success= 2016-01-11_22:17:38 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 ERROR: Task process exited 1, but never updated its task record to indicate success and record its output. 2016-01-11_22:17:39 wx7k5-8i9sb-3m1ey3oj3sibv7l 48530 1 failure (#1, permanent) after 1009 seconds
Updated by Bryan Cosca about 9 years ago
I reran the job with a sdk version: 722e147756526579ba32a31f967e9e00d47fd3ed
(before I used 92768ce858673678aa7924f83ad41e2a9f8dd678)
https://workbench.wx7k5.arvadosapi.com/pipeline_instances/wx7k5-d1hrv-ejxs4nxc0dy21xr
and it worked, no longer blocked.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Bryan Cosca about 9 years ago
This job used 722e147756526579ba32a31f967e9e00d47fd3ed
and failed with the same error.
Updated by Bryan Cosca about 9 years ago
Also I think its currently blocking #7933, its a little hard to tell, but I think this is making the job end prematurely.
Updated by Tom Clegg about 9 years ago
- It takes >60 seconds to process the buffered stderr after the slurm jobstep exits. The process already exited 1 for some unrelated reason, but we don't see the relevant part of stderr because slurm cut us off before we got to it.
- The task process forks/detaches a child process that is still running after the task process exits, and stderr is still coming from that daemon process for >= 60s when slurm decides something is wrong. (But: wouldn't "docker run" shut down the container and kill off any such daemon processs before exiting?)
Updated by Bryan Cosca about 9 years ago
https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status
This job has the stderr print to a file, and the error does not show up.
Updated by Brett Smith almost 9 years ago
- Subject changed from srun: error: Abandoning IO 60 secs after job shutdown initiated to [Crunch] Tool fails and srun reports error: Abandoning IO 60 secs after job shutdown initiated
Bryan Cosca wrote:
https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-6k7kp8hgr81hicn#Status
This job has the stderr print to a file, and the error does not show up.
I think I'm convinced this is basically a problem in snpsift, and SLURM is actually doing what we want it to do.
snpsift writes to stderr prolifically. So much so, SLURM can't send it all over the network within 60 seconds of snpsift finishing.
It's possible that these things are related: that snpsift is exiting with an error because it can't write to stderr immediately, because the buffer is full, or something like that. But I don't think we want to turn off SLURM's I/O timeout to fix this: that could threaten the stability of the cluster more generally.
Writing stderr to a file as you've done seems like a decent fix. Is all that stderr actually useful, though? If not, you might consider calling snpsift with switches to turn down the messages, or piping its stderr through grep to filter out some of the less useful messages.
Updated by Brett Smith almost 9 years ago
- Target version deleted (
Arvados Future Sprints)