Bug #11190: Containers seem to run more than once, which isn't supposed to happen - Arvados

2017-03-01_17:27:35.24495 2017/03/01 17:27:35 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:27:35.24517 2017/03/01 17:27:35 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:27:35.35069 2017/03/01 17:27:35 sbatch succeeded: "Submitted batch job 2948" 
2017-03-01_17:27:35.35071 2017/03/01 17:27:35 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:37.15184 2017/03/01 17:29:37 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:42.32428 2017/03/01 17:29:42 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:46.97205 2017/03/01 17:29:46 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:51.83317 2017/03/01 17:29:51 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:56.42094 2017/03/01 17:29:56 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:57.89127 2017/03/01 17:29:57 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:01.25862 2017/03/01 17:30:01 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:30:01.25865 2017/03/01 17:30:01 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:30:01.32075 2017/03/01 17:30:01 sbatch succeeded: "Submitted batch job 2949" 
2017-03-01_17:30:01.32077 2017/03/01 17:30:01 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:06.85462 2017/03/01 17:30:06 Dispatcher says container tb05z-dz642-eie1eal1059y9bb is done: cancel slurm job
2017-03-01_17:30:07.23672 2017/03/01 17:30:07 container tb05z-dz642-eie1eal1059y9bb is still in squeue after scancel
2017-03-01_17:30:13.53918 2017/03/01 17:30:13 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:31:02.73009 2017/03/01 17:31:02 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:31:02.73013 2017/03/01 17:31:02 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:31:02.76251 2017/03/01 17:31:02 sbatch succeeded: "Submitted batch job 2950" 
2017-03-01_17:31:02.76253 2017/03/01 17:31:02 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:32:35.91008 2017/03/01 17:32:35 Done monitoring container tb05z-dz642-eie1eal1059y9bb

Actions

Copy link

Updated by Tom Morris about 8 years ago

Target version set to 2017-03-29 sprint

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Category set to Crunch
Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

Actions

Copy link

Updated by Peter Amstutz about 8 years ago

I wonder if we should move the state transition to "Running" as soon as crunch-run has starting doing anything substantive. E.g. if it fails to load the Docker image, that shouldn't shouldn't put it back into Locked state, that should go Running->Cancelled.

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

Actions

Copy link

Updated by Tom Clegg almost 8 years ago

Allowing multiple dispatch attempts is a deliberate feature: when the dispatch/startup infrastructure fails early enough that it's absolutely certain the container has never been started, we don't count an "attempt" against a container request.

Currently there is no limit on the number of lock-attempt-unlock cycles, though. We should have a site-configurable limit. This counter doesn't have to be visible to anyone except the api server, although it would be useful to expose it to admin clients for troubleshooting purposes.

Actions

Copy link

Updated by Tom Morris almost 8 years ago

Target version changed from 2017-04-26 sprint to 2017-05-10 sprint

Actions

Copy link

#10

Updated by Tom Clegg almost 8 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #11190

Containers seem to run more than once, which isn't supposed to happen

Updated by Tom Clegg about 8 years ago

Updated by Tom Morris about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Peter Amstutz about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg almost 8 years ago

Updated by Tom Morris almost 8 years ago

Updated by Tom Clegg almost 8 years ago