Project

General

Profile

Actions

Bug #11190

closed

Containers seem to run more than once, which isn't supposed to happen

Added by Tom Clegg almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/01/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description


Subtasks 1 (0 open1 closed)

Task #11263: ReviewResolvedPeter Amstutz03/01/2017

Actions

Related issues 3 (0 open3 closed)

Related to Arvados - Bug #11166: [Crunch2] crunchrun.go should avoid name collisions when creating log collectionsResolvedTom Clegg02/24/2017

Actions
Related to Arvados - Bug #11220: [SDKs] Fix misleading arv-mount/pysdk error messages by removing obsolete "fetch manifest from Keep" codeResolvedTom Clegg10/20/2017

Actions
Related to Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given containerResolvedPeter Amstutz04/26/2017

Actions
Actions #2

Updated by Tom Clegg almost 8 years ago

2017-03-01_17:27:35.24495 2017/03/01 17:27:35 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:27:35.24517 2017/03/01 17:27:35 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:27:35.35069 2017/03/01 17:27:35 sbatch succeeded: "Submitted batch job 2948" 
2017-03-01_17:27:35.35071 2017/03/01 17:27:35 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:37.15184 2017/03/01 17:29:37 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:42.32428 2017/03/01 17:29:42 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:46.97205 2017/03/01 17:29:46 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:51.83317 2017/03/01 17:29:51 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:56.42094 2017/03/01 17:29:56 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:57.89127 2017/03/01 17:29:57 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:01.25862 2017/03/01 17:30:01 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:30:01.25865 2017/03/01 17:30:01 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:30:01.32075 2017/03/01 17:30:01 sbatch succeeded: "Submitted batch job 2949" 
2017-03-01_17:30:01.32077 2017/03/01 17:30:01 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:06.85462 2017/03/01 17:30:06 Dispatcher says container tb05z-dz642-eie1eal1059y9bb is done: cancel slurm job
2017-03-01_17:30:07.23672 2017/03/01 17:30:07 container tb05z-dz642-eie1eal1059y9bb is still in squeue after scancel
2017-03-01_17:30:13.53918 2017/03/01 17:30:13 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:31:02.73009 2017/03/01 17:31:02 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:31:02.73013 2017/03/01 17:31:02 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:31:02.76251 2017/03/01 17:31:02 sbatch succeeded: "Submitted batch job 2950" 
2017-03-01_17:31:02.76253 2017/03/01 17:31:02 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:32:35.91008 2017/03/01 17:32:35 Done monitoring container tb05z-dz642-eie1eal1059y9bb
Actions #3

Updated by Tom Morris almost 8 years ago

  • Target version set to 2017-03-29 sprint
Actions #4

Updated by Tom Clegg almost 8 years ago

  • Category set to Crunch
  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg over 7 years ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint
Actions #6

Updated by Peter Amstutz over 7 years ago

I wonder if we should move the state transition to "Running" as soon as crunch-run has starting doing anything substantive. E.g. if it fails to load the Docker image, that shouldn't shouldn't put it back into Locked state, that should go Running->Cancelled.

Actions #7

Updated by Tom Clegg over 7 years ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
Actions #8

Updated by Tom Clegg over 7 years ago

Allowing multiple dispatch attempts is a deliberate feature: when the dispatch/startup infrastructure fails early enough that it's absolutely certain the container has never been started, we don't count an "attempt" against a container request.

Currently there is no limit on the number of lock-attempt-unlock cycles, though. We should have a site-configurable limit. This counter doesn't have to be visible to anyone except the api server, although it would be useful to expose it to admin clients for troubleshooting purposes.

Actions #9

Updated by Tom Morris over 7 years ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
Actions #10

Updated by Tom Clegg over 7 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF