Project

General

Profile

Actions

Feature #6518

closed

[Crunch] [Crunch2] Dispatch containers via slurm

Added by Tom Clegg over 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Crunch
Target version:
Start date:
07/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

When containers appear in the queue, use SLURM to execute them on worker nodes.

For now, the queue is arvados.v1.containers.queue (much like the Crunch1 job queue).

From Crunch2 dispatch:

slurm batch mode
  • Use "sinfo" to determine whether it is possible to run the container
  • Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1"
  • When container priority changes, use scontrol and scancel to propagate changes to slurm
  • Use strigger to run a cleanup script when a container exits

The cleanup script just has to deal with cases like the node dying before crunch-run has a chance to update the container record to state="Complete"


Subtasks 4 (0 open4 closed)

Task #8474: Review 6518-crunch2-dispatch-slurmResolvedPeter Amstutz07/08/2015

Actions
Task #8522: Implement crunch-dispatch-slurmResolvedPeter Amstutz07/08/2015

Actions
Task #8608: Review tests branch: 6518-crunch2-dispatch-slurm-testsResolvedPeter Amstutz07/08/2015

Actions
Task #8607: Add testsResolvedRadhika Chippada07/08/2015

Actions

Related issues 4 (0 open4 closed)

Related to Arvados - Story #6282: [Crunch] Write stories for implementation of Crunch v2ResolvedPeter Amstutz06/23/2015

Actions
Related to Arvados - Feature #7816: [Crunch2] Execute minimal container spec with loggingResolvedPeter Amstutz11/17/2015

Actions
Related to Arvados - Feature #8128: [Crunch2] API support for crunch-dispatchResolvedTom Clegg04/28/2016

Actions
Blocked by Arvados - Story #6429: [API] [Crunch2] Implement "containers" and "container requests" tables, models and controllersResolvedPeter Amstutz12/03/2015

Actions
Actions #1

Updated by Tom Clegg over 9 years ago

  • Tracker changed from Bug to Feature
Actions #2

Updated by Brett Smith about 9 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Peter Amstutz about 9 years ago

Suggest writing crunch 2 job dispatcher as a new set of actors in node manger.

This would enable us to solve the question of communication between the scheduler and cloud node management (#6520).

Node manager already has a lot of the framework we will want like concurrency (can have one actor per job) and a configuration system.

Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to how different cloud providers are supported now.

Actions #4

Updated by Peter Amstutz about 9 years ago

More ideas:

Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API server similar to how it is done for nodes to show they are alive.

A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that locks the record so that only the dispatcher can update it.

If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail.

Dispatchers should be able to annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM".

Actions #5

Updated by Peter Amstutz about 9 years ago

If we go with the architecture described in #8001, that will be is a prerequisite.

Actions #6

Updated by Peter Amstutz about 9 years ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz about 9 years ago

#7816 is now the story for actually running containers

Actions #8

Updated by Brett Smith about 9 years ago

  • Target version deleted (Arvados Future Sprints)
  • Release set to 11
Actions #9

Updated by Brett Smith almost 9 years ago

  • Story points set to 3.0
Actions #10

Updated by Peter Amstutz almost 9 years ago

I think we can narrow this down to a 1 point story that just submits to "sbatch" and possibly checks "squeue" for status updates.

Actions #11

Updated by Peter Amstutz almost 9 years ago

  • Story points changed from 3.0 to 1.0
Actions #12

Updated by Peter Amstutz almost 9 years ago

  • Target version set to 2016-03-02 sprint
Actions #13

Updated by Peter Amstutz almost 9 years ago

  • Assigned To set to Peter Amstutz
Actions #14

Updated by Tom Clegg almost 9 years ago

  • Description updated (diff)
Actions #15

Updated by Tom Clegg almost 9 years ago

  • Description updated (diff)
Actions #16

Updated by Radhika Chippada almost 9 years ago

Review feedback for branch 6518-crunch2-dispatch-slurm

crunch-dispatch-slurm.go

  • Comment for runQueuedContainers says “Invoke dispatchLocal for each ticker cycle”. Please update to say “Invoke dispatchSlurm …” instead
  • It would greatly improve readability of code if camel case is used consistently for names such as submiterr, stdinerr, similar to updateErr etc. There are several variables that could be updated as such.
  • func strigger: can you please rename it say what it does “setup trigger for when job finishes” ?
  • comment for func run (line 225): pl update it to say submit batch command etc. Current comment is not quite correct (applicable to crunch-dispatch-local)
  • can you please add comments to submit and strigger funcs
  • This comment “#uuid=$(squeue --jobs=$jobid --states=all --format=%j --noheader)” in the shell script seems to be out of sync?
Actions #17

Updated by Peter Amstutz almost 9 years ago

  • Status changed from New to In Progress
Actions #18

Updated by Brett Smith almost 9 years ago

  • Assigned To changed from Peter Amstutz to Radhika Chippada
  • Target version changed from 2016-03-02 sprint to 2016-03-16 sprint
Actions #19

Updated by Radhika Chippada almost 9 years ago

Added tests in branch 6518-crunch2-dispatch-slurm-tests, derived from 6518-crunch2-dispatch-slurm at bf3a2814843a8f7a78592e3fb4c629fc9f4819b9

Actions #20

Updated by Peter Amstutz almost 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:7bb66fca9371232cc32dd6b365ceb33e926eb0e7.

Actions

Also available in: Atom PDF