Feature #6518
closed[Crunch] [Crunch2] Dispatch containers via slurm
100%
Description
When containers appear in the queue, use SLURM to execute them on worker nodes.
For now, the queue is arvados.v1.containers.queue (much like the Crunch1 job queue).
From Crunch2 dispatch:
slurm batch mode- Use "sinfo" to determine whether it is possible to run the container
- Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1"
- When container priority changes, use scontrol and scancel to propagate changes to slurm
- Use strigger to run a cleanup script when a container exits
The cleanup script just has to deal with cases like the node dying before crunch-run has a chance to update the container record to state="Complete"
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz about 9 years ago
Suggest writing crunch 2 job dispatcher as a new set of actors in node manger.
This would enable us to solve the question of communication between the scheduler and cloud node management (#6520).
Node manager already has a lot of the framework we will want like concurrency (can have one actor per job) and a configuration system.
Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to how different cloud providers are supported now.
Updated by Peter Amstutz about 9 years ago
More ideas:
Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API server similar to how it is done for nodes to show they are alive.
A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that locks the record so that only the dispatcher can update it.
If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail.
Dispatchers should be able to annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM".
Updated by Peter Amstutz about 9 years ago
If we go with the architecture described in #8001, that will be is a prerequisite.
Updated by Peter Amstutz about 9 years ago
#7816 is now the story for actually running containers
Updated by Brett Smith about 9 years ago
- Target version deleted (
Arvados Future Sprints) - Release set to 11
Updated by Peter Amstutz almost 9 years ago
I think we can narrow this down to a 1 point story that just submits to "sbatch" and possibly checks "squeue" for status updates.
Updated by Peter Amstutz almost 9 years ago
- Story points changed from 3.0 to 1.0
Updated by Peter Amstutz almost 9 years ago
- Target version set to 2016-03-02 sprint
Updated by Radhika Chippada almost 9 years ago
Review feedback for branch 6518-crunch2-dispatch-slurm
crunch-dispatch-slurm.go
- Comment for runQueuedContainers says “Invoke dispatchLocal for each ticker cycle”. Please update to say “Invoke dispatchSlurm …” instead
- It would greatly improve readability of code if camel case is used consistently for names such as submiterr, stdinerr, similar to updateErr etc. There are several variables that could be updated as such.
- func strigger: can you please rename it say what it does “setup trigger for when job finishes” ?
- comment for func run (line 225): pl update it to say submit batch command etc. Current comment is not quite correct (applicable to crunch-dispatch-local)
- can you please add comments to submit and strigger funcs
- It would be helpful if we add tests that mock slurm commands. I searched a little and found a few links that may come handy.
- This link says "helperCommand" can be used to mock a command. http://stackoverflow.com/questions/24286683/how-to-test-system-commands-in-go
- https://golang.org/src/os/exec/exec_test.go
- This comment “#uuid=$(squeue --jobs=$jobid --states=all --format=%j --noheader)” in the shell script seems to be out of sync?
Updated by Peter Amstutz almost 9 years ago
- Status changed from New to In Progress
Updated by Brett Smith almost 9 years ago
- Assigned To changed from Peter Amstutz to Radhika Chippada
- Target version changed from 2016-03-02 sprint to 2016-03-16 sprint
Updated by Radhika Chippada almost 9 years ago
Added tests in branch 6518-crunch2-dispatch-slurm-tests, derived from 6518-crunch2-dispatch-slurm at bf3a2814843a8f7a78592e3fb4c629fc9f4819b9
Updated by Peter Amstutz almost 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:7bb66fca9371232cc32dd6b365ceb33e926eb0e7.