Feature #6518
closed
[Crunch] [Crunch2] Dispatch containers via slurm
Added by Tom Clegg over 9 years ago.
Updated almost 9 years ago.
Assigned To:
Radhika Chippada
Estimated time:
(Total: 0.00 h)
Release relationship:
Auto
Description
When containers appear in the queue, use SLURM to execute them on worker nodes.
For now, the queue is arvados.v1.containers.queue (much like the Crunch1 job queue).
From Crunch2 dispatch:
slurm batch mode
- Use "sinfo" to determine whether it is possible to run the container
- Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1"
- When container priority changes, use scontrol and scancel to propagate changes to slurm
- Use strigger to run a cleanup script when a container exits
The cleanup script just has to deal with cases like the node dying before crunch-run has a chance to update the container record to state="Complete"
- Tracker changed from Bug to Feature
- Target version set to Arvados Future Sprints
Suggest writing crunch 2 job dispatcher as a new set of actors in node manger.
This would enable us to solve the question of communication between the scheduler and cloud node management (#6520).
Node manager already has a lot of the framework we will want like concurrency (can have one actor per job) and a configuration system.
Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to how different cloud providers are supported now.
More ideas:
Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API server similar to how it is done for nodes to show they are alive.
A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that locks the record so that only the dispatcher can update it.
If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail.
Dispatchers should be able to annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM".
If we go with the architecture described in #8001, that will be is a prerequisite.
- Description updated (diff)
#7816 is now the story for actually running containers
- Target version deleted (
Arvados Future Sprints)
- Release set to 11
I think we can narrow this down to a 1 point story that just submits to "sbatch" and possibly checks "squeue" for status updates.
- Story points changed from 3.0 to 1.0
- Target version set to 2016-03-02 sprint
- Assigned To set to Peter Amstutz
- Description updated (diff)
- Description updated (diff)
Review feedback for branch 6518-crunch2-dispatch-slurm
crunch-dispatch-slurm.go
- Comment for runQueuedContainers says “Invoke dispatchLocal for each ticker cycle”. Please update to say “Invoke dispatchSlurm …” instead
- It would greatly improve readability of code if camel case is used consistently for names such as submiterr, stdinerr, similar to updateErr etc. There are several variables that could be updated as such.
- func strigger: can you please rename it say what it does “setup trigger for when job finishes” ?
- comment for func run (line 225): pl update it to say submit batch command etc. Current comment is not quite correct (applicable to crunch-dispatch-local)
- can you please add comments to submit and strigger funcs
- It would be helpful if we add tests that mock slurm commands. I searched a little and found a few links that may come handy.
- This comment “#uuid=$(squeue --jobs=$jobid --states=all --format=%j --noheader)” in the shell script seems to be out of sync?
- Status changed from New to In Progress
- Assigned To changed from Peter Amstutz to Radhika Chippada
- Target version changed from 2016-03-02 sprint to 2016-03-16 sprint
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:7bb66fca9371232cc32dd6b365ceb33e926eb0e7.
Also available in: Atom
PDF