Feature #23076
closedarvados-dispatch-slurm supports GPU requirements
Description
Currently, a-d-s is not honoring GPU reqs as its LSF counterpart does, resulting in not being able to run GPU backed workflows on Slurm HPCs.
Files
Updated by Lucas Di Pentima 8 months ago
- Related to Idea #15957: GPU support added
Updated by Brett Smith 8 months ago
- Target version changed from Future to Development 2025-08-21
Updated by Tom Clegg 8 months ago
23076-slurm-gpu @ 0bfe025b3ab2bab96d39cccbf73f91c90872a670 -- developer-run-tests: #4849
retry run-tests-remainder: #5431
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ add
--gpus=%dto sbatch args whenRuntimeConstraints.GPU.DeviceCount > 0
- ✅ add
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ add test to confirm
--gpus=%dgets added when appropriate (and existing tests confirm it doesn't when not) - no manual testing with actual slurm
- ✅ add test to confirm
- The tested code incorporates recent main branch changes.
- ✅
- New or changed UI/UX has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- ✅ no scaling consequences anticipated
- Considered backwards and forwards compatibility issues between client and server.
- ✅ n/a
- Follows our coding standards and GUI style guidelines.
- ✅
Updated by Lucas Di Pentima 7 months ago
This looks simple enough, but I wonder if we're creating inconsistencies in how we handle GPU requests between different schedulers.
While looking at the LSF-related documentation ( Containers.LSF.BsubArgumentsList section), AFAICT we use some kind of argument templating to pass the amount of GPUs to bsub. This would allow a cluster admin to disable GPUs site-wide, and that's something we wouldn't offer in SLURM's case.
So even though I think the changes in this branch are ready for merging, I wanted to double-check if what I'm interpreting is correct and if we should offer users the same flexibility on both LSF and Slurm (this might also be useful to tune sbatch with other arguments like --gpus-per-task and --gpus-per-node).
Updated by Tom Clegg 7 months ago
I do like the idea of unifying the lsf/slurm approaches. It would avoid the (documented) weirdness that it's possible to effectively set defaults for slurm arguments but not to override the dispatcher's arguments. But this is a pre-existing discrepancy, and the change will involve a config migration, so I think we should have a separate ticket for it.
Practically speaking I think the various slurm options (--gpus-per-task, --gpus-per-node) are all equivalent in that crunch-dispatch-slurm always submits containers as single-node single-task slurm jobs.
Updated by Tom Clegg 7 months ago
- Related to Feature #23091: Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration style added