Feature #23076: arvados-dispatch-slurm supports GPU requirements - Arvados

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ add --gpus=%d to sbatch args when RuntimeConstraints.GPU.DeviceCount > 0
Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ add test to confirm --gpus=%d gets added when appropriate (and existing tests confirm it doesn't when not)
- no manual testing with actual slurm
The tested code incorporates recent main branch changes.
- ✅
New or changed UI/UX has gotten feedback from stakeholders.
- n/a
Documentation has been updated.
- n/a
Behaves appropriately at the intended scale (describe intended scale).
- ✅ no scaling consequences anticipated
Considered backwards and forwards compatibility issues between client and server.
- ✅ n/a
Follows our coding standards and GUI style guidelines.
- ✅

Actions

Copy link

Updated by Tom Clegg 8 months ago

File crunch-dispatch-slurm~0bfe025b3ab2bab96d39cccbf73f91c90872a670-dev crunch-dispatch-slurm~0bfe025b3ab2bab96d39cccbf73f91c90872a670-dev added

Actions

Copy link

Updated by Tom Clegg 7 months ago

Subtask #23089 added

Actions

Copy link

Updated by Lucas Di Pentima 7 months ago

This looks simple enough, but I wonder if we're creating inconsistencies in how we handle GPU requests between different schedulers.

While looking at the LSF-related documentation ( Containers.LSF.BsubArgumentsList section), AFAICT we use some kind of argument templating to pass the amount of GPUs to bsub. This would allow a cluster admin to disable GPUs site-wide, and that's something we wouldn't offer in SLURM's case.

So even though I think the changes in this branch are ready for merging, I wanted to double-check if what I'm interpreting is correct and if we should offer users the same flexibility on both LSF and Slurm (this might also be useful to tune sbatch with other arguments like --gpus-per-task and --gpus-per-node).

Actions

Copy link

Updated by Tom Clegg 7 months ago

I do like the idea of unifying the lsf/slurm approaches. It would avoid the (documented) weirdness that it's possible to effectively set defaults for slurm arguments but not to override the dispatcher's arguments. But this is a pre-existing discrepancy, and the change will involve a config migration, so I think we should have a separate ticket for it.

Practically speaking I think the various slurm options (--gpus-per-task, --gpus-per-node) are all equivalent in that crunch-dispatch-slurm always submits containers as single-node single-task slurm jobs.

Actions

Copy link

#10

Updated by Lucas Di Pentima 7 months ago

Ok, then this LGTM. Thanks!

Actions

Copy link

#11

Updated by Tom Clegg 7 months ago

Related to Feature #23091: Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration style added

Actions

Copy link

#12

Updated by Tom Clegg 7 months ago

Status changed from In Progress to Resolved

Actions

Copy link

#13

Updated by Brett Smith 6 months ago

Release set to 79

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Watchers (1)

Feature #23076

arvados-dispatch-slurm supports GPU requirements

Updated by Lucas Di Pentima 8 months ago

Updated by Brett Smith 8 months ago

Updated by Tom Clegg 8 months ago

Updated by Tom Clegg 8 months ago

Updated by Tom Clegg 8 months ago

Updated by Tom Clegg 7 months ago

Updated by Lucas Di Pentima 7 months ago

Updated by Tom Clegg 7 months ago

Updated by Lucas Di Pentima 7 months ago

Updated by Tom Clegg 7 months ago

Updated by Tom Clegg 7 months ago

Updated by Brett Smith 6 months ago

	Related to Arvados Epics - Idea #15957: GPU support	Resolved		10/01/2021	03/31/2022		Actions
	Related to Arvados - Feature #23091: Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration style	Resolved	Tom Clegg				Actions