Project

General

Profile

Actions

Feature #23076

closed

arvados-dispatch-slurm supports GPU requirements

Added by Lucas Di Pentima 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Dispatchers
Target version:
Story points:
-
Release relationship:
Auto

Description

Currently, a-d-s is not honoring GPU reqs as its LSF counterpart does, resulting in not being able to run GPU backed workflows on Slurm HPCs.


Files


Subtasks 1 (0 open1 closed)

Task #23089: Review 23076-slurm-gpuResolvedTom Clegg08/11/2025Actions

Related issues 2 (0 open2 closed)

Related to Arvados Epics - Idea #15957: GPU supportResolved10/01/202103/31/2022Actions
Related to Arvados - Feature #23091: Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration styleResolvedTom CleggActions
Actions #1

Updated by Lucas Di Pentima 8 months ago

Actions #3

Updated by Brett Smith 8 months ago

  • Target version changed from Future to Development 2025-08-21
Actions #4

Updated by Tom Clegg 8 months ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #5

Updated by Tom Clegg 8 months ago

23076-slurm-gpu @ 0bfe025b3ab2bab96d39cccbf73f91c90872a670 -- developer-run-tests: #4849

retry run-tests-remainder: #5431

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ add --gpus=%d to sbatch args when RuntimeConstraints.GPU.DeviceCount > 0
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ add test to confirm --gpus=%d gets added when appropriate (and existing tests confirm it doesn't when not)
    • no manual testing with actual slurm
  • The tested code incorporates recent main branch changes.
  • New or changed UI/UX has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • ✅ no scaling consequences anticipated
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ n/a
  • Follows our coding standards and GUI style guidelines.
Actions #7

Updated by Tom Clegg 7 months ago

  • Subtask #23089 added
Actions #8

Updated by Lucas Di Pentima 7 months ago

This looks simple enough, but I wonder if we're creating inconsistencies in how we handle GPU requests between different schedulers.

While looking at the LSF-related documentation ( Containers.LSF.BsubArgumentsList section), AFAICT we use some kind of argument templating to pass the amount of GPUs to bsub. This would allow a cluster admin to disable GPUs site-wide, and that's something we wouldn't offer in SLURM's case.

So even though I think the changes in this branch are ready for merging, I wanted to double-check if what I'm interpreting is correct and if we should offer users the same flexibility on both LSF and Slurm (this might also be useful to tune sbatch with other arguments like --gpus-per-task and --gpus-per-node).

Actions #9

Updated by Tom Clegg 7 months ago

I do like the idea of unifying the lsf/slurm approaches. It would avoid the (documented) weirdness that it's possible to effectively set defaults for slurm arguments but not to override the dispatcher's arguments. But this is a pre-existing discrepancy, and the change will involve a config migration, so I think we should have a separate ticket for it.

Practically speaking I think the various slurm options (--gpus-per-task, --gpus-per-node) are all equivalent in that crunch-dispatch-slurm always submits containers as single-node single-task slurm jobs.

Actions #10

Updated by Lucas Di Pentima 7 months ago

Ok, then this LGTM. Thanks!

Actions #11

Updated by Tom Clegg 7 months ago

  • Related to Feature #23091: Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration style added
Actions #12

Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Brett Smith 6 months ago

  • Release set to 79
Actions

Also available in: Atom PDF