Project

General

Profile

Actions

Feature #23091

closed

Unify SLURM (SbatchArgumentsList) and LSF (BsubArgumentsList) configuration style

Added by Tom Clegg 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Dispatchers
Target version:
Story points:
-
Release relationship:
Auto

Description

Background

Currently the LSF dispatcher configuration uses a template approach:

        # Template variables starting with % will be substituted as follows:
        #
        # %U uuid
        # %C number of VCPUs
        # %M memory in MB
        # %T tmp in MB
        # %G number of GPU devices (runtime_constraints.gpu.device_count)
[...]
        BsubArgumentsList: ["-o", "/tmp/crunch-run.%%J.out", "-e", "/tmp/crunch-run.%%J.err", "-J", "%U", "-n", "%C", "-D", "%MMB", "-R", "rusage[mem=%MMB:tmp=%TMB] span[hosts=1]", "-R", "select[mem>=%MMB]", "-R", "select[tmp>=%TMB]", "-R", "select[ncpus>=%C]", "-We", "%W"]

The SLURM dispatcher just accepts literal command line arguments, with no template capability. The arguments used to set job name, CPU count, memory, etc., are hard-coded.

        SbatchArgumentsList: []

Our documentation mentions:

Note: If an argument is supplied multiple times, slurm uses the value of the last occurrence of the argument on the command line. Arguments specified through Arvados are added after the arguments listed in SbatchArguments. This means, for example, an Arvados container with that specifies partitions in scheduling_parameter will override an occurrence of --partition in SbatchArguments. As a result, for container parameters that can be specified through Arvados, SbatchArguments can be used to specify defaults but not enforce specific policy.

Proposal

Update the SLURM dispatcher to behave similarly to the LSF dispatcher: use template variables and make the default SbatchArgumentsList something like

["--mem=%M", "--cpus-per-task=%C", "--tmp=%T", "--gpus=%G", "--no-requeue"]

A note about nice: The dispatcher updates nice after queueing the job to maintain relative priority ordering among arvados-submitted slurm jobs. Because of this, we will not provide any template support for setting nice. The configuration will document this limitation with this explanation.

Write an upgrade note that explains you must add previously-default arguments to your configuration. Unless it's utterly trivial, do not bother detecting and reporting the old format, or figuring out some fancy auto-detected migration.


Files

clipboard-202508251008-pu1ev.png (62.3 KB) clipboard-202508251008-pu1ev.png Brett Smith, 08/25/2025 02:08 PM

Subtasks 1 (0 open1 closed)

Task #23128: Review 23091-sbatch-template-argsResolvedTom Clegg09/02/2025Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Feature #23076: arvados-dispatch-slurm supports GPU requirementsResolvedTom CleggActions
Related to Arvados - Feature #23110: Add SLURM.SbatchGPUArguments configurationResolvedTom CleggActions
Actions #1

Updated by Tom Clegg 7 months ago

  • Related to Feature #23076: arvados-dispatch-slurm supports GPU requirements added
Actions #2

Updated by Tom Clegg 7 months ago

  • Category set to Dispatchers
Actions #3

Updated by Brett Smith 7 months ago

  • Release set to 79
Actions #4

Updated by Brett Smith 7 months ago

  • Related to Feature #23110: Add SLURM.SbatchGPUArguments configuration added
Actions #5

Updated by Brett Smith 7 months ago

  • Target version set to Development 2025-09-03
  • Assigned To set to Tom Clegg
  • Description updated (diff)
Actions #6

Updated by Brett Smith 7 months ago

  • Subtask #23128 added
Actions #7

Updated by Tom Clegg 7 months ago

  • Status changed from New to In Progress
Actions #8

Updated by Tom Clegg 7 months ago

23091-sbatch-template-args @ bb493db09c0470294316ee94cceaa011234c5d89 -- developer-run-tests: #4859

Based on unmerged branch 23110-sbatch-gpu-arguments.

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ SbatchArgumentsList and SbatchGPUArgumentsList configs are templated
    • ✨ Updated existing tests (args for test cases got re-ordered)
    • ✨ Updated existing tests to exercise a non-zero ReserveExtraRAM config (existing tests weren't checking that it was being used at all)
    • ✨ Added an error-checking test for LSF BsubArgumentsList (noticed there wasn't one)
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ Updated existing tests
  • The tested code incorporates recent main branch changes.
  • New or changed UI/UX has gotten feedback from stakeholders.
  • Documentation has been updated.
    • ✅ Updated config reference.
    • ✅ Added upgrade note (if SbatchArgumentsList is already configured, it must be updated during upgrade in order to retain previous behavior).
  • Behaves appropriately at the intended scale (describe intended scale).
    • ✅ n/a
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ n/a
  • Follows our coding standards and GUI style guidelines.

crunch-dispatch-slurm has a seemingly undocumented feature that if you have InstanceTypes in your Arvados config it will choose one and run sbatch --constraint=instancetype=X instead of sbatch --mem=X --cpus-per-task=Y --tmp=Z. This branch preserves the feature but you have to be more explicit with SbatchArgumentsList: ["--constraint=instancetype=%I"]. And now the capability is (very slightly) documented because the %I sequence is mentioned in the config reference. The current upgrade note doesn't mention it though, so we're basically assuming nobody is using it. Or does it deserve to be mentioned there?

Actions #9

Updated by Brett Smith 7 months ago

Tom Clegg wrote in #note-8:

23091-sbatch-template-args @ bb493db09c0470294316ee94cceaa011234c5d89 -- developer-run-tests: #4859

The test failure definitely looks like it could be related to the branch. Can you please at least look and weigh in?

When you start a code block in the docs, please write the first line on the same physical line as the opening <pre> to avoid this blank line on rendering:

It would also be nice to highlight the formerly-default arguments with <span class="userinput"> to call attention to where they are on the line. See the existing upgrade note about RHEL repository GPG keys for an example.

Code LGTM otherwise, thanks.

Actions #10

Updated by Tom Clegg 7 months ago

23091-sbatch-template-args @ a62dd27026fec42caeb1c2e9fcd2e1ed889692cc -- developer-run-tests: #4860

Fixed a bug that caused lsf test suite to deadlock and time out when there was more than one test func.

Actions #11

Updated by Tom Clegg 7 months ago

Brett Smith wrote in #note-9:

The test failure definitely looks like it could be related to the branch. Can you please at least look and weigh in?

Indeed. See #note-10 above.

When you start a code block in the docs, please write the first line on the same physical line as the opening <pre> to avoid this blank line on rendering:

Oops, fixed.

It would also be nice to highlight the formerly-default arguments with <span class="userinput"> to call attention to where they are on the line. See the existing upgrade note about RHEL repository GPG keys for an example.

Good point, done.

23091-sbatch-template-args @ 7ccd2d1f543b57a9e45931f34d30f37d2afc962e

Actions #12

Updated by Brett Smith 7 months ago

Tom Clegg wrote in #note-11:

23091-sbatch-template-args @ 7ccd2d1f543b57a9e45931f34d30f37d2afc962e

Manually reviewed the new docs and LGTM, thank you.

Actions #13

Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF