Project

General

Profile

Actions

Feature #18324

closed

LSF support for requesting node with CUDA support

Added by Peter Amstutz about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/05/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=features-enabling-jobs-use-gpu-resources

According to this, GPUs can be configured at the job level, but also at the queue level, so depending on the site, you might need to request a specific queue.

Customer email:

these are the parameters we're using to request GPUs:

-gpu "num=1:j_exclusive=yes"

The exclusive part should probably be configurable as it's not mandatory, but on our cluster the default is that GPUs are shared, so we recommend our users to request them exclusively.

Maybe having a parameter for the GPU string with a placeholder for the number of GPUs similar to the Memory or CPUs.

Proposed design:

Add new option "LSF.BsubCUDAArguments". It is appended to the end of "BsubArgumentsList" when CUDA.DeviceCount > 0 in the container runtime constraints. Introduce a new template variable %G with for the value of DeviceCount.

Example:

BsubCUDAArguments: ["-gpu", "num=%G:j_exclusive=yes"]


Subtasks 1 (0 open1 closed)

Task #18598: review 18324-lsf-gpuResolvedPeter Amstutz01/05/2022

Actions

Related issues 1 (0 open1 closed)

Related to Arvados Epics - Story #15957: GPU supportResolved10/01/202103/31/2022

Actions
Actions #1

Updated by Peter Amstutz about 3 years ago

Actions #2

Updated by Peter Amstutz about 3 years ago

  • Release set to 46
Actions #3

Updated by Peter Amstutz about 3 years ago

  • Assigned To set to Peter Amstutz
  • Target version set to 2022-01-05 sprint
Actions #4

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg about 3 years ago

Seems like we could just add ["-gpu", "num=%G:j_exclusive=yes"] to BsubArgumentsList all the time, without a separate "only if GPU" config? Or would -gpu num=0:j_exclusive=yes do something undesirable to non-GPU jobs?

Actions #7

Updated by Peter Amstutz about 3 years ago

Tom Clegg wrote:

Seems like we could just add ["-gpu", "num=%G:j_exclusive=yes"] to BsubArgumentsList all the time, without a separate "only if GPU" config? Or would -gpu num=0:j_exclusive=yes do something undesirable to non-GPU jobs?

Carlos wrote:

I just tested this as I was not sure what would happen and LSF doesn't seem to like it:

$ bsub -gpu num=0 hostname
GPU num not valid in gpu requirement. Job not submitted.

Which is pretty much the outcome I expected. Time to go back to plan A?

Actions #8

Updated by Peter Amstutz about 3 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
Actions #10

Updated by Peter Amstutz about 3 years ago

  • Category deleted (Crunch)
  • Target version changed from 2022-01-19 sprint to 2022-01-05 sprint
  • Release deleted (46)
Actions #11

Updated by Tom Clegg almost 3 years ago

Too bad, I thought it would be least surprising for it to work the same way as "mem" and "tmp". But indeed, if 0 is not a valid number of GPUs then the proposal makes sense to me.

Actions #13

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
Actions #14

Updated by Tom Clegg almost 3 years ago

LGTM, thanks!

If you didn't already see it, while you were waiting for me to review this I did #18604 which removes the generated_config.go file entirely -- if git-merge seems surprised, that's probably why.

Actions #15

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (2022-01-19 sprint)
Actions #16

Updated by Peter Amstutz almost 3 years ago

  • Target version set to 2022-01-19 sprint
Actions #17

Updated by Peter Amstutz almost 3 years ago

  • Status changed from In Progress to Resolved
Actions #18

Updated by Peter Amstutz over 2 years ago

  • Release set to 46
Actions

Also available in: Atom PDF