Feature #18324
closedLSF support for requesting node with CUDA support
100%
Description
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=features-enabling-jobs-use-gpu-resources
According to this, GPUs can be configured at the job level, but also at the queue level, so depending on the site, you might need to request a specific queue.
Customer email:
these are the parameters we're using to request GPUs:
-gpu "num=1:j_exclusive=yes"
The exclusive part should probably be configurable as it's not mandatory, but on our cluster the default is that GPUs are shared, so we recommend our users to request them exclusively.
Maybe having a parameter for the GPU string with a placeholder for the number of GPUs similar to the Memory or CPUs.
Proposed design:
Add new option "LSF.BsubCUDAArguments". It is appended to the end of "BsubArgumentsList" when CUDA.DeviceCount > 0 in the container runtime constraints. Introduce a new template variable %G with for the value of DeviceCount.
Example:
BsubCUDAArguments: ["-gpu", "num=%G:j_exclusive=yes"]
Updated by Peter Amstutz about 3 years ago
- Related to Story #15957: GPU support added
Updated by Peter Amstutz about 3 years ago
- Assigned To set to Peter Amstutz
- Target version set to 2022-01-05 sprint
Updated by Tom Clegg about 3 years ago
Seems like we could just add ["-gpu", "num=%G:j_exclusive=yes"]
to BsubArgumentsList
all the time, without a separate "only if GPU" config? Or would -gpu num=0:j_exclusive=yes
do something undesirable to non-GPU jobs?
Updated by Peter Amstutz about 3 years ago
Tom Clegg wrote:
Seems like we could just add
["-gpu", "num=%G:j_exclusive=yes"]
toBsubArgumentsList
all the time, without a separate "only if GPU" config? Or would-gpu num=0:j_exclusive=yes
do something undesirable to non-GPU jobs?
Carlos wrote:
I just tested this as I was not sure what would happen and LSF doesn't seem to like it:
$ bsub -gpu num=0 hostname
GPU num not valid in gpu requirement. Job not submitted.
Which is pretty much the outcome I expected. Time to go back to plan A?
Updated by Peter Amstutz about 3 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 3 years ago
- Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
Updated by Peter Amstutz about 3 years ago
- Category deleted (
Crunch) - Target version changed from 2022-01-19 sprint to 2022-01-05 sprint
- Release deleted (
46)
Updated by Tom Clegg almost 3 years ago
Too bad, I thought it would be least surprising for it to work the same way as "mem" and "tmp". But indeed, if 0 is not a valid number of GPUs then the proposal makes sense to me.
Updated by Peter Amstutz almost 3 years ago
18324-lsf-gpu @ 565612fd40474044e2afaa4fcb993c8c0197ca8e
Updated by Peter Amstutz almost 3 years ago
- Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
Updated by Tom Clegg almost 3 years ago
LGTM, thanks!
If you didn't already see it, while you were waiting for me to review this I did #18604 which removes the generated_config.go file entirely -- if git-merge seems surprised, that's probably why.
Updated by Peter Amstutz almost 3 years ago
- Target version deleted (
2022-01-19 sprint)
Updated by Peter Amstutz almost 3 years ago
- Target version set to 2022-01-19 sprint
Updated by Peter Amstutz almost 3 years ago
- Status changed from In Progress to Resolved