Feature #18324
closed
LSF support for requesting node with CUDA support
Added by Peter Amstutz about 3 years ago.
Updated over 2 years ago.
Estimated time:
(Total: 0.00 h)
Release relationship:
Auto
Description
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=features-enabling-jobs-use-gpu-resources
According to this, GPUs can be configured at the job level, but also at the queue level, so depending on the site, you might need to request a specific queue.
Customer email:
these are the parameters we're using to request GPUs:
-gpu "num=1:j_exclusive=yes"
The exclusive part should probably be configurable as it's not mandatory, but on our cluster the default is that GPUs are shared, so we recommend our users to request them exclusively.
Maybe having a parameter for the GPU string with a placeholder for the number of GPUs similar to the Memory or CPUs.
Proposed design:
Add new option "LSF.BsubCUDAArguments". It is appended to the end of "BsubArgumentsList" when CUDA.DeviceCount > 0 in the container runtime constraints. Introduce a new template variable %G with for the value of DeviceCount.
Example:
BsubCUDAArguments: ["-gpu", "num=%G:j_exclusive=yes"]
- Assigned To set to Peter Amstutz
- Target version set to 2022-01-05 sprint
- Description updated (diff)
- Description updated (diff)
Seems like we could just add ["-gpu", "num=%G:j_exclusive=yes"]
to BsubArgumentsList
all the time, without a separate "only if GPU" config? Or would -gpu num=0:j_exclusive=yes
do something undesirable to non-GPU jobs?
Tom Clegg wrote:
Seems like we could just add ["-gpu", "num=%G:j_exclusive=yes"]
to BsubArgumentsList
all the time, without a separate "only if GPU" config? Or would -gpu num=0:j_exclusive=yes
do something undesirable to non-GPU jobs?
Carlos wrote:
I just tested this as I was not sure what would happen and LSF doesn't seem to like it:
$ bsub -gpu num=0 hostname
GPU num not valid in gpu requirement. Job not submitted.
Which is pretty much the outcome I expected. Time to go back to plan A?
- Status changed from New to In Progress
- Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
- Category deleted (
Crunch)
- Target version changed from 2022-01-19 sprint to 2022-01-05 sprint
- Release deleted (
46)
Too bad, I thought it would be least surprising for it to work the same way as "mem" and "tmp". But indeed, if 0 is not a valid number of GPUs then the proposal makes sense to me.
- Target version changed from 2022-01-05 sprint to 2022-01-19 sprint
LGTM, thanks!
If you didn't already see it, while you were waiting for me to review this I did #18604 which removes the generated_config.go file entirely -- if git-merge seems surprised, that's probably why.
- Target version deleted (
2022-01-19 sprint)
- Target version set to 2022-01-19 sprint
- Status changed from In Progress to Resolved
Also available in: Atom
PDF