Bug #23459
closedSpecial case of instance types with GPU.VRAM==0 broken after #14922
Description
When you try arvados-dispatch-cloud logs a message: "BUG? insufficient resources on idle instance i-0fc6227c2bc61bbcd type g4dnxlarge for container tordo-dz642-2lr67y2o4ajwhx9: ir {VCPUs:4 RAM:17179869184 Scratch:8000000000000 GPUs:1 GPUVRAM:0 sharedVCPUUsed:false} ctrr {VCPUs:4 RAM:14405888969 Scratch:15737028608 GPUs:1 GPUVRAM:8388608000 sharedVCPUUsed:false}"
Updated by Brett Smith 28 days ago
git log -G GPUVRAM only returns commits from #14922. I think this is effectively a regression from that ticket. Since that work isn't slated to be included in 3.2.1, I think we can proceed with testing. We should just double-check that a CUDA workflow passes with a 3.2.1 release candidate.
Updated by Brett Smith 28 days ago
- Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added
Updated by Brett Smith 27 days ago
· Edited
This might just be a cluster configuration issue: the software might be expecting the configured instance types to list GPUVRAM, and tordo isn't.
If that's the case, there should probably be an upgrade note about this. Even if the configuration was available before, if we're only now starting to enforce it, that's worth mentioning.
Updated by Brett Smith 27 days ago
· Edited
The command I found most helpful for getting the history here is git log -p --follow lib/dispatchcloud/container/node_size.go.
2394ebe80486c4f163ee536109f068d24e5f813c says "Handle backwards compatability CUDA->GPU in config," comments "If VRAM is unspecified for an instance type, it can't be used to reject the instance," and adds a check it.GPU.VRAM > 0 to ChooseInstanceType.
In fc698018e83e0f7258c0adc24d1c6054541a3515, the new Less and Minus functions (later renamed Accommodates and Sub in 4116ede52db4ab244620dc16636a8023b85c9384) don't have any similar guard. They expect GPUVRAM to be set.
So when we call ChooseInstanceType, it's willing to select a type with GPUVRAM ⩵ 0, but then later on Accommodates says no, that type can't accommodate this container request. That's the inconsistency that leads to the "BUG?" log.
Now the question is what should we do about it. The simplest thing would be to document that you must set GPU.VRAM in your instance types now.
We could add code to Accommodates and Sub to deal with the case of GPUVRAM ⩵ 0. But then that kind of undermines the container sharing efforts we're building towards. I'm not sure we should.
Updated by Brett Smith 27 days ago
- Subject changed from Can't run CUDA containers on tordo because of insufficient GPUVRAM to Special case of instance types with GPU.VRAM==0 broken after #14922
Updated by Brett Smith 27 days ago
- Related to Feature #21926: AMD ROCm GPU support added
Updated by Brett Smith 27 days ago
I will sleep on it but right now I'm leaning towards removing the special case and support for the old CUDA configuration sections anyway. We added the GPU configuration in Arvados 3.1.0 and wrote in the upgrade note:
Admins are advised to update the configuration file as the legacy field will be removed in a future version.
It's been about a year so I feel pretty okay doing that now.
Updated by Brett Smith 25 days ago
23459-remove-instance-type-cuda @ f1d9653659f563fb02ad2f9630c022e0122037f2 - developer-run-tests: #5039
The original problem arises when configured InstanceTypes use the old CUDA section. This was deprecated in Arvados 3.1.0, so this branch simply removes that support to avoid the problem, and documents the change.
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above
- Manually tested this version of
arvados-server config-checkreports an error if anyInstanceTypeshave aCUDAsection. - Manually tested this version of
arvados-server config-checkacceptsGPU.VRAMwith a byte suffix.
- Tested code incorporates recent main branch changes.
- Yes
- New or changed UI/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- Yes
- Behaves appropriately at the intended scale (describe intended scale).
- No change
- Considered backwards and forwards compatibility issues between client and server.
- We said we would remove the feature, now we are
- Follows our coding standards and GUI style guidelines.
- Yes
Updated by Brett Smith 22 days ago
- Status changed from New to Resolved
Applied in changeset arvados|3985c82195ff53c4c2cbeb0451bcc5925eefccc8.