Bug #23459: Special case of instance types with GPU.VRAM==0 broken after #14922 - Arvados

Actions

Copy link

Bug #23459

closed

Special case of instance types with GPU.VRAM==0 broken after #14922

Added by Brett Smith about 1 month ago. Updated 22 days ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Dispatchers

Target version:

Development 2026-03-04

Story points:

Description

When you try arvados-dispatch-cloud logs a message: "BUG? insufficient resources on idle instance i-0fc6227c2bc61bbcd type g4dnxlarge for container tordo-dz642-2lr67y2o4ajwhx9: ir {VCPUs:4 RAM:17179869184 Scratch:8000000000000 GPUs:1 GPUVRAM:0 sharedVCPUUsed:false} ctrr {VCPUs:4 RAM:14405888969 Scratch:15737028608 GPUs:1 GPUVRAM:8388608000 sharedVCPUUsed:false}"

Subtasks 1 (0 open — 1 closed)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Brett Smith 28 days ago

git log -G GPUVRAM only returns commits from #14922. I think this is effectively a regression from that ticket. Since that work isn't slated to be included in 3.2.1, I think we can proceed with testing. We should just double-check that a CUDA workflow passes with a 3.2.1 release candidate.

Actions

Copy link

Updated by Brett Smith 28 days ago

Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added

Actions

Copy link

Updated by Brett Smith 27 days ago · Edited

This might just be a cluster configuration issue: the software might be expecting the configured instance types to list GPUVRAM, and tordo isn't.

If that's the case, there should probably be an upgrade note about this. Even if the configuration was available before, if we're only now starting to enforce it, that's worth mentioning.

Actions

Copy link

Updated by Brett Smith 27 days ago · Edited

The command I found most helpful for getting the history here is git log -p --follow lib/dispatchcloud/container/node_size.go.

2394ebe80486c4f163ee536109f068d24e5f813c says "Handle backwards compatability CUDA->GPU in config," comments "If VRAM is unspecified for an instance type, it can't be used to reject the instance," and adds a check it.GPU.VRAM > 0 to ChooseInstanceType.

In fc698018e83e0f7258c0adc24d1c6054541a3515, the new Less and Minus functions (later renamed Accommodates and Sub in 4116ede52db4ab244620dc16636a8023b85c9384) don't have any similar guard. They expect GPUVRAM to be set.

So when we call ChooseInstanceType, it's willing to select a type with GPUVRAM ⩵ 0, but then later on Accommodates says no, that type can't accommodate this container request. That's the inconsistency that leads to the "BUG?" log.

Now the question is what should we do about it. The simplest thing would be to document that you must set GPU.VRAM in your instance types now.

We could add code to Accommodates and Sub to deal with the case of GPUVRAM ⩵ 0. But then that kind of undermines the container sharing efforts we're building towards. I'm not sure we should.

Actions

Copy link

Updated by Brett Smith 27 days ago

Subject changed from Can't run CUDA containers on tordo because of insufficient GPUVRAM to Special case of instance types with GPU.VRAM==0 broken after #14922

Actions

Copy link

Updated by Brett Smith 27 days ago

Related to Feature #21926: AMD ROCm GPU support added

Actions

Copy link

Updated by Brett Smith 27 days ago

I will sleep on it but right now I'm leaning towards removing the special case and support for the old CUDA configuration sections anyway. We added the GPU configuration in Arvados 3.1.0 and wrote in the upgrade note:

Admins are advised to update the configuration file as the legacy field will be removed in a future version.

It's been about a year so I feel pretty okay doing that now.

Actions

Copy link

Updated by Brett Smith 25 days ago

23459-remove-instance-type-cuda @ f1d9653659f563fb02ad2f9630c022e0122037f2 - developer-run-tests: #5039

The original problem arises when configured InstanceTypes use the old CUDA section. This was deprecated in Arvados 3.1.0, so this branch simply removes that support to avoid the problem, and documents the change.

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above
- Manually tested this version of arvados-server config-check reports an error if any InstanceTypes have a CUDA section.
- Manually tested this version of arvados-server config-check accepts GPU.VRAM with a byte suffix.
Tested code incorporates recent main branch changes.
- Yes
New or changed UI/UX and has gotten feedback from stakeholders.
- N/A
Documentation has been updated.
- Yes
Behaves appropriately at the intended scale (describe intended scale).
- No change
Considered backwards and forwards compatibility issues between client and server.
- We said we would remove the feature, now we are
Follows our coding standards and GUI style guidelines.
- Yes

Actions

Copy link

Updated by Brett Smith 25 days ago

Subtask #23462 added

Actions

Copy link

#10

Updated by Lisa Knox 22 days ago

lgtm

Actions

Copy link

#11

Updated by Brett Smith 22 days ago

Status changed from New to Resolved

Applied in changeset arvados|3985c82195ff53c4c2cbeb0451bcc5925eefccc8.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #23459

Special case of instance types with GPU.VRAM==0 broken after #14922

Updated by Brett Smith 28 days ago

Updated by Brett Smith 28 days ago

Updated by Brett Smith 27 days ago · Edited

Updated by Brett Smith 27 days ago · Edited

Updated by Brett Smith 27 days ago

Updated by Brett Smith 27 days ago

Updated by Brett Smith 27 days ago

Updated by Brett Smith 25 days ago

Updated by Brett Smith 25 days ago

Updated by Lisa Knox 22 days ago

Updated by Brett Smith 22 days ago

	Related to Arvados - Feature #14922: Run multiple containers concurrently on a single cloud VM	Resolved	Tom Clegg				Actions
	Related to Arvados - Feature #21926: AMD ROCm GPU support	Resolved	Peter Amstutz				Actions