Project

General

Profile

Actions

Bug #23459

closed

Special case of instance types with GPU.VRAM==0 broken after #14922

Added by Brett Smith about 1 month ago. Updated 22 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Dispatchers
Target version:
Story points:
-

Description

When you try arvados-dispatch-cloud logs a message: "BUG? insufficient resources on idle instance i-0fc6227c2bc61bbcd type g4dnxlarge for container tordo-dz642-2lr67y2o4ajwhx9: ir {VCPUs:4 RAM:17179869184 Scratch:8000000000000 GPUs:1 GPUVRAM:0 sharedVCPUUsed:false} ctrr {VCPUs:4 RAM:14405888969 Scratch:15737028608 GPUs:1 GPUVRAM:8388608000 sharedVCPUUsed:false}"


Subtasks 1 (0 open1 closed)

Task #23462: Review 23459-remove-instance-type-cudaResolvedBrett Smith03/02/2026Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Feature #14922: Run multiple containers concurrently on a single cloud VMResolvedTom CleggActions
Related to Arvados - Feature #21926: AMD ROCm GPU supportResolvedPeter AmstutzActions
Actions #1

Updated by Brett Smith 28 days ago

git log -G GPUVRAM only returns commits from #14922. I think this is effectively a regression from that ticket. Since that work isn't slated to be included in 3.2.1, I think we can proceed with testing. We should just double-check that a CUDA workflow passes with a 3.2.1 release candidate.

Actions #2

Updated by Brett Smith 28 days ago

  • Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added
Actions #3

Updated by Brett Smith 27 days ago · Edited

This might just be a cluster configuration issue: the software might be expecting the configured instance types to list GPUVRAM, and tordo isn't.

If that's the case, there should probably be an upgrade note about this. Even if the configuration was available before, if we're only now starting to enforce it, that's worth mentioning.

Actions #4

Updated by Brett Smith 27 days ago · Edited

The command I found most helpful for getting the history here is git log -p --follow lib/dispatchcloud/container/node_size.go.

2394ebe80486c4f163ee536109f068d24e5f813c says "Handle backwards compatability CUDA->GPU in config," comments "If VRAM is unspecified for an instance type, it can't be used to reject the instance," and adds a check it.GPU.VRAM > 0 to ChooseInstanceType.

In fc698018e83e0f7258c0adc24d1c6054541a3515, the new Less and Minus functions (later renamed Accommodates and Sub in 4116ede52db4ab244620dc16636a8023b85c9384) don't have any similar guard. They expect GPUVRAM to be set.

So when we call ChooseInstanceType, it's willing to select a type with GPUVRAM ⩵ 0, but then later on Accommodates says no, that type can't accommodate this container request. That's the inconsistency that leads to the "BUG?" log.

Now the question is what should we do about it. The simplest thing would be to document that you must set GPU.VRAM in your instance types now.

We could add code to Accommodates and Sub to deal with the case of GPUVRAM ⩵ 0. But then that kind of undermines the container sharing efforts we're building towards. I'm not sure we should.

Actions #5

Updated by Brett Smith 27 days ago

  • Subject changed from Can't run CUDA containers on tordo because of insufficient GPUVRAM to Special case of instance types with GPU.VRAM==0 broken after #14922
Actions #6

Updated by Brett Smith 27 days ago

Actions #7

Updated by Brett Smith 27 days ago

I will sleep on it but right now I'm leaning towards removing the special case and support for the old CUDA configuration sections anyway. We added the GPU configuration in Arvados 3.1.0 and wrote in the upgrade note:

Admins are advised to update the configuration file as the legacy field will be removed in a future version.

It's been about a year so I feel pretty okay doing that now.

Actions #8

Updated by Brett Smith 25 days ago

23459-remove-instance-type-cuda @ f1d9653659f563fb02ad2f9630c022e0122037f2 - developer-run-tests: #5039

The original problem arises when configured InstanceTypes use the old CUDA section. This was deprecated in Arvados 3.1.0, so this branch simply removes that support to avoid the problem, and documents the change.

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • See above
    • Manually tested this version of arvados-server config-check reports an error if any InstanceTypes have a CUDA section.
    • Manually tested this version of arvados-server config-check accepts GPU.VRAM with a byte suffix.
  • Tested code incorporates recent main branch changes.
    • Yes
  • New or changed UI/UX and has gotten feedback from stakeholders.
    • N/A
  • Documentation has been updated.
    • Yes
  • Behaves appropriately at the intended scale (describe intended scale).
    • No change
  • Considered backwards and forwards compatibility issues between client and server.
    • We said we would remove the feature, now we are
  • Follows our coding standards and GUI style guidelines.
    • Yes
Actions #9

Updated by Brett Smith 25 days ago

  • Subtask #23462 added
Actions #10

Updated by Lisa Knox 22 days ago

lgtm

Actions #11

Updated by Brett Smith 22 days ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF