Bug #23104: compute_nvidia ansible role fails to install packages - Arvados

Actions

Copy link

Bug #23104

closed

compute_nvidia ansible role fails to install packages

Added by Lucas Di Pentima 7 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Lucas Di Pentima

Category:

Deployment

Target version:

Development 2025-08-21

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

While trying to build a compute node image with NVIDIA support, I'm seeing package install errors like the ones in this run: packer-build-compute-image: #339

I've manually reproduced this on a docker container and the well-formatted error message is the following:

# apt install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-cuda : Depends: nvidia-opencl-icd (= 570.172.08-1) but it is not going to be installed
                      Depends: nvidia-persistenced (= 570.172.08-1) but it is not going to be installed
 nvidia-driver-libs : Depends: libnvidia-egl-xcb1 (>= 1:1.0.1) but 570.124.06-1 is to be installed
                      Depends: libnvidia-egl-xlib1 (>= 1:1.0.1) but 570.124.06-1 is to be installed
                      Recommends: nvidia-driver-libs:i386 (= 570.172.08-1)
 nvidia-open-560 : Depends: nvidia-kernel-open-dkms (< 561) but 570.172.08-1 is to be installed
                   Depends: nvidia-driver (< 561) but 570.172.08-1 is to be installed
                   Depends: libcuda1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libcudadebugger1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvcuvid1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-allocator1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-pkcs11-openssl3 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-encode1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-fbc1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-opticalflow1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvidia-ptxjitcompiler1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libnvoptix1 (< 561) but 570.172.08-1 is to be installed
                   Depends: libxnvctrl-dev (< 561) but 570.172.08-1 is to be installed
                   Depends: nvidia-cuda-mps (>= 560.35.05)
                   Depends: nvidia-cuda-mps (< 561)
                   Depends: nvidia-opencl-common (< 561)
                   Depends: nvidia-opencl-icd (>= 560.35.05) but it is not going to be installed
                   Depends: nvidia-opencl-icd (< 561) but it is not going to be installed
                   Depends: nvidia-smi (>= 560.35.05)
                   Depends: nvidia-smi (< 561)
                   Depends: nvidia-modprobe (< 561) but 570.172.08-1 is to be installed
                   Depends: nvidia-settings (< 561) but 570.172.08-1 is to be installed
 nvidia-xconfig : Depends: libnvidia-cfg1 (= 580.65.06-1) but 570.172.08-1 is to be installed
                  Recommends: libglx-nvidia0 (= 580.65.06-1) but 570.172.08-1 is to be installed
E: Unable to correct problems, you have held broken packages.

Actions

Copy link

Updated by Brett Smith 7 months ago

In order to be completely effective, our apt pins need to pin not just cuda, but all of its transitive dependencies. At least ones provided by NVIDIA.

A bug we've had in the past, and it looks like it has happened again, is: we pin a bunch of stuff, it works for the current version. But then a new version come out, and then the combination of the things we do have pinned plus newer versions of packages we failed to pin yields an error like this, because the combination makes no sense.

The basic fix is just to figure out what pins we're missing and add them. Unfortunately that's a bear. There's a whole web of packages and they have different version numbers that don't obviously relate to each other. The best I've been able to do in the past is an arduous manual process of traversing the packages myself.

One thing that might make this easier is, we can upgrade the version of CUDA we're using now that we've dropped support for Debian 11. We can upgrade to the latest version available in Ubuntu 22.04. Maybe starting the process over with whatever version that is would be easier than trying to reverse-engineer what happened with our current pin.

Actions

Copy link

Updated by Lucas Di Pentima 7 months ago

Updates at 6bc5229f67 - branch 23104-ansible-nvidia-fix

Test run: packer-build-compute-image: #344

Pins manually tested on both debian12 and ubuntu2204.

Actions

Copy link

Updated by Brett Smith 7 months ago

Lucas Di Pentima wrote in #note-2:

Updates at 6bc5229f67 - branch 23104-ansible-nvidia-fix

Test run: packer-build-compute-image: #344

Pins manually tested on both debian12 and ubuntu2204.

Ideally we would test by deploying this compute node image to tordo then running a CUDA workflow. We would want to do this anyway for 3.2.0 release prep, so we might as well do it now. If you can take care of the deployment I can take care of the workflow. As long as that works the changes LGTM. Thanks.

Actions

Copy link

Updated by Lucas Di Pentima 7 months ago

Brett Smith wrote in #note-3:

Ideally we would test by deploying this compute node image to tordo then running a CUDA workflow. We would want to do this anyway for 3.2.0 release prep, so we might as well do it now. If you can take care of the deployment I can take care of the workflow. As long as that works the changes LGTM. Thanks.

Makes sense. I've deployed the config changes so that tordo uses the newly created AMI ami-0e7958a0fd8c2eee9.

Actions

Copy link

Updated by Brett Smith 7 months ago

Lucas Di Pentima wrote in #note-4:

Makes sense. I've deployed the config changes so that tordo uses the newly created AMI ami-0e7958a0fd8c2eee9.

tordo-xvhdp-6lz7vvi0n3tdmvh - Unfortunately the logs don't seem to report the CUDA version directly, but it's clearly using a recently-built compute image (based on the arv-mount version) and using CUDA successfully, so this looks good to me. Please merge. Thanks for taking care of that.

Actions

Copy link