Bug #23104
closedcompute_nvidia ansible role fails to install packages
Description
While trying to build a compute node image with NVIDIA support, I'm seeing package install errors like the ones in this run: packer-build-compute-image: #339
I've manually reproduced this on a docker container and the well-formatted error message is the following:
# apt install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nvidia-driver-cuda : Depends: nvidia-opencl-icd (= 570.172.08-1) but it is not going to be installed
Depends: nvidia-persistenced (= 570.172.08-1) but it is not going to be installed
nvidia-driver-libs : Depends: libnvidia-egl-xcb1 (>= 1:1.0.1) but 570.124.06-1 is to be installed
Depends: libnvidia-egl-xlib1 (>= 1:1.0.1) but 570.124.06-1 is to be installed
Recommends: nvidia-driver-libs:i386 (= 570.172.08-1)
nvidia-open-560 : Depends: nvidia-kernel-open-dkms (< 561) but 570.172.08-1 is to be installed
Depends: nvidia-driver (< 561) but 570.172.08-1 is to be installed
Depends: libcuda1 (< 561) but 570.172.08-1 is to be installed
Depends: libcudadebugger1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvcuvid1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-allocator1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-pkcs11-openssl3 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-encode1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-fbc1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-opticalflow1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvidia-ptxjitcompiler1 (< 561) but 570.172.08-1 is to be installed
Depends: libnvoptix1 (< 561) but 570.172.08-1 is to be installed
Depends: libxnvctrl-dev (< 561) but 570.172.08-1 is to be installed
Depends: nvidia-cuda-mps (>= 560.35.05)
Depends: nvidia-cuda-mps (< 561)
Depends: nvidia-opencl-common (< 561)
Depends: nvidia-opencl-icd (>= 560.35.05) but it is not going to be installed
Depends: nvidia-opencl-icd (< 561) but it is not going to be installed
Depends: nvidia-smi (>= 560.35.05)
Depends: nvidia-smi (< 561)
Depends: nvidia-modprobe (< 561) but 570.172.08-1 is to be installed
Depends: nvidia-settings (< 561) but 570.172.08-1 is to be installed
nvidia-xconfig : Depends: libnvidia-cfg1 (= 580.65.06-1) but 570.172.08-1 is to be installed
Recommends: libglx-nvidia0 (= 580.65.06-1) but 570.172.08-1 is to be installed
E: Unable to correct problems, you have held broken packages.
Updated by Brett Smith 7 months ago
In order to be completely effective, our apt pins need to pin not just cuda, but all of its transitive dependencies. At least ones provided by NVIDIA.
A bug we've had in the past, and it looks like it has happened again, is: we pin a bunch of stuff, it works for the current version. But then a new version come out, and then the combination of the things we do have pinned plus newer versions of packages we failed to pin yields an error like this, because the combination makes no sense.
The basic fix is just to figure out what pins we're missing and add them. Unfortunately that's a bear. There's a whole web of packages and they have different version numbers that don't obviously relate to each other. The best I've been able to do in the past is an arduous manual process of traversing the packages myself.
One thing that might make this easier is, we can upgrade the version of CUDA we're using now that we've dropped support for Debian 11. We can upgrade to the latest version available in Ubuntu 22.04. Maybe starting the process over with whatever version that is would be easier than trying to reverse-engineer what happened with our current pin.
Updated by Lucas Di Pentima 7 months ago
Updates at 6bc5229f67 - branch 23104-ansible-nvidia-fix
Test run: packer-build-compute-image: #344
Pins manually tested on both debian12 and ubuntu2204.
Updated by Brett Smith 7 months ago
Lucas Di Pentima wrote in #note-2:
Updates at 6bc5229f67 - branch
23104-ansible-nvidia-fixTest run: packer-build-compute-image: #344
Pins manually tested on both debian12 and ubuntu2204.
Ideally we would test by deploying this compute node image to tordo then running a CUDA workflow. We would want to do this anyway for 3.2.0 release prep, so we might as well do it now. If you can take care of the deployment I can take care of the workflow. As long as that works the changes LGTM. Thanks.
Updated by Lucas Di Pentima 7 months ago
Brett Smith wrote in #note-3:
Ideally we would test by deploying this compute node image to tordo then running a CUDA workflow. We would want to do this anyway for 3.2.0 release prep, so we might as well do it now. If you can take care of the deployment I can take care of the workflow. As long as that works the changes LGTM. Thanks.
Makes sense. I've deployed the config changes so that tordo uses the newly created AMI ami-0e7958a0fd8c2eee9.
Updated by Brett Smith 7 months ago
Lucas Di Pentima wrote in #note-4:
Makes sense. I've deployed the config changes so that
tordouses the newly created AMIami-0e7958a0fd8c2eee9.
tordo-xvhdp-6lz7vvi0n3tdmvh - Unfortunately the logs don't seem to report the CUDA version directly, but it's clearly using a recently-built compute image (based on the arv-mount version) and using CUDA successfully, so this looks good to me. Please merge. Thanks for taking care of that.
Updated by Lucas Di Pentima 7 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|8e22e8b2a757280c319d341a6ce8e1043790e521.