Feature #18325
closedOption to include CUDA tooling in cloud compute image
100%
Description
Node type¶
I used a "g4nd.xlarge" node for testing because on brief inspection, it seemed to be the cheapest GPU nodes available (something like $0.526/hr). It has a Tesla T4 GPU. However you could probably have packer install all this stuff on a non-GPU node.
Kernel stuff¶
Need to have the linux-headers package that corresponds exactly to the kernel image, this is because it use dkms
to compile the nvidia kernel module on demand.
For Buster the latest seem to be:
linux-image-4.19.0-18-cloud-amd64
linux-headers-4.19.0-18-cloud-amd64
CUDA stuff¶
Note: starting with CUDA 11.5 they only support Debian Bullseye. The previous version, 11.4.3, only supports Buster.
Installation commands from https://developer.nvidia.com/cuda-11-4-3-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_network
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub apt-get install software-properties-common add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /" add-apt-repository contrib apt-get update apt-get -y install cuda
After everything is installed, use "nvidia-detect" to make sure the GPU is detected and "nvidia-smi" to make sure the kernel module / driver is loaded.
If "nvidia-smi" doesn't work, it probably means the kernel module didn't build, try "dkms autoinstall" and see what failed.
Docker stuff¶
We need to have Docker 19.03 or later installed -- the current compute image is using the "docker.io" package shipped with Buster, which is 18.xx. The latest version in the docker-ce 19.03.xx series is 19.03.15. We could also upgrade to a more recent version.
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg mkdir -p /etc/apt/sources.list.d && \ echo deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian/ buster stable > /etc/apt/sources.list.d/docker.list && \ apt-get update && \ apt-get -yq --no-install-recommends install docker-ce=5:19.03.15~3-0~debian-buster && \ apt-get clean
nvidia-container-toolkit¶
This is some additional tooling used by both Singularity and and Docker to support CUDA.
DIST=$(. /etc/os-release; echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.list | \ sudo tee /etc/apt/sources.list.d/libnvidia-container.list sudo apt-get update apt-get install libnvidia-container1 libnvidia-container-tools nvidia-container-toolkit
you might also need to restart docker after this is installed
systemctl restart docker
Testing that GPU is available inside the container¶
docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi
singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi
Updated by Peter Amstutz about 3 years ago
- Related to Story #15957: GPU support added
Updated by Peter Amstutz about 3 years ago
- Description updated (diff)
- Assigned To set to Ward Vandewege
- Target version set to 2022-01-05 sprint
Updated by Ward Vandewege about 3 years ago
I built the debian11 version of the Tordo compute image in packer-build-compute-image: #155 , the resulting ami is ami-0f6e25f2051835461.
This was built from fcbfddb10723cb876a1c83e883ce3bfb4f6a2565 on branch 18325-compute-image-cuda.
# nvidia-smi Thu Dec 16 21:37:03 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 51C P8 11W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Docker can't find libnvidia-ml.so:
# docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi Unable to find image 'nvidia/cuda:11.0-base' locally 11.0-base: Pulling from nvidia/cuda 54ee1f796a1e: Pull complete f7bfea53ad12: Pull complete 46d371e02073: Pull complete b66c17bbf772: Pull complete 3642f1a6dfb3: Pull complete e5ce55b8b4b9: Pull complete 155bc0332b0a: Pull complete Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a Status: Downloaded newer image for nvidia/cuda:11.0-base NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
Specifying LD_PRELOAD and pointing it at the so works though:
docker run -it --rm --gpus 1 nvidia/cuda:11.0-base bash root@db39cd1cf9cf:/# nvidia-smi NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH. root@db39cd1cf9cf:/# LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.495.29.05 nvidia-smi Thu Dec 16 21:44:05 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Oh, running `ldconfig` first works too:
root@121eff917be3:/# ldconfig root@121eff917be3:/# nvidia-smi Thu Dec 16 21:58:25 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 27C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Singularity works without issue:
# singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi INFO: Converting OCI blobs to SIF format INFO: Starting build... Getting image source signatures Copying blob 54ee1f796a1e done Copying blob f7bfea53ad12 done Copying blob 46d371e02073 done Copying blob b66c17bbf772 done Copying blob 3642f1a6dfb3 done Copying blob e5ce55b8b4b9 done Copying blob 155bc0332b0a done Copying config e0bead0b96 done Writing manifest to image destination Storing signatures 2021/12/16 21:40:25 info unpack layer: sha256:54ee1f796a1e650627269605cb8e6a596b77b324e6f0a1e4443dc41def0e58a6 2021/12/16 21:40:26 info unpack layer: sha256:f7bfea53ad120b47cea5488f0b8331e737a97b33003517b0bd05e83925b578f0 2021/12/16 21:40:26 info unpack layer: sha256:46d371e02073acecf750a166495a63358517af793de739a51b680c973fae8fb9 2021/12/16 21:40:26 info unpack layer: sha256:b66c17bbf772fa072c280b10fe87bc999420042b5fce5b111db38b4fe7c40b49 2021/12/16 21:40:26 info unpack layer: sha256:3642f1a6dfb3bdd5ba6a8173363fe58bf4d46a01fa3f4b3907a1e60b803527bf 2021/12/16 21:40:26 info unpack layer: sha256:e5ce55b8b4b9ff443d9de73e100d843d57d8708ebaef7bbc33c0c0544c14d1b1 2021/12/16 21:40:26 info unpack layer: sha256:155bc0332b0a8aee5dad9ff6e67299b34ed4cd81d5a6a0a75418b7f48f378998 INFO: Creating SIF file... Thu Dec 16 21:40:36 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 31C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Updated by Peter Amstutz about 3 years ago
Looks like singularity is working, but Docker isn't.
Updated by Peter Amstutz about 3 years ago
FWIW after I finished installing CUDA stuff, here's what the root disk usage looked like, so you can give packer a 15-16 GB root disk instead of the default 8 GB.
Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1p1 28G 14G 14G 50% /
Updated by Ward Vandewege about 3 years ago
I built the debian10 version of the Tordo compute image in packer-build-compute-image: #157 , the resulting ami is ami-0e7f0fbea4715116c.
This was built from ec5a52d3551e558e6df50c50e94118d84b0cde08 on branch 18325-compute-image-cuda.
root@ip-10-253-254-164:/home/admin# nvidia-smi Fri Dec 17 15:57:12 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Docker is OK:
root@ip-10-253-254-164:/home/admin# docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi Unable to find image 'nvidia/cuda:11.0-base' locally 11.0-base: Pulling from nvidia/cuda 54ee1f796a1e: Pull complete f7bfea53ad12: Pull complete 46d371e02073: Pull complete b66c17bbf772: Pull complete 3642f1a6dfb3: Pull complete e5ce55b8b4b9: Pull complete 155bc0332b0a: Pull complete Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a Status: Downloaded newer image for nvidia/cuda:11.0-base Fri Dec 17 15:57:49 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Singularity is OK too:
root@ip-10-253-254-164:/home/admin# singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi INFO: Converting OCI blobs to SIF format INFO: Starting build... Getting image source signatures Copying blob 54ee1f796a1e done Copying blob f7bfea53ad12 done Copying blob 46d371e02073 done Copying blob b66c17bbf772 done Copying blob 3642f1a6dfb3 done Copying blob e5ce55b8b4b9 done Copying blob 155bc0332b0a done Copying config e0bead0b96 done Writing manifest to image destination Storing signatures 2021/12/17 16:00:25 info unpack layer: sha256:54ee1f796a1e650627269605cb8e6a596b77b324e6f0a1e4443dc41def0e58a6 2021/12/17 16:00:26 info unpack layer: sha256:f7bfea53ad120b47cea5488f0b8331e737a97b33003517b0bd05e83925b578f0 2021/12/17 16:00:26 info unpack layer: sha256:46d371e02073acecf750a166495a63358517af793de739a51b680c973fae8fb9 2021/12/17 16:00:26 info unpack layer: sha256:b66c17bbf772fa072c280b10fe87bc999420042b5fce5b111db38b4fe7c40b49 2021/12/17 16:00:26 info unpack layer: sha256:3642f1a6dfb3bdd5ba6a8173363fe58bf4d46a01fa3f4b3907a1e60b803527bf 2021/12/17 16:00:26 info unpack layer: sha256:e5ce55b8b4b9ff443d9de73e100d843d57d8708ebaef7bbc33c0c0544c14d1b1 2021/12/17 16:00:26 info unpack layer: sha256:155bc0332b0a8aee5dad9ff6e67299b34ed4cd81d5a6a0a75418b7f48f378998 INFO: Creating SIF file... Fri Dec 17 16:00:38 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Updated by Ward Vandewege about 3 years ago
ready for review at ec5a52d3551e558e6df50c50e94118d84b0cde08 on branch 18325-compute-image-cuda.
The image based on Debian10 works out of the box for Docker and Singularity. The Debian11 image requires an extra call to `ldconfig` in the docker image, I'm still not sure why. Singularity works out of the box though.
Updated by Peter Amstutz about 3 years ago
Ward Vandewege wrote:
ready for review at ec5a52d3551e558e6df50c50e94118d84b0cde08 on branch 18325-compute-image-cuda.
The image based on Debian10 works out of the box for Docker and Singularity. The Debian11 image requires an extra call to `ldconfig` in the docker image, I'm still not sure why. Singularity works out of the box though.
LGTM
Updated by Peter Amstutz about 3 years ago
- Status changed from New to In Progress
Updated by Ward Vandewege about 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados-private:commit:arvados|adfb76eacbb5677ae1db2efd102c674481a3b065.
Updated by Ward Vandewege about 3 years ago
- Status changed from Resolved to In Progress
The equivalent change for Azure, plus support for Ubuntu, is now ready for review at 00cee49e2c3cfa62e7ec8a58437a7d432013c4c3 on branch 18325-cuda-azure-image.
I built a compute image for ce8i5 with nvidia/cuda support in packer-build-compute-image: #162 .
Testing on a machine spun up with that image (type NV6, with a Tesla M60 VM):
# nvidia-smi Mon Dec 20 19:22:52 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000001:00:00.0 Off | Off | | N/A 28C P0 37W / 150W | 0MiB / 8129MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Docker is OK:
# docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi Unable to find image 'nvidia/cuda:11.0-base' locally 11.0-base: Pulling from nvidia/cuda 54ee1f796a1e: Pull complete f7bfea53ad12: Pull complete 46d371e02073: Pull complete b66c17bbf772: Pull complete 3642f1a6dfb3: Pull complete e5ce55b8b4b9: Pull complete 155bc0332b0a: Pull complete Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a Status: Downloaded newer image for nvidia/cuda:11.0-base Mon Dec 20 19:24:29 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000001:00:00.0 Off | Off | | N/A 30C P0 37W / 150W | 0MiB / 8129MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Singularity is also OK:
# singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi INFO: Converting OCI blobs to SIF format INFO: Starting build... Getting image source signatures Copying blob 54ee1f796a1e done Copying blob f7bfea53ad12 done Copying blob 46d371e02073 done Copying blob b66c17bbf772 done Copying blob 3642f1a6dfb3 done Copying blob e5ce55b8b4b9 done Copying blob 155bc0332b0a done Copying config e0bead0b96 done Writing manifest to image destination Storing signatures 2021/12/20 19:25:20 info unpack layer: sha256:54ee1f796a1e650627269605cb8e6a596b77b324e6f0a1e4443dc41def0e58a6 2021/12/20 19:25:21 info unpack layer: sha256:f7bfea53ad120b47cea5488f0b8331e737a97b33003517b0bd05e83925b578f0 2021/12/20 19:25:21 info unpack layer: sha256:46d371e02073acecf750a166495a63358517af793de739a51b680c973fae8fb9 2021/12/20 19:25:21 info unpack layer: sha256:b66c17bbf772fa072c280b10fe87bc999420042b5fce5b111db38b4fe7c40b49 2021/12/20 19:25:21 info unpack layer: sha256:3642f1a6dfb3bdd5ba6a8173363fe58bf4d46a01fa3f4b3907a1e60b803527bf 2021/12/20 19:25:21 info unpack layer: sha256:e5ce55b8b4b9ff443d9de73e100d843d57d8708ebaef7bbc33c0c0544c14d1b1 2021/12/20 19:25:21 info unpack layer: sha256:155bc0332b0a8aee5dad9ff6e67299b34ed4cd81d5a6a0a75418b7f48f378998 INFO: Creating SIF file... Mon Dec 20 19:25:30 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000001:00:00.0 Off | Off | | N/A 31C P0 37W / 150W | 0MiB / 8129MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Updated by Ward Vandewege about 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|6ab97c819cd92a212f804a0895fed88c935ff92b.
Updated by Ward Vandewege about 3 years ago
- Related to Support #18606: GPU support on tordo cluster added