Feature #18325
Updated by Peter Amstutz about 3 years ago
h2. h1. Node type I used a "g4nd.xlarge" node for testing because on brief inspection, it seemed to be the cheapest GPU nodes available (something like $0.526/hr). It has a Tesla T4 GPU. However you could probably have packer install all this stuff on a non-GPU node. h2. h1. Kernel stuff Need to have the linux-headers package that corresponds exactly to the kernel image, this is because it use @dkms@ to compile the nvidia kernel module on demand. For Buster the latest seem to be: linux-image-4.19.0-18-cloud-amd64 linux-headers-4.19.0-18-cloud-amd64 h2. CUDA stuff Note: starting with CUDA 11.5 they only support Debian Bullseye. The previous version, 11.4.3, only supports Buster. Installation commands from https://developer.nvidia.com/cuda-11-4-3-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_network <pre> apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub apt-get install software-properties-common add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /" add-apt-repository contrib apt-get update apt-get -y install cuda </pre> After everything is installed, use "nvidia-detect" to make sure the GPU is detected and "nvidia-smi" to make sure the kernel module / driver is loaded. If "nvidia-smi" doesn't work, it probably means the kernel module didn't build, try "dkms autoinstall" and see what failed. h2. Docker stuff We need to have Docker 19.03 or later installed -- the current compute image is using the "docker.io" package shipped with Buster, which is 18.xx. The latest version in the docker-ce 19.03.xx series is 19.03.15. We could also upgrade to a more recent version. <pre> curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg mkdir -p /etc/apt/sources.list.d && \ echo deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian/ buster stable > /etc/apt/sources.list.d/docker.list && \ apt-get update && \ apt-get -yq --no-install-recommends install docker-ce=5:19.03.15~3-0~debian-buster && \ apt-get clean </pre> h2. nvidia-container-toolkit This is some additional tooling used by both Singularity and and Docker to support CUDA. <pre> DIST=$(. /etc/os-release; echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.list | \ sudo tee /etc/apt/sources.list.d/libnvidia-container.list sudo apt-get update apt-get install libnvidia-container1 libnvidia-container-tools nvidia-container-toolkit </pre> you might also need to restart docker after this is installed <pre> systemctl restart docker </pre> h2. Testing that GPU is available inside the container <pre> docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi </pre> <pre> singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi </pre>