Bug #18713
closed[gpu] nvidia-persistenced.service fails when booted on a node without GPUs
100%
Description
The systemd nvidia-persistenced.service fails to start when a compute image with Nvidia GPU support is started on a non-GPU node:
# systemctl ... ● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon ...
# systemctl status nvidia-persistenced.service ● nvidia-persistenced.service - NVIDIA Persistence Daemon Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled) Active: failed (Result: exit-code) since Thu 2022-02-03 18:38:16 UTC; 16min ago Feb 03 18:38:15 ip-10-253-254-98 systemd[1]: Starting NVIDIA Persistence Daemon... Feb 03 18:38:15 ip-10-253-254-98 nvidia-persistenced[559]: Started (559) Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 108 has read and write permissions for those files. Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[552]: nvidia-persistenced failed to initialize. Check syslog for more details. Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Shutdown (559) Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: Failed to start NVIDIA Persistence Daemon.
This is a problem because it means that
systemctl is-system-running
returns degraded
. That command is our default for BootProbeCommand
. In other words, the compute nodes never reach "ready" state from Arvados' perspective.
Files
Updated by Ward Vandewege almost 3 years ago
- Status changed from New to In Progress
Updated by Ward Vandewege almost 3 years ago
- Related to Story #15957: GPU support added
Updated by Peter Amstutz almost 3 years ago
It is probably fine to have it disabled, because crunch-run does some GPU driver initialization on its own already.
Updated by Ward Vandewege almost 3 years ago
I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.
I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo
Updated by Peter Amstutz almost 3 years ago
Ward Vandewege wrote:
I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.
I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo
In the comment I would include a note that this doesn't matter, because crunch-run does its own basic CUDA initialization.
We should also confirm that in fact GPUs still work.
Updated by Peter Amstutz almost 3 years ago
- File tf-mnist-tutorial.py tf-mnist-tutorial.py added
- File tf-mnist-tutorial-gpu.cwl tf-mnist-tutorial-gpu.cwl added
- Assigned To deleted (
Ward Vandewege) - Target version deleted (
2022-02-16 sprint)
Updated by Peter Amstutz almost 3 years ago
- Assigned To set to Ward Vandewege
- Target version set to 2022-02-16 sprint
Updated by Ward Vandewege almost 3 years ago
Peter Amstutz wrote:
Ward Vandewege wrote:
I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.
I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo
In the comment I would include a note that this doesn't matter, because crunch-run does its own basic CUDA initialization.
Sure, updated in 12c1c51313e897abd0e9d1801b42bc8dc3b8d1d9 on branch 18713-nvidia-persistenced
We should also confirm that in fact GPUs still work.
Thanks for the sample workflow, it completed at tordo-xvhdp-h7cu2u53dtjf3ag (without reuse!).
Updated by Ward Vandewege almost 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados-private:commit:arvados|8685251f024c4519c5f61413b9dcb66a86e3efd6.