Bug #6157
closed[Documentation] Explain extra steps needed when compute hostnames are not fooN
100%
Description
background¶
Changing slurm config files, and keeping them synchronized across controller+workers, is a bit painful and can cause race conditions that are annoying to diagnose, so we try to avoid setups where it has to change during normal operation.
"fooN", where N is decimal, lets you write foo[0-199] or foo[000-199] in your slurm config files. Therefore, nodes.ping makes it easy to manage a setup like this. In the API server configuration, you can set assign_node_hostname
to a corresponding format string to so that nodes that ping without a hostname get one set matching the schema, and max_compute_nodes
to make sure it doesn't go over your allocation.
However, in some setups it might be inconvenient/difficult/impossible to use hostnames like "fooN".
improvement¶
Install docs should include a section explaining- Why foo[0-N] is a good idea (see above)
- What to do differently if you use a naming scheme besides string+decimal (e.g., your worker nodes' hostnames are {alice, bob, clay, ...})
We should make the simplifying assumption that the hostnames are assigned manually/OOB, and known in advance. IOW, instead of covering scenarios where slurm config has to change every time a new compute node is turned up, we should just advise against that.
AFAIK, as long as the available/powered-on nodes' hostnames are a subset of the hostnames given in slurm.conf, and no two hosts have the same name, slurm and Arvados should work without any code changes.
Files