Bug #12057
Updated by Peter Amstutz over 7 years ago
While running the node manager stress test, I noticed the following behavior from slurm: <pre> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14269 compute c97qk-dz root PD 0:00 1 (Resources) 14270 compute c97qk-dz root PD 0:00 1 (Priority) 14271 compute c97qk-dz root PD 0:00 1 (Priority) 14272 compute c97qk-dz root PD 0:00 1 (Priority) 14273 compute c97qk-dz root PD 0:00 1 (Priority) 14274 compute c97qk-dz root PD 0:00 1 (Priority) 14275 compute c97qk-dz root PD 0:00 1 (Priority) 14276 compute c97qk-dz root PD 0:00 1 (Priority) 14277 compute c97qk-dz root PD 0:00 1 (Priority) 14278 compute c97qk-dz root PD 0:00 1 (Priority) 14279 compute c97qk-dz root PD 0:00 1 (Priority) 14280 compute c97qk-dz root PD 0:00 1 (Priority) 14281 compute c97qk-dz root PD 0:00 1 (Priority) 14282 compute c97qk-dz root PD 0:00 1 (Priority) 14283 compute c97qk-dz root PD 0:00 1 (Priority) 14284 compute c97qk-dz root PD 0:00 1 (Priority) 14285 compute c97qk-dz root PD 0:00 1 (Priority) 14286 compute c97qk-dz root PD 0:00 1 (Priority) 14287 compute c97qk-dz root PD 0:00 1 (Priority) 14288 compute c97qk-dz root PD 0:00 1 (Priority) 14289 compute c97qk-dz root PD 0:00 1 (Priority) 14290 compute c97qk-dz root PD 0:00 1 (Priority) 14291 compute c97qk-dz root PD 0:00 1 (Priority) 14292 compute c97qk-dz root PD 0:00 1 (Priority) 14293 compute c97qk-dz root PD 0:00 1 (Priority) 14294 compute c97qk-dz root PD 0:00 1 (Priority) 14296 compute c97qk-dz root PD 0:00 1 (Priority) 14297 compute c97qk-dz root PD 0:00 1 (Priority) 14298 compute c97qk-dz root PD 0:00 1 (Priority) 14299 compute c97qk-dz root PD 0:00 1 (Priority) 14300 compute c97qk-dz root PD 0:00 1 (Priority) 14302 compute c97qk-dz root PD 0:00 1 (Priority) 14303 compute c97qk-dz root PD 0:00 1 (Priority) 14306 compute c97qk-dz root PD 0:00 1 (Priority) 14307 compute c97qk-dz root PD 0:00 1 (Priority) 14308 compute c97qk-dz root PD 0:00 1 (Priority) 14309 compute c97qk-dz root PD 0:00 1 (Priority) 14310 compute c97qk-dz root PD 0:00 1 (Priority) 14311 compute c97qk-dz root PD 0:00 1 (Priority) 14312 compute c97qk-dz root PD 0:00 1 (Priority) 14186 compute c97qk-dz root R 20:12 1 compute8 14259 compute c97qk-dz root R 0:59 1 compute3 14260 compute c97qk-dz root R 0:58 1 compute13 14261 compute c97qk-dz root R 0:51 1 compute4 14262 compute c97qk-dz root R 0:50 1 compute5 14263 compute c97qk-dz root R 0:42 1 compute6 14264 compute c97qk-dz root R 0:30 1 compute9 14265 compute c97qk-dz root R 0:30 1 compute12 14266 compute c97qk-dz root R 0:28 1 compute11 14267 compute c97qk-dz root R 0:27 1 compute10 14268 compute c97qk-dz root R 0:17 1 compute7 </pre> Slurm only marks one pending job as being limited by (Resources) and the rest are limited by (Priority). Currently node manager only boots new nodes for jobs marked (Resources) and does not recognize (Priority). The effect is to dribble out one new node at a time instead of booting a lot of nodes at once, despite a deep queue. Node manager should create nodes for slurm jobs marked (Priority). Note: it might be a good idea Idea: to throttle node creation to help avoid API rate limits, and also avoid over-shooting and booting excessive new nodes (which can happen if a job completes and a queued job node is assigned to an existing node, a queued job, so that no new node is needed). needed), suggest returning a desired additional node count that is 75% the pending job count (rounded up). For example, if 4 nodes were needed, it would boot 3 nodes on the 1st round, then 1 node on the next round. Suggest capping Alternately, we could cap the number of nodes per type created in a round (perhaps 10). It may also be desirable to throttle node creation to avoid API rate limits. (rounds are about 60 second apart).