Project

General

Profile

Actions

Bug #13489

closed

what version of SLURM does arvados expect?

Added by Joshua Randall almost 8 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

We are currently running SLURM 15.08.7, which results in a large number of errors logged from crunch-dispatch-slurm along the lines of this:

May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-1j8w55pfx9dzcq4" "Nice=42101"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-1fefihrik76dg2f" "Nice=42057"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-1bcwtvp4k464iz9" "Nice=43371"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-194marn0fcrbnfe" "Nice=42013"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-17cyzh7quowf5xg" "Nice=42061"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
May 14 06:33:14 arvados-master-eglyx crunch-dispatch-slurm[28897]: 2018/05/14 06:33:14 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-15a08f4gfg5vjm3" "Nice=42032"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 

It looks like the range of valid nice values on 15.08 is 10000 to 10000, whereas on the newest version of SLURM (17.11) this is documented to be +/ 2147483645 (https://slurm.schedmd.com/scontrol.html). What version of SLURM does arvados require/expect?

Actions #1

Updated by Tom Clegg almost 8 years ago

I'd recommend a newer version if possible (slurm 16.05.9 also allows nice values up to 2147483645) but slurm 15 is supported.

Reducing PrioritySpread (from the default 10) to 1 or 2 in your crunch-dispatch-slurm config should improve/fix this. (see https://doc.arvados.org/install/crunch2-slurm/install-dispatch.html)

Even with that, there might be situations where the limited range, combined with slurm's approach of reducing the default priority with each job submission, makes it impossible for c-d-s to achieve its desired priority ordering.

In the worst case, it's possible to deadlock by consuming all of your compute nodes with arvados-cwl-runner jobs, each waiting for a child which is waiting for a compute node to run on.

All slurm jobs are submitted with nice=10000 and adjusted from there, so (apart from the deadlock situation) containers will still run even if the adjustment fails.

That said, we should consider
  • rate-limiting these errors
  • updating the error message to mention PrioritySpread and the possibility of upgrading slurm
Actions #2

Updated by Joshua Randall almost 8 years ago

Even with `PrioritySpread: 1` we are seeing these errors, attempting to set nice values as high as 67012.

Actions #3

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF