Bug #14117
openc-d-s reniceAll sets nice on jobs that are not pending
0%
Description
The reniceAll function makes no distinction between the state of jobs, and SqueueChecker runs `squeue` with the `--all` option, which returns jobs in all states.
As a result, it appears that reniceAll ends up setting priority on jobs whose priority has no impact on scheduling, including jobs that are already running and those which have recently completed, been cancelled, or failed.
I would suggest adding `state` to the `slurmJob` struct and then adding a new conditional block before the `if j.wantPriority == 0` one that is something like:
if j.state != "PENDING" { continue }
Updated by Joshua Randall over 6 years ago
According to squeue docs (https://slurm.schedmd.com/squeue.html) the complete set of possible job states are:
BF BOOT_FAIL Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active. DL DEADLINE Job terminated on deadline. F FAILED Job terminated with non-zero exit code or other failure condition. NF NODE_FAIL Job terminated due to failure of one or more allocated nodes. OOM OUT_OF_MEMORY Job experienced out of memory error. PD PENDING Job is awaiting resource allocation. PR PREEMPTED Job terminated due to preemption. R RUNNING Job currently has an allocation. RD RESV_DEL_HOLD Job is held. RF REQUEUE_FED Job is being requeued by a federation. RH REQUEUE_HOLD Held job is being requeued. RQ REQUEUED Completing job is being requeued. RS RESIZING Job is about to change size. RV REVOKED Sibling was removed from cluster due to other cluster starting the job. SE SPECIAL_EXIT The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value. ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. TO TIMEOUT Job terminated upon reaching its time limit.
Of those, I think the ones relevant to prioritisation are:
NODE_FAIL - because in some SLURM configurations jobs that experience node failure can be automatically requeued
PENDING - this is the normal state that requires prioritisation
PREEMPTED - because when job preemption is configured, preempted jobs are automatically requeued
RESV_DEL_HOLD - not really sure when this would happen, but it sounds like it could still be queued if the priority changes
REQUEUE_FED - probably not relevant to arvados use-case but if it is being requeued, then priority still matters
REQUEUE_HOLD - again, if it is requeued then priority probably matters
REQUEUED - same
RESIZING - I think this probably only happens to running jobs but the priority may influence whether the resize is successful (I'm not sure)
SPECIAL_EXIT - this says that it means the job has been requeued so I guess priority may matter
SUSPENDED - because when using preemption SLURM can be configured to suspend jobs rather than requeuing them, and a priority change could be relevant to a resume decision
And the ones that I would argue should not be subject to ongoing renice adjustments are:
BOOT_FAIL - failed and can not be requeued in this state
CANCELLED - jobs that are done should not be prioritised
COMPLETED - jobs that are done should not be prioritised
CONFIGURING - as job has already been allocated resources, the prioritisation decision has already been made
COMPLETING - no need to prioritise jobs that are already running
DEADLINE - jobs that are done should not be prioritised
FAILED - jobs that are done should not be prioritised
OUT_OF_MEMORY - jobs that are done should not be prioritised
RUNNING - no need to prioritise jobs that are already running
REVOKED - this seems unlikely to happen in an arvados configuration, but it sounds like it is a final state for this cluster
STOPPED - no need to prioritise jobs that are already running
TIMEOUT - jobs that are done should not be prioritised
Updated by Joshua Randall over 6 years ago
so, perhaps:
for _, j := range sqc.queue { + switch j.state { + case + "BOOT_FAIL", + "CANCELLED", + "COMPLETED", + "CONFIGURING", + "COMPLETING", + "DEADLINE", + "FAILED", + "OUT_OF_MEMORY", + "RUNNING", + "REVOKED", + "STOPPED", + "TIMEOUT": + continue + } if j.wantPriority == 0 {