Project

General

Profile

Actions

Bug #13991

open

crunch-dispatch-slurm does not warn when slurm MaxJobCount reached

Added by Joshua Randall over 6 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

SLURM has a default MaxJobCount of 10000. MaxJobCount is not specified in the example slurm.conf suggested by Arvados install docs (https://doc.arvados.org/install/crunch2-slurm/install-slurm.html). Perhaps it should be, so it is clear that this parameter might matter to Arvados.

When hitting the MaxJobCount, I would have expected crunch-dispatch-slurm to log some sort of warning indicating that it was unable to queue all of the jobs because of the limit. I did not see it saying anything that seemed to mean that.

Actions #1

Updated by Tom Morris over 6 years ago

  • Target version set to To Be Groomed
Actions #2

Updated by Tom Clegg over 6 years ago

Tested on 9tee4 with MaxJobCount=2.

$ sbatch -N1 <(printf '#!/bin/sh\nsleep 8000\n')
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
^C

It would be more convenient to get a non-zero exit code, so we don't have to scrape stderr while sbatch is still running.

There's an sbatch --immediate option that fails if the allocation can't be granted, but that's not what we want either. We want to fail only if the job can't be queued.

So it seems the solution is for crunch-dispatch-slurm to monitor stderr while sbatch is running, and if that message appears:
  • Log a suggestion to increase MaxJobCount
  • Avoid starting more sbatch processes until this one exits (incidentally, we only recently stopped serializing all sbatch invocations).

We should also log any other unexpected messages from sbatch, to make other similar problems easier to diagnose.

Actions #3

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (To Be Groomed)
Actions #4

Updated by Lucas Di Pentima almost 2 years ago

  • Release set to 60
Actions

Also available in: Atom PDF