Project

General

Profile

Actions

Feature #11146

open

[Crunch2] [Workbench] Show slurm queue position of containers submitted to slurm but not yet running

Added by Tom Clegg almost 8 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
3.0

Description

Background

From the user's perspective, it's hard to see what (if anything) is happening between the time a container is created/queued and the time it actually starts running.

In a SLURM setup, the container typically moves quickly from Queued to Locked state when crunch-dispatch-slurm puts it in the slurm queue, and then stays there for some time waiting for SLURM resources to run it.

Proposed feature

Soon after a container is submitted to the SLURM queue, Workbench should start indicating how close the resulting SLURM job is to the front of the queue.

Implementation

When checking squeue, crunch-dispatch-slurm should notice the slurm queue position for each "Locked" container, and propagate this information to the API server.
  • API: Add a new serialized Hash field dispatch_info
  • crunch-dispatch-slurm: store queue position as dispatch_info["queue_position"]
  • crunch-dispatch-slurm: only update containers for which this process has the lock
  • crunch-dispatch-slurm: rate-limit queue position updates for any given container: max one update per second, avoid sending redundant updates like "update queue position from 5 to 5"
  • crunch-dispatch-slurm: ensure no races between "update queue position" and "update container state" requests
  • Workbench: display the latest queue position when available
Actions #1

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
  • Category set to Crunch
  • Assigned To set to Tom Clegg
  • Target version set to Arvados Future Sprints
Actions #2

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Morris over 7 years ago

  • Story points set to 3.0
Actions #4

Updated by Tom Clegg over 7 years ago

  • Story points deleted (3.0)

from squeue(1): "The default value of sort for jobs is "P,t,-p" (increasing partition name then within a given partition by increasing [job] state and then decreasing priority)"

We might want to use "S,-p,V,P" (expected start time, decreasing priority, submission time, partition name).

If we include %t (job state) in the format string, {number of PENDING jobs seen before this one}+1 can be used as the queue position for a job.

Actions #5

Updated by Tom Clegg over 7 years ago

  • Story points set to 3.0
Actions #6

Updated by Tom Morris over 7 years ago

  • Assigned To deleted (Tom Clegg)
Actions #7

Updated by Ward Vandewege over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF