Story #12552
closedWhen priority is equal, the children of an earlier-submitted workflow should run first
100%
Description
Currently containers are prioritized by descending priority, then creation time.
Typically, each workflow runs many child containers. When multiple workflows are running, the creation times of their child containers are interleaved, and therefore so is their execution order. As more work is added to the system, earlier workflows make slower and slower progress.
Desired behavior:
If two concurrent workflows have equal priority, the children of the earlier-submitted workflow should be prioritized ahead of children of the later-submitted workflow.
Updated by Peter Amstutz about 7 years ago
- Related to Story #12574: [API] Propagate priority of parent container to children added
Updated by Peter Amstutz about 7 years ago
One way to do this is something like what slurm does: submit the first workflow at a high priority (say 900) and each subsequent workflow is given the next lower priority, 899, 898, 897, ... and the priority propagates to all child jobs (#12574)
Updated by Tom Clegg about 7 years ago
Peter Amstutz wrote:
One way to do this is something like what slurm does: submit the first workflow at a high priority (say 900) and each subsequent workflow is given the next lower priority, 899, 898, 897, ... and the priority propagates to all child jobs (#12574)
I don't think we need to change the meaning of the "priority" field or ask users to track a moving target. If we sort first by top-level priority, then by top-level submit time, and propagate that order to the slurm queue (#12573) or other scheduling backend, it seems like users will get what they want: higher-priority workflows go first, and equal-priority workflows run ~serially instead of slowing one another down.
Updated by Peter Amstutz about 7 years ago
- Related to Bug #12573: Priority is ignored in Crunch2 added
Updated by Peter Amstutz about 7 years ago
Updated by Tom Clegg almost 7 years ago
The container priority field would be much easier for dispatchers (and workbench and other clients) to use if it were offered as a single number, with the semantics "a container with priority=X has precedence over every container with priority<X."
The container request priority field should still be controlled directly by the caller, though, and the behavior requested here (i.e., if top-level containers WF1 and WF2 have equal priority, new children of WF1 run before old children of WF2) should still be possible.
Proposed implementation:- Container priority is an int64: low ~50 bits are a negative timestamp1, high ~10 bits are "Pmax", the highest priority of any container request that's being satisfied by this container
- The timestamp is not necessarily the creation time of the container itself -- it's the creation time of the oldest top-level container that has priority=Pmax.
In this example, the container at the bottom has priority (2<<50)–1911 because the highest priority of any requesting CR is 2, and the oldest top-level container with priority=2 was created at 1911.
CTR REQ CTR REQ CTR REQ priority=1 priority=2 priority=2 created=1900 created=1910 created=1920 | | | CTR CTR CTR priority=(1<<50)-1901 priority=(2<<50)-1911 priority=(2<<50)-1921 created=1901 created=1911 created=1921 | | | CTR REQ CTR REQ CTR REQ priority=1 priority=2 priority=2 created=1902 created=1912 created=1922 | | | +-------------------------+------------------------+ | CTR priority=(2<<50)-1911 created=1904This bottom-level container would be delayed only by a competing container with an ancestor that has
- priority>2, or
- priority=2 and created<1911.
1 javascript timestamp might be a reasonable choice: 44 bits are enough to span 1970-01-01 to 2527-06-23 with 1ms granularity.
Updated by Tom Clegg almost 7 years ago
Priority won't be in the 1...1000 range any more. Therefore it seems like we'd need to bundle this with the more complete SLURM scheduling fix we punted during #12573.
Updated by Tom Clegg almost 7 years ago
- new jobs are submitted with nice=10000
- config.PriorityElbowRoom is
- ideally 1000
- lower (10?) on sites where slurm nice values only go to 10000
- lower (0?) on sites where Arvados-submitted jobs need to compete with other slurm tenants and it's more important to get priority higher than to minimize scontrol invocations
func reprioritize() {
wantQueue := []struct {
ctr *arvados.Container
priority int // current slurm priority
nice int // current slurm nice value
}{}
// Get current set of "our" containers (queued, running, and
// locked) into wantQueue
// Call "squeue" and fill in "priority" and "nice" fields of
// wantQueue
// Sort wantQueue by ctr.Priority desc, ctr.CreatedAt asc (i.e.,
// wantQueue[0] should end up with the highest slurm priority)
// Check & fix:
var nextPriority int
for i, ent := range wantQueue {
adjust := 0 // how much priority should increase
if i == 0 {
// Might as well make the highest priority job
// nice=0.
adjust = ent.nice
} else if ent.priority > nextPriority {
// Current slurm priority is too high.
adjust = nextPriority - ent.priority
if i == len(wantQueue) || wantQueue[i+1].priority > nextPriority-1 {
// We're going to renice the next item
// anyway (or this is the last item).
// We might as well make some elbow
// room so we don't go full N^2.
adjust -= config.PriorityElbowRoom
}
setSlurmNice(ent.ctr.UUID, ent.nice-adjust)
} else if ent.priority < nextPriority-config.PriorityElbowRoom*10 {
// Too much distance. This ensures we pull
// newly submitted jobs up to a reasonable
// priority number, even if they're at the
// back of the queue.
adjust = nextPriority - ent.priority - config.PriorityElbowRoom
}
if adjust > ent.nice {
// Can't adjust that much without being root.
// Just go as far as we can.
adjust = ent.nice
}
if adjust != 0 {
setSlurmNice(ent.ctr.UUID, ent.nice-adjust)
}
nextPriority = ent.priority + adjust - 1
}
}
func setSlurmNice(ctrUUID string, newNiceValue int) {
// ...
}
(moved from #12753#note-19)
Updated by Tom Morris almost 7 years ago
- Target version changed from To Be Groomed to 2018-01-31 Sprint
- Story points set to 2.0
Updated by Tom Clegg almost 7 years ago
- Subject changed from Job scheduler should prioritize based on time workflow is submitted to When priority is equal, the children of an earlier-submitted workflow should run first
- Description updated (diff)
Updated by Tom Morris almost 7 years ago
- Target version changed from 2018-01-31 Sprint to To Be Groomed
Updated by Tom Morris almost 7 years ago
- Target version changed from To Be Groomed to 2018-02-28 Sprint
Updated by Tom Clegg almost 7 years ago
- ensures slurm queue priority order matches arvados priority order
- sorts on container priority and creation time
- compatible with (but does not yet implement) api server reporting "workflow priority" (taking into account the highest-priority ancestor) in container response
- "elbow room" config renamed to PrioritySpread and added to install docs
Updated by Lucas Di Pentima almost 7 years ago
Sorry for taking so long reviewing this, there are some concepts that I’m not being able to understand yet, for example the “nice” value of slurm, is it used to adjust the priority on the slurm side? If so, what happens to the nice value? is set to 0 by slurm once it’s used? Is it used the same way as on Linux processes (eg: greater niceness => lower priority) or the opposite? I suppose the TestReniceChurn
test is simulating Slurm behavior, by adding the niceness to the priority so maybe the greater the nice value is, the less nice is the job?
I think that if the PrioritySpread documentation would need more detail, for example some guidance on which symptoms could be observed when having a non ideal value? Maybe some examples, etc?
If lots of scontrol
executions are going to happen, can the time of adjustments be reduced by executing several commands at once? Like in a thread pool or similar... don't know if slurm is able to handle that.
Could it be that the main topic of this story (When priority is equal, the children of an earlier-submitted workflow should run first) is still pending to be tested? That is, AFAICT the tests implicitly accept that the jobs list is already sorted by the encoded priority/created_at value, maybe it's convenient to also test the encoding part?
Updated by Tom Clegg almost 7 years ago
Lucas Di Pentima wrote:
Sorry for taking so long reviewing this, there are some concepts that I’m not being able to understand yet, for example the “nice” value of slurm, is it used to adjust the priority on the slurm side? If so, what happens to the nice value? is set to 0 by slurm once it’s used? Is it used the same way as on Linux processes (eg: greater niceness => lower priority) or the opposite? I suppose the
TestReniceChurn
test is simulating Slurm behavior, by adding the niceness to the priority so maybe the greater the nice value is, the less nice is the job?
(priority reported by slurm) = (priority assigned by slurm at submit time) - (nice value)
(priority assigned by slurm at submit time) = some function of submit time/order, slurm configuration, etc., IOW, a bunch of stuff we can't control.
"more nice" = "lower priority".
I think that if the PrioritySpread documentation would need more detail, for example some guidance on which symptoms could be observed when having a non ideal value? Maybe some examples, etc?
Seems reasonable. How about:
"If your Arvados containers are waiting too long in the SLURM queue because their "nice" values are too high for them to compete with other SLURM jobs, you should use a lower PrioritySpread value."
If lots of
scontrol
executions are going to happen, can the time of adjustments be reduced by executing several commands at once? Like in a thread pool or similar... don't know if slurm is able to handle that.
I wondered about batching and/or parallelizing update invocations, and although I think slurm should be able to handle it, I'm wary of it being a premature optimization.
Could it be that the main topic of this story (When priority is equal, the children of an earlier-submitted workflow should run first) is still pending to be tested? That is, AFAICT the tests implicitly accept that the jobs list is already sorted by the encoded priority/created_at value, maybe it's convenient to also test the encoding part?
The thing implemented/tested so far is just reordering the slurm job priorities to match the Arvados container priorities.
The API server update (which isn't done yet) is still needed to make the Arvados priorities obey the rule "when container request priority is equal, children of an earlier-submitted workflow have a higher container priority".
Taken together, these should add up to "user-desired order is communicated to slurm scheduler".
Updated by Lucas Di Pentima almost 7 years ago
Thanks for the explanations. I think the additional comment on the documentation will be useful mainly for ops people who are upgrading and see scheduling behaviour changes.
With that, it LGTM.
Updated by Tom Clegg almost 7 years ago
- Target version changed from 2018-02-28 Sprint to 2018-03-14 Sprint
Updated by Tom Clegg almost 7 years ago
- Related to Bug #13166: [node manager] wishlist should consist of top priority containers added
Updated by Tom Clegg almost 7 years ago
Aside:
We seem to have a surprising API behavior: When a container_requests#create call is recognized as coming from a running container, the supplied priority is ignored, and the submitter's own priority is used instead -- even if the supplied priority is zero. In other cases, requesting priority=0 results in priority=0 (i.e., if state=Committed, find/create a suitable container, but don't run it). I think we should check for components that are relying on this behavior, and fix it. Meanwhile it looks like it shouldn't affect the current issue.
Updated by Tom Clegg almost 7 years ago
- when container request priority is equal, children of an earlier-submitted workflow have a higher container priority
- when container request priority is higher, all of its children/dependencies also have higher container priority
Container priorities are equal in many cases (e.g., if only one workflow is running, its children all have the same priority). The slurm dispatcher doesn't have an "equal priority" concept so it sorts them next on container UUID to avoid incessant slurm-queue-rearranging.
It would be better to run containers in creation order when priority is equal. Working on that branch next.
Updated by Lucas Di Pentima almost 7 years ago
12552-wf-priority
LGTM, with just one comment:
- File
services/api/test/unit/container_request_test.rb
- Lines 328 & 329: I think assert_operator is meant to check parents[n] instead of children[n]
Updated by Tom Clegg almost 7 years ago
- Target version changed from 2018-03-14 Sprint to 2018-03-28 Sprint
Updated by Tom Clegg almost 7 years ago
- Status changed from In Progress to Resolved