Project

General

Profile

Actions

Story #9187

closed

[Crunch2] Update dispatcher to use new API support

Added by Brett Smith over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
05/02/2016
Due date:
% Done:

75%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Subtasks 4 (0 open4 closed)

Task #9196: Review 9187-crunchv2-dispatchingResolvedTom Clegg05/02/2016

Actions
Task #9195: Update local dispatcherResolvedPeter Amstutz05/02/2016

Actions
Task #9194: Update slurm dispatcherResolvedPeter Amstutz05/02/2016

Actions
Task #9365: Review 9187-requeued-containersResolvedTom Clegg05/02/2016

Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Story #9626: [Crunch2] c-d-slurm should avoid new --priority optionResolvedBrett Smith07/18/2016

Actions
Follows (1 day) Arvados - Feature #8128: [Crunch2] API support for crunch-dispatchResolvedTom Clegg04/28/2016

Actions
Actions #1

Updated by Brett Smith over 8 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Tom Clegg over 8 years ago

See #8128
  • Dispatchers should honor priority list: "order by priority DESC, modified_at ASC"
  • Also filter priority > 0
  • Slurm dispatcher should pass priority on sbatch command line, --priority=<value>
  • Slurm dispatcher, when claiming a Locked container previously owned by itself, should check squeue to make sure the slurm job still exists, clean up record if not.
  • crunch-run needs to get the container auth, and pass that (instead of its own dispatch token) into the container env and arv-mount
  • changing state to Running should happen in crunch-run, not crunch-dispatch.
Actions #3

Updated by Peter Amstutz over 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-25 sprint
Actions #4

Updated by Peter Amstutz over 8 years ago

  • Category set to Crunch
  • Assigned To set to Peter Amstutz
  • Story points set to 2.0
Actions #5

Updated by Tom Clegg over 8 years ago

Still need: "Slurm dispatcher should pass priority on sbatch command line, --priority=<value>"

Still need: "Slurm dispatcher, when claiming a Locked container previously owned by itself, should check squeue to make sure the slurm job still exists, clean up record if not."

dispatch.go

"XXX needs to handle paging" ... for now, maybe we should at least log something if items_available>len(items)? use limit=1000?

pollContainers() calls dispatcher.getContainers() to get updated state for containers that are "mine" but (according to the "locked_by_uuid" query) aren't locked by us. (Would something like "notReallyMine" be a more precise name for this "monitored" slice?) If I'm following correctly, a container that is now locked by a different dispatcher (e.g., this dispatcher unlocked it, and a different dispatcher took it) will then end up in the "State Locked || State Running" case in handleContainers(), where we start up a new monitoring goroutine, which seems wrong.

Inline function in handleContainers() seems redundant, suggest:

-            go func(uuid string, ch chan Container) {
-                dispatcher.monitorContainer(dispatcher, uuid, ch)
-            }(container.UUID, dispatcher.setMine(container.UUID))
+            go dispatcher.monitorContainer(dispatcher, container.UUID, dispatcher.setMine(container.UUID))

updateMine() needs its comment ... updated.

Suggest renaming "DispatcherState" to "Dispatcher".

Rather than mapping each Dispatcher field to a MakeDispatcher argument, we could make the caller-controlled fields public (e.g., PollInterval, RunContainer, MonitorContainer), add comments in the type definition to explain how they work, and put the private initialization stuff in RunDispatcher(). That way we can add fields without having to update all callers. So, instead of MakeDispatcher(...), callers would say

d := &Dispatcher{
    RunContainer: myRunFunc,
    PollInterval: time.Minute,
    ...,
}
d.Run()

Suggest changing "framework" / "framework struct" to "object".

Suggest renaming handleContainers to handleUpdates. Changing it to handleUpdate() and putting the range loop in RunDispatcher might make tests a little more convenient.

The monitorContainer interface seems a bit weird: it seems to mean "start a goroutine and listen on this channel for updates". Wouldn't it be easier to just send each update directly to a "handleUpdate" callback? It seems the local case would be simpler if runContainer() started a goroutine to wait for the child proc to finish (it's a bit awkward for monitorRun to do this because there isn't necessarily a process to wait for). The slurm case would just lose boilerplate.

crunch-dispatch-slurm.go

The first "container" variable here seems to be useless:

    var container dispatch.Container
    dispatcher.Arv.Get("containers", uuid, nil, &container)

    for container := range status {

dispatcher.Arv.Update("containers", ... -- should this be dispatcher.UpdateState()?

review/update comments in crunch-dispatch-*.go on the funcs that were rewritten to use Dispatcher.

Actions #6

Updated by Tom Clegg over 8 years ago

Also still need: "crunch-run needs to get the container auth, and pass that (instead of its own dispatch token) into the container env and arv-mount" oh yeah, we moved that to #9272

Actions #7

Updated by Tom Clegg over 8 years ago

Why is the Locked→Complete state transition allowed? This seems wrong: how can it be complete if nobody ever started running it?

If we somehow decide after locking that it's not possible for any dispatcher to run the container, shouldn't we call that Cancelled?

Actions #8

Updated by Peter Amstutz over 8 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Brett Smith over 8 years ago

  • Target version changed from 2016-05-25 sprint to 2016-06-08 sprint
Actions #10

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Why is the Locked→Complete state transition allowed? This seems wrong: how can it be complete if nobody ever started running it?

If we somehow decide after locking that it's not possible for any dispatcher to run the container, shouldn't we call that Cancelled?

If the container was locked but then determined to be unrunnable, what should it do?

If the container is started, but then the system crashes and we have a ghost record in the "Locked" state, what should it do?

We could transition to "Cancelled", but I thought "Cancelled" means that it was cancelled intentionally when the user sets priority to 0. You're suggesting that "Cancelled" is overloaded to also mean the container was unrunnable or lost for some other reason. When we discussed it yesterday, you said that the "Complete" state with a null return code means "the container was lost or something went wrong" which is why I did changed the API server. I would vastly prefer to have an unambiguous representation with an explicit "Error" state for the above conditions.

Actions #11

Updated by Peter Amstutz over 8 years ago

  • Story points changed from 2.0 to 1.0
Actions #12

Updated by Tom Clegg over 8 years ago

Peter Amstutz wrote:

Tom Clegg wrote:

If we somehow decide after locking that it's not possible for any dispatcher to run the container, shouldn't we call that Cancelled?

If the container was locked but then determined to be unrunnable, what should it do?

Set state to Cancelled. (If you mean "unrunnable by anything", that is. If you mean "_this_ dispatcher can't run it right now, possibly ever", set state to Queued.)

If the container is started, but then the system crashes and we have a ghost record in the "Locked" state, what should it do?

Surely it's an error to start a container without changing state to Running?

If the dispatcher crashes while the container is Locked, and then restarts, it should either unlock it (if it can assure itself a crunch-run process isn't still working on it -- e.g., by checking the slurm queue) or leave it alone.

If a container is in the Running state, but crunch-run crashes and loses its ability to track the container's progress, the container is effectively cancelled, so we should set state to Cancelled.

But it's really not crunch-dispatch's job to clean up abandoned containers. That has to happen higher up anyway, so we can clean up containers that are locked by crunch-dispatch processes/tokens that don't exist, for example.

We could transition to "Cancelled", but I thought "Cancelled" means that it was cancelled intentionally when the user sets priority to 0. You're suggesting that "Cancelled" is overloaded to also mean the container was unrunnable or lost for some other reason. When we discussed it yesterday, you said that the "Complete" state with a null return code means "the container was lost or something went wrong" which is why I did changed the API server. I would vastly prefer to have an unambiguous representation with an explicit "Error" state for the above conditions.

Sorry, I guess I misspoke. I didn't intend to creep API changes into this story.

I did make a distinction between "container definitely didn't start" and "container did (or maybe did) start". This is the Locked→Running distinction. If we never even got to Running state, it makes sense to put the container back in the queue and retry -- even if the user asked for retries=0 -- because we know it wasn't the container's own fault. Once we invoke the container, though, we shouldn't put it back in the queue: we allocated resources and ran user-provided code, so it makes sense to show the user "we ran your container on machine M, and X happened".

"Cancelled" means we (the infrastructure) terminated the container: it did not run to completion. ("User sets priority to zero" isn't the only case where we'd do that deliberately.)

The difference between Complete and Cancelled is that Complete means the container exited on its own accord. In the case of Cancelled, for purposes of using the result it doesn't much matter whether we deliberately cut it off before it had a chance to exit by itself, or we accidentally lost track of it and don't know what happened to it -- we have effectively cancelled it and can't say anything about the container's outcome (output / exit code). It doesn't make sense to reuse this result to satisfy a different container request, or even say "this program, when run on this input, is known to fail".

Deciding whether a failure was caused by infrastructure seems orthogonal to this distinction, and something we can deal with separately. For example, if a container exits non-zero after timing out connecting to the API server, is that an infrastructure error? I think a user would say yes, but detecting that and calling it state=Error would be hard to get right.

Actions #13

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Still need: "Slurm dispatcher should pass priority on sbatch command line, --priority=<value>"

Done

Still need: "Slurm dispatcher, when claiming a Locked container previously owned by itself, should check squeue to make sure the slurm job still exists, clean up record if not."

Done.

dispatch.go

"XXX needs to handle paging" ... for now, maybe we should at least log something if items_available>len(items)? use limit=1000?

Done.

pollContainers() calls dispatcher.getContainers() to get updated state for containers that are "mine" but (according to the "locked_by_uuid" query) aren't locked by us. (Would something like "notReallyMine" be a more precise name for this "monitored" slice?) If I'm following correctly, a container that is now locked by a different dispatcher (e.g., this dispatcher unlocked it, and a different dispatcher took it) will then end up in the "State Locked || State Running" case in handleContainers(), where we start up a new monitoring goroutine, which seems wrong.

The reason it has to do this is because when a container record goes to "Cancelled" or "Complete" state, the "locked_by_uuid" field gets set to nil (because need_lock = nil if the state is not Locked or Running in validate_lock of container.rb). I don't know if that was the intended behavior but that's what it does now. You're right, this could be a problem if the dispatcher put a locked container back into the queue, but that currently doesn't happen.

updateMine() needs its comment ... updated.

Done

Suggest renaming "DispatcherState" to "Dispatcher".

Done

Rather than mapping each Dispatcher field to a MakeDispatcher argument, we could make the caller-controlled fields public (e.g., PollInterval, RunContainer, MonitorContainer), add comments in the type definition to explain how they work, and put the private initialization stuff in RunDispatcher(). That way we can add fields without having to update all callers. So, instead of MakeDispatcher(...), callers would say

Done

Suggest renaming handleContainers to handleUpdates. Changing it to handleUpdate() and putting the range loop in RunDispatcher might make tests a little more convenient.

Done.

The monitorContainer interface seems a bit weird: it seems to mean "start a goroutine and listen on this channel for updates". Wouldn't it be easier to just send each update directly to a "handleUpdate" callback? It seems the local case would be simpler if runContainer() started a goroutine to wait for the child proc to finish (it's a bit awkward for monitorRun to do this because there isn't necessarily a process to wait for). The slurm case would just lose boilerplate.

I simplified this to a single RunContainer callback goroutine which is responsible for determining if it should execute/submit a container, monitor an existing submission, or clean up a stale record.

The idea to use a channel is to decouple the get containers / send updates process from the goroutines that babysit each individual container. It feels cleaner than a callback, because the goroutines have the flexibility to process status changes only when it is appropriate.

Now @ 4153cb6

Actions #14

Updated by Peter Amstutz over 8 years ago

The reason it has to do this is because when a container record goes to "Cancelled" or "Complete" state, the "locked_by_uuid" field gets set to nil (because need_lock = nil if the state is not Locked or Running in validate_lock of container.rb). I don't know if that was the intended behavior but that's what it does now. You're right, this could be a problem if the dispatcher put a locked container back into the queue, but that currently doesn't happen.

I see from your comments on #9272, that putting a locked container back to "queued" can happen now.

On further consideration, I think the logic just needs to be tweaked to something like "if locked_by_uuid is nil and container is complete or cancelled or queued then release it"

Actions #15

Updated by Tom Clegg over 8 years ago

Peter Amstutz wrote:

Tom Clegg wrote:

Still need: "Slurm dispatcher should pass priority on sbatch command line, --priority=<value>"

Done

I know my comment above says "--priority=x" but did you try this? I get:

crunch@c97qk:~$ sbatch --priority=123 test
sbatch: unrecognized option '--priority=123'
Try "sbatch --help" for more information
crunch@c97qk:~$ sbatch --nice=-123 test
sbatch: error: Nice value must be non-negative, value ignored
Submitted batch job 556
crunch@c97qk:~$ sbatch --nice=0 test
Submitted batch job 557
crunch@c97qk:~$ sbatch --nice=100 test
Submitted batch job 558

Thoughts?

Still need: "Slurm dispatcher, when claiming a Locked container previously owned by itself, should check squeue to make sure the slurm job still exists, clean up record if not."

Done.

checkSqueue should check squeue's exit code.

An error in checkSqueue() doesn't mean our job has disappeared from the queue, so we shouldn't cancel the container in this case. It just means we don't know what's in the slurm queue right now. We should probably just log the error and try again. Can we move the squeue check into the "range status" loop so we notice even if slurm drops it on the floor but crunch-dispatch-slurm is still running?

crunch-dispatch-slurm should notice when a monitored container disappears from the slurm queue even if the Dispatcher doesn't send any updates about that container. Dispatcher sends periodic updates about every running/owned container even when nothing has changed, so this isn't necessary, right?

pollContainers() calls dispatcher.getContainers() to get updated state for containers that are "mine" but (according to the "locked_by_uuid" query) aren't locked by us. (Would something like "notReallyMine" be a more precise name for this "monitored" slice?) If I'm following correctly, a container that is now locked by a different dispatcher (e.g., this dispatcher unlocked it, and a different dispatcher took it) will then end up in the "State Locked || State Running" case in handleContainers(), where we start up a new monitoring goroutine, which seems wrong.

The reason it has to do this is because when a container record goes to "Cancelled" or "Complete" state, the "locked_by_uuid" field gets set to nil (because need_lock = nil if the state is not Locked or Running in validate_lock of container.rb). I don't know if that was the intended behavior but that's what it does now. You're right, this could be a problem if the dispatcher put a locked container back into the queue, but that currently doesn't happen.

The API already allows Locked→Queued, and I think it has to: otherwise a slurm dispatcher would prematurely fail all containers if it started getting errors running "sbatch", instead of letting someone else (or itself) try to run them, for example. #7292 does this in crunch-run: if we aren't able to start the container, just put it back in the queue so someone else can take it.

Anyway, it seems like the bug can be fixed by adding a filter like ["locked_by_uuid","=",dispatcher.auth.UUID]. Anything that isn't locked by our token (and wasn't already received in the earlier "queued" query) should be un-monitored, regardless of state, right?

Basically it seems like "Dispatcher doesn't give you updates about containers that don't actually belong to you" would be a sensible feature.

I simplified this to a single RunContainer callback goroutine which is responsible for determining if it should execute/submit a container, monitor an existing submission, or clean up a stale record.

I think this is better, but can you update RunDispatcher's comment to reflect the new behavior?

getContainers() should return after logging the API error, instead of pyramiding the rest of the function in an else.

Actions #16

Updated by Tom Clegg over 8 years ago

If "cancel container if it disappears from the queue without cleaning itself up" moved into the "range status" loop, then we could drop the "crunch-finish-slurm.sh" thing and the dependency on Ruby etc. Is that right?

Actions #17

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

I know my comment above says "--priority=x" but did you try this? I get:

[...]

Thoughts?

root@75d5ed700c4d:/# echo '#!/bin/sh\necho foo' | sbatch --priority=5 
Submitted batch job 2
root@75d5ed700c4d:/# sbatch --version
slurm 14.03.9

Still need: "Slurm dispatcher, when claiming a Locked container previously owned by itself, should check squeue to make sure the slurm job still exists, clean up record if not."

Done.

checkSqueue should check squeue's exit code.

Done.

An error in checkSqueue() doesn't mean our job has disappeared from the queue, so we shouldn't cancel the container in this case. It just means we don't know what's in the slurm queue right now. We should probably just log the error and try again. Can we move the squeue check into the "range status" loop so we notice even if slurm drops it on the floor but crunch-dispatch-slurm is still running?

Sure. Now on each status tick it checks squeue and updates state if not in the queue.

crunch-dispatch-slurm should notice when a monitored container disappears from the slurm queue even if the Dispatcher doesn't send any updates about that container. Dispatcher sends periodic updates about every running/owned container even when nothing has changed, so this isn't necessary, right?

FWIW part of the reasoning for the design is to make it easy to plug in websockets or some other notification mechanism instead of polling in the future. In which case there won't be a default "heartbeat".

pollContainers() calls dispatcher.getContainers() to get updated state for containers that are "mine" but (according to the "locked_by_uuid" query) aren't locked by us. (Would something like "notReallyMine" be a more precise name for this "monitored" slice?) If I'm following correctly, a container that is now locked by a different dispatcher (e.g., this dispatcher unlocked it, and a different dispatcher took it) will then end up in the "State Locked || State Running" case in handleContainers(), where we start up a new monitoring goroutine, which seems wrong.

Fixed. Now stops monitoring any container where container.LockedByUUID != dispatcher.Auth.UUID and not Queued.

The API already allows Locked→Queued, and I think it has to: otherwise a slurm dispatcher would prematurely fail all containers if it started getting errors running "sbatch", instead of letting someone else (or itself) try to run them, for example. #7292 does this in crunch-run: if we aren't able to start the container, just put it back in the queue so someone else can take it.

So the intended behavior is that any errors that occur up to actually running the container should move it from "Locked" to "Queued"? Ok. Fixed.

Anyway, it seems like the bug can be fixed by adding a filter like ["locked_by_uuid","=",dispatcher.auth.UUID]. Anything that isn't locked by our token (and wasn't already received in the earlier "queued" query) should be un-monitored, regardless of state, right?

There's 3 queries that the dispatcher makes:

  1. for queued containers with > 0
  2. for containers where locked_by_uuid = dispatcher.auth.uuid
  3. for containers that known to the dispatcher and not included in either of the prior two queries. This is necessary to notice the transitions that result in locked_by_uuid = null; e.g. when a container goes from "Running" to "Complete".

Containers retrieved in the last query will probably also result in "notMine()" being called in handleUpdate().

Basically it seems like "Dispatcher doesn't give you updates about containers that don't actually belong to you" would be a sensible feature.

Done.

I simplified this to a single RunContainer callback goroutine which is responsible for determining if it should execute/submit a container, monitor an existing submission, or clean up a stale record.

I think this is better, but can you update RunDispatcher's comment to reflect the new behavior?

Done.

getContainers() should return after logging the API error, instead of pyramiding the rest of the function in an else.

Done.

Actions #18

Updated by Tom Clegg over 8 years ago

Peter Amstutz wrote:

root@75d5ed700c4d:/# echo '#!/bin/sh\necho foo' | sbatch --priority=5 
Submitted batch job 2
root@75d5ed700c4d:/# sbatch --version
slurm 14.03.9

Please clarify. Does this mean we depend on slurm >= 14 (?) and crunch-dispatch should submit jobs as root?

This is supposed to run jobs on existing slurm installations that aren't "owned" by Arvados -- it seems like either one of those could be a non-starter.

(fwiw, debian7 has 2.3.4, ubuntu1404 has 2.6.5.)

crunch-dispatch-slurm should notice when a monitored container disappears from the slurm queue even if the Dispatcher doesn't send any updates about that container. Dispatcher sends periodic updates about every running/owned container even when nothing has changed, so this isn't necessary, right?

FWIW part of the reasoning for the design is to make it easy to plug in websockets or some other notification mechanism instead of polling in the future. In which case there won't be a default "heartbeat".

I guess Dispatcher should explicitly document what the caller is supposed to expect (update when something changes? or periodic updates even when nothing changes?) and crunch-dispatch-slurm should either rely on the heartbeat or not, accordingly.

Suggest previewing the godoc page by running GOPATH={buildtmp}/GOPATH godoc -http :6060 and looking at http://localhost:6060/pkg/git.curoverse.com/arvados.git/sdk/go/dispatch/, if you haven't already.

(It sounds like you have some unpushed commits? 9187-crunchv2-dispatching is at 4153cb6)

Actions #19

Updated by Brett Smith over 8 years ago

Tom Clegg wrote:

This is supposed to run jobs on existing slurm installations that aren't "owned" by Arvados -- it seems like either one of those could be a non-starter.

(fwiw, debian7 has 2.3.4, ubuntu1404 has 2.6.5.)

I'm not sure how much we should care about Debian 7 anymore, since it's out of official security support and off to the LTS team. But the point about Ubuntu 14.04 is pretty compelling.

RHEL 6 came out in November 2010, and is still fully supported. SLURM 2.1 was the latest version at that time; SLURM 2.2 came out in December 2010. If we're going to require a later version (and I realize that's still an "if" from the comments), we need to have a discussion about what we're imposing on our customers.

Actions #20

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

[...]

Please clarify. Does this mean we depend on slurm >= 14 (?) and crunch-dispatch should submit jobs as root?

It does seem to be an issue with the slurm version. We don't need to submit jobs as root, that's just an artifact of happening to run sbatch as root from within the arvbox container.

This is supposed to run jobs on existing slurm installations that aren't "owned" by Arvados -- it seems like either one of those could be a non-starter.

(fwiw, debian7 has 2.3.4, ubuntu1404 has 2.6.5.)

Ok, we need check with users what version to target.

crunch-dispatch-slurm should notice when a monitored container disappears from the slurm queue even if the Dispatcher doesn't send any updates about that container. Dispatcher sends periodic updates about every running/owned container even when nothing has changed, so this isn't necessary, right?

FWIW part of the reasoning for the design is to make it easy to plug in websockets or some other notification mechanism instead of polling in the future. In which case there won't be a default "heartbeat".

I guess Dispatcher should explicitly document what the caller is supposed to expect (update when something changes? or periodic updates even when nothing changes?) and crunch-dispatch-slurm should either rely on the heartbeat or not, accordingly.

Documentation is updated to reflect that.

Suggest previewing the godoc page by running GOPATH={buildtmp}/GOPATH godoc -http :6060 and looking at http://localhost:6060/pkg/git.curoverse.com/arvados.git/sdk/go/dispatch/, if you haven't already.

(It sounds like you have some unpushed commits? 9187-crunchv2-dispatching is at 4153cb6)

Coming right up...

Actions #21

Updated by Peter Amstutz over 8 years ago

Now @ 3ae9a789410e93eeb31ca5670c17a6d03d77f608

Slurm dispatcher improvements around squeue:

  • Clarify that status updates are not guaranteed to be delivered on a
    heartbeat.
  • Refactor slurm dispatcher to monitor the container in squeue in a separate
    goroutine.
  • Refactor polling squeue to a single goroutine and cache the results so that
    monitoring 100 containers doesn't result in 100 calls to squeue.
  • No longer set up strigger to cancel job on finish, instead cancel running
    jobs not in squeue.
  • Test both cases where a job is/is not in squeue.
Actions #22

Updated by Tom Clegg over 8 years ago

Thanks for reducing "one squeue per poll per container" to "one squeue per poll", and getting rid of strigger. This is looking much better.

An error in checkSqueue() doesn't mean our job has disappeared from the queue, so we shouldn't cancel the container in this case. It just means we don't know what's in the slurm queue right now. We should probably just log the error and try again.

I think we still have this problem in the if container.State == dispatch.Locked block: if we get an squeue error we unlock/re-queue the container, but we shouldn't touch it in this case: we don't actually know what's happening.

Perhaps we could address this by doing all of the checking in one goroutine (i.e., the body of run()), something like this:

submitted := false
for { select {
case tick:
  inQ, err := CheckSqueue()
  if err {
    log
  } else if inQ {
    submitted = true
  } else if !submitted {
    submit
  } else {
    change state Locked->Queued or Running->Cancelled
  }
case status:
  if status is closed {
    break
  }
  if submitted && priority == 0 {
    scancel
  }
}}

Comments on Squeue methods would be helpful.

Might be better to log squeue errors in CheckSqueue instead of in run() -- that way logs will reveal when squeue is failing, even if no new containers are starting.

nit: could use an RWMutex in Squeue -- CheckSqueue could use squeue.RLocker().Lock(), so many goroutines can read the squeue (but writes are still safe).

nit: updating squeue.squeueContents seems like it could be done in runSqueue. Then SyncSqueue could call runSqueue, and CheckSqueue wouldn't need the "empty uuid means I'm just using you to call runSqueue" feature.

run() has a variable uuid which is just container.UUID, but most of the code still uses container.UUID instead of the local copy anyway. The place that does need its own copy (to avoid races with the for container = ... loop) is the squeue-checking goroutine. Suggest:
  • removing the variable and using container.UUID instead in the ~4 affected places in the main func
  • passing a copy of container into the goroutine func. go func(container dispatch.Container) { ... }(container) (unless we get rid of the goroutine as above, of course)

Change formally→formerly in handleUpdate comment

Actions #23

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Thanks for reducing "one squeue per poll per container" to "one squeue per poll", and getting rid of strigger. This is looking much better.

Agree. I further and added an condition variable so that the monitoring goroutines only turn over when squeue has actually been successfully updated. This has some benefits for reasoning about sequencing because CheckSqueue() now blocks until the next update instead of having to interleave polling with calls to run squeue immediately to ensure that the results of sbatch/scancel show up.

Perhaps we could address this by doing all of the checking in one goroutine (i.e., the body of run()), something like this:

[...]

Done. Except it needs to be in a separate goroutine, because you can't use the select { } construct on mutexes or condition variables. (An alternate construct would be to have a goroutine which waits on CheckSqueue() and sends events over a channel, but I'm not sure if that would have any advantages here.)

Comments on Squeue methods would be helpful.

Done.

Might be better to log squeue errors in CheckSqueue instead of in run() -- that way logs will reveal when squeue is failing, even if no new containers are starting.

Done.

nit: could use an RWMutex in Squeue -- CheckSqueue could use squeue.RLocker().Lock(), so many goroutines can read the squeue (but writes are still safe).

It's structured a little bit differently now so the lock doesn't extend over the loop that searches the array. The critical section is extremely small, just enough to read or write the squeueContents field.

(As an aside, read-write locks are not actually more efficient than standard mutexes because you still need to perform an atomic operation on each RLock/RUnlock, and when the critical section is small the overhead of the atomic operations dominate the time spent in the critical section.)

nit: updating squeue.squeueContents seems like it could be done in runSqueue. Then SyncSqueue could call runSqueue, and CheckSqueue wouldn't need the "empty uuid means I'm just using you to call runSqueue" feature.

Done, along with the more substantial refactor of CheckSqueue() into a blocking operation.

run() has a variable uuid which is just container.UUID, but most of the code still uses container.UUID instead of the local copy anyway. The place that does need its own copy (to avoid races with the for container = ... loop) is the squeue-checking goroutine. Suggest:
  • removing the variable and using container.UUID instead in the ~4 affected places in the main func
  • passing a copy of container into the goroutine func. go func(container dispatch.Container) { ... }(container) (unless we get rid of the goroutine as above, of course)

Cleaned that up.

Change formally→formerly in handleUpdate comment

now @ bb10b77

Actions #24

Updated by Tom Clegg over 8 years ago

Refactoring mishap -- Squeue's methods have a pointer receiver called "squeue" but they all seem to ignore it and operate on the squeueUpdater global instead.

RunSqueue only calls Broadcast when it successfully gets fresh squeue information. Meanwhile, CheckSqueue returns an error if squeueError != nil after Wait(). I see that it's possible for squeueError to be non-nil after Wait(), but I don't think it actually matters. What matters is that, if Wait() returns, squeueContents is fresh (newer than when we called Wait()). If, in the time since we called Wait(), there has been a successful update followed by an error, we do have fresh information to return, right? If I'm following correctly, this means CheckSqueue() doesn't need to return an error. As its comment says, it simply doesn't return until there is a successful update.

This also seems to mean squeueError is useless, which would be nice, because RunSqueue could lose some repetitive Lock/Unlock code.

StartMonitor calls RunSqueue() once before starting the periodic updates, which I assume is intended to try preloading a good squeue. But I suspect this is in vain: this first result (if any) can never be used, because no caller can possibly be waiting in squeueCond.Wait() yet.

Actions #25

Updated by Peter Amstutz over 8 years ago

Tom Clegg wrote:

Refactoring mishap -- Squeue's methods have a pointer receiver called "squeue" but they all seem to ignore it and operate on the squeueUpdater global instead.

facepalm you're right. Fixed.

RunSqueue only calls Broadcast when it successfully gets fresh squeue information. Meanwhile, CheckSqueue returns an error if squeueError != nil after Wait(). I see that it's possible for squeueError to be non-nil after Wait(), but I don't think it actually matters. What matters is that, if Wait() returns, squeueContents is fresh (newer than when we called Wait()). If, in the time since we called Wait(), there has been a successful update followed by an error, we do have fresh information to return, right? If I'm following correctly, this means CheckSqueue() doesn't need to return an error. As its comment says, it simply doesn't return until there is a successful update.

This also seems to mean squeueError is useless, which would be nice, because RunSqueue could lose some repetitive Lock/Unlock code.

That's a good insight. The key is to wait for a successful squeue check occuring strictly after sbatch/scancel. Removed squeueError.

StartMonitor calls RunSqueue() once before starting the periodic updates, which I assume is intended to try preloading a good squeue. But I suspect this is in vain: this first result (if any) can never be used, because no caller can possibly be waiting in squeueCond.Wait() yet.

Done.

I also refactored the tests a little bit and added a new test for the priority=0 -> scancel case.

now @ c405873d87e0764acf3855c559c85fa6d7a63cfb

Actions #26

Updated by Tom Clegg over 8 years ago

LGTM @ c405873

Aside: I accidentally ran tests with an out-of-date API server, and got to observe an unfortunate interaction between the signal handling and "25 retries": the dispatcher can take a while to get through 26 API calls before eventually noticing SIGTERM, which seemed especially silly because [a] the failing API call was "lock container" which gets retried by crunch-dispatch-slurm itself on the next squeue update anyway, and [b] the API error was 422 because the request was invalid, but we consider that retryable (because Rails/API uses 422 for "miscellaneous"?).

Given dispatch's "figure out what to do next; do it; repeat" approach, I wonder whether the 25 retries are really worth anything -- or, if we reduced it to zero, would everything still work correctly in the face of failures? It seems like sometimes even 25 retries won't be enough anyway, so it might be better (wrt heisenbugs) to use one fault tolerance path more often, rather than relying on 25 retries most of the time, and relying on crunch-dispatch's own fault tolerance only in the cases where retrying doesn't work.

I'd be fine with retries=0, or a smaller number than 25, or just leaving it alone for now, whatever you think is good. This story has crept enough already...

Actions #27

Updated by Peter Amstutz over 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:55febaa6856eb2d77657edd685aaae78bba0ec82.

Actions #28

Updated by Peter Amstutz over 8 years ago

  • Status changed from Resolved to In Progress

Reopened for 9187-requeued-containers bug fixes based on testing.

Actions #29

Updated by Tom Clegg over 8 years ago

checkMine's doc comment isn't quite correct: it says it only returns true if update is true.

I think it would be helpful to add a comment on the "if Queued and already mine" case. The idea here is that we close the existing monitoring channel, open a new one, and start a new RunContainer() goroutine, right? Normally this means crunch-dispatch-* or crunch-run has set state back to Queued after some failure... And in other cases a lost-and-confused crunch-run process is expected to die naturally due to changed auth token, because some version of #9328 will be in place... But in any case it doesn't make sense to sit around waiting for the failed attempt to do something interesting... right?

Aside from adding/fixing comments, LGTM. Thanks.

Actions #30

Updated by Peter Amstutz over 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:84b538b627cba20558e0a52f3efe73eec90a46fe.

Actions

Also available in: Atom PDF