Bug #11626
closed[Crunch2] Still not propagating slurm errors to user
Added by Peter Amstutz over 7 years ago. Updated over 7 years ago.
100%
Description
If a user submits a job and slurm rejects it, the slurm error should be propagated to the container log so the user can see what is wrong.
Updated by Peter Amstutz over 7 years ago
- Target version set to 2017-05-10 sprint
Updated by Peter Amstutz over 7 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 7 years ago
I'm not sure why the HasUUID() check is added in run() in crunch-dispatch-slurm.go. ctx.Done() fires as soon as HasUUID() returns false, so it seems like it could only make a difference during a race between <-status and <-ctx.Done() -- the rest of the time it just introduces an unnecessary polling delay between dispatch telling crunch-dispatch-slurm to cancel and c-d-s actually running scancel.
Also not sure why we're adding the extra call to tracker.update(c) before tracker.close(). The dispatch library is supposed to reduce the amount of state/priority logic each dispatcher has to implement -- if the idea here is just to log every state change, perhaps it should go in dispatch.go instead?
event_type="dispatch" might be better than "crunch-dispatch-slurm" for filtering.
In TestIntegrationNormal, why is the *bool better than a bool?
Updated by Peter Amstutz over 7 years ago
However I've rebased out the HasUUID changes so we can discuss it separately.
Now at @ f8675ad473b45387b1286c6b7a41edf36148ebac
(note-5 refering to 26d7e1809376534b9060534e25399cbd462f38da)
Hhe problem I was trying to solve was that the logs are full of this:
2017-05-08_14:09:51.64626 2017/05/08 14:09:51 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:09:52.35643 2017/05/08 14:09:52 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:09:53.35654 2017/05/08 14:09:53 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:09:54.35664 2017/05/08 14:09:54 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:09:55.35680 2017/05/08 14:09:55 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:09:56.35509 2017/05/08 14:09:56 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:09:57.35538 2017/05/08 14:09:57 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:09:58.35661 2017/05/08 14:09:58 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:09:59.35681 2017/05/08 14:09:59 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:00.35669 2017/05/08 14:10:00 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:10:01.35804 2017/05/08 14:10:01 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:02.35522 2017/05/08 14:10:02 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:10:03.35664 2017/05/08 14:10:03 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:04.35882 2017/05/08 14:10:04 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:10:05.35906 2017/05/08 14:10:05 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:06.36192 2017/05/08 14:10:06 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:10:07.36220 2017/05/08 14:10:07 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:08.36188 2017/05/08 14:10:08 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel 2017-05-08_14:10:09.36211 2017/05/08 14:10:09 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job 2017-05-08_14:10:10.35595 2017/05/08 14:10:10 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Updated by Tom Clegg over 7 years ago
Peter Amstutz wrote:
Hhe problem I was trying to solve was that the logs are full of this:
[...]
This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.
It looks to me like there's a 1-second delay. This delay is inserted by scancel(). The reason it keeps hitting scancel() is that the mocked squeue keeps reporting that the slurm job is still running.
I think the correct way to get rid of those messages would be to fix the squeue mock, so it mocks the behavior we're hoping to test against, instead of a slurm failure mode that makes us (correctly) produce a bunch of debug logs.
Updated by Tom Clegg over 7 years ago
oops, just noticed
log.Printf(text)
should be
log.Print(text)
Other than that, LGTM @ f8675ad473b45387b1286c6b7a41edf36148ebac
(as tempting as it is to ask for an error check on disp.Arv.Create() so we can log "error logging error: ...")
Updated by Tom Clegg over 7 years ago
I missed that the example logs above are from a real site, not the test suite. So the mystery is why scancel wasn't effective.
AFAICT crunch-dispatch-slurm was doing the right thing in the circumstances: keep hitting scancel, with a ≥1 second delay between attempts, until the job goes away.
In any case it seems like a separate topic from "propagate slurm errors to user", because at this point the container is already done and the requesting user probably has little interest in (and ability to fix) slurm problems that happen later. So I think if this scancel behavior needs attention, we should open a new issue. discuss further on #11669.
Updated by Peter Amstutz over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:a309f8c5b4842075d6c83f99a8f2a1e1016976f5.