Bug #14115
openc-d-s reniceAll takes a very long time and holds up SqueueChecker updates and initial job submission
0%
Description
reniceAll is taking many many times longer than the PollPeriod. see this partial c-d-s log output with added debug prints:
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: check() took 2.149983887s Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: reniceAll() took 1.16567ms Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.start goroutine loop: check() took 2.160180571s Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: reniceAll() took 17.809860522s Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: consuming ticker without update Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.start goroutine loop: check() took 1.109416296s Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: reniceAll() took 3m46.741228235s Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: consuming ticker without update Aug 23 22:54:18 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms
The comment at the top of reniceAll() admits that it is slow:
// This is slow (it shells out to scontrol many times) and no // other goroutines update sqc.queue or any of the job fields // we use here, so we don't acquire a lock.
The comment was introduced recently along with removing the explicit locking in https://github.com/curoverse/arvados/commit/23574f4d4039641f9a9f624d3b16b64148de3826
The commit message there is: "13399: Don't block job submissions during release/renice batches." and although I cannot view issue 13399 the message suggests this is an attempt to stop reniceAll from blocking submission of jobs to SLURM. However, reniceAll still blocks the SqueueChecker main loop goroutine from getting to the next iteration of check, which in turn blocks the sqc.notify.BroadCast, so all goroutines that enter sqc.notify.Wait() while reniceAll is running will be blocked until it completes (and the next check completes). I would note that includes HasUUID() which is called at the beginning of runContainer, so it will delay the submission of the container until reniceAll returns.