Bug #13933
closedcrunch-dispatch-slurm / Go SDK Dispatcher performs poorly in the presence of a large backlog
100%
Description
When there is a large backlog of queued containers, crunch-dispatch-slurm takes a long time to process them, and can be very sensitive to (even transient) API server issues.
For example, we currently have ~37000 containers in state Queued with priority > 0. crunch-dispatch-slurm requests these from the API server in batches of 100 (also each time through the loop wastefully asking for a count of items available for which the database takes around the same amount of time as the request itself). The API server also makes a pre-flight check of the size of the mounts in the container records, so to fulfill each one of these batches of 100 the database gets queried three times with the same conditions (but different select values) and at ~3x the time cost. Changing the dispatcher code so that it does a `limit: 0` count request at the beginning and then does `count: none` requests in each loop iteration improves performance significantly. Changing the API server so that it does not check the size of the mounts fields when the limits are already at the minimum (100 seems to be the minimum?) could yield an additional 50% speedup on these queries.
If any one of the (in our case ~370) batch list requests to the API server fails for any reason, crunch-dispatch-slurm (really the Go SDK Dispatcher) gives up and starts again from the beginning (N.B. it doesn't even log a warning in this situation for some reason). The code path here is that checkForUpdates returns false at https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L172 which then triggers the `if !querySuccess { continue }` block at either https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L100 or https://github.com/curoverse/arvados/blob/master/sdk/go/dispatch/dispatch.go#L128. In an environment with a large backlog and a nonzero API server error rate, this makes it difficult to reach the later stages of the Run() function. I don't have a solution to suggest for this, but I think it would be helpful at a minimum if both of those continue blocks logged a message indicating that not all containers were retrieved from the API server successfully so that operators have a chance to notice the problem.