Bug #23382
closedPrevent unnecessary row locking on final containers
Description
We have evidence that when a completed container is reused, we go through row_lock_for_priority_update at least once. This can start getting very expensive when the same container is reused many (100K+) times.
Avoid that: make it so we do not call row_lock_for_priority_update when the associated container is Complete or Cancelled. In these cases, there is no point to propagating a priority change, and therefore there is no need to lock all these rows.
Thinking it all the way through: this strategy does mean that we'll still do potentially a lot of locking if the same container is reused many times while it's still running. However, we have evidence that performance is acceptable when a container is reused ~10K times. So, today, getting to this point would require having more than 10K workflow runners running. I'm not aware we have any users today who have MaxInstances * SupervisorFraction > 10_000, so that's tomorrow's problem.
Updated by Brett Smith 3 months ago
- Description updated (diff)
- Subject changed from Prevent unnecessary row locking on container reuse to Prevent unnecessary row locking on final containers
Updated by Tom Clegg 3 months ago
23382-optimize-container-reuse @ 367af1bd07bcc89973bfd4e6a3a3d99a1c17e418 -- developer-run-tests: #4983
Updated by Brett Smith 3 months ago
Tom Clegg wrote in #note-4:
23382-optimize-container-reuse @ 367af1bd07bcc89973bfd4e6a3a3d99a1c17e418 -- developer-run-tests: #4983
Assuming tests pass, LGTM. My one nit-suggestion is, y'know, I always like it when regexps have anchors at the boundary (at least \b) to avoid matching superstrings. But I concede that's an awfully remote concern for these particular regexps.
Thank you especially for the comments in the test, I think that's helpful to clarify what's actually being tested vs. what's sort of checking test integrity.
Updated by Tom Clegg 3 months ago
23382-optimize-container-reuse @ 06a7fda1028f9d3379c602b1b45f7aa2900a59e3 -- developer-run-tests: #4984
Fixes the "update container request with dangling container UUID" case.
Updated by Tom Clegg 3 months ago
There are also two places where we explicitly row-lock a container when selecting it for reuse, to avoid leaving the container un-finalized after a race between container finalization and container reuse. These row locks are unnecessary when the container is already finalized before being selected for reuse.
The row lock statements in the Rails logs look identical to the statements we've seen taking multiple minutes in production when a single container is reused by lots of concurrent workflows.
SELECT "containers".* FROM "containers" WHERE "containers"."id" = $1 LIMIT $2 FOR UPDATE [["id", 1048559796], ["LIMIT", 1]]
23382-optimize-container-reuse @ d7a6e71bd1cc8a3754abcadfc93d5f1901fcedb7 -- developer-run-tests: #4990
Updated by Lucas Di Pentima 3 months ago
Tom Clegg wrote in #note-7:
23382-optimize-container-reuse @ d7a6e71bd1cc8a3754abcadfc93d5f1901fcedb7 -- developer-run-tests: #4990
This LGTM, thank you!
Updated by Tom Clegg 3 months ago
cherry-picked to 3.2-staging, building 3.2.1~dev20251224022200-1 packages at build-and-publish-rc-packages: #292
Updated by Brett Smith 3 months ago
- Status changed from In Progress to Resolved