Bug #22824
closedkeep-web IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict
Description
FAIL: handler_test.go:2329: IntegrationSuite.TestConcurrentWrites
IntegrationSuite.TestConcurrentWrites: n=2
IntegrationSuite.TestConcurrentWrites: n=4
handler_test.go:2367:
c.Check(resp.StatusCode, check.Equals, http.StatusCreated, check.Commentf("%s", content))
... obtained int = 409
... expected int = 201
... i=1/j=3
handler_test.go:2367:
c.Check(resp.StatusCode, check.Equals, http.StatusCreated, check.Commentf("%s", content))
... obtained int = 409
... expected int = 201
... i=1/j=0
handler_test.go:2372:
c.Check(resp.StatusCode, check.Equals, http.StatusOK, check.Commentf("%s", content))
... obtained int = 404
... expected int = 200
... i=1/j=3
handler_test.go:2374:
c.Check(strings.TrimSuffix(string(body), filler), check.Equals, content)
... obtained string = "Not Found"
... expected string = "i=1/j=3"
handler_test.go:2372:
c.Check(resp.StatusCode, check.Equals, http.StatusOK, check.Commentf("%s", content))
... obtained int = 404
... expected int = 200
... i=1/j=0
handler_test.go:2374:
c.Check(strings.TrimSuffix(string(body), filler), check.Equals, content)
... obtained string = "Not Found"
... expected string = "i=1/j=0"
Full log attached.
Files
Updated by Brett Smith 6 months ago
- Release set to 82
- Target version deleted (
Future)
Seen again last night.
Updated by Brett Smith 6 months ago
- Subject changed from IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict to keep-web IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict
Updated by Brett Smith 5 months ago
- Target version set to Development 2025-10-29
- Assigned To set to Tom Clegg
Updated by Tom Clegg 5 months ago
- MKCOL "i=1", then PUT "i=1/j=5"
- MKCOL "i=2", then PUT "i=2/j=5"
- MKCOL "i=3", then PUT "i=3/j=5"
- MKCOL "i=4", then PUT "i=4/j=5"
- Concurrent Sync() operations make overlapping "get latest version" API calls
- Those calls are handled out-of-order in Rails, and return different versions (because the collection was also being modified)
- When handling
PUT "i=4/j=5",needSync()indicates the cache is up-to-date, but a moment later when attempting to write the file, the cache has reverted to an earlier version from beforeMKCOL "i=4"was applied, so it fails.
- collection is in initial state "v0"
- MKCOL "i=4": determine collection cache is up-to-date (forceReload==false)
- MKCOL "i=1": mkdir, update database (via controller->rails), resulting collection is "v1"
- MKCOL "i=2": determine collection cache is not up-to-date (forceReload==true)
- MKCOL "i=2": call Sync(), request latest version from controller->rails
- MKCOL "i=3": determine collection cache is not up-to-date (forceReload==true)
- MKCOL "i=3": call Sync(), request latest version from controller->rails
- MKCOL "i=4": mkdir in cache
- MKCOL "i=4": mkdir in database (via controller->rails), resulting collection is "v2"
- MKCOL "i=4": respond 201
- MKCOL "i=2": receive "latest version" response, update cache to "v2"
- PUT "i=4/j=5": determine collection cache is up-to-date (forceReload==false) at "v2"
- MKCOL "i=3": receive "latest version" response, update cache to "v1"
- PUT "i=4/j=5": attempt to write file, fail because dir "i=4" does not exist
- Use
modified_atto avoid reverting to an older version during Sync - Aside: propagate x-request-id from the keep-web request through to controller when calling the collection update API
I don't have a [great idea for a] test that can expose this consistently. Currently running many test reps. Without this fix, the slightly modified TestConcurrentWrites test on 22824-debug failed 4 of 200 runs.
Updated by Brett Smith 5 months ago
- Target version changed from Development 2025-10-29 to Development 2025-11-12
Updated by Tom Clegg 5 months ago
Reverted "Propagate X-Request-Id from keep-web to controller." It turns out the only reason (*IntegrationSuite)TestRequestTimeoutExemption works is that we're not propagating context to controller.
22824-keep-web-conflict @ f4c3c601d2796ca3bb496af20e3ee285f7c2a828 -- developer-run-tests: #4928
The 22824-debug branch (with the 22824-keep-web-conflict fix, plus more debug output) failed 1% (2 of 200 runs), which is better than the previous 2% (4 of 200 runs) but indicates there is another race bug lurking.
I think we should merge this fix in the meantime.
Updated by Tom Clegg 5 months ago
- Status changed from New to In Progress
- File concurrent-writes-fail-a62a533020-1.txt concurrent-writes-fail-a62a533020-1.txt added
- File concurrent-writes-fail-a62a533020-2.txt concurrent-writes-fail-a62a533020-2.txt added
Updated by Brett Smith 5 months ago
Tom Clegg wrote in #note-12:
22824-keep-web-conflict @ f4c3c601d2796ca3bb496af20e3ee285f7c2a828 -- developer-run-tests: #4928
The 22824-debug branch (with the 22824-keep-web-conflict fix, plus more debug output) failed 1% (2 of 200 runs), which is better than the previous 2% (4 of 200 runs) but indicates there is another race bug lurking.
I think we should merge this fix in the meantime.
Agreed, this LGTM, thanks.
Updated by Brett Smith 4 months ago
- Target version changed from Development 2025-11-12 to Development 2025-12-10
Updated by Tom Clegg 4 months ago
22824-fix-concurrent-sync @ d3069586e45dea1c833a5a945d52a2808d001ef2 -- developer-run-tests: #4965
keep-web's uses (*collectionFileSystem)Sync() to fetch the latest collection version from controller after it detects that the local version is out of date.
Sync is, roughly:- compute manifest and PDH of local content
- compare PDH to the most recently loaded/saved version
- if they are not equal, save computed manifest to controller
- if they are equal (no local changes), load current manifest from controller
- Sync "A" and Sync "B" compute the same manifest and PDH from local content
- Sync "A" determines that currentPDH = savedPDH and loads current content from controller
- Sync "B" determines that currentPDH != savedPDH (because Sync "A" just updated savedPDH) and writes the now-outdated local manifest (computed in step 1) to controller
- another Sync has loaded a new version from controller and there are no local changes, as in the keep-web case, so the only correct behavior is to return without saving local changes, or
- another Sync has saved local changes, but those changes might not include all of the changes that existed when this Sync was called, so the only correct behavior is to save local changes.
In principle this could cause starvation (Sync could retry on each iteration for a long time) but in practice, even the test case trying to exercise this case has only been able to hit it very rarely.
With this fix in place, the test case previously failing 2/100 attempts has passed 300/300 attempts.
The other commit on this branch fixes the X-Request-Id propagation issue again, but this time with a fix for the timeout issue mentioned in #note-12.
Updated by Brett Smith 3 months ago
- Target version changed from Development 2025-12-10 to Development 2026-01-06
Updated by Lisa Knox 3 months ago
I don't like the idea of a never-ending (in theory) loop, even if it always ends in practice. If we could figure out how many retries occur on average, we could set a limit of that*1000, just so there is some limit there. I doubt that's practical though, given that we would have to find an average of an event that happens randomly 1% of the time. I'd just feel safer with some limit than none.
Updated by Tom Clegg 3 months ago
That's fair. Added a comment about worst-case behavior, and a failsafe to give up after 1000 attempts since that would indicate something is happening that we don't anticipate (1 is typical, 2 happens in about 2% of TestConcurrentWrites runs).
22824-fix-concurrent-sync @ 16a55aa63d974df717580b55df7790b0bbf8d3ed -- developer-run-tests: #4979
Updated by Tom Clegg 3 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|5cda31e0978bfdcbabea678ed8e6960ae72bb799.