Project

General

Profile

Actions

Bug #22824

closed

keep-web IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict

Added by Brett Smith 11 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Story points:
-
Release relationship:
Auto

Description

FAIL: handler_test.go:2329: IntegrationSuite.TestConcurrentWrites

IntegrationSuite.TestConcurrentWrites: n=2
IntegrationSuite.TestConcurrentWrites: n=4
handler_test.go:2367:
    c.Check(resp.StatusCode, check.Equals, http.StatusCreated, check.Commentf("%s", content))
... obtained int = 409
... expected int = 201
... i=1/j=3

handler_test.go:2367:
    c.Check(resp.StatusCode, check.Equals, http.StatusCreated, check.Commentf("%s", content))
... obtained int = 409
... expected int = 201
... i=1/j=0

handler_test.go:2372:
    c.Check(resp.StatusCode, check.Equals, http.StatusOK, check.Commentf("%s", content))
... obtained int = 404
... expected int = 200
... i=1/j=3

handler_test.go:2374:
    c.Check(strings.TrimSuffix(string(body), filler), check.Equals, content)
... obtained string = "Not Found" 
... expected string = "i=1/j=3" 

handler_test.go:2372:
    c.Check(resp.StatusCode, check.Equals, http.StatusOK, check.Commentf("%s", content))
... obtained int = 404
... expected int = 200
... i=1/j=0

handler_test.go:2374:
    c.Check(strings.TrimSuffix(string(body), filler), check.Equals, content)
... obtained string = "Not Found" 
... expected string = "i=1/j=0" 

Full log attached.


Files


Subtasks 2 (0 open2 closed)

Task #23224: Review 22824-keep-web-conflictResolvedTom Clegg10/30/2025Actions
Task #23354: Review 22824-fix-concurrent-syncResolvedTom Clegg12/15/2025Actions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Target version set to Development 2025-07-09
Actions #2

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-07-09 to Development 2025-07-23
Actions #3

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2025-07-23 to Future
Actions #4

Updated by Brett Smith 6 months ago

  • Release set to 82
  • Target version deleted (Future)

Seen again last night.

Actions #5

Updated by Brett Smith 6 months ago

  • Subject changed from IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict to keep-web IntegrationSuite.TestConcurrentWrites can fail with 409 Conflict
Actions #6

Updated by Brett Smith 5 months ago

  • Target version set to Development 2025-10-29
  • Assigned To set to Tom Clegg
Actions #7

Updated by Brett Smith 5 months ago

  • Subtask #23224 added
Actions #9

Updated by Tom Clegg 5 months ago

The test checks that the following sequences can happen concurrently without interfering with one another:
  1. MKCOL "i=1", then PUT "i=1/j=5"
  2. MKCOL "i=2", then PUT "i=2/j=5"
  3. MKCOL "i=3", then PUT "i=3/j=5"
  4. MKCOL "i=4", then PUT "i=4/j=5"
There is a problem when:
  • Concurrent Sync() operations make overlapping "get latest version" API calls
  • Those calls are handled out-of-order in Rails, and return different versions (because the collection was also being modified)
  • When handling PUT "i=4/j=5", needSync() indicates the cache is up-to-date, but a moment later when attempting to write the file, the cache has reverted to an earlier version from before MKCOL "i=4" was applied, so it fails.
Example caught in a test run:
  1. collection is in initial state "v0"
  2. MKCOL "i=4": determine collection cache is up-to-date (forceReload==false)
  3. MKCOL "i=1": mkdir, update database (via controller->rails), resulting collection is "v1"
  4. MKCOL "i=2": determine collection cache is not up-to-date (forceReload==true)
  5. MKCOL "i=2": call Sync(), request latest version from controller->rails
  6. MKCOL "i=3": determine collection cache is not up-to-date (forceReload==true)
  7. MKCOL "i=3": call Sync(), request latest version from controller->rails
  8. MKCOL "i=4": mkdir in cache
  9. MKCOL "i=4": mkdir in database (via controller->rails), resulting collection is "v2"
  10. MKCOL "i=4": respond 201
  11. MKCOL "i=2": receive "latest version" response, update cache to "v2"
  12. PUT "i=4/j=5": determine collection cache is up-to-date (forceReload==false) at "v2"
  13. MKCOL "i=3": receive "latest version" response, update cache to "v1"
  14. PUT "i=4/j=5": attempt to write file, fail because dir "i=4" does not exist
22824-keep-web-conflict @ b797206c8ae261f5657d6f655ccbcff50907871d -- developer-run-tests: #4927
  • Use modified_at to avoid reverting to an older version during Sync
  • Aside: propagate x-request-id from the keep-web request through to controller when calling the collection update API

I don't have a [great idea for a] test that can expose this consistently. Currently running many test reps. Without this fix, the slightly modified TestConcurrentWrites test on 22824-debug failed 4 of 200 runs.

Actions #11

Updated by Brett Smith 5 months ago

  • Target version changed from Development 2025-10-29 to Development 2025-11-12
Actions #12

Updated by Tom Clegg 5 months ago

Reverted "Propagate X-Request-Id from keep-web to controller." It turns out the only reason (*IntegrationSuite)TestRequestTimeoutExemption works is that we're not propagating context to controller.

22824-keep-web-conflict @ f4c3c601d2796ca3bb496af20e3ee285f7c2a828 -- developer-run-tests: #4928

The 22824-debug branch (with the 22824-keep-web-conflict fix, plus more debug output) failed 1% (2 of 200 runs), which is better than the previous 2% (4 of 200 runs) but indicates there is another race bug lurking.

I think we should merge this fix in the meantime.

Actions #14

Updated by Brett Smith 5 months ago

Tom Clegg wrote in #note-12:

22824-keep-web-conflict @ f4c3c601d2796ca3bb496af20e3ee285f7c2a828 -- developer-run-tests: #4928

The 22824-debug branch (with the 22824-keep-web-conflict fix, plus more debug output) failed 1% (2 of 200 runs), which is better than the previous 2% (4 of 200 runs) but indicates there is another race bug lurking.

I think we should merge this fix in the meantime.

Agreed, this LGTM, thanks.

Actions #15

Updated by Brett Smith 4 months ago

  • Target version changed from Development 2025-11-12 to Development 2025-12-10
Actions #16

Updated by Tom Clegg 4 months ago

22824-fix-concurrent-sync @ d3069586e45dea1c833a5a945d52a2808d001ef2 -- developer-run-tests: #4965

keep-web's uses (*collectionFileSystem)Sync() to fetch the latest collection version from controller after it detects that the local version is out of date.

Sync is, roughly:
  1. compute manifest and PDH of local content
  2. compare PDH to the most recently loaded/saved version
  3. if they are not equal, save computed manifest to controller
  4. if they are equal (no local changes), load current manifest from controller
This can fail with concurrent Sync calls:
  1. Sync "A" and Sync "B" compute the same manifest and PDH from local content
  2. Sync "A" determines that currentPDH = savedPDH and loads current content from controller
  3. Sync "B" determines that currentPDH != savedPDH (because Sync "A" just updated savedPDH) and writes the now-outdated local manifest (computed in step 1) to controller
The fix is just to check for this race. If Sync detects local changes, but savedPDH has changed since starting to compute the manifest of local content, that means the test for local changes was inconclusive, so it needs to retry. The test was inconclusive because either
  • another Sync has loaded a new version from controller and there are no local changes, as in the keep-web case, so the only correct behavior is to return without saving local changes, or
  • another Sync has saved local changes, but those changes might not include all of the changes that existed when this Sync was called, so the only correct behavior is to save local changes.

In principle this could cause starvation (Sync could retry on each iteration for a long time) but in practice, even the test case trying to exercise this case has only been able to hit it very rarely.

With this fix in place, the test case previously failing 2/100 attempts has passed 300/300 attempts.

The other commit on this branch fixes the X-Request-Id propagation issue again, but this time with a fix for the timeout issue mentioned in #note-12.

Actions #17

Updated by Tom Clegg 4 months ago

  • Subtask #23354 added
Actions #18

Updated by Brett Smith 3 months ago

  • Target version changed from Development 2025-12-10 to Development 2026-01-06
Actions #19

Updated by Lisa Knox 3 months ago

I don't like the idea of a never-ending (in theory) loop, even if it always ends in practice. If we could figure out how many retries occur on average, we could set a limit of that*1000, just so there is some limit there. I doubt that's practical though, given that we would have to find an average of an event that happens randomly 1% of the time. I'd just feel safer with some limit than none.

Actions #20

Updated by Tom Clegg 3 months ago

That's fair. Added a comment about worst-case behavior, and a failsafe to give up after 1000 attempts since that would indicate something is happening that we don't anticipate (1 is typical, 2 happens in about 2% of TestConcurrentWrites runs).

22824-fix-concurrent-sync @ 16a55aa63d974df717580b55df7790b0bbf8d3ed -- developer-run-tests: #4979

Actions #21

Updated by Lisa Knox 3 months ago

LGTM

Actions #22

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved
Actions #23

Updated by Brett Smith about 2 months ago

  • Release changed from 82 to 84
Actions

Also available in: Atom PDF