Bug #9435
closedkeep-balance failing when a small page of collections has the same modified_at timestamp
Description
I got an error from keep-balance:
2016/06/18 10:55:31 collections: 3253996/3273174 2016/06/18 10:57:02 collections: 3278996/3273174 2016/06/18 10:58:31 collections: 3303996/3273174 2016/06/18 11:00:00 collections: 3328996/3273174 2016/06/18 11:01:39 collections: 3353996/3273174 2016/06/18 11:03:11 collections: 3378996/3273174 2016/06/18 11:04:44 collections: 3403996/3273174 2016/06/18 11:06:09 collections: 3428996/3273174 2016/06/18 11:07:42 collections: 3453996/3273174 2016/06/18 11:09:09 collections: 3478996/3273174 2016/06/18 11:09:48 collections: 3490827/3273174 2016/06/18 11:09:52 collections: 3491730/3273174 2016/06/18 11:09:52 collections: 3491822/3273174 2016/06/18 11:09:55 collections: 3491824/3273174 2016/06/18 11:09:55 collections: 3491901/3273174 2016/06/18 11:09:55 collections: 3491904/3273174 2016/06/18 11:09:55 GetCurrentState: took 3h32m5.093893905s 2016/06/18 11:09:55 Run: took 3h32m5.150470067s 2016/06/18 11:09:55 run failed: BUG: Received an entire page with the same modified_at timestamp (2016-06-18 10:09:55.410114 +0000 UTC), cannot make progress
Updated by Tom Clegg almost 10 years ago
It looks like the last page received had 3 items, and the repeated timestamp is very new, so presumably the reason we got 3 items was that there are no more recent ones to get.
I think this means we need an exception to that starvation check: it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).
Updated by Tom Clegg almost 10 years ago
Tom Clegg wrote:
it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).
We already have this check:
if last.ModifiedAt == nil || *last.ModifiedAt == filterTime {
if page.ItemsAvailable > len(page.Items) {
// TODO: use "mtime=X && UUID>Y"
// filters to get all collections with
// this timestamp, then use "mtime>X"
// to get the next timestamp.
return fmt.Errorf("BUG: Received an entire page with the same modified_at timestamp (%v), cannot make progress", filterTime)
}
break
}
I wonder if this is a race on the API server, when a collection is added or modified between the "get items" query and the "count items for items_available" query, so they don't reflect the same database state.
Updated by Tom Clegg almost 10 years ago
See https://www.postgresql.org/docs/9.1/static/transaction-iso.html → 13.2.2. Repeatable Read Isolation Level
Using "repeatable read" for every API transaction sounds like it would be nice and safe, but the locking is optimistic, so we would need auto-retry and associated safeguards. OTOH, using it for "index" should be relatively unobtrusive.- "Applications using this level must be prepared to retry transactions due to serialization failures."
- "Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts."
Updated by Tom Clegg almost 10 years ago
We should also implement the "mtime=X and UUID>Y" fallback described in the comment, like we did in datamanager. This is useful even with the transaction behavior fixed: it will still be possible to get a full page of results with identical timestamps.