Bug #9435: keep-balance failing when a small page of collections has the same modified_at timestamp - Arvados

Actions

Copy link

Bug #9435

closed

keep-balance failing when a small page of collections has the same modified_at timestamp

Added by Joshua Randall almost 10 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Category:

Keep

Target version:

Story points:

Description

I got an error from keep-balance:

2016/06/18 10:55:31 collections: 3253996/3273174
2016/06/18 10:57:02 collections: 3278996/3273174
2016/06/18 10:58:31 collections: 3303996/3273174
2016/06/18 11:00:00 collections: 3328996/3273174
2016/06/18 11:01:39 collections: 3353996/3273174
2016/06/18 11:03:11 collections: 3378996/3273174
2016/06/18 11:04:44 collections: 3403996/3273174
2016/06/18 11:06:09 collections: 3428996/3273174
2016/06/18 11:07:42 collections: 3453996/3273174
2016/06/18 11:09:09 collections: 3478996/3273174
2016/06/18 11:09:48 collections: 3490827/3273174
2016/06/18 11:09:52 collections: 3491730/3273174
2016/06/18 11:09:52 collections: 3491822/3273174
2016/06/18 11:09:55 collections: 3491824/3273174
2016/06/18 11:09:55 collections: 3491901/3273174
2016/06/18 11:09:55 collections: 3491904/3273174
2016/06/18 11:09:55 GetCurrentState: took 3h32m5.093893905s
2016/06/18 11:09:55 Run: took 3h32m5.150470067s
2016/06/18 11:09:55 run failed: BUG: Received an entire page with the same modified_at timestamp (2016-06-18 10:09:55.410114 +0000 UTC), cannot make
 progress

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg almost 10 years ago

It looks like the last page received had 3 items, and the repeated timestamp is very new, so presumably the reason we got 3 items was that there are no more recent ones to get.

I think this means we need an exception to that starvation check: it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).

Actions

Copy link

Updated by Tom Clegg almost 10 years ago

Tom Clegg wrote:

it's perfectly fine to receive a page of identical timestamps if we don't need to get the next page (i.e., if there is no next page).

We already have this check:

        if last.ModifiedAt == nil || *last.ModifiedAt == filterTime {
            if page.ItemsAvailable > len(page.Items) {
                // TODO: use "mtime=X && UUID>Y" 
                // filters to get all collections with
                // this timestamp, then use "mtime>X" 
                // to get the next timestamp.
                return fmt.Errorf("BUG: Received an entire page with the same modified_at timestamp (%v), cannot make progress", filterTime)
            }
            break
        }

I wonder if this is a race on the API server, when a collection is added or modified between the "get items" query and the "count items for items_available" query, so they don't reflect the same database state.

Actions

Copy link

Updated by Tom Clegg almost 10 years ago

See https://www.postgresql.org/docs/9.1/static/transaction-iso.html → 13.2.2. Repeatable Read Isolation Level

Using "repeatable read" for every API transaction sounds like it would be nice and safe, but the locking is optimistic, so we would need auto-retry and associated safeguards. OTOH, using it for "index" should be relatively unobtrusive.

"Applications using this level must be prepared to retry transactions due to serialization failures."
"Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts."

Actions

Copy link

Updated by Tom Clegg almost 10 years ago

We should also implement the "mtime=X and UUID>Y" fallback described in the comment, like we did in datamanager. This is useful even with the transaction behavior fixed: it will still be possible to get a full page of results with identical timestamps.

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Status changed from New to Resolved

In #9998 keep-balance started using count="none" and accommodating multiple pages of identical timestamps, so this shouldn't affect keep-balance any more. #9496 remains.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #9435

keep-balance failing when a small page of collections has the same modified_at timestamp

Updated by Tom Clegg almost 10 years ago

Updated by Tom Clegg almost 10 years ago

Updated by Tom Clegg almost 10 years ago

Updated by Tom Clegg almost 10 years ago

Updated by Tom Clegg about 9 years ago