Bug #8483: datamanager fails if >=50 collections have the same modified_at timestamp - Arvados

Actions

Copy link

Bug #8483

closed

datamanager fails if >=50 collections have the same modified_at timestamp

Added by Joshua Randall about 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Joshua Randall

Category:

Keep

Target version:

Story points:

Description

The loop in GetCollections() that queries the API server for collection data uses a filter on the modified_at timestamp to page through results (presumably so that we won't miss any collections that are modified while the loop is running, which makes sense.

Unfortunately, this means if one has a number of collections with identical modified_at timestamps, the loop will exit early because it thinks it has finished. This failure mode is potentially catastrophic, as all the subsequent collections are missed and all of their associated blocks can be marked for deletion as if they were orphaned fragments.

I would propose two fixes:

1. Sanity-check the number of collections retrieved after exiting the query loop in GetCollections. It should in all cases be >= initialNumberOfCollectionsAvailable. If not, GetCollections should return an error so that datamanager won't make bad decisions based on partial information.

2. Fix the loop/query code so that an offset is used for paging for records with identical modified_at values. If the first and last record of a batch have identical modified_at timestamps, increment the offset by BatchSize. As long as that continues to be the case, keep incrementing the offset by BatchSize. Once there is any range in modified_at time in the batch results, set offset back to 0.

Subtasks 2 (0 open — 2 closed)

Actions

Copy link

Updated by Joshua Randall about 10 years ago

While testing my fix for 8485, I found another way that the loop can terminate early (my sanity check implemented in 8484 successfully caught that situation).

In addition to a set of collections sharing exactly the same modified_at timestamp, another cause of early loop termination can be if a set of collections are modified after the loop starts running. Because the test compares only whether any new collections have been added to the map, if it happens that all of the collections returned in a batch have already been loaded because they previously had earlier modified_at timestamps, the loop also terminates early.

The fix would be to change the loop termination condition to be specifically that there are no collections remaining according to the count that the API server returns for the batch (taking into account the offset and BatchSize). I'll roll that into the 8485 fix, which I'm working on now.

Actions

Copy link

Updated by Joshua Randall about 10 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #8483

datamanager fails if >=50 collections have the same modified_at timestamp

Updated by Joshua Randall about 10 years ago

Updated by Joshua Randall about 10 years ago