Bug #5319
closed[API] Portable data hash mismatches for collections with location hints (+K)
Added by Bryan Cosca almost 10 years ago. Updated almost 10 years ago.
100%
Description
Files
portabledatahash.png (56.3 KB) portabledatahash.png | Bryan Cosca, 02/26/2015 04:31 PM | ||
portablehash-2.png (30.6 KB) portablehash-2.png | Bryan Cosca, 02/26/2015 04:33 PM |
Updated by Bryan Cosca almost 10 years ago
- File portablehash-2.png portablehash-2.png added
My current workaround is to select all the files and then create a new collection, which i guess makes a new hash. and I can use that new collection.
Updated by Brett Smith almost 10 years ago
- Subject changed from Portable Data hash does not match computed hash to [Workbench] Can't copy collection between projects due to portable data hash mismatch
- Category set to Workbench
- Target version set to Bug Triage
Updated by Brett Smith almost 10 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
Updated by Brett Smith almost 10 years ago
- Target version changed from Bug Triage to 2015-03-11 sprint
Updated by Bryan Cosca almost 10 years ago
qr1hi-4zz18-yrocuypi31jo3ru and qr1hi-4zz18-h03wet37z45z2c9 are known culprits
Updated by Brett Smith almost 10 years ago
The affected collections have manifests with Keep location hints in them (+Kqr1hi
), and the portable data hash corresponds to a manifest text including those hints. Now that API server disregards all hints when calculating a portable data hash, a straightforward copy fails.
Updated by Brett Smith almost 10 years ago
I believe this change is our culprit, specifically on old line 87/new line 106. That's validation-related code that changed from using self.manifest_text to portable_manifest_text. portable_manifest_text strips all hints, so that would explain the difference.
Questions:
- What's the right policy here? Should the PDH correspond to the text without any hints, or just without access tokens? Tom made this change, so I'm guessing it's the former…
- Once we figure that out, do we need a migration to correct the PDHs of older collections?
Updated by Brett Smith almost 10 years ago
- Subject changed from [Workbench] Can't copy collection between projects due to portable data hash mismatch to [API] Portable data hash mismatches for collections with location hints (+K)
- Category changed from Workbench to API
Updated by Peter Amstutz almost 10 years ago
+1 to doing a migration. It should probably include updating the PDH of the collection in pipelines templates.
Updated by Tom Clegg almost 10 years ago
Yes, PDH should stay the same when the hints change (e.g., when the data comes and goes from different storage systems, and when the manifest is copied between instances).
As for cleaning up the existing incorrect PDH values:
A migration that changes the PDH values would make the old/broken PDH unfindable, which means either editing evidence about existing jobs and pipeline instances (yuck) or making them un-repeatable (also yuck, although less so). Perhaps we can make copies of the affected collections instead, so we use the new PDH for future work but the old ones can still be looked up. The remaining challenge would be to prevent the old/bogus copies from showing up as confusing duplicates in searches.
(A "copy" API call would help address this (and, perhaps more importantly, #5096), but seems like excessive creep to do here, and doesn't address the need for a migration to fix the bogus PDH values.)
Updated by Brett Smith almost 10 years ago
Tom Clegg wrote:
A migration that changes the PDH values would make the old/broken PDH unfindable, which means either editing evidence about existing jobs and pipeline instances (yuck) or making them un-repeatable (also yuck, although less so). Perhaps we can make copies of the affected collections instead, so we use the new PDH for future work but the old ones can still be looked up. The remaining challenge would be to prevent the old/bogus copies from showing up as confusing duplicates in searches.
The longer-term plan is for Workbench to ignore objects that expire in the future (showing them only in a Trash), right? If that's the case, what if the migration set expires_at for the affected Collections far in the future, like 2038? That way they'd still be easy to find for jobs, and we have a way to hide them in Workbench that's consistent with other plans.
Updated by Tom Clegg almost 10 years ago
Brett Smith wrote:
The longer-term plan is for Workbench to ignore objects that expire in the future (showing them only in a Trash), right? If that's the case, what if the migration set expires_at for the affected Collections far in the future, like 2038? That way they'd still be easy to find for jobs, and we have a way to hide them in Workbench that's consistent with other plans.
Yes, that sounds like it should work nicely.
Updated by Peter Amstutz almost 10 years ago
General comment: my understanding is that best practices for migrations to modify the tables directly using SQL, instead of using models, since the model code can change to no longer match the assumptions at the time the migration was written. However, since most of the time the migrations are run promptly, this hasn't been much of an issue in practice.
- Since existing Collection records are getting updated, perhaps it should be writing updates to the log table?
- Why does each_bad_collection search for tags matching
\+[B-Z]
and not\+[A-Z]
? - Have you tried this out on a test database?
Updated by Brett Smith almost 10 years ago
Peter Amstutz wrote:
General comment: my understanding is that best practices for migrations to modify the tables directly using SQL, instead of using models, since the model code can change to no longer match the assumptions at the time the migration was written. However, since most of the time the migrations are run promptly, this hasn't been much of an issue in practice.
I have reworked the migration to follow the Rails Guide's advice about this.
- Since existing Collection records are getting updated, perhaps it should be writing updates to the log table?
Sure, added.
- Why does each_bad_collection search for tags matching
\+[B-Z]
and not\+[A-Z]
?
The API server has been pretty good for a while now about making sure that manifests don't get saved with permission hints. But it can't hurt to be sure, so changed.
- Have you tried this out on a test database?
Yes. The easiest way I've found is to add this test collection fixture:
with_location_hints: owner_uuid: zzzzz-tpzed-xurymjxw79nv3jz created_at: 2014-02-03 17:22:54.000000000 Z modified_by_client_uuid: zzzzz-ozdt8-brczlopd8u8d0jr modified_by_user_uuid: zzzzz-tpzed-d9tiejq69daie8f modified_at: 2014-02-03 17:22:54.000000000 Z portable_data_hash: 3eab909704896fd5e1c70cd135ab6b8d+52 updated_at: 2014-02-03 17:22:54.000000000 Z uuid: bcsdv-4zz18-6pxgxibl4vmyt09 manifest_text: | . 37b51d194a7513e45b56f6524f2d51f2+3+Kfhqwh 0:3:bar name: manifest with location hints description: properties: {}
With that in place, you can db:fixtures:load
, then db:migrate
and db:rollback
and observe the results.
Now at cea7dc4. Thanks.
Updated by Peter Amstutz almost 10 years ago
I noticed that it doesn't actually update the contents of manifest_text
. Is that intentional or an oversight?
modified_at
also isn't updated, and the copied collection has created_at
copied instead of set to now. This might be the right thing to do, but I'd like to hear your rationale.
Everything else looks good.
Updated by Brett Smith almost 10 years ago
Peter Amstutz wrote:
I noticed that it doesn't actually update the contents of
manifest_text
. Is that intentional or an oversight?
Intentional: that's what the API server is currently doing. See strip_manifest_text
in the Collection model: the only hints it strips are signatures. You can also see qr1hi-4zz18-cs8ezglaomoym37, which is the collection created by creating a "new" collection with all the same files as the original: its manifest text still has the hints. And this makes sense to me. This way, the portable data hash stays the same as long as the data does, but we retain hint information as much as possible to help find that data faster.
modified_at
also isn't updated, and the copied collection hascreated_at
copied instead of set to now. This might be the right thing to do, but I'd like to hear your rationale.
I guess the question boils down to, should those timestamps be true to the literal database record, or should they be true to the underlying collection? From the database perspective, I think your ideas are the right ones. I was thinking about it more at the collection level: the collection isn't new or changed, we've just patched up how we address it in the database. And so the logic reflects that.
To the extent that this is a bugfix migration that we're trying to keep invisible to the user (and admittedly, that's still not very invisible), I think this approach supports that goal. Imagine a user doing API requests for collections ordered by one of these timestamps descending. I think it would surprise them to see "old" collections suddenly spring up to the top of those results. We can't hide the second record from them, but I think this approach puts it in a context that can make it easier to see where it's coming from.
Thanks.
Updated by Brett Smith almost 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:19656a41f019488120f950b06ecf9e19074b11a3.