Story #16427
closed"undelete" command to recover trashed blocks and restore a deleted collection
Added by Tom Clegg over 4 years ago. Updated about 4 years ago.
100%
Description
- some blocks may not have been trashed because they are still referenced by other collections
- some blocks may still be in the recoverable "trashed" state, if BlobTrashLifetime has not arrived yet
Given a manifest (or the UUID of a a collection update/delete log entry that has a manifest in old_attributes), the recovery command ("arvados-server undelete") should use keepstore's HEAD and untrash APIs to untrash the data as needed, make a new manifest with fresh signatures, and save a new collection.
If a block is not recoverable, the command should continue to untrash as many as it can, and report how many were attempted/successful, but not save a new collection. (Future improvement: save a partial collection in this case, omitting any files affected by the missing blocks.)
Files
16427-doc.png (205 KB) 16427-doc.png | Tom Clegg, 06/05/2020 09:16 PM |
Updated by Peter Amstutz over 4 years ago
- Target version set to 2020-06-03 Sprint
Updated by Tom Clegg over 4 years ago
- Status changed from New to In Progress
work in progress at 16427-undelete
Updated by Ward Vandewege over 4 years ago
- Related to Support #16421: [doc] document deletion lifecycle of collections, and steps to undelete collections added
Updated by Tom Clegg over 4 years ago
16427-undelete @ 167e2537365ec68fe6be6a330a2eb698f177aa05 -- developer-run-tests: #1879
TODO: accept UUID of collection-delete/update log entry (currently must provide file containing manifest text)
TODO: doc page
Updated by Ward Vandewege over 4 years ago
Review comments:
lib/undelete/cmd.go
+ func (und undeleter) newestMtime(logger logrus.FieldLogger, blk string, svc arvados.KeepService) (time.Time, bool) {
It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!
+func (und undeleter) ensureSafe(ctx context.Context, logger logrus.FieldLogger, blk string, svc arvados.KeepService, blobsigttl time.Duration, blobsigexp time.Time) bool {
Same - why not return an error instead of a bool?
+func (und undeleter) RecoverManifest(mtxt string) (string, error) {
Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?
+ for i := 0; i < 2*len(services); i++ {
It would be great to make it a bit more clear that these are worker threads, and that the number of threads is being sized by the number of keepstore nodes (rather than the cores on the client, or something). Maybe something like
"workerThreads := 2 * len(services)"
+ if havenot > 0 {
+ if have > 0 {
+ und.logger.Warn("partial recovery is not implemented")
+ }
+ return "", fmt.Errorf("unable to recover %d of %d blocks", havenot, have+havenot)
+ }
Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.
lib/undelete/cmd_test.go
+func (*Suite) TestUntrashAndTouchBlock(c *check.C) {
+ if fi, err := os.Stat(datadir); err != nil || !fi.IsDir() {
+ c.Logf("skipping datadir %q, evidently neither keepstore is using it", datadir)
+ continue
+ }
This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.
Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:
$ ./arvados-server undelete
Usage: ./arvados-server undelete [options] uuid_or_file ...
-config file
Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
-legacy-crunch-dispatch-slurm-config file
Legacy crunch-dispatch-slurm configuration file (default "/etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml")
-legacy-git-httpd-config file
Legacy arv-git-httpd configuration file (default "/etc/arvados/git-httpd/git-httpd.yml")
-legacy-keepbalance-config file
Legacy keep-balance configuration file (default "/etc/arvados/keep-balance/keep-balance.yml")
-legacy-keepproxy-config file
Legacy keepproxy configuration file (default "/etc/arvados/keepproxy/keepproxy.yml")
-legacy-keepstore-config file
Legacy keepstore configuration file (default "/etc/arvados/keepstore/keepstore.yml")
-legacy-keepweb-config file
Legacy keep-web configuration file (default "/etc/arvados/keep-web/keep-web.yml")
-legacy-ws-config file
Legacy arvados-ws configuration file (default "/etc/arvados/ws/ws.yml")
-log-level string
logging level (debug, info, ...) (default "info")
-skip-legacy
Don't load legacy config files
INFO[2020-06-03T09:59:49.409731940-04:00] exiting
Can we have some help text?
Otherwise, LGTM so far!
Updated by Tom Clegg over 4 years ago
- Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint
Updated by Tom Clegg over 4 years ago
Ward Vandewege wrote:
It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!
Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.
Is there a particular reason why half a BlobSigningTTL is a good choice for the save deadline? Can the reason go into a comment?
Yes, added comment.
+ for i := 0; i < 2*len(services); i++ {
Added explanatory comment for this too.
Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.
Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.
Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)
This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.
Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.
Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:
Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.
Usage: ./tmp/GOPATH/bin/arvados-server undelete [options ...] /path/to/manifest.txt [...] This program recovers deleted collections. Recovery is possible when the collection's manifest is still available and all of its data blocks are still available or recoverable (e.g., garbage collection is not enabled, the blocks are too new for garbage collection, the blocks are referenced by other collections, or the blocks have been trashed but not yet deleted). For each provided collection manifest, once all data blocks are recovered/protected from garbage collection, a new collection is saved and its UUID is printed on stdout. Exit status will be zero if recovery is successful, i.e., a collection is saved for each provided manifest. Options: -config file Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml") -log-level string logging level (debug, info, ...) (default "info")
16427-undelete @ f525d01de971b8b06dabc827bc0bbf46ee7ce9cc -- developer-run-tests: #1892
and with master merged:
16427-undelete @ 2be95ce7f069d9eb131b0d2d922a5e556f75810c -- developer-run-tests: #1893
Updated by Ward Vandewege over 4 years ago
Tom Clegg wrote:
Ward Vandewege wrote:
It would be nice to follow the Go error pattern instead, and make the second arg an 'error'. Presumably, block not on svc is an error too!
Not exactly (we might go on to find it on a different server), but yes, I've updated both funcs to return error instead of an ok bool.
Cool. Looks like the comments for both functions still needs updating.
Partial unrecovery will be nice. Just listing the affected blocks, could also be nice. Blocks + affected filenames even better. That should probably be part of the partial recovery feature.
Yeah, I think the first "partial recovery" feature would remove any files that are affected by the missing blocks, and save the files that are still intact, and exit non-zero.
Yes. That would be very useful already.
Manifests don't support sparse files, so our only options for saving partial files seem to be zero-padding and truncating, so if you start with "abcdefghi" and "def" goes missing, you can recover either "abcghi" or "abc\0\0\0ghi". (I don't think we're going to implement either right away though.)
I would probably prefer the zero padding approach (because it shows exactly what sections are missing). I agree that this could be a follow on story.
This seems like it could generate unnecessary noise in the tests, no? Can we make this debug-only, or even just skip the logging altogether? It doesn't seem super useful.
Yes, it only shows up when the test fails or you run -check.vv, but I changed it to print the datadirs that are used instead of the ones that aren't, which is pretty much the same information but less mysterious.
Perfect, thanks.
Functionally: running arvados-server with the `undelete` command prints this, which seems unhelpful:
Removed the legacy config flags, added more explanation, and made the "Usage: ..." part appear in the "bad args" case as well as the "no positional args" case.
Excellent thank you.
LGTM with those function comments updated. Thanks!
Updated by Tom Clegg over 4 years ago
- File 16427-doc.png 16427-doc.png added
This branch also adds the "get old manifest from log entry UUID" feature.
16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- developer-run-tests: #1897
Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".
Updated by Ward Vandewege over 4 years ago
Tom Clegg wrote:
Nice! I like the documentation.
This branch also adds the "get old manifest from log entry UUID" feature.
16427-undelete @ 570c793f1aa43fd9763a0368554dd395dacdd238 -- developer-run-tests: #1897
Yeah that LGTM. The test failure looks unrelated, do you agree?
Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".
Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?
Updated by Tom Clegg over 4 years ago
Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".
Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?
I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.
Updated by Ward Vandewege over 4 years ago
Tom Clegg wrote:
Should we also add "provide collection UUID -> get old version from most recent log entry concerning that collection" ...? That would probably be the most convenient way to do the most obvious case, i.e., undo the last change, whether or not it was a "delete".
Yes, that would be nice. Should we call that 'undo' instead of 'undelete' ?
I think "undo" isn't quite right because we're (intentionally) saving the recovered data elsewhere, rather than restoring the same collection UUID back to an earlier state. Same reasoning applies to calling the current functionality "undelete" for that matter -- it doesn't actually make the collection UUID come back, there's no attempt to restore name/metadata, etc. Would "recover-collection-data" be better, and make sense whether passing a log uuid, collection uuid, or manifest saved in a file? It's a bit unwieldy/verbose but I'm thinking that's ok for something only used in rare/unusual cases.
Yes, good call. You could shorten it to recover-collection if you want.
Updated by Peter Amstutz over 4 years ago
- Related to Story #16514: Actionable insight into keep usage added
Updated by Tom Clegg over 4 years ago
- rename subcommand to "recover-collection"
- accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)
Updated by Ward Vandewege over 4 years ago
Tom Clegg wrote:
16427-undelete @ e48478841828b1dbab8b69eb9453db23b42ed63f -- developer-run-tests: #1913
- rename subcommand to "recover-collection"
- accept collection UUID (equivalent to passing UUID of most recent log entry with matching object_uuid)
LGTM thanks!
Updated by Anonymous over 4 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|9447d2f3bd4c15a4b3bda16dc3ddef0faeb68771.