S3 bucket volume implementation » History » Version 5
Tom Clegg, 07/07/2016 06:59 PM
| 1 | 1 | Tom Clegg | h1. S3 bucket volume implementation |
|---|---|---|---|
| 2 | |||
| 3 | {{toc}} |
||
| 4 | |||
| 5 | h2. Background - unsafe delete |
||
| 6 | |||
| 7 | With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely. |
||
| 8 | |||
| 9 | There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored. |
||
| 10 | |||
| 11 | h2. S3 API challenge |
||
| 12 | |||
| 13 | S3 offers eventual consistency. Specifically: |
||
| 14 | * Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo. |
||
| 15 | * Deleting is not immediate: "write X=foo; delete X; read X" might return foo. |
||
| 16 | * There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo"). |
||
| 17 | |||
| 18 | This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently? |
||
| 19 | |||
| 20 | 4 | Tom Clegg | h2. Proposed approach |
| 21 | 1 | Tom Clegg | |
| 22 | 4 | Tom Clegg | The state of a block is represented by the existence and timestamps of three objects: |
| 23 | 1 | Tom Clegg | * Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant. |
| 24 | * Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch. |
||
| 25 | * Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash. |
||
| 26 | 4 | Tom Clegg | |
| 27 | The strategy is: |
||
| 28 | # Tolerate, and automatically clean up, untidy state caused by races. |
||
| 29 | # Avoid entering races where there's a possibility of losing data. |
||
| 30 | |||
| 31 | 5 | Tom Clegg | Specifically, there is a window, just before a trashed block becomes eligible for final deletion, when it is impossible to safely trash a newer copy of that block: |
| 32 | * Racing EmptyTrash ("delete trash/X if it's old enough") against Trash ("copy X to trash/X, then delete X") can result in both copies (X and trash/X) being deleted: EmptyTrash inspects state and issues a "delete trash/X" request; then Trash issues "copy X to trash/X" and "delete X" requests; then the S3 service performs those two operations in the reverse order, and both copies are deleted. |
||
| 33 | * Meanwhile, if Put(X) was also racing, we would discover next time we call FixRace() that X shouldn't have been trashed at all, and we should rescue it from the trash ... but that would be impossible. |
||
| 34 | |||
| 35 | 4 | Tom Clegg | |X|recent/X|trash/X|State| |
| 36 | |✓|-|-|X is not trash. It was written by an old version, or there was a race between EmptyTrash and Put. To fix, create a new recent/X.| |
||
| 37 | |✓|new|-|X is not trash, and is new. If it appears in a trash list, do nothing.| |
||
| 38 | |✓|old|-|X is not trash, and is old. If it appears in a trash list, trash it.| |
||
| 39 | |✓|any|✓|X is not trash. There was a race between Trash and Put, or Trash didn't (yet) finish.| |
||
| 40 | |-|old|✓|X is trash.| |
||
| 41 | |-|new|✓|X is not trash. There was a race between Trash and Put.| |
||
| 42 | |-|-|✓|Impossible? (Race between Trash, EmptyTrash, and Put?) To fix, touch recent/X and copy trash/X to X.| |
||
| 43 | |||
| 44 | h2. Implementation draft |
||
| 45 | 1 | Tom Clegg | |
| 46 | Put(X): |
||
| 47 | # Write data to data object, "X" |
||
| 48 | # Write zero bytes to mtime object, "recent/X" |
||
| 49 | |||
| 50 | Touch(X): |
||
| 51 | # Write zero bytes to mtime object, "recent/X" |
||
| 52 | # Ensure data object "X" exists. If not: |
||
| 53 | #* If FixRace(X) returns false, return ErrNotExist |
||
| 54 | #* Else return nil (OK) |
||
| 55 | #* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.) |
||
| 56 | |||
| 57 | Mtime(X): |
||
| 58 | # Ensure data object "X" exists. If not, return ErrNotExist. |
||
| 59 | # Ensure "recent/X" exists. If not, create it. |
||
| 60 | # Return LastModified("recent/X") |
||
| 61 | |||
| 62 | Trash(X): |
||
| 63 | # If trashLifetime==0 && !unsafeDelete: "not implemented" error |
||
| 64 | 2 | Tom Clegg | # If "trash/X" exists and LastModified("trash/X") + trashLifetime > now + raceWindow, then it is not safe to delete "X", so return without doing anything. |
| 65 | 1 | Tom Clegg | # Get "recent/X" |
| 66 | # PutCopy("X", "trash/X") |
||
| 67 | # Del("X") |
||
| 68 | |||
| 69 | Get(X): |
||
| 70 | # If Get("X") succeeds, return content |
||
| 71 | # Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content |
||
| 72 | |||
| 73 | FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup." |
||
| 74 | # Use HEAD to get LastModified("recent/X"). |
||
| 75 | # If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch": |
||
| 76 | #* Put("recent/X", zeroBytes) |
||
| 77 | #* Log this event. |
||
| 78 | # Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false. |
||
| 79 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it: |
||
| 80 | #* Perform HEAD requests for "X" and "trash/X" |
||
| 81 | #* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing |
||
| 82 | #* If "X" exists and etag matches, delete "trash/X" |
||
| 83 | #* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
| 84 | #* Log this event. |
||
| 85 | #* Return true. |
||
| 86 | # Return false. |
||
| 87 | |||
| 88 | Untrash(X): |
||
| 89 | # PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
| 90 | |||
| 91 | EmptyTrash(): |
||
| 92 | # List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X... |
||
| 93 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X) |
||
| 94 | # Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object: |
||
| 95 | #* Del("trash/X") |
||
| 96 | 2 | Tom Clegg | #* Del("recent/X") unless "X" exists |
| 97 | 1 | Tom Clegg | |
| 98 | Index(): |
||
| 99 | # List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs. |
||
| 100 | # Include X in the index only if "X" *and* "recent/X" exist. |
||
| 101 | # In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X") |
||
| 102 | |||
| 103 | 2 | Tom Clegg | Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided: |
| 104 | * the S3 service achieves eventual consistency faster than raceWindow, and |
||
| 105 | * raceWindow < trashLifetime |
||
| 106 | 1 | Tom Clegg | |
| 107 | These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf. |
||
| 108 | |||
| 109 | h2. Unresolved issues |
||
| 110 | |||
| 111 | 3 | Tom Clegg | Is there a deadline for eventual consistency? IOW, is there a value raceWindow can be set to? |