S3 bucket volume implementation » History » Version 2
Tom Clegg, 07/04/2016 08:30 PM
| 1 | 1 | Tom Clegg | h1. S3 bucket volume implementation |
|---|---|---|---|
| 2 | |||
| 3 | {{toc}} |
||
| 4 | |||
| 5 | h2. Background - unsafe delete |
||
| 6 | |||
| 7 | With the current (2016-07-04) S3 volume implementation, an S3 volume can store blocks safely, but cannot do garbage collection safely. |
||
| 8 | |||
| 9 | There is a command line flag @-s3-unsafe-delete@ that enables unsafe garbage collection. With this setting, it is possible to lose data: if a client writes a new copy of a previously stored block that is now scheduled for deletion, the newly written copy can be deleted even though the client is told the data is safely stored. |
||
| 10 | |||
| 11 | h2. S3 API challenge |
||
| 12 | |||
| 13 | S3 offers eventual consistency. Specifically: |
||
| 14 | * Reading an object doesn't necessarily return the latest version: "write X=foo; write X=bar; read X" might return foo. |
||
| 15 | * Deleting is not immediate: "write X=foo; delete X; read X" might return foo. |
||
| 16 | * There is no atomic conditional write operation (e.g., "overwrite X only if X has etag=foo"). |
||
| 17 | |||
| 18 | This makes it hard to do garbage collection. How can we delete X if we don't know whether X has been written recently? |
||
| 19 | |||
| 20 | h2. Implementation draft |
||
| 21 | |||
| 22 | Proposed approach: Each block is represented by (up to) three S3 objects. |
||
| 23 | * Data object, name="{md5}" -- contains the stored data; timestamp and metadata are irrelevant. |
||
| 24 | * Mtime object, name="recent/{md5}" -- zero data bytes; Last-Modified reflects most recent Put or Touch. |
||
| 25 | * Trash object, name="trash/{md5}" -- contains a copy of the stored data; Last-Modified reflects most recent Trash. |
||
| 26 | |||
| 27 | Put(X): |
||
| 28 | # Write data to data object, "X" |
||
| 29 | # Write zero bytes to mtime object, "recent/X" |
||
| 30 | |||
| 31 | Touch(X): |
||
| 32 | # Write zero bytes to mtime object, "recent/X" |
||
| 33 | # Ensure data object "X" exists. If not: |
||
| 34 | #* If FixRace(X) returns false, return ErrNotExist |
||
| 35 | #* Else return nil (OK) |
||
| 36 | #* (Handlers only use Touch after Get succeeds, so a 404 here is almost certainly an early opportunity to catch and correct a touch-vs.-trash race.) |
||
| 37 | |||
| 38 | Mtime(X): |
||
| 39 | # Ensure data object "X" exists. If not, return ErrNotExist. |
||
| 40 | # Ensure "recent/X" exists. If not, create it. |
||
| 41 | # Return LastModified("recent/X") |
||
| 42 | |||
| 43 | Trash(X): |
||
| 44 | # If trashLifetime==0 && !unsafeDelete: "not implemented" error |
||
| 45 | 2 | Tom Clegg | # If "trash/X" exists and LastModified("trash/X") + trashLifetime > now + raceWindow, then it is not safe to delete "X", so return without doing anything. |
| 46 | 1 | Tom Clegg | # Get "recent/X" |
| 47 | # PutCopy("X", "trash/X") |
||
| 48 | # Del("X") |
||
| 49 | |||
| 50 | Get(X): |
||
| 51 | # If Get("X") succeeds, return content |
||
| 52 | # Else if Get returned a 404, and FixRace(X) returns true, call Get("X") again and return content |
||
| 53 | |||
| 54 | FixRace(X): "check whether X needs to be cleaned up after a race; if so, clean it up. Return true if X was untrashed during cleanup." |
||
| 55 | # Use HEAD to get LastModified("recent/X"). |
||
| 56 | # If "recent/X" does not exist, an emptyTrash/Trash race went wrong, so we need to fix it by saving the very conservative value "now" as "most recent Put or Touch": |
||
| 57 | #* Put("recent/X", zeroBytes) |
||
| 58 | #* Log this event. |
||
| 59 | # Use HEAD to get LastModified("trash/X"). If "trash/X" does not exist, return false. |
||
| 60 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL, a touch/trash race went wrong, so we need to fix it: |
||
| 61 | #* Perform HEAD requests for "X" and "trash/X" |
||
| 62 | #* If "X" exists but etag differs from "trash/X", log a collision warning and do nothing |
||
| 63 | #* If "X" exists and etag matches, delete "trash/X" |
||
| 64 | #* If "X" does not exist, PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
| 65 | #* Log this event. |
||
| 66 | #* Return true. |
||
| 67 | # Return false. |
||
| 68 | |||
| 69 | Untrash(X): |
||
| 70 | # PutCopy("trash/X", "X") (but don't delete "trash/X" -- EmptyTrash will get it next time) |
||
| 71 | |||
| 72 | EmptyTrash(): |
||
| 73 | # List all trash objects and mtime objects ("trash/X" and "recent/X"), using a merge sort to get them in matching pairs. For each X... |
||
| 74 | # If LastModified("trash/X") < LastModified("recent/X") + permissionTTL: call FixRace(X) |
||
| 75 | # Else if now() > LastModified("trash/X") + trashLifetime, it's time to really delete this object: |
||
| 76 | #* Del("trash/X") |
||
| 77 | 2 | Tom Clegg | #* Del("recent/X") unless "X" exists |
| 78 | 1 | Tom Clegg | |
| 79 | Index(): |
||
| 80 | # List all data objects and mtime objects ("X" and "recent/X"), using a merge sort to get them in matching pairs. |
||
| 81 | # Include X in the index only if "X" *and* "recent/X" exist. |
||
| 82 | # In the index response row "hash+size timestamp": hash is X, size is Size("X"), timestamp is LastModified("recent/X") |
||
| 83 | |||
| 84 | 2 | Tom Clegg | Sometimes, races will cause valuable blocks to be trashed, but they will be recovered automatically by FixRace(), provided: |
| 85 | * the S3 service achieves eventual consistency faster than raceWindow, and |
||
| 86 | * raceWindow < trashLifetime |
||
| 87 | 1 | Tom Clegg | |
| 88 | These "accidentally trashed" blocks will be omitted from the index, which means keep-balance will occasionally report them as "lost". This is undesirable, because never cry wolf. |
||
| 89 | |||
| 90 | h2. Unresolved issues |
||
| 91 | |||
| 92 | If both X and trash/X exist (e.g., a client has written a copy of a block that was already in trash), a race between Trash and FixRace can delete both X and trash/X (see #8555#note-21). |