Keep Proxy Specification » History » Version 13
Peter Amstutz, 07/23/2014 09:12 AM
1 | 1 | Peter Amstutz | h1. Reverse Keep Proxy |
---|---|---|---|
2 | |||
3 | 13 | Peter Amstutz | _Archived for informational purposes. The proposal described here is now implemented in arvados/services/keep/src/arvados.org/keepproxy_ |
4 | |||
5 | 1 | Peter Amstutz | h2. Problem |
6 | |||
7 | Need to be able to automatically upload huge (+1 TiB) datasets into Arvados. Current proposed solution is to upload the data to a staging area and then put the data into Keep. On further consideration, this solution is inadequate for a number of reasons: |
||
8 | * Must set aside staging area big enough to accommodate large uploads. |
||
9 | * When uploads are not occurring, this empty space just sits around, costing money. |
||
10 | * Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions |
||
11 | * Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up. |
||
12 | * Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity. |
||
13 | * Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use. |
||
14 | |||
15 | h2. Solution |
||
16 | |||
17 | Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers. |
||
18 | * Doesn't require staging except in RAM of the Keep proxy. |
||
19 | * No dataset limits except Keep's overall capacity |
||
20 | * Fewer contention problems (although many uploaders could overwhelm the proxy node...) |
||
21 | * Data is available immediately once upload is completed |
||
22 | * This is the right thing to do in the long term anyway. We shouldn't waste our time with messy hacks. |
||
23 | |||
24 | h2. Approach |
||
25 | |||
26 | 8 | Peter Amstutz | # Develop a subset Arvados Go SDK that supports accessing API server and can write to Keep server (reading from Keep is out of scope). |
27 | 2 | Peter Amstutz | ** Read files in 64 MiB blocks and calculate hashes |
28 | 4 | Peter Amstutz | ** Pack small files into a single block |
29 | 2 | Peter Amstutz | ** Put 64 MiB blocks to Keep server over HTTPS |
30 | ** Create manifest (should be normalized form) |
||
31 | ** Write manifest to Keep |
||
32 | ** Use Google API client to talk to API server to create collection, metadata links |
||
33 | 7 | Peter Amstutz | # Develop uploader program in Go to recursively upload a directory structure |
34 | 6 | Peter Amstutz | ** Take API server, API token, directory path on the command line (+ additional metadata links to set on the collection after it is completed) |
35 | 2 | Peter Amstutz | ** Should be self-contained static x64 ELF binary with minimal dependencies that will run on any modern x64 Linux. |
36 | 7 | Peter Amstutz | ** Use Go Keep client library to upload blocks, create manifest, upload manifest to API server, add metadata links. |
37 | 2 | Peter Amstutz | ** Should checkpoint during upload so that upload can be canceled and resumed. |
38 | 3 | Peter Amstutz | # Reverse Keep Proxy |
39 | 2 | Peter Amstutz | ** Publicly accessible head node providing write access into Keep (read access is out of scope for this task) |
40 | 9 | Peter Amstutz | ** List proxy contact info in discovery document |
41 | 2 | Peter Amstutz | ** Check API token to ensure client has permission to write |
42 | ** Accept blocks from client, forward them to internal Keep cluster. Extend existing Keep Go server by writing a new volume backend that writes to the appropriate internal Keep servers instead of to the disk. |
||
43 | 10 | Peter Amstutz | ** Block hash, user uuid for each block logged to API server |
44 | 2 | Peter Amstutz | ** Writing to internal Keep servers and API server will use Arvados Go SDK |
45 | # API server |
||
46 | 11 | Peter Amstutz | ** API call allowing normal users to create special user accounts that use a combination of limited permissions and scopes to restrict to uploading tasks. Scopes alone are not powerful enough because a scope cannot restrict the uploader to only creating links about collections known to the uploader. |
47 | 2 | Peter Amstutz | ** Restricted to a few tasks, such as creating collections, creating metadata links about that collection. |
48 | ** Restricted account is owned by the Arvados user, so user can see and change everything the uploader account owns. |
||
49 | 1 | Peter Amstutz | ** Can deactivate uploader account when done with it. |
50 | 12 | Peter Amstutz | ** (This task can probably separated from tasks 1-3 but is necessary to support delegation) |