Project

General

Profile

Keep Proxy Specification » History » Version 4

Peter Amstutz, 04/28/2014 12:41 PM

1 1 Peter Amstutz
h1. Reverse Keep Proxy
2
3
h2. Problem
4
5
Need to be able to automatically upload huge (+1 TiB) datasets into Arvados.  Current proposed solution is to upload the data to a staging area and then put the data into Keep.  On further consideration, this solution is inadequate for a number of reasons:
6
* Must set aside staging area big enough to accommodate large uploads.
7
* When uploads are not occurring, this empty space just sits around, costing money.
8
* Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions
9
* Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up.
10
* Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity.
11
* Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use.
12
13
h2. Solution
14
15
Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers.  
16
* Doesn't require staging except in RAM of the Keep proxy.
17
* No dataset limits except Keep's overall capacity
18
* Fewer contention problems (although many uploaders could overwhelm the proxy node...)
19
* Data is available immediately once upload is completed
20
* This is the right thing to do in the long term anyway.  We shouldn't waste our time with messy hacks.
21
22
h2. Approach
23
24 2 Peter Amstutz
# Develop Arvados Go SDK that supports accessing API server and can upload to Keep server.
25
** Read files in 64 MiB blocks and calculate hashes
26 4 Peter Amstutz
** Pack small files into a single block
27 2 Peter Amstutz
** Put 64 MiB blocks to Keep server over HTTPS
28
** Create manifest (should be normalized form)
29
** Write manifest to Keep
30
** Use Google API client to talk to API server to create collection, metadata links
31
# Create uploader program to recursively upload a directory structure
32
** Take API server, API token, directory path on the command line (probably also additional metadata links)
33
** Should be self-contained static x64 ELF binary with minimal dependencies that will run on any modern x64 Linux.
34
** Use Keep client library to upload blocks, create manifest, upload manifest to API server, add metadata links.
35
** Should checkpoint during upload so that upload can be canceled and resumed.
36 3 Peter Amstutz
# Reverse Keep Proxy
37 2 Peter Amstutz
** Publicly accessible head node providing write access into Keep (read access is out of scope for this task)
38
** List proxy node in discovery document
39
** Check API token to ensure client has permission to write
40
** Accept blocks from client, forward them to internal Keep cluster.  Extend existing Keep Go server by writing a new volume backend that writes to the appropriate internal Keep servers instead of to the disk.
41
** Hash and user account associated with each upload block logged to API server
42
** Writing to internal Keep servers and API server will use Arvados Go SDK
43
# API server
44
** API call allowing normal users to create of special user accounts that use a combination of limited permissions and scopes to restrict to uploading tasks.  Scopes alone are not powerful enough because a scope cannot restrict the uploader to only creating links about collections known to the uploader.
45
** Restricted to a few tasks, such as creating collections, creating metadata links about that collection.
46
** Restricted account is owned by the Arvados user, so user can see and change everything the uploader account owns.
47
** Can deactivate uploader account when done with it.