Federated collections » History » Version 7
Tom Clegg, 08/09/2018 07:03 PM
| 1 | 1 | Peter Amstutz | h1. Federated collections |
|---|---|---|---|
| 2 | 2 | Peter Amstutz | |
| 3 | 7 | Tom Clegg | In a federation, a client on cluster A can read a collection that is hosted on cluster B. Cluster A pulls the metadata and file content from cluster B as needed. The client's behavior is exactly the same as it is for collections hosted on cluster A. |
| 4 | 2 | Peter Amstutz | |
| 5 | 7 | Tom Clegg | Cases: |
| 6 | * Read collection by uuid |
||
| 7 | * Read collection by pdh |
||
| 8 | * Update collection by uuid (not covered here yet; needs a strategy for writing the data through to the remote cluster) |
||
| 9 | 1 | Peter Amstutz | |
| 10 | 7 | Tom Clegg | h2. Differences from federated workflow retrieval |
| 11 | 1 | Peter Amstutz | |
| 12 | 7 | Tom Clegg | If the collection is requested from cluster A with @GET /arvados/v1/collections/{uuid}@, cluster A can proxy a request to cluster B, using the same approach used for workflows in #13493. |
| 13 | 2 | Peter Amstutz | |
| 14 | 7 | Tom Clegg | If the collection is requested from cluster A with @GET /arvados/v1/collections/{pdh}@, and cluster A does not have a matching collection, it can scan remote clusters until it finds one. |
| 15 | 1 | Peter Amstutz | |
| 16 | 7 | Tom Clegg | Once the collection is retrieved, the client also needs to read the data blocks. Without some additional mechanism, this won't work: the local keepstore servers will reject the blob signatures provided by the remote cluster, and they generally won't have the requested data anyway. |
| 17 | 1 | Peter Amstutz | |
| 18 | 7 | Tom Clegg | h2. Remote data hints |
| 19 | 1 | Peter Amstutz | |
| 20 | 7 | Tom Clegg | If cluster A uses a salted token to retrieve a collection from cluster B, cluster B provides a signed manifest: |
| 21 | 5 | Peter Amstutz | |
| 22 | 7 | Tom Clegg | <pre> |
| 23 | . acbd18db4cc2f85cedef654fccc4a4d8+3+Aabcdef@12345678 0:3:foo.txt |
||
| 24 | </pre> |
||
| 25 | 3 | Peter Amstutz | |
| 26 | 7 | Tom Clegg | Cluster A propagates cluster B's signature but includes the remote cluster ID: |
| 27 | 3 | Peter Amstutz | |
| 28 | 7 | Tom Clegg | <pre> |
| 29 | . acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-abcdef@12345678 0:3:foo.txt |
||
| 30 | </pre> |
||
| 31 | 3 | Peter Amstutz | |
| 32 | 7 | Tom Clegg | Any keepstore service on cluster A will be able to fetch the block from cluster B: |
| 33 | * Look up bbbbb in remote cluster list in discovery doc |
||
| 34 | * Look up bbbbb's keepproxy address in bbbbb's discovery doc |
||
| 35 | * Fetch <code>https://{keepproxy}/acbd18db4cc2f85cedef654fccc4a4d8+3+Abcdefa@12345678</code> |
||
| 36 | 5 | Peter Amstutz | |
| 37 | 7 | Tom Clegg | h2. Remote signature hint |
| 38 | 3 | Peter Amstutz | |
| 39 | 7 | Tom Clegg | Possible syntaxes: |
| 40 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678 |
||
| 41 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678 |
||
| 42 | 3 | Peter Amstutz | |
| 43 | 7 | Tom Clegg | The chosen syntax must support having both local and remote signatures on a single locator. This can help a sophisticated (future) controller communicate securely to keepstore, on a per-block or per-collection basis, whether keepstore should skip contacting the remote cluster when returning remote data that also happens to be stored locally. |
| 44 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678+Aabcdef@12345678 |
||
| 45 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678+Aabcdef@12345678 |
||
| 46 | 3 | Peter Amstutz | |
| 47 | 7 | Tom Clegg | h2. Optimization: Data cache on cluster A |
| 48 | 3 | Peter Amstutz | |
| 49 | 7 | Tom Clegg | A keepstore service on cluster A, when proxying a GET request to cluster B, has some opportunities to conserve network resources: |
| 50 | # Before proxying, check whether the block exists on a local volume. If so: |
||
| 51 | ## Request a content challenge from the remote cluster to ensure the remote cluster does in fact have the data. (This can be skipped if cluster A trusts cluster B to enforce data access permissions.) |
||
| 52 | ## Return the local copy. |
||
| 53 | # When passing a proxied response through to the client, write the data to a local volume as well, so it can be returned more efficiently next time. |
||
| 54 | 3 | Peter Amstutz | |
| 55 | 7 | Tom Clegg | h2. Optimization: Identical content exists on cluster A |
| 56 | 3 | Peter Amstutz | |
| 57 | 7 | Tom Clegg | When proxying a "get collection by UUID" request to cluster B, cluster A might notice that the PDH returned by cluster B matches a collection stored on cluster A. In this case, all data blocks are already stored locally: it can replace the cluster B's signatures with its own, and the client will end up reading the blocks from local volumes. |
| 58 | 3 | Peter Amstutz | |
| 59 | 7 | Tom Clegg | To avoid an information leak, a configuration setting can restrict this optimization to cases where the caller's token has permission to read the existing local collection. |
| 60 | |||
| 61 | h2. Implementation |
||
| 62 | |||
| 63 | * #13993 [API] Fetch remote-hosted collection by UUID |
||
| 64 | * #13994 [Keepstore] Fetch blocks from federated clusters |