Keep » History » Version 3
Tom Clegg, 04/10/2013 10:21 PM
1 | 1 | Tom Clegg | h1. Keep |
---|---|---|---|
2 | |||
3 | 2 | Tom Clegg | Keep is a distributed content-addressable storage system designed for high performance in I/O-bound cluster environments. |
4 | |||
5 | Notable design goals and features include: |
||
6 | |||
7 | * High scalability |
||
8 | * Node-level redundancy |
||
9 | * Maximum overall throughput in a busy cluster environment |
||
10 | * Maximum data bandwidth from client to disk |
||
11 | * Minimum transaction overhead |
||
12 | * Elimination of disk thrashing (commonly caused by multiple simultaneous readers) |
||
13 | * Client-controlled redundancy |
||
14 | |||
15 | h2. Design |
||
16 | |||
17 | The above goals are accomplished by the following design features. |
||
18 | |||
19 | * Data is transferred directly between the client and the physical node where the disk is installed. |
||
20 | * Data collections are encoded in large (≤64 MiB) blocks to minimize short read/write operations. |
||
21 | * Each disk accepts only one block-read/write operation at a time. This prevents disk thrashing and maximizes total throughput when many clients compete for a disk. |
||
22 | * Storage redundancy is directly controlled, and can be easily verified, by the client simply by reading or writing a block of data on multiple nodes. |
||
23 | 3 | Tom Clegg | * Data block distribution is computed based on a cryptographic digest of the data block being stored or retrieved. This eliminates the need for a central or synchronized database of block storage locations. |
24 | 2 | Tom Clegg | |
25 | h2. Components |
||
26 | |||
27 | 1 | Tom Clegg | The Keep storage system consists of data block read/write services, SDKs, and management agents. |
28 | |||
29 | The responsibilities of the Keep service are: |
||
30 | |||
31 | * Write data blocks |
||
32 | 3 | Tom Clegg | * When writing: ensure data integrity by comparing client-supplied cryptographic digest and data |
33 | 1 | Tom Clegg | * Read data blocks (subject to permission, which is determined by the system/metadata DB) |
34 | * Send read/write/error event logs to management agents |
||
35 | |||
36 | The responsibilities of the SDK are: |
||
37 | |||
38 | * When writing: split data into ≤64 MiB chunks |
||
39 | * When writing: encode directory trees as manifests |
||
40 | * When writing: write data to the desired number of nodes to achieve storage redundancy |
||
41 | * After writing: register a collection with Arvados |
||
42 | * When reading: parse manifests |
||
43 | * When reading: verify data integrity by comparing locator to MD5 digest of retrieved data |
||
44 | 3 | Tom Clegg | |
45 | The responsibilities of management agents are: |
||
46 | |||
47 | * Verify validity of permission tokens |
||
48 | * Determine which blocks have higher or lower redundancy than required |
||
49 | * Monitor disk space and move or delete blocks as needed |
||
50 | * Collect per-user, per-group, per-node, and per-disk usage statistics |