Keep-balance » History » Revision 5
Revision 4 (Tom Clegg, 02/14/2014 10:22 AM) → Revision 5/7 (Tom Clegg, 02/16/2014 06:28 PM)
h1. Data Manager The Data Manager enforces policies and generates reports about storage resource usage. The Data Manager interacts with the [[Keep server]] and the metadata database. Clients/users do not interact with the Data Manager directly: the metadata database service acts as a proxy/cache on their behalf, and is responsible for access controls. directly. See also: * [[Keep server]] * [[Keep manifest format]] * source: n/a (design phase) Responsibilities: * Garbage collector: decide what is eligible for deletion (and some partial order of preference) * Replication enforcer: copy and delete blocks blobs in various backing stores to achieve desired replication level * Rebalancer: move blocks blobs to redistribute free space and reduce client probes * Data location index: know which backing stores should be contacted to retrieve a given block * Report query engine Example reports/queries: * for managers: Tell managers how much disk space is being conserved due to CAS * for managers: Tell managers how much disk space is occupied in a given backing store service * for managers: Tell managers how disk usage would be affected by modifying storage policy * for managers: Tell managers how much disk space+time is used (per user, group, node, disk) * for users: Tell users when replication/policy specified for a collection is not currently satisfied (and why, for how long, etc) * for users: Tell users how much disk space is represented by a given set of collections * for users: Tell users how much disk space can be made available by garbage collection * for users: Tell users how soon they should expect their cached data to disappear * for users: Tell users performance statistics (how fast should I expect my job to read data?) * for ops: Tell ops where each block was most recently read/written, in case data recovery is needed * for ops: Tell ops how unbalanced the backing stores are across the cluster * for ops: Tell ops activity level and performance statistics * for ops: Tell ops activity level vs. amount of space (how much of the data is being accessed by users?) * for ops: Tell ops disk performance/error/status trends (and SMART reports) to help identify bad hardware * for ops: Tell ops history of disk adds, removals, moves Basic kinds of data in the index: * Which blocks are used by which collections (and which collections are valued by which users/groups) * Which blocks are stored in which services (local Keep, remote Keep, other storage service) * Which blocks are stored on which disks * Which disks are attached to which nodes * Aggregate read/write activity per block and per disk (where applicable, e.g., block stored in local Keep) Read events * Write events * Exceptions (checksum mismatch, IO error) h2. Implementation considerations Overview * REST service for queries ** All requests require authentication. Token validity verified against Metadata server, and cached locally. * Subscribes to system event log API server may cache/proxy some queries * Connects to metadata API server (has a system_user token), at least periodically, to ensure eventual consistency with metadata DB's idea of what data is important may redirect some queries Permissions * Persistent database Support +A tokens like [[Keep server]] when accepting collection/blob uuids in request? * In-memory database Require admin api_token for some queries, site-configurable? Distributed/asynchronous * Easy to run multiple keep index services. * Most features do not need synchronous operation / real time data. * Features that move or delete data should be tied to a single "primary" indexing service (failover event likely requires resetting some state). * Substantial disagreement between multiple index services should be easy to flag on admin dashboard.