Node manager policy matrix » History » Version 1
Peter Amstutz, 04/14/2016 06:12 PM
| 1 | 1 | Peter Amstutz | h1. Node manager policy matrix |
|---|---|---|---|
| 2 | |||
| 3 | arvados node state (last_ping_at, crunch_worker_state): |
||
| 4 | * (no arvados record associated with cloud node) -> unpaired |
||
| 5 | * last_ping_at is stale -> down |
||
| 6 | * slurm state is idle -> idle |
||
| 7 | * slurm state is drng, alloc, maint -> busy |
||
| 8 | * slurm state is drain, down, error, fail, unknown, any* -> down |
||
| 9 | |||
| 10 | billing window: |
||
| 11 | * open |
||
| 12 | * closed |
||
| 13 | |||
| 14 | boot_grace: (time since boot) |
||
| 15 | * under boot expiry -> boot wait |
||
| 16 | * exceeded boot expiry -> boot exceeded |
||
| 17 | |||
| 18 | idle_grace (time since last state change to "idle") |
||
| 19 | * arvados node state is not 'idle' -> not idle |
||
| 20 | * idle and not exceeded grace period -> idle wait |
||
| 21 | * idle and exceeded grace period -> idle exceed |
||
| 22 | |||
| 23 | Node manager will construct a state tuple and then consult the following table to determine what action to take. Actions are: |
||
| 24 | |||
| 25 | * None (do nothing) |
||
| 26 | * START_DRAIN (put the node into slurm draining state) |
||
| 27 | * START_SHUTDOWN (initiate cloud shutdown) |
||
| 28 | |||
| 29 | <pre> |
||
| 30 | crunch_worker_state = ['unpaired', 'busy', 'idle', 'down'] |
||
| 31 | window = ["open", "closed"] |
||
| 32 | boot_grace = ["boot wait", "boot exceeded"] |
||
| 33 | idle_grace = ["not idle", "idle wait", "idle exceeded"] |
||
| 34 | |||
| 35 | {('busy', 'closed', 'boot exceeded', 'idle exceeded'): None, |
||
| 36 | ('busy', 'closed', 'boot exceeded', 'idle wait'): None, |
||
| 37 | ('busy', 'closed', 'boot exceeded', 'not idle'): None, |
||
| 38 | ('busy', 'closed', 'boot wait', 'idle exceeded'): None, |
||
| 39 | ('busy', 'closed', 'boot wait', 'idle wait'): None, |
||
| 40 | ('busy', 'closed', 'boot wait', 'not idle'): None, |
||
| 41 | ('busy', 'open', 'boot exceeded', 'idle exceeded'): None, |
||
| 42 | ('busy', 'open', 'boot exceeded', 'idle wait'): None, |
||
| 43 | ('busy', 'open', 'boot exceeded', 'not idle'): None, |
||
| 44 | ('busy', 'open', 'boot wait', 'idle exceeded'): None, |
||
| 45 | ('busy', 'open', 'boot wait', 'idle wait'): None, |
||
| 46 | ('busy', 'open', 'boot wait', 'not idle'): None, |
||
| 47 | |||
| 48 | ('down', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
| 49 | ('down', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
| 50 | ('down', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
| 51 | ('down', 'closed', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", |
||
| 52 | ('down', 'closed', 'boot wait', 'idle wait'): "START_SHUTDOWN", |
||
| 53 | ('down', 'closed', 'boot wait', 'not idle'): "START_SHUTDOWN", |
||
| 54 | ('down', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
| 55 | ('down', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
| 56 | ('down', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
| 57 | ('down', 'open', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", |
||
| 58 | ('down', 'open', 'boot wait', 'idle wait'): "START_SHUTDOWN", |
||
| 59 | ('down', 'open', 'boot wait', 'not idle'): "START_SHUTDOWN", |
||
| 60 | |||
| 61 | ('idle', 'closed', 'boot exceeded', 'idle exceeded'): None, |
||
| 62 | ('idle', 'closed', 'boot exceeded', 'idle wait'): None, |
||
| 63 | ('idle', 'closed', 'boot exceeded', 'not idle'): None, |
||
| 64 | ('idle', 'closed', 'boot wait', 'idle exceeded'): None, |
||
| 65 | ('idle', 'closed', 'boot wait', 'idle wait'): None, |
||
| 66 | ('idle', 'closed', 'boot wait', 'not idle'): None, |
||
| 67 | ('idle', 'open', 'boot exceeded', 'idle exceeded'): "START_DRAIN", |
||
| 68 | ('idle', 'open', 'boot exceeded', 'idle wait'): None, |
||
| 69 | ('idle', 'open', 'boot exceeded', 'not idle'): None, |
||
| 70 | ('idle', 'open', 'boot wait', 'idle exceeded'): "START_DRAIN", |
||
| 71 | ('idle', 'open', 'boot wait', 'idle wait'): None, |
||
| 72 | ('idle', 'open', 'boot wait', 'not idle'): None, |
||
| 73 | |||
| 74 | ('unpaired', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
| 75 | ('unpaired', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
| 76 | ('unpaired', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
| 77 | ('unpaired', 'closed', 'boot wait', 'idle exceeded'): None, |
||
| 78 | ('unpaired', 'closed', 'boot wait', 'idle wait'): None, |
||
| 79 | ('unpaired', 'closed', 'boot wait', 'not idle'): None, |
||
| 80 | ('unpaired', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
| 81 | ('unpaired', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
| 82 | ('unpaired', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
| 83 | ('unpaired', 'open', 'boot wait', 'idle exceeded'): None, |
||
| 84 | ('unpaired', 'open', 'boot wait', 'idle wait'): None, |
||
| 85 | ('unpaired', 'open', 'boot wait', 'not idle'): None} |
||
| 86 | </pre> |
||
| 87 | |||
| 88 | |||
| 89 | Note on libcloud node states: |
||
| 90 | * error, unknown -> broken |
||
| 91 | * everything else -> ok |
||
| 92 | |||
| 93 | However we don't use it, it's expensive to fetch on some clouds and not as useful as knowing whether the node is actually live and in communication. A blanket policy that shuts down nodes that are unavailable to do useful work should also catch broken nodes. |