Crunch v2 cloud scheduling » History » Version 1
Peter Amstutz, 12/07/2016 06:01 PM
| 1 | 1 | Peter Amstutz | h1. Crunch v2 cloud scheduling |
|---|---|---|---|
| 2 | |||
| 3 | Options: |
||
| 4 | |||
| 5 | h2. SLURM (no node sharing) |
||
| 6 | |||
| 7 | Don't try to share nodes, run 1 container per node. |
||
| 8 | |||
| 9 | Extend existing "want list" logic in node manager to include queued/locked/running containers. |
||
| 10 | |||
| 11 | Tasks: update node manager; disable node sharing in slurm config. |
||
| 12 | |||
| 13 | h2. SLURM (support node sharing) |
||
| 14 | |||
| 15 | https://slurm.schedmd.com/elastic_computing.html |
||
| 16 | |||
| 17 | For each node type, list a range of nodes in slurm.conf. |
||
| 18 | |||
| 19 | Nodes are in "CLOUD" state which hides them from sinfo. |
||
| 20 | |||
| 21 | Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one. These are responsible for creating and destroying cloud nodes. |
||
| 22 | |||
| 23 | "ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname). This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion. |
||
| 24 | |||
| 25 | If we use node manager, needs mechanism for signaling that specific nodes should be up/down. Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs. |
||
| 26 | |||
| 27 | Tasks are either: |
||
| 28 | |||
| 29 | * Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag. |
||
| 30 | * Write new ResumeProgram/SuspendProgram programs |
||
| 31 | |||
| 32 | h2. Home-grown |
||
| 33 | |||
| 34 | h2. Something else |
||
| 35 | |||
| 36 | Mesos, Kubernetes, Open Lava, etc.. |
||
| 37 | |||
| 38 | Unknown amount of effort. |