Version 1 - History - Crunch v2 cloud scheduling - Arvados

1

Peter Amstutz

h1. Crunch v2 cloud scheduling

2

3

Options:

4

5

h2. SLURM (no node sharing)

6

7

Don't try to share nodes, run 1 container per node.

8

9

Extend existing "want list" logic in node manager to include queued/locked/running containers.

10

11

Tasks: update node manager; disable node sharing in slurm config.

12

13

h2. SLURM (support node sharing)

14

15

https://slurm.schedmd.com/elastic_computing.html

16

17

For each node type, list a range of nodes in slurm.conf.

18

19

Nodes are in "CLOUD" state which hides them from sinfo.

20

21

Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one.  These are responsible for creating and destroying cloud nodes.

22

23

"ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname).  This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion.

24

25

If we use node manager, needs mechanism for signaling that specific nodes should be up/down.  Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs.

26

27

Tasks are either:

28

29

* Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag.

30

* Write new ResumeProgram/SuspendProgram programs

31

32

h2. Home-grown

33

34

h2. Something else

35

36

Mesos, Kubernetes, Open Lava, etc..

37

38

Unknown amount of effort.

Project

General

Profile

Arvados

Crunch v2 cloud scheduling » History » Version 1