Story #11349
closed[Node Manager] Add status URL for node manager
Added by Tom Morris almost 8 years ago. Updated over 7 years ago.
100%
Description
Implemented HTTP server which serves status URL with JSON format output
Configurable port #
- List of nodes sizes
- Number of nodes in each state
- State of each node
Updated by Tom Morris almost 8 years ago
- Description updated (diff)
- Story points set to 2.0
Updated by Tom Clegg almost 8 years ago
See source:sdk/python/tests/keepstub.py and source:sdk/python/tests/test_keep_client.py for example of starting up a multithreaded http server.
Suggest maintaining a global status variable, protected by a mutex, and just dumping its content in the status.json handler.
Updated by Tom Clegg almost 8 years ago
- Assigned To set to Tom Clegg
- Target version changed from Arvados Future Sprints to 2017-04-12 sprint
Updated by Tom Clegg almost 8 years ago
details (proposed):
New config section "[Manage]" with "port" (127.0.0.1) and "address" (default -1, which disables management server)
status.json response
{
"nodes_up": 3,
"nodes_shutdown": 1,
"nodes_booting": 2,
"nodes_wish": 4
}
Updated by Tom Clegg almost 8 years ago
11349-nodemanager-status-api @ ab9a73d2c0b567d3c05d1d4d8463633a69eafda2
Updated by Nico César almost 8 years ago
I see 2 clusters of questions I get often, one group is about "orchestration-related" or "pipeline-wide" and the other group of questions is about the resources inside a node when a job is running.
From the first group I usually get question like this (which this should help):- "why my job is queued for X hours?" -> having a historical # nodes in wishlist could potentially give a clue.
- "my pipeline ran for 24 hours, which nodes did it use? " -> having a correlation of node with the pipeline helps.
- "is my node actually doing something?" -> having a "node38: up" doesn't say much, I think that's a question to answer with logs
- "how many cores/ram/big should my nodes have/be?" -> this is an analysis with the resources inside the node
so I think we can pull information from node manager to respond to the first group, usually this implies that the "node size" isn't as important as "how long has it been up and in which state" . so uniquely identifying the node than been able to plot that is good. But I have to admit that too much detail could turn this in to an Logstash nightmare-adventure I don't want to go, so some summarized state values as a first step is good.
the proposal is good:
{
"nodes_up": 3,
"nodes_shutdown": 1,
"nodes_booting": 2,
"nodes_wish": 4
}
later will be good to have unique node names and a way to report them over time (which makes it very difficult when they weren't born yet and in the "nodes_wish" pile)
Updated by Tom Clegg over 7 years ago
11349-nodemanager-status-api @ e7876a3ac520b128be7836e30172079ab2af5e45
Updated by Lucas Di Pentima over 7 years ago
Local test run was successful
Questions:services/nodemanager/arvnodeman/status.py
- Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)services/nodemanager/tests/test_status.py:43
- Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?- Is the state of each node going to be included? (asking because it's mentioned on the story description)
Updated by Tom Clegg over 7 years ago
Lucas Di Pentima wrote:
Local test run was successful
services/nodemanager/arvnodeman/status.py
- Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)
Yes, added.
if not self.enabled:
_logger.warning("Management server disabled. "+
"Use [Manage] config section to enable.")
return
services/nodemanager/tests/test_status.py:43
- Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?
Yes, moved it outside the loop.
- Is the state of each node going to be included? (asking because it's mentioned on the story description)
Indeed, we seem to have changed our minds about that: for now we just want a summary that we can graph easily.
Suggest adding "/nodes.json" with info about each node. (Not sure if we should keep this issue open for it or make a new one.)
11349-nodemanager-status-api @ a779382603d2da2ec38ceb8a21262cc4f151f077
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved