Feature #19377
closed[diagnostics] show health-check response
Added by Tom Clegg over 2 years ago. Updated about 2 years ago.
100%
Description
arvados-client sudo diagnostics
should perform a cluster-wide health check. This should work even if the "health aggregator" service is not running.
(Note we can't do this in the non-sudo case because arvados-client won't have the management token needed to query health-check endpoints. In general, arvados-client diagnostics
doesn't even run on the cluster's internal network, so it can't necessarily connect to all internal services.)
Add text to install guide about how to use the diagnostics and health check to establish confidence that a cluster is correctly installed & working.
Do this assuming correct information from the config file. We suspect some clusters may not have usable InternalURL values, in these cases we will fix the configuration and/or the installer.
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-09-28 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Updated by Peter Amstutz over 2 years ago
- Related to Story #18685: Synchronize configuration on multi-node cluster added
Updated by Tom Clegg over 2 years ago
19377-diagnostics-health-check @ fe5fc284f0dd1cdd0c01922a98bcf75b4186e80a -- developer-run-tests: #3309
(wb1 failure is surely not related)
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|ed218eff47b6784b6c40fa919491d78fa78d1a19.
Updated by Peter Amstutz over 2 years ago
- Status changed from Resolved to Feedback
Updated by Peter Amstutz over 2 years ago
Comments about arvados-server check
, which of course also applies to arvados-client diagnostics
.
So arvados-server check
doesn't print anything by default. This doesn't inspire confidence, it just left me wondering if anything happened or if I did something wrong.
Diagnostics is only slightly better:
INFO 5: running health check (same as `arvados-server check`) INFO ... health check: reported clock skew 0s
A message like "all nodes report status OK with version X and with config hash Y, pass -yaml for more information" would be a big improvement. Then have a "-quiet" option that really suppresses all non-error output.
Start time or uptime for each service would also be a handy piece of information to return. I don't know if that's in scope for this story.
Is there a useful distinction between "Checks" and "Services" here? Also I have literally no idea what "N" means:
Checks: arvados-api-server+http://localhost:8004/_health/ping: ClockTime: "2022-10-06T15:17:20Z" ConfigSourceSHA256: 241ebaba8e8336f07978c13dbd9bce185ab4c467a7c306088fbc588131976840 ConfigSourceTimestamp: "2022-06-07T11:10:27.69153Z" HTTPStatusCode: 200 Health: OK Response: health: OK ResponseTime: 0.013167 Version: 2.5.0~dev20221005202835 Services: arvados-api-server: Health: OK "N": 1
Updated by Tom Clegg over 2 years ago
- Status changed from Feedback to In Progress
Added a "health check OK" message, and a -quiet
flag to disable it.
19377-diagnostics-health-check @ c1af8a9ad9d5d8fc0e6bf49ad2bcc9119c85bfdc -- developer-run-tests: #3312 (wb1 fail)
I think we could use a human readable cluster info/status report somewhere in between "OK" and "way too much detail but not much useful information.yaml".
I think that deserves some thought, because it doesn't seem like a great fit for either health checks (which are intended for monitoring/alerting and load balancing, hence primarily machine-readable) or diagnostics (which are intended to exercise the cluster and detect specific problems).
Updated by Peter Amstutz over 2 years ago
- Related to Story #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning added
Updated by Tom Clegg over 2 years ago
-v
verbose mode that prints lots of additional information even when diagnostic tests are passing, so diagnostics can serve as a single report with both (a) assurance that [some] things are working, and (b) cluster details that are likely to help troubleshooting problems that aren't specifically checked by diagnostics. Possible examples:
- Arvados version
- Nginx version
- config file timestamp/checksum
- hostname of machine where
arvados-client diagnostics
is running
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Updated by Peter Amstutz over 2 years ago
Tom Clegg wrote in #note-15:
Added a "health check OK" message, and a
-quiet
flag to disable it.19377-diagnostics-health-check @ c1af8a9ad9d5d8fc0e6bf49ad2bcc9119c85bfdc -- developer-run-tests: #3312 (wb1 fail)
I think we could use a human readable cluster info/status report somewhere in between "OK" and "way too much detail but not much useful information.yaml".
I think that deserves some thought, because it doesn't seem like a great fit for either health checks (which are intended for monitoring/alerting and load balancing, hence primarily machine-readable) or diagnostics (which are intended to exercise the cluster and detect specific problems).
This LGTM.
Updated by Tom Clegg over 2 years ago
19377-verbose @ 58afa4202a5ff084fe7eee1a5274e02071858ba5 -- developer-run-tests: #3318
Added version reporting, and some info that we previously showed when using "debug" log level.
(we have "loglevel=debug" and "verbose" as distinct features -- "debug" shows more detail about how the diagnostic checks are succeeding/failing, while "verbose" shows additional information about the state of the cluster -- does this seem right?)
example
api.ce8i5:~# ./arvados-client sudo diagnostics -v INFO ... hostname = api.ce8i5.arvadosapi.com INFO 5: running health check (same as `arvados-server check`) INFO ... reported clock skew = 0s INFO ... arvados version = 2.5.0~dev20221013155415 INFO ... http frontend version = nginx/1.14.0 + Phusion Passenger(R) 6.0.15 INFO ... http frontend version = nginx/1.14.0 (Ubuntu) INFO ... config file sha256 = 241ebaba8e8336f07978c13dbd9bce185ab4c467a7c306088fbc588131976840 INFO 10: getting discovery document from https://ce8i5.arvadosapi.com/discovery/v1/apis/arvados/v1/rest INFO ... BlobSignatureTTL = 1209600 INFO 20: getting exported config from https://ce8i5.arvadosapi.com/arvados/v1/config INFO ... Collections.BlobSigning = true INFO 30: getting current user record INFO ... user uuid = ce8i5-tpzed-000000000000000 INFO 40: connecting to service endpoint https://keep.ce8i5.arvadosapi.com/ INFO 41: connecting to service endpoint https://*.collections.ce8i5.arvadosapi.com/ ...
Updated by Peter Amstutz over 2 years ago
Ok, this is a bit of a weird case. I ran "arvados-client diagnostics" from inside an arvbox instance, but set ARVADOS_API_HOST to point at ce8i5. It turns out that healthcheck reads /etc/arvados/config.yml so it contacted the arvbox instance for the healthcheck but then ran the rest of the diagnostics against ce8i5 as intended.
Could we add a basic sanity check that /etc/arvados/config.yml and ARVADOS_API_HOST are talking about the same cluster, maybe by comparing the ExternalURL for controller?
arvbox@ad53c3218ece:/var/lib/gopath/bin$ ./arvados-client diagnostics -external-client -v INFO ... hostname = ad53c3218ece INFO 5: running health check (same as `arvados-server check`) INFO ... skipping because provided token is not SystemRootToken ERROR health check: keep-balance: MISSING: no InternalURLs configured ERROR health check: version mismatch: arvados-api-server+http://localhost:8004 is running 2.4.2~dev20220803202846 -- expected dev (go1.17.7) ERROR health check: version mismatch: arvados-workbench1+https://172.17.0.2:443 is running 2.3.3~dev20220214191519 -- expected dev (go1.17.7) INFO ... reported clock skew = 0s INFO ... arvados version = dev INFO ... arvados version = 2.4.2~dev20220803202846 INFO ... arvados version = 2.3.3~dev20220214191519 INFO ... http frontend version = nginx/1.15.8 + Phusion Passenger 6.0.2 INFO ... http frontend version = nginx/1.14.2 INFO ... config file sha256 = 9dd86a9f3c90086e5328f3103f34bcf8f13eee96deb529d7a1ab9dd370954427 INFO 10: getting discovery document from https://pirca.arvadosapi.com/discovery/v1/apis/arvados/v1/rest INFO ... BlobSignatureTTL = 1209600
Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.
Updated by Tom Clegg over 2 years ago
Peter Amstutz wrote in #note-21:
Could we add a basic sanity check that /etc/arvados/config.yml and ARVADOS_API_HOST are talking about the same cluster, maybe by comparing the ExternalURL for controller?
Oops. I fixed the existing sanity check that nearly worked there -- it said "skipping" but then went on to do the health check anyway. Now it says "skipping" and actually skips.
(My first thought was also to check the controller URL but -- unlike the token -- host:port in a URL can be equivalent without being equal, so I thought checking the token was the easiest way to check whether we're really in the "arvados-client sudo" situation vs. an unrelated config file just happens to be readable.)
Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.
Added.
19377-verbose @ 14f6625379992bc3ee054ad419095b476a1c4284 -- developer-run-tests: #3325
Updated by Peter Amstutz about 2 years ago
Tom Clegg wrote in #note-22:
Oops. I fixed the existing sanity check that nearly worked there -- it said "skipping" but then went on to do the health check anyway. Now it says "skipping" and actually skips.
(My first thought was also to check the controller URL but -- unlike the token -- host:port in a URL can be equivalent without being equal, so I thought checking the token was the easiest way to check whether we're really in the "arvados-client sudo" situation vs. an unrelated config file just happens to be readable.)
So it does skip it now, but the error message still doesn't really explain what happened. It should say something like "the system root token that was read from /etc/arvados/config.yml doesn't match the token in $ARVADOS_API_TOKEN".
Actually I'm playing with it some more and I'm more confused,
Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.
Added.
This is logged at "INFO" level, which means it doesn't show up in the error summary.
INFO ... consider running `arvados-server check -yaml` for a comprehensive report ...30 more lines of stuff... --- cut here --- error summary --- ERROR health check: keep-balance: MISSING: no InternalURLs configured ERROR health check: version mismatch: arvados-workbench1+https://172.17.0.3:443 is running 2.3.3~dev20220214191519 -- expected dev (go1.17.7) ERROR health check: version mismatch: arvados-api-server+http://localhost:8004 is running 2.4.2~dev20220803202846 -- expected dev (go1.17.7) ERROR 160: running a container (8210 ms): container request x94bf-xvhdp-7blvxox8522h2zk is final but container x94bf-dz642-igejnw1p2dzjb9o did not complete: container state = "Cancelled" $
Also, I had a thought that -json
output option would be handy to be able to run it through jq
.
Updated by Tom Clegg about 2 years ago
So it does skip it now, but the error message still doesn't really explain what happened. It should say something like "the system root token that was read from /etc/arvados/config.yml doesn't match the token in $ARVADOS_API_TOKEN".
Changed to
skipping because SystemRootToken read from /etc/arvados/config.yml does not match $ARVADOS_API_TOKEN (consider using 'arvados-client sudo diagnostics' to load endpoint and token from config file instead of environment)
Actually I'm playing with it some more and I'm more confused,
?
This is logged at "INFO" level, which means it doesn't show up in the error summary.
Right, "consider using x to get more detail" is not an error.
Repeating the errors at the end is meant to (a) ensure that, if any errors/problems are detected, the user's terminal window is clearly showing errors when the command finishes, even if the terminal is 80x25 and lots of other non-error messages have been shown and (b) offer a concise checklist of problems to fix, without by-the-way logging noise. I think repeating non-error informational messages too would tend to defeat both purposes.
19377-verbose @ bef451193ffe6c7239c545ed1c93769b50c9e0a8 -- developer-run-tests: #3336
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Updated by Peter Amstutz about 2 years ago
Tom Clegg wrote in #note-24:
Actually I'm playing with it some more and I'm more confused,
?
I think I started writing that, then I figured out what happened and wrote the note about improving the error message, and forgot to delete the original text. Whoops.
This is logged at "INFO" level, which means it doesn't show up in the error summary.
Right, "consider using x to get more detail" is not an error.
Repeating the errors at the end is meant to (a) ensure that, if any errors/problems are detected, the user's terminal window is clearly showing errors when the command finishes, even if the terminal is 80x25 and lots of other non-error messages have been shown and (b) offer a concise checklist of problems to fix, without by-the-way logging noise. I think repeating non-error informational messages too would tend to defeat both purposes.
19377-verbose @ bef451193ffe6c7239c545ed1c93769b50c9e0a8 -- developer-run-tests: #3336
My concern is that because it isn't repeated, most people will never see it, because why would you scroll up to look at the same error messages that you already have at the bottom. So right at the point in time you most need help debugging, the option most likely to help you isn't being presented front and center.
On the same theme, it might be better to have "skipping because SystemRootToken" be logged as an error, because it is almost certainly a user error and having it say "skipping" at the info level gives a false sense that things are fine when actually it has no idea if things are fine or not.
However if you're unconvinced don't want to change it, go ahead and merge.
Updated by Tom Clegg about 2 years ago
Peter Amstutz wrote in #note-26:
So right at the point in time you most need help debugging, the option most likely to help you isn't being presented front and center.
Hm, I guess I'm just not seeing the situation where the full health-check yaml output is especially helpful. Is there something we could add to the error messages instead of advising the user to look for clues in a yaml doc?
On the same theme, it might be better to have "skipping because SystemRootToken" be logged as an error, because it is almost certainly a user error and having it say "skipping" at the info level gives a false sense that things are fine when actually it has no idea if things are fine or not.
I like that option. I've made this an error, with advice about how to get rid of it one way or the other.
19377-verbose @ d5df19fdff62724f9faeb3bee17201363071bf9e -- developer-run-tests: #3344
Updated by Tom Clegg about 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|dee23e534ab1c84fbe2e34730c0327989cc263ac.