Feature #16838
closed
Added by Ward Vandewege over 4 years ago.
Updated about 4 years ago.
Estimated time:
(Total: 0.00 h)
Release relationship:
Auto
Description
As an indicator of how healthy our cloud is:
- avg runProbe duration by success/failed state (SummaryVec)
- Status changed from New to In Progress
- Assigned To set to Ward Vandewege
- Target version changed from 2020-10-07 Sprint to 2020-09-23 Sprint
TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry())
at worker_test.go L242
Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?
Tom Clegg wrote:
TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry())
at worker_test.go L242
Doh, I ran tests, but perhaps not in the correct git tree. Fixed as you suggested.
Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?
That's fair. I've removed the Observe(0) call.
Changes at 126139084160563c2b4fe3969461c40ecbbf6951 on branch 16838-probe-metrics
Just to be sure, running all tests at developer-run-tests: #2107
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
- Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added
Also available in: Atom
PDF