Project

General

Profile

Actions

Feature #16838

closed

[a-d-c] probe metrics

Added by Ward Vandewege over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/18/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

As an indicator of how healthy our cloud is:

  • avg runProbe duration by success/failed state (SummaryVec)

Subtasks 1 (0 open1 closed)

Task #16845: review 16838-probe-metricsResolvedWard Vandewege09/18/2020

Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Feature #16636: [arvados-dispatch-cloud] Add instance metricsResolvedWard Vandewege08/03/2020

Actions
Actions #1

Updated by Ward Vandewege over 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Ward Vandewege
  • Target version changed from 2020-10-07 Sprint to 2020-09-23 Sprint

Ready for review in 799f8e333e7067cee0db0ee8bbcf45a56602d1f1 on branch 16838-probe-metrics

Actions #2

Updated by Tom Clegg over 4 years ago

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

Actions #3

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Doh, I ran tests, but perhaps not in the correct git tree. Fixed as you suggested.

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

That's fair. I've removed the Observe(0) call.

Changes at 126139084160563c2b4fe3969461c40ecbbf6951 on branch 16838-probe-metrics

Just to be sure, running all tests at developer-run-tests: #2107

Actions #4

Updated by Tom Clegg over 4 years ago

LGTM, thanks!

Actions #5

Updated by Ward Vandewege over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
Actions #6

Updated by Peter Amstutz about 4 years ago

  • Release set to 25
Actions #7

Updated by Ward Vandewege about 4 years ago

  • Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added
Actions

Also available in: Atom PDF