Feature #12260
closedHealthcheck endpoint aggregator
100%
Updated by Tom Clegg over 7 years ago
listen on configured host:port (default :9201)
- respond to "GET /_health/all"
- require admin token in "Authorization: Bearer XXX" header (return 500 if token can't be checked because API is down)
- concurrently call /_health/ping on all services that should be up
- when all ping responses have arrived/failed, respond to client with {"health":"OK"} if all succeeded, otherwise {"health":"ERROR"}
- also include detail of each check, using key "{node}/{component}/{health-check-path}":
{ "health":"OK", "checks":{ "keep/keep-web/_health/ping":{"health":"OK","responseTime":0.00123,"status":200}, "keep0/keepstore/_health/ping":{"health":"OK","responseTime":0.00123,"status":200}, ... } }
- (?) allow for multiple instances of a service (e.g., keepstore) on a single node
- report missing services as failed
- in general, the system is healthy if there are enough healthy instances of each component/service
- for now, call 1 "enough" for all services
- /etc/arvados/config.yml is a cluster configuration
- minimum for this story: cluster id (uuid prefix), port numbers (if different from install guide), map of hostname → services ("SystemNodes")
- type SiteConfig struct in sdk/go/arvados
- example /etc/arvados/config.yml:
Clusters: qr1hi: ManagementToken: s3cr3t SystemNodes: keep0: Keepstore: Listen: :25107 keep1: Keepstore: Listen: :25107
- package health (lib/health)
- health.Aggregator implements http.Handler
- health.Aggregator has an arvados.SiteConfig field
- package main (services/health → arvados-health_*.deb → /usr/bin/arvados-health)
- main starts http server at (SiteConfig)CurrentSystemNode().Health.Port and attaches a health.Aggregator{SiteConfig: arvados.LoadSiteConfig(nil)}
Updated by Tom Clegg over 7 years ago
- Subject changed from Healthcheck / status endpoint aggregator to Healthcheck endpoint aggregator
Updated by Tom Morris over 7 years ago
- Target version changed from Arvados Future Sprints to 2017-10-11 Sprint
- Story points set to 2.0
Updated by Tom Clegg over 7 years ago
12260-system-health @ a9497f8d2756104ba07d88d5c8c7b84790fd83f3
Known todos:- Update package scripts to build deb/rpm packages ("arvados-health")
- Update install guide (or maybe wait until we have seen it work in real life?)
Updated by Tom Clegg over 7 years ago
- Target version changed from 2017-10-11 Sprint to 2017-10-25 Sprint
Updated by Lucas Di Pentima over 7 years ago
Some comments/questions:
- File
sdk/go/arvados/config.go
- Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that
- Line 60: The fact that GetSystemNode() returns
(*SystemNode, error)
is enough to avoid explicitly returning 'error' there?
- File
sdk/go/health/aggregator.go
- Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)
Updated by Tom Clegg over 7 years ago
Lucas Di Pentima wrote:
Some comments/questions:
- File
sdk/go/arvados/config.go
- Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that
Fixed
- Line 60: The fact that GetSystemNode() returns
(*SystemNode, error)
is enough to avoid explicitly returning 'error' there?
Yes. In fact, if GetSystemNode(x) only returned a single value, "return GetSystemNode(x)" wouldn't compile ("not enough values to return").
(You can use the same shortcut with function args: if you have a func foo(*SystemNode, error)
you can call foo(GetSystemNode(x))
.
- File
sdk/go/health/aggregator.go
- Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)
Yes. You can read or write from a map, but there's no "update" syntax. If resp.Services were a map[string]*ServiceHealth then you could read a pointer and update the object it points to: "resp.Services[svc].N++". But it's a map[string]ServiceHealth, so we have to do read-update-write explicitly.
12260-system-health @ ff100fbf824e2dbc2ff0afd3d746ac562532cfb6
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Feedback
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-10-25 Sprint to 2017-11-08 Sprint
Updated by Tom Clegg over 7 years ago
- Target version changed from 2017-11-08 Sprint to 2017-11-22 Sprint
- Story points changed from 1.0 to 0.0
Updated by Tom Clegg about 7 years ago
- Status changed from Feedback to Resolved
Updated by Tom Clegg almost 7 years ago
- Description updated (diff)
- Story points deleted (
0.0)