Project

General

Profile

Actions

Bug #16217

closed

[arvados-ws] Websocket server stops processing events, but stays connected

Added by Tom Clegg almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
03/12/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Sometimes, after successfully processing hundreds or thousands of events, arvados-ws goes into a state where clients don't receive any events. The EventsIn number at /status.json is static, which indicates arvados-ws isn't receiving events from PostgreSQL.

Clients can still connect / stay connected, the once-per-minute empty "ping" message still works.

Cause is unknown.


Subtasks 4 (0 open4 closed)

Task #16230: Review 16217-ws-pingResolvedTom Clegg03/12/2020

Actions
Task #16231: Export event counters as metricsResolvedTom Clegg03/31/2020

Actions
Task #16232: [ops] Add arvados-ws to prometheus configsResolved04/07/2020

Actions
Task #16309: Review 16217-ws-metricsResolvedTom Clegg03/12/2020

Actions
Actions #1

Updated by Peter Amstutz almost 5 years ago

  • Target version set to 2020-03-25 Sprint
Actions #2

Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg

Not sure whether this is related to the observed failures but it seems worth fixing either way. Arvados-ws does a periodic listener ping, but hasn't been checking the returned error. With this change, if the ping fails, arvados-ws will log the error and exit/restart.

16217-ws-ping @ 9ebf73b1a1229bba507057ed2fb6a39635ce7e24 -- developer-run-tests: #1765

Actions #3

Updated by Lucas Di Pentima almost 5 years ago

16217-ws-ping LGTM, thanks!

Actions #4

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-03-25 Sprint to 2020-04-08 Sprint
Actions #5

Updated by Tom Clegg over 4 years ago

Replaces the old status/debug.json stuff with prometheus metrics. Also refactors services/ws to share service-startup code and distribute inside arvados-server like controller, boot, install, dispatchcloud, etc.

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- developer-run-tests: #1797

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- developer-run-tests: #1798

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- developer-run-tests: #1800

Actions #6

Updated by Lucas Di Pentima over 4 years ago

This LGTM, thanks!

Actions #7

Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Peter Amstutz about 4 years ago

  • Release set to 25
Actions

Also available in: Atom PDF