Bug #11070: [arvados-ws] job logs are not being updated in real time - Arvados

Actions

Copy link

Bug #11070

closed

[arvados-ws] job logs are not being updated in real time

Added by Ward Vandewege about 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

API

Target version:

2017-02-15 sprint

Story points:

Description

In workbench, job logs are updated in real time via websockets. On our test installations with arvados-ws, that is no longer working. According to the arvados-ws logs, workbench is subscribing to the updates, so the problem appears to be somewhere in arvados-ws.

Actions

Copy link

Updated by Ward Vandewege about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg about 9 years ago

Lots of connections from arvados-ws to postgresql sitting in CLOSE_WAIT, which doesn't seem right:

9tee4:~# ls -lart --full-time /proc/2322/fd/ 
total 0
dr-xr-xr-x 9 root root  0 2016-12-28 19:27:58.596931467 +0000 ..
dr-x------ 2 root root  0 2017-01-06 12:22:12.549920285 +0000 .
lrwx------ 1 root root 64 2017-01-31 18:05:35.954714735 +0000 508 -> socket:[661990687]
lrwx------ 1 root root 64 2017-02-01 23:23:12.710549212 +0000 99 -> socket:[129111377]
lrwx------ 1 root root 64 2017-02-01 23:23:12.710549212 +0000 98 -> socket:[129110706]
...
lrwx------ 1 root root 64 2017-02-01 23:23:12.710549212 +0000 1 -> socket:[18678]
lr-x------ 1 root root 64 2017-02-01 23:23:12.710549212 +0000 0 -> /dev/null
lrwx------ 1 root root 64 2017-02-08 05:32:54.873000662 +0000 513 -> socket:[664409635]
9tee4:~# netstat -nape | grep arvados-ws | grep tcp6 | wc -l
509
9tee4:~# netstat -nape | grep arvados-ws | grep CLOSE_WAIT | wc -l
500

Actions

Copy link

Updated by Tom Clegg about 9 years ago

I ran tcpdump on the active connection, and I could see the once-per-minute "ping" between arvados-ws and postgresql. But if I generate a log entry (e.g., by modifying my user record) there is no evidence of a "notify" event on the wire.

Actions

Copy link

Updated by Tom Clegg about 9 years ago

I think I have it: when we hit a connection problem, we stopped our own event loop but

didn't tell the pq library's listener to stop listening (which is why there was still 1 ping every minute on the wire), and
didn't exit the program (so clients could still connect to the webserver goroutine, and wait for events that will never come).

Actions

Copy link