Project

General

Profile

Actions

Bug #23403

open

PySDK websocket tests timing out sometimes on Debian 13

Added by Brett Smith about 2 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
-
Story points:
-

Description

Since we upgraded the Jenkins worker nodes to Debian 13, the PySDK websocket tests have gotten flaky like this:

=================================== FAILURES ===================================
_________ WebsocketTest.test_subscribe_websocket_with_start_time_past __________

self = <tests.test_events.WebsocketTest testMethod=test_subscribe_websocket_with_start_time_past>

    def test_subscribe_websocket_with_start_time_past(self):
>       self._test_subscribe(
            poll_fallback=False, expect_type=arvados.events.EventClient,
            start_time=self.localiso(self.TIME_PAST),
            expected=2)

tests/test_events.py:172: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_events.py:106: in _test_subscribe
    self.ws = arvados.events.subscribe(
/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/arvados/events.py:546: in subscribe
    return _subscribe_websocket(api, filters, on_event, last_log_id)
/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/arvados/events.py:502: in _subscribe_websocket
    client = EventClient(uri_with_token, filters, on_event, last_log_id)
/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/arvados/events.py:127: in __init__
    self._connect()
/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/arvados/events.py:134: in _connect
    self._client = ws_client.connect(
/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/websockets/sync/client.py:378: in connect
    connection.handshake(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <websockets.sync.client.ClientConnection object at 0x7fb92f90b250>
additional_headers = None
user_agent_header = 'Python/3.10.19 arvados.events/3.3.0.dev20251121214649'
timeout = 9.997357026000032

    def handshake(
        self,
        additional_headers: HeadersLike | None = None,
        user_agent_header: str | None = USER_AGENT,
        timeout: float | None = None,
    ) -> None:
        """ 
        Perform the opening handshake.

        """ 
        with self.send_context(expected_state=CONNECTING):
            self.request = self.protocol.connect()
            if additional_headers is not None:
                self.request.headers.update(additional_headers)
            if user_agent_header is not None:
                self.request.headers.setdefault("User-Agent", user_agent_header)
            self.protocol.send_request(self.request)

        if not self.response_rcvd.wait(timeout):
>           raise TimeoutError("timed out while waiting for handshake response")
E           TimeoutError: timed out while waiting for handshake response

/home/jenkins/tmp/VENV3DIR/lib/python3.10/site-packages/websockets/sync/client.py:94: TimeoutError
----------------------------- Captured stderr call -----------------------------
2026-01-21 14:56:58 arvados.events[21564] WARNING: Failed to connect to websockets on wss://127.0.0.1:59529/websocket
------------------------------ Captured log call -------------------------------
DEBUG    googleapiclient.discovery:discovery.py:434 URL being requested: GET https://127.0.0.1:46433/discovery/v1/apis/arvados/v1/rest
DEBUG    googleapiclient.discovery:discovery.py:1258 URL being requested: POST https://127.0.0.1:58859/arvados/v1/collections?alt=json
DEBUG    googleapiclient.discovery:discovery.py:434 URL being requested: GET https://127.0.0.1:46433/discovery/v1/apis/arvados/v1/rest
WARNING  arvados.events:events.py:504 Failed to connect to websockets on wss://127.0.0.1:59529/websocket
=========================== short test summary info ============================
FAILED tests/test_events.py::WebsocketTest::test_subscribe_websocket_with_start_time_past - TimeoutError: timed out while waiting for handshake response

This has affected several different test_subscribe_websocket_with_start_time_* tests.

My first suspicion is the issue isn't with the test per se, but with the stack of test servers. Maybe there's some configuration knob we need to tweak or something.

Actions #1

Updated by Brett Smith about 2 months ago

  • Target version set to Development 2026-02-04
  • Assigned To set to Brett Smith
  • Status changed from New to In Progress

I can reproduce on my development VM, 2/50 failed the same way as Jenkins.

Actions #2

Updated by Brett Smith about 2 months ago

I did not see any signs of distress in the nginx or ws logs around the time of the failing test. I believe these are the only two services involved since the nginx test configuration includes a stanza to upgrade websocket connections.

The websockets library we use has had upgrades around this same time. Jenkins currently has 15.0.1. That was true under Debian 12 as well, so that's probably not it.

Actions #3

Updated by Brett Smith about 2 months ago · Edited

https://github.com/python-websockets/websockets/issues/1648

https://github.com/python/cpython/issues/137583

but that bisects the problem to Python commit e047a35 which was authored July 2025 so wouldn't be included in the Python 3.10 we're using.

Actions #4

Updated by Brett Smith about 2 months ago

I tried upgrading websockets to 16.0 (the latest version) just because that was easy.

In general, we should pin to a major version anyway. The library does make API breaking changes between major versions. See the changelog

But this didn't help. In fact I'm seeing new problems. First is the same TimeoutError from WebsocketTest.test_subscribe_websocket—but it might just be coincidence we haven't seen that before. Also got this once:

========================================================== FAILURES ==========================================================
_________________________________ WebsocketTest.test_websocket_reconnect_on_unexpected_close _________________________________

self = <tests.test_events.WebsocketTest testMethod=test_websocket_reconnect_on_unexpected_close>

    def test_websocket_reconnect_on_unexpected_close(self):
>       self._test_websocket_reconnect(True)

tests/test_events.py:284: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_events.py:262: in _test_websocket_reconnect
    event = events.get(True, 5)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <queue.Queue object at 0x7f43d7238a90>, block = True, timeout = 5

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.

        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
>                       raise Empty
E                       _queue.Empty

/opt/Python-3.10.19/lib/python3.10/queue.py:179: Empty
----------------------------------------------------- Captured log call ------------------------------------------------------
WARNING  arvados.events:events.py:226 Unexpected close. Reconnecting.
================================================== short test summary info ===================================================
FAILED tests/test_events.py::WebsocketTest::test_websocket_reconnect_on_unexpected_close - _queue.Empty
=============================================== 1 failed, 19 passed in 43.65s ================================================
Actions #5

Updated by Brett Smith about 2 months ago · Edited

I still had my Debian 12 VM lying around so I tried on that. Same version of Python (3.10.19) and websockets (15.0.1), run out of main, fresh test temp dir, 45 runs, zero failures. I can do more runs but my initial take is this means it's very unlikely the issue is anything in the Arvados code base or written in Python.

I'm really hoping nginx is the issue at this point, because if it's an underlying C library that's going to be a giant pain to track down and deal with.

Actions #6

Updated by Brett Smith about 2 months ago

  • Release deleted (83)
  • Target version deleted (Development 2026-02-04)
Actions

Also available in: Atom PDF