Bug #17989
closedPySDK socket read timeout on big collection create calls
100%
Description
While working on #17948 I found an issue with our Python API client while making collection creation request with big payloads:
Traceback (most recent call last): File "tools/test-collection-create/test-collection-create.py", line 165, in <module> sys.exit(main()) File "tools/test-collection-create/test-collection-create.py", line 155, in main "manifest_text": manifest File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 835, in execute method=str(self.method), body=self.body, headers=self.headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 179, in _retry_request raise exception File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 162, in _retry_request resp, content = http.request(uri, method, *args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/arvados/api.py", line 111, in _intercept_http_request return self.orig_http_request(uri, method, headers=headers, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1709, in request conn, authority, uri, request_uri, method, body, headers, redirections, cachekey, File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1424, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1376, in _conn_request response = conn.getresponse() File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse response.begin() File "/usr/lib/python3.7/http/client.py", line 310, in begin version, status, reason = self._read_status() File "/usr/lib/python3.7/http/client.py", line 271, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.7/ssl.py", line 911, in read return self._sslobj.read(len, buffer) socket.timeout: The read operation timed out
The collection got created but the client failed to wait for the response (it seems that only waits for around 12 seconds no matter which timeout setting is used).
For more information: https://dev.arvados.org/issues/17948#note-8
Updated by Lucas Di Pentima over 3 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 3 years ago
- Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Updated by Lucas Di Pentima over 3 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 3 years ago
Updates at ca109d685 - branch 17989-pysdk-timeout
Test run: developer-run-tests: #2680
- Fixes the
arvados.api_from_config()
call so that it uses the specifiedtimeout
param. - Adds test.
- As a bonus, lowers the max manifest size limit on the test collection creation script to avoid "payload is too large" errors.
Updated by Peter Amstutz over 3 years ago
Lucas Di Pentima wrote:
Updates at ca109d685 - branch
17989-pysdk-timeout
Test run: developer-run-tests: #2680
- Fixes the
arvados.api_from_config()
call so that it uses the specifiedtimeout
param.- Adds test.
- As a bonus, lowers the max manifest size limit on the test collection creation script to avoid "payload is too large" errors.
That was a unfortunate oversight, but it looks like it was easy to find and fix.
On the subject of timeouts:
We already have feedback that a default timeout 10s is too short (e.g. arv-mount reportedly cannot load very large collections).
Timeouts are one of those things that don't matter until you start hitting them and you actually do need it to wait longer, then they matter a lot.
So I have a couple of follow up suggestions:
- Raise the timeout to 20 or 30 seconds (let's see what Tom and Ward think)
- increase the default timeout of arv-mount, and/or add a command line flag that lets you adjust the timeout and/or add an enviroment variable ARVADOS_TIMEOUT or something that lets you adjust the default timeout.
Updated by Peter Amstutz over 3 years ago
I should have said, the 17989-pysdk-timeout branch LGTM.
Updated by Peter Amstutz over 3 years ago
- Related to Bug #17171: [sdk] unreasonable python SDK timeout when ARVADOS_API_HOST is reachable added
Updated by Peter Amstutz over 3 years ago
(copied from slack)
I don't suppose we have any better granularity?
Initial TCP connection/TLS handshake timeout and subsequent response timeout really ought to be different things.
This is one of the reasons we use pycurl for keepclient, because it gave us a more knobs to control exactly when to abandon a connection.
Updated by Peter Amstutz over 3 years ago
From discussion: let's revert the timeout to 300 seconds and investigate further if we can separate the connect/handshake timeout from the response timeout.
Updated by Lucas Di Pentima over 3 years ago
Updates at aab565926 - branch 17989-revert-pysdk-default-timeout
Test run: developer-run-tests: #2681
- Reverts
arvados.api()
's default timeout to 300 seconds. - Updates test.
Updated by Peter Amstutz over 3 years ago
Lucas Di Pentima wrote:
Updates at aab565926 - branch
17989-revert-pysdk-default-timeout
Test run: developer-run-tests: #2681
- Reverts
arvados.api()
's default timeout to 300 seconds.- Updates test.
LGTM
Updated by Lucas Di Pentima over 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados-private:commit:arvados|834f7c726f727e04250ed6b3eda287ca996e9c38.