Bug #3857
closed[SDKs] Python KeepClient should use different timeouts for different operations
100%
Description
Executive summary:
If an arvados.KeepClient is connecting to a Keep proxy (service_type == 'proxy'):- The TCP connection timeout should be 2 seconds.
- Otherwise, connection timeout should be 20 seconds.
Leave all other timeouts as they are.
More in detail:
Right now the Python KeepClient takes a single timeout setting when it's instantiated, and uses that for all HTTP requests by passing it along when it instantiates its own httplib2.Http objects. It seems like httplib2.Http is using this same timeout for all aspects of the request. However, we'd like to be smarter than that:
- The initial attempt to connect, to send the request, should use a short timeout.
- After the request is sent, it's appropriate to use a longer timeout. (The worst-case scenario is that we're PUTting 64MiB to a proxy. Just sending the data will take a little while, and then we'll have to wait for it to be duplexed to several servers before we receive any response.)
With the single timeout we currently have, a small value causes premature failures using tools like arv-put (see #3846), but a long value can cause clients to spend needless time waiting to contact a Keep service that's down.
Investigate whether it's possible to set different timeouts for these different operations, and implement it if so. Without looking, this might be the difference between the socket module's timeout and httplib2.Http's, or httplib2.Http might accept different configuration arguments for these.
Updated by Ward Vandewege over 10 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Clegg over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Updated by Ward Vandewege over 10 years ago
- Story points changed from 2.0 to 1.0
Updated by Tim Pierce about 10 years ago
- Target version changed from 2014-10-08 sprint to Bug Triage
Updated by Tom Clegg about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Tom Clegg about 10 years ago
- Subject changed from [SDK] Python KeepClient should use different timeouts for different operations to [SDKs] Python KeepClient should use different timeouts for different operations
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Updated by Ward Vandewege about 10 years ago
- Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-11-19 sprint to Arvados Future Sprints
Updated by Tom Clegg about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Tim Pierce about 10 years ago
There doesn't appear to be a trivial fix to this one. httplib2 is built on Python's httplib, which takes a single timeout setting when a connection is created and gives no opportunity to change the timeout afterward or on individual requests. Nor is the httplib2 code organized in such a way as to offer a good place for us to inject this. The httplib2.Http
method is implemented as a monolithic class method, so even if we subclass it, there's no obvious place where we can change the socket timeout after creating a connection and before issuing a request.
That said, making this change in the source doesn't look terribly difficult. This is currently marked as an open issue for httplib2. If we were willing to patch the library, we could submit a pull request and hope it makes it into a new httplib2 release. httplib2 appears to have been mostly in maintenance mode, but the author has accepted a few pull requests this year, so it's not impossible.
Updated by Tim Pierce about 10 years ago
- Status changed from New to In Progress
Updated by Tim Pierce about 10 years ago
Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.
At a glance, it also only supports a single "timeout" parameter that is applied the same way to all connection objects. But it might be possible to use the Session
class to set up a connection in one call and then issue a request on it with a different timeout setting.
Updated by Brett Smith about 10 years ago
Tim Pierce wrote:
Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.
To be clear, that was very speculative. Someone mentioned downgrading to httplib, so I wondered if we might do just as well to "upgrade" to requests, which has sometimes been called httplib3. But I've never used it and don't know what it's capable of.
Updated by Tim Pierce about 10 years ago
Tim Pierce wrote:
Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.
At a glance, it also only supports a single "timeout" parameter that is applied the same way to all connection objects. But it might be possible to use the
Session
class to set up a connection in one call and then issue a request on it with a different timeout setting.
Yes, reading more closely, the timeout parameter can be given as a (connection_timeout, read_timeout) tuple:
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
http://requests.readthedocs.org/en/latest/user/advanced/#timeouts
Updated by Tom Clegg about 10 years ago
I can't say I've given this a thorough look but I've noticed a couple of things:
I think it would be valuable to have different timeouts (and different defaults) for proxies. Perhaps we could accept timeout=((a,b),(c,d))
to mean "(a,b)
if non-proxy, (c,d)
if proxy"? In that case I'd suggest ((3,30),(20,60))
as a default.
Is there some way we can test this? Perhaps timeout=(0,12)
reliably causes a connection timeout, etc.? If testing the timeouts turns out to be tricky, we could test some basic stuff like passing timeout=123
and timeout=(123,456)
and timeout=None
doesn't crash.
The doc string bit about the timeout
argument could refer to the Requests documentation for more detail about the timeouts.
On that note, I would like to know whether, in the case of a large PUT, the read_timeout timer starts when the last byte is sent by the client, or when the connection is made (~when the first byte is sent by the client). If the latter, I'd suggest something more like ((3,60),(20,300))
instead of ((3,30),(20,60))
above.
(Perhaps Ward has better ideas about defaults.)
Updated by Tim Pierce about 10 years ago
This certainly should be testable, and some clever use of mocks should allow us to test the timeouts without actually waiting for 30 seconds or whatever. I'll ask our mock expert for advice here.
Separate timeouts for proxies seems like a reasonable thing. Is it okay to push that into another story? I'm eager to wrap up this story so I can finish as much work done as possible on NodeManager before my science support gig starts on Monday.
Updated by Tim Pierce about 10 years ago
On the other hand, the more I think about it, the more it seems like the test we're talking about is just that requests.get
does what it advertises it will. I'm not sure it's a good use of our time to add testing infrastructure for that. Certainly we should make sure that in practice it gives us the result we want, but testing internal behavior of a third-party library is the kind of test that can be difficult to write and prone to break. WDYT?
With respect to your last question, the package defines the read_timeout parameter to mean the maximum amount of time to wait between consecutive read operations for a response from the server. That seems to indicate that the read timer is reset when the client calls socket.recv
, i.e. after the last byte of the request is sent.
Updated by Brett Smith about 10 years ago
Tim Pierce wrote:
This certainly should be testable, and some clever use of mocks should allow us to test the timeouts without actually waiting for 30 seconds or whatever. I'll ask our mock expert for advice here.
Are you talking about me? Oh god, you're talking about me. Horrifying.
Separate timeouts for proxies seems like a reasonable thing. Is it okay to push that into another story?
At the very least, I think this branch needs to use different defaults based on whether or not we're using a proxy. I don't know how else you meet Ward's requirements in this story without regressing #3846. I feel less strongly about whether or not we expose these as separate options to SDK users (as opposed to making them check themselves and pass in one timeout tuple appropriately).
On the other hand, the more I think about it, the more it seems like the test we're talking about is just that requests.get does what it advertises it will. I'm not sure it's a good use of our time to add testing infrastructure for that. Certainly we should make sure that in practice it gives us the result we want, but testing internal behavior of a third-party library is the kind of test that can be difficult to write and prone to break. WDYT?
My feeling is that if we're just passing values directly through to requests.get, that's probably not a big deal to test; but any logic that we have to use different timeout values based on the Keep service type does need to be tested. If there's not an independent unit to test there, then checking what's passed into the requests mock is probably the next best option.
Updated by Tim Pierce about 10 years ago
1fa1006 adds a proxy_timeout configuration setting. The default timeout value is now (3, 30) and the default proxy_timeout is (20, 60).
Added unit tests to check that KeepClient.get
and KeepClient.put
use the appropriate timeout setting when communicating directly with a Keep server and with a proxy, respectively.
Updated by Brett Smith about 10 years ago
Reviewing 1fa1006. I like this branch a lot. Thanks. This is all small potatoes stuff.
- The story says to leave all non-connect timeouts as they are. We've been using 300 for read timeouts. We know for sure we need more than 60 for the proxy case, because that's what caused #3846. Let's go back to that for read timeouts.
- The timeout argument to KeepService.__init__ seems to be a noop. If it is, let's remove it.
- In KeepService.last_status, AttributeError is the only exception that still makes sense to catch.
- In test_collections, _mock_api_call should not be changed to use a Requests error. The Google API client library uses httplib2, so ApiError (which is a replacement for apiclient's own HttpError) is expecting to get httplib2's 2-tuple.
- KeepClientRetryGetTestMixin and KeepClientRetryPutTestMixin seem like unnecessary layering. KeepClientRetryTestMixin users already have to define a few class variables; why not just add mock_method to that list (and document it in the comments)?
Personally, I'd prefer to have them go further and have mixin users define the patch method:TEST_PATCHER = tutil.mock_get_responses
. Then all the test methods callwith self.TEST_PATCHER(args, …)
. This gives the users more flexibility and eliminates the need formock_requestslib_responses
entirely. - I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.
- PEP 257 calls for no blank line before a docstring's closing triple quotes (KeepClient.__init__).
- Please keep import blocks sorted where they already are (arvados_testutil).
Updated by Brett Smith about 10 years ago
Brett Smith wrote:
- I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.
Actually, there's an even better reason to move this: the current implementation assumes we know at __init__
time whether or not we're using a proxy, but that's not always true. If no service roots are defined at init time, KeepClient will ask Arvados for available services, and we might learn then that we're using a proxy. Whether or not we're using a proxy might even change over the lifetime of the KeepClient, since it refreshes this list during retries. KeepClient needs to save both timeout values and pick the right one at request time based on the current value of self.using_proxy.
KeepClientServiceTestCase has all the infrastructure for mocking the API service call, so it can help you test this case of "no service roots at init time, learn we have a proxy later."
Updated by Tim Pierce about 10 years ago
All of these changes adopted at 6959469
Brett Smith wrote:
Reviewing 1fa1006. I like this branch a lot. Thanks. This is all small potatoes stuff.
- The story says to leave all non-connect timeouts as they are. We've been using 300 for read timeouts. We know for sure we need more than 60 for the proxy case, because that's what caused #3846. Let's go back to that for read timeouts.
- The timeout argument to KeepService.__init__ seems to be a noop. If it is, let's remove it.
- In KeepService.last_status, AttributeError is the only exception that still makes sense to catch.
- In test_collections, _mock_api_call should not be changed to use a Requests error. The Google API client library uses httplib2, so ApiError (which is a replacement for apiclient's own HttpError) is expecting to get httplib2's 2-tuple.
- KeepClientRetryGetTestMixin and KeepClientRetryPutTestMixin seem like unnecessary layering. KeepClientRetryTestMixin users already have to define a few class variables; why not just add mock_method to that list (and document it in the comments)?
Personally, I'd prefer to have them go further and have mixin users define the patch method:TEST_PATCHER = tutil.mock_get_responses
. Then all the test methods callwith self.TEST_PATCHER(args, …)
. This gives the users more flexibility and eliminates the need formock_requestslib_responses
entirely.- I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.
- PEP 257 calls for no blank line before a docstring's closing triple quotes (KeepClient.__init__).
Well. Hmph. Somebody should tell Barry Warsaw so M-q does the right thing in python mode, then.
- Please keep import blocks sorted where they already are (arvados_testutil).
I'm confused -- I didn't reorder the import modules.
(I'd like to adopt a convention of putting all imports in alphabetical order, but that's a discussion for another time.)
Updated by Tim Pierce about 10 years ago
Tim Pierce wrote:
I'm confused -- I didn't reorder the import modules.
(I'd like to adopt a convention of putting all imports in alphabetical order, but that's a discussion for another time.)
OOPS
fixed at a98086e.
Updated by Brett Smith about 10 years ago
Reviewing a98086e, and just one small technicality about the comments you added KeepClientRetryTestMixin. TEST_PATCHER doesn't need to be a static method: a subclass could def TEST_PATCHER(self, …)
and it would work fine. Your specific subclasses need to use staticmethod because that's how you convert an independent function to a method, roughly speaking.
With a small tweak there, this is good to merge, thanks.
Updated by Tim Pierce about 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ec75fda0fc2c86a77d831dcd7962ece7a2d6ae6d.