Bug #8206
closed[Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver
100%
Description
SSL error getting max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy()
will stop the node-manager-
here is the stacktrace:
2016-01-14_12:42:41.57638 Traceback (most recent call last): 2016-01-14_12:42:41.57641 File "/usr/local/bin/arvados-node-manager", line 6, in <module> 2016-01-14_12:42:41.57643 main() 2016-01-14_12:42:41.57643 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 125, in main 2016-01-14_12:42:41.59825 max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy() 2016-01-14_12:42:41.59827 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 94, in start 2016-01-14_12:42:41.59827 obj = cls(*args, **kwargs) 2016-01-14_12:42:41.59829 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 123, in __init__ 2016-01-14_12:42:41.59844 self._cloud_driver = self._new_cloud() 2016-01-14_12:42:41.59846 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/config.py", line 105, in new_cloud_client 2016-01-14_12:42:41.59846 self.get_section('Cloud Create')) 2016-01-14_12:42:41.59847 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/gce.py", line 36, in __init__ 2016-01-14_12:42:41.59847 driver_class) 2016-01-14_12:42:41.59847 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 40, in __init__ 2016-01-14_12:42:41.59848 self.real = driver_class(**auth_kwargs) 2016-01-14_12:42:41.59848 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1053, in __init__ 2016-01-14_12:42:41.59862 self.zone_list = self.ex_list_zones() 2016-01-14_12:42:41.59863 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1785, in ex_list_zones 2016-01-14_12:42:41.59881 response = self.connection.request(request, method='GET').object 2016-01-14_12:42:41.59883 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 120, in request 2016-01-14_12:42:41.59889 response = super(GCEConnection, self).request(*args, **kwargs) 2016-01-14_12:42:41.59889 File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 698, in request 2016-01-14_12:42:41.59895 raise e 2016-01-14_12:42:41.59895 ssl.SSLError: The read operation timed out
seems that the nodemanager after that is stuck. will be good to retry or at least die gracefully.
Steps to fix:
Put self.real
initialization into retry loop on cloud error.
Log error backtrace.
Updated by Brett Smith about 9 years ago
- Subject changed from [NODEMANAGER] ssl.SSLError when reading config will get nodemanager stuck to [Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver
- Category set to Node Manager
#8226 saw same thing from another part of code. Network call is inside libcloud driver init method, so we need to be able to retry that initialization. Need to check if that's safe. If not, maybe we can override __new__
to get what we need?
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz about 9 years ago
- Description updated (diff)
- Story points set to 1.0
Updated by Peter Amstutz about 9 years ago
- Target version changed from Arvados Future Sprints to 2016-02-03 Sprint
Updated by Tom Clegg about 9 years ago
It seems a bit odd at this point to maintain the pretense that _retry()
is "private" enough to have a leading underscore.
Also, now that it's divorced from the ComputeNodeStateChangeBase class, its docstring should mention a full list of assumptions it makes about the calling class (self.retry_wait, self.max_retry_wait, self.min_retry_wait, self._cloud, self._logger, ...?)
Alternatively, this seems like it could be refactored as a mixin, so those things can be passed to a constructor, they can have defaults, it's more obvious what's happening, etc.
Seems like self.retry_wait should get reset to self.min_retry_wait when re-raising an unhandled exception. (OTOH the previous version seems to have survived without this.)
The delay behavior seems to differ depending on whether there is a self._timer:- with a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of attempt N
- without a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of the first attempt (this is probably unintentional; it seems to mean once we reach max_retry_wait we will stop delaying entirely)
How about having attempt N+1 happen self._retry_wait seconds after the end time of attempt N in both cases?
Updated by Peter Amstutz about 9 years ago
Tom Clegg wrote:
Alternatively, this seems like it could be refactored as a mixin, so those things can be passed to a constructor, they can have defaults, it's more obvious what's happening, etc.
Refactored to RetryMixin.
Seems like self.retry_wait should get reset to self.min_retry_wait when re-raising an unhandled exception. (OTOH the previous version seems to have survived without this.)
Fixed.
The delay behavior seems to differ depending on whether there is a self._timer:
- with a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of attempt N
- without a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of the first attempt (this is probably unintentional; it seems to mean once we reach max_retry_wait we will stop delaying entirely)
How about having attempt N+1 happen self._retry_wait seconds after the end time of attempt N in both cases?
Yep, that makes sense. Done.
Updated by Tom Clegg about 9 years ago
Oh. But could you mock sleep() for the test? Otherwise this is nodemanager's first slow test. (Could assert sleep() was called as expected, too)
Updated by Peter Amstutz about 9 years ago
- Status changed from New to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:40103b455d0df2bc47eaba422a72bf6ec4fd6808.