Actions
Bug #8206
closed[Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
01/14/2016
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Story points:
1.0
Description
SSL error getting max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy()
will stop the node-manager-
here is the stacktrace:
2016-01-14_12:42:41.57638 Traceback (most recent call last): 2016-01-14_12:42:41.57641 File "/usr/local/bin/arvados-node-manager", line 6, in <module> 2016-01-14_12:42:41.57643 main() 2016-01-14_12:42:41.57643 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 125, in main 2016-01-14_12:42:41.59825 max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy() 2016-01-14_12:42:41.59827 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 94, in start 2016-01-14_12:42:41.59827 obj = cls(*args, **kwargs) 2016-01-14_12:42:41.59829 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 123, in __init__ 2016-01-14_12:42:41.59844 self._cloud_driver = self._new_cloud() 2016-01-14_12:42:41.59846 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/config.py", line 105, in new_cloud_client 2016-01-14_12:42:41.59846 self.get_section('Cloud Create')) 2016-01-14_12:42:41.59847 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/gce.py", line 36, in __init__ 2016-01-14_12:42:41.59847 driver_class) 2016-01-14_12:42:41.59847 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 40, in __init__ 2016-01-14_12:42:41.59848 self.real = driver_class(**auth_kwargs) 2016-01-14_12:42:41.59848 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1053, in __init__ 2016-01-14_12:42:41.59862 self.zone_list = self.ex_list_zones() 2016-01-14_12:42:41.59863 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1785, in ex_list_zones 2016-01-14_12:42:41.59881 response = self.connection.request(request, method='GET').object 2016-01-14_12:42:41.59883 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 120, in request 2016-01-14_12:42:41.59889 response = super(GCEConnection, self).request(*args, **kwargs) 2016-01-14_12:42:41.59889 File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 698, in request 2016-01-14_12:42:41.59895 raise e 2016-01-14_12:42:41.59895 ssl.SSLError: The read operation timed out
seems that the nodemanager after that is stuck. will be good to retry or at least die gracefully.
Steps to fix:
Put self.real
initialization into retry loop on cloud error.
Log error backtrace.
Actions