Bug #8913
closed
[Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize
Added by Nico César about 9 years ago.
Updated about 9 years ago.
Estimated time:
(Total: 0.00 h)
Description
This happened in qr2hi: (I don't know if this exceptions are the cause of the manager being wedged or not. ) I restarted the service and the nodes were created.
# grep Traceback arvados-node-manager/log/main/current -A28
2016-04-08_18:00:17.44134 Traceback (most recent call last):
2016-04-08_18:00:17.44134 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 128, in main
2016-04-08_18:00:17.44135 signal.pause()
2016-04-08_18:00:17.44136 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 90, in shutdown_signal
2016-04-08_18:00:17.44136 node_daemon.shutdown()
2016-04-08_18:00:17.44136 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/baseactor.py", line 25, in __call__
2016-04-08_18:00:17.44137 self.actor_ref.tell(message)
2016-04-08_18:00:17.44137 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 398, in tell
2016-04-08_18:00:17.44137 raise ActorDeadError('%s not found' % self)
2016-04-08_18:00:17.44137 ActorDeadError: NodeManagerDaemonActor (urn:uuid:e9844486-0662-4b73-bc46-8e64f57ac168) not found
2016-04-08_18:00:17.44211 2016-04-08 18:00:17 pykka[29660] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:1c85ed8e-3b54-43fb-80eb-9cd3a5a9738f)
Files
The last traceback you pasted, the one you based the subject on, is #6225.
The ActorDeadError above that is more interesting, that's almost always going to be a problem. More logs before that would be good to see.
- Description updated (diff)
yes... I guess the fingerprint it's irrelevant. Probably we should not transform that traceback into a WARNING or something.
I added a log that has the ActorDead. moved to Arvados private just because it has a log.
- Project changed from 35 to Arvados
- Subject changed from [Nodemanager] GCE returns "ActorDead" to [Nodemanager] 'unicode' object has no attribute 'id'
The original error was aaallllll the way back here:
2016-04-06_16:52:29.77830 2016-04-06 16:52:29 NodeManagerDaemonActor.8e64f57ac168[29660]
ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper ob
ject at 0x261ce90>
2016-04-06_16:52:29.77831 Traceback (most recent call last):
2016-04-06_16:52:29.77831 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem
on.py", line 326, in update_server_wishlist
2016-04-06_16:52:29.77831 nodes_wanted = self._nodes_wanted(size)
2016-04-06_16:52:29.77831 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem
on.py", line 285, in _nodes_wanted
2016-04-06_16:52:29.77832 total_price = self._total_price()
2016-04-06_16:52:29.77833 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 250, in _total_price
2016-04-06_16:52:29.77834 for i in (self.booted, self.cloud_nodes.nodes)
2016-04-06_16:52:29.77834 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 251, in <genexpr>
2016-04-06_16:52:29.77834 for c in i.itervalues())
2016-04-06_16:52:29.77835 AttributeError: 'unicode' object has no attribute 'id'
From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.
- Subject changed from [Nodemanager] 'unicode' object has no attribute 'id' to [Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize
This is a bug introduced by #8872. The node returned by search_for doesn't have its size attribute fixed.
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version set to 2016-04-13 sprint
Brett Smith wrote:
The original error was aaallllll the way back here:
[...]
From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.
Related to this, perhaps on_failure() should kill self on all unhandled exceptions and not just certain ones? Currently the policy is to handle recoverable exceptions before it gets to on_failure(), so once an exception gets to on_failure() it means an actor is going to die unexpectedly, which generally results in node manager getting wedged. (Filed a separate report #8932)
The fix in 8912-node-manager-patch-nodes-wip 8db9ad8 LGTM.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:788b8d7247da8c4592b1f9d482fff4e1509f57f3.
Also available in: Atom
PDF