Story #8437
closed
[Node Manager] Actors define on_failure to terminate the process on exceptions that are difficult to recover
Added by Brett Smith almost 9 years ago.
Updated almost 9 years ago.
Estimated time:
(Total: 0.00 h)
Description
on_failure gets called when an unhandled exception is raised in an actor method. All our actors should use this to detect when there's an exception that means the whole process is unlikely to recover, and kill it (probably by sending SIGKILL to the process group):
- threading.ThreadError (can't create thread due to out of RAM)
- OSError reporting that we can't allocate RAM (hopefully it has errno ENOMEM? Basically we want this to replace the strategy implemented in #6321.)
- MemoryError
Define a new class that just defines this on_failure method; then have all of the other Node Manager actors use it as their superclass.
- Description updated (diff)
- Assigned To set to Peter Amstutz
on_failure is only called when there is no future associated with the message. As it turns out, all calls that use ActorProxy have an associated Future object, and all messaging between actors in node manager uses ActorProxy. This means unhandled exceptions are stored in a Future object to be returned to the caller. However, if the caller never calls get() on the Future object (because it never stored it), this means the exception is silently ignored.
These lingering future objects may also be creating circular references that is causing the memory leak.
- Status changed from New to In Progress
reviewing 0a1c109684c62f0bc42e7dca30319fc8222dbef7
there are 3 cases where this exception is raise but the test only tests MemoryError
It will be cool to also have threading.ThreadError
and OSError
with exception_value.errno == errno.ENOMEM
down the road this could catch other things
the rest LGTM
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:6ed351ec65d657c27b48b4e4ac0c89d880a2fd1a.
Also available in: Atom
PDF