Bug #10805
closed[FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite
Added by Tom Clegg about 8 years ago. Updated over 7 years ago.
100%
Description
Test suite deadlocks frequently if llfuse is upgraded past 0.41.1 (tested: 0.42, 0.42.1, 0.43, 1.1.1)
To reproduce, change llfuse version requirement in services/fuse/setup.py and repeat the fuse tests a few times in a row:
export WORKSPACE=`pwd` ./build/run-tests.sh --temp /tmp/buildtmp --only install ./build/run-tests.sh --temp /tmp/buildtmp --skip-install --only services/fuse --repeat 32
The deadlock seems to happen only while the mount is being shut down. It isn't clear whether the problem could affect real-life usage too, or is specific to the test suite (e.g., combining multiprocessing with threading).
https://pythonhosted.org/llfuse/changes.html#release-0-42-2016-01-30
Updated by Tom Morris almost 8 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Morris over 7 years ago
- Subject changed from [FUSE] Upgrade llfuse to 1.1, fix deadlock in test suite to [FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite
- Target version changed from Arvados Future Sprints to 2017-06-07 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-06-07 sprint to 2017-06-21 sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Peter Amstutz over 7 years ago
Well this is pretty awesome:
2017-06-26 16:57:26 arvados.arvados_fuse[2301] ERROR: Unhandled exception during FUSE operation Traceback (most recent call last): File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 293, in catch_exceptions_wrapper return orig_func(self, *args, **kwargs) File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 370, in destroy if llfuse.lock.acquire(): File "src/misc.pxi", line 173, in llfuse.Lock.acquire (src/llfuse.c:26550) RuntimeError: Global lock cannot be acquired more than once Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/usr/src/arvados/services/fuse/arvados_fuse/command.py", line 142, in <lambda> self.llfuse_thread = threading.Thread(None, lambda: self._llfuse_main()) File "/usr/src/arvados/services/fuse/arvados_fuse/command.py", line 378, in _llfuse_main llfuse.close() File "src/fuse_api.pxi", line 524, in llfuse.close (src/llfuse.c:38158) File "src/handlers.pxi", line 23, in llfuse.fuse_destroy (src/llfuse.c:7084) File "src/handlers.pxi", line 24, in llfuse.fuse_destroy (src/llfuse.c:7034) File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 306, in catch_exceptions_wrapper raise llfuse.FUSEError(errno.EIO) FUSEError: Input/output error
Updated by Peter Amstutz over 7 years ago
Looks like in llfuse 1.2 the lock is already held, so the solution is to just not try to acquire the lock.
Updated by Peter Amstutz over 7 years ago
According to gdb, this is where it hangs:
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x00000000004f6785 in PyEval_RestoreThread () #2 0x00000000005c8919 in PyGILState_Ensure () #3 0x00007f3f84a09ece in __pyx_f_6llfuse_fuse_destroy (__pyx_v_userdata=<optimized out>) at src/llfuse.c:6950 #4 0x00007f3f8477c7a2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 #5 0x00007f3f84782da3 in fuse_session_destroy () from /lib/x86_64-linux-gnu/libfuse.so.2 #6 0x00007f3f849c77b9 in __pyx_pf_6llfuse_12close (__pyx_self=<optimized out>, __pyx_v_unmount=0x92a970 <_Py_TrueStruct>) at src/llfuse.c:38008 #7 __pyx_pw_6llfuse_13close (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at src/llfuse.c:37938 #8 0x00000000004c9e05 in PyEval_EvalFrameEx () #9 0x00000000004ca592 in PyEval_EvalFrameEx () #10 0x00000000004e5fe8 in ?? () #11 0x00000000004cc36b in PyEval_EvalFrameEx () #12 0x00000000004ca592 in PyEval_EvalFrameEx () #13 0x00000000004ca592 in PyEval_EvalFrameEx () #14 0x00000000004e5fe8 in ?? () #15 0x00000000005045d8 in ?? () #16 0x00000000004d1a1b in PyEval_CallObjectWithKeywords () #17 0x00000000005bc102 in ?? () #18 0x00007f3f8bdf9064 in start_thread (arg=0x7f3f81fd9700) at pthread_create.c:309 #19 0x00007f3f8b20b62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Updated by Peter Amstutz over 7 years ago
Working backwards, here is my theory of the deadlock:
- It gets stuck in
PyEval_RestoreThread()
attempting to acquire the GIL. However, if the GIL is already acquired, this results in a deadlock. - The method
PyGILState_Ensure()
is intended to determine if the calling thread is the interpreter's "current" thread. This is accomplished by testing the variable _PyThreadState_Current against the ThreadState associated with this thread. - The ThreadState is stored in a linked list keyed on the thread id, which has semantics that PyThread_set_key_value() returns the existing value (but doesn't update it) if the key is already present.
- Calls to
PyGILState_Ensure()
need to be paired withPyGILState_Release()
which maintains a gilstate_counter. When gilstate_counter goes to zero, the thread context is cleaned up and removed from the ThreadState list. - PyThread_get_thread_ident() uses pthread_self(). Pthreads specifies that thread ids may be reused.
- llfuse creates worker threads directly using pthread_create. These threads are Cython code which call into Python.
- llfuse also uses pthread_cancel() to terminate its worker threads.
- If pthread_cancel() stops a thread that has a Python thread state, so
PyGILState_Release()
isn't called, it will leak that thread state - If a new thread is created, it may get the same id as the previous thread. Somehow, it is possible for this to result in creation of a new ThreadState object and not taking over the old one.
- Later, the new thread calls
PyGILState_Ensure()
. The _PyThreadState_Current is correct and indicates the GIL is locked, but looking up the ThreadState for this thread using PyThread_get_key_value returns a different struct. This causes Python to believe that it needs to acquire the GIL and swap in the new ThreadState. - Because the GIL is actually already locked, it deadlocks.
Possible fixes:
- Remove pthread_cancel(). This seems to eliminate the problem. The main drawback is long-running request handlers could cause problems of their own by delaying unmount/shutdown.
- Add pthread_cleanup_push() to ensure that PyGILState_Release() gets called when the thread is canceled. May require a bit of tinkering to ensure Cython emits the right code.
So far with solution #1 I got 16 consecutive runs without a deadlock (previously it would usually deadlock after 3-4 tries), then this happened:
---------------------------------------------------------------------- Ran 83 tests in 228.099s OK *** Error in `python': free(): invalid next size (normal): 0x00000000012d8e60 ***
Updated by Peter Amstutz over 7 years ago
The crash may be a red herring. Not long after it crashing, I had to reboot my compute due to a kernel panic (it has some flaky RAM issues...)
Updated by Peter Amstutz over 7 years ago
Other options include:
- Work around deadlock by refactoring test cases so that llfuse only runs in a short-lived multiprocess workers
- Do nothing, keep it limping along and prioritize rewrite of FUSE mount in Go.
Updated by Peter Amstutz over 7 years ago
Filed a bug report on llfuse, with recommendation to remove pthread_cancel().
https://bitbucket.org/nikratio/python-llfuse/issues/108/using-pthread_cancel-on-workers-leaves
(I tinkered with pthread_cleanup_push() a bit, I don't think that's going to work as it looks like it relies on longjmp, which conflicts with Python's own use of setjmp()).
Updated by Peter Amstutz over 7 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-07-05 sprint to 2017-07-19 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Updated by Tom Clegg over 7 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg over 7 years ago
- Without this change, I see stack traces all over the test output (even though tests are passing).
- With this change, the stack traces go away.
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:5e5bc2b2153a0af624d75a3c348b996ddf458ef2.