Project

General

Profile

Actions

Bug #10805

closed

[FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite

Added by Tom Clegg about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Start date:
01/04/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Test suite deadlocks frequently if llfuse is upgraded past 0.41.1 (tested: 0.42, 0.42.1, 0.43, 1.1.1)

To reproduce, change llfuse version requirement in services/fuse/setup.py and repeat the fuse tests a few times in a row:

export WORKSPACE=`pwd`
./build/run-tests.sh --temp /tmp/buildtmp --only install
./build/run-tests.sh --temp /tmp/buildtmp --skip-install --only services/fuse --repeat 32

The deadlock seems to happen only while the mount is being shut down. It isn't clear whether the problem could affect real-life usage too, or is specific to the test suite (e.g., combining multiprocessing with threading).

https://pythonhosted.org/llfuse/changes.html#release-0-42-2016-01-30


Subtasks 1 (0 open1 closed)

Task #11827: ReviewResolvedLucas Di Pentima01/04/2017

Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Story #8345: [FUSE] Support llfuse 0.42+ResolvedTom Clegg02/03/2016

Actions
Related to Arvados - Bug #12134: [arv-mount] Fix test deadlock by using new llfuse in test suiteResolvedTom Clegg08/16/2017

Actions
Actions #1

Updated by Tom Clegg about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Morris almost 8 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Tom Morris over 7 years ago

  • Subject changed from [FUSE] Upgrade llfuse to 1.1, fix deadlock in test suite to [FUSE] Upgrade llfuse to 1.2, fix deadlock in test suite
  • Target version changed from Arvados Future Sprints to 2017-06-07 sprint
Actions #4

Updated by Tom Morris over 7 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Morris over 7 years ago

  • Target version changed from 2017-06-07 sprint to 2017-06-21 sprint
Actions #6

Updated by Tom Morris over 7 years ago

  • Story points set to 1.0
Actions #7

Updated by Peter Amstutz over 7 years ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Actions #9

Updated by Peter Amstutz over 7 years ago

Well this is pretty awesome:

2017-06-26 16:57:26 arvados.arvados_fuse[2301] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
  File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 293, in catch_exceptions_wrapper
    return orig_func(self, *args, **kwargs)
  File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 370, in destroy
    if llfuse.lock.acquire():
  File "src/misc.pxi", line 173, in llfuse.Lock.acquire (src/llfuse.c:26550)
RuntimeError: Global lock cannot be acquired more than once
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/src/arvados/services/fuse/arvados_fuse/command.py", line 142, in <lambda>
    self.llfuse_thread = threading.Thread(None, lambda: self._llfuse_main())
  File "/usr/src/arvados/services/fuse/arvados_fuse/command.py", line 378, in _llfuse_main
    llfuse.close()
  File "src/fuse_api.pxi", line 524, in llfuse.close (src/llfuse.c:38158)
  File "src/handlers.pxi", line 23, in llfuse.fuse_destroy (src/llfuse.c:7084)
  File "src/handlers.pxi", line 24, in llfuse.fuse_destroy (src/llfuse.c:7034)
  File "/usr/src/arvados/services/fuse/arvados_fuse/__init__.py", line 306, in catch_exceptions_wrapper
    raise llfuse.FUSEError(errno.EIO)
FUSEError: Input/output error
Actions #10

Updated by Peter Amstutz over 7 years ago

Looks like in llfuse 1.2 the lock is already held, so the solution is to just not try to acquire the lock.

Actions #11

Updated by Peter Amstutz over 7 years ago

According to gdb, this is where it hangs:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x00000000004f6785 in PyEval_RestoreThread ()
#2  0x00000000005c8919 in PyGILState_Ensure ()
#3  0x00007f3f84a09ece in __pyx_f_6llfuse_fuse_destroy
(__pyx_v_userdata=<optimized out>) at src/llfuse.c:6950
#4  0x00007f3f8477c7a2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#5  0x00007f3f84782da3 in fuse_session_destroy () from
/lib/x86_64-linux-gnu/libfuse.so.2
#6  0x00007f3f849c77b9 in __pyx_pf_6llfuse_12close
(__pyx_self=<optimized out>, __pyx_v_unmount=0x92a970
<_Py_TrueStruct>) at src/llfuse.c:38008
#7  __pyx_pw_6llfuse_13close (__pyx_self=<optimized out>,
__pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at
src/llfuse.c:37938
#8  0x00000000004c9e05 in PyEval_EvalFrameEx ()
#9  0x00000000004ca592 in PyEval_EvalFrameEx ()
#10 0x00000000004e5fe8 in ?? ()
#11 0x00000000004cc36b in PyEval_EvalFrameEx ()
#12 0x00000000004ca592 in PyEval_EvalFrameEx ()
#13 0x00000000004ca592 in PyEval_EvalFrameEx ()
#14 0x00000000004e5fe8 in ?? ()
#15 0x00000000005045d8 in ?? ()
#16 0x00000000004d1a1b in PyEval_CallObjectWithKeywords ()
#17 0x00000000005bc102 in ?? ()
#18 0x00007f3f8bdf9064 in start_thread (arg=0x7f3f81fd9700) at
pthread_create.c:309
#19 0x00007f3f8b20b62d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Actions #12

Updated by Peter Amstutz over 7 years ago

Working backwards, here is my theory of the deadlock:

  1. It gets stuck in PyEval_RestoreThread() attempting to acquire the GIL. However, if the GIL is already acquired, this results in a deadlock.
  2. The method PyGILState_Ensure() is intended to determine if the calling thread is the interpreter's "current" thread. This is accomplished by testing the variable _PyThreadState_Current against the ThreadState associated with this thread.
  3. The ThreadState is stored in a linked list keyed on the thread id, which has semantics that PyThread_set_key_value() returns the existing value (but doesn't update it) if the key is already present.
  4. Calls to PyGILState_Ensure() need to be paired with PyGILState_Release() which maintains a gilstate_counter. When gilstate_counter goes to zero, the thread context is cleaned up and removed from the ThreadState list.
  5. PyThread_get_thread_ident() uses pthread_self(). Pthreads specifies that thread ids may be reused.
  6. llfuse creates worker threads directly using pthread_create. These threads are Cython code which call into Python.
  7. llfuse also uses pthread_cancel() to terminate its worker threads.
  8. If pthread_cancel() stops a thread that has a Python thread state, so PyGILState_Release() isn't called, it will leak that thread state
  9. If a new thread is created, it may get the same id as the previous thread. Somehow, it is possible for this to result in creation of a new ThreadState object and not taking over the old one.
  10. Later, the new thread calls PyGILState_Ensure(). The _PyThreadState_Current is correct and indicates the GIL is locked, but looking up the ThreadState for this thread using PyThread_get_key_value returns a different struct. This causes Python to believe that it needs to acquire the GIL and swap in the new ThreadState.
  11. Because the GIL is actually already locked, it deadlocks.

Possible fixes:

  1. Remove pthread_cancel(). This seems to eliminate the problem. The main drawback is long-running request handlers could cause problems of their own by delaying unmount/shutdown.
  2. Add pthread_cleanup_push() to ensure that PyGILState_Release() gets called when the thread is canceled. May require a bit of tinkering to ensure Cython emits the right code.

So far with solution #1 I got 16 consecutive runs without a deadlock (previously it would usually deadlock after 3-4 tries), then this happened:

----------------------------------------------------------------------
Ran 83 tests in 228.099s

OK
*** Error in `python': free(): invalid next size (normal): 0x00000000012d8e60 ***
Actions #13

Updated by Peter Amstutz over 7 years ago

The crash may be a red herring. Not long after it crashing, I had to reboot my compute due to a kernel panic (it has some flaky RAM issues...)

Actions #14

Updated by Peter Amstutz over 7 years ago

Other options include:

  1. Work around deadlock by refactoring test cases so that llfuse only runs in a short-lived multiprocess workers
  2. Do nothing, keep it limping along and prioritize rewrite of FUSE mount in Go.
Actions #15

Updated by Peter Amstutz over 7 years ago

Filed a bug report on llfuse, with recommendation to remove pthread_cancel().

https://bitbucket.org/nikratio/python-llfuse/issues/108/using-pthread_cancel-on-workers-leaves

(I tinkered with pthread_cleanup_push() a bit, I don't think that's going to work as it looks like it relies on longjmp, which conflicts with Python's own use of setjmp()).

Actions #16

Updated by Peter Amstutz over 7 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-07-05 sprint to 2017-07-19 sprint
Actions #18

Updated by Tom Morris over 7 years ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Actions #19

Updated by Tom Clegg over 7 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
Actions #20

Updated by Tom Clegg over 7 years ago

10805-llfuse-1.2 @ 7e7ada63bca240416584871398076c1bafc90f76
  • Without this change, I see stack traces all over the test output (even though tests are passing).
  • With this change, the stack traces go away.
Actions #21

Updated by Tom Clegg over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:5e5bc2b2153a0af624d75a3c348b996ddf458ef2.

Actions

Also available in: Atom PDF