Story #8345
closed[FUSE] Support llfuse 0.42+
Added by Brett Smith almost 9 years ago. Updated about 8 years ago.
100%
Files
run-build-packages-all-targets.log.txt (558 KB) run-build-packages-all-targets.log.txt | Tom Clegg, 02/20/2016 09:50 AM | ||
ubuntu1204-ruby-crash.txt (249 KB) ubuntu1204-ruby-crash.txt | Tom Clegg, 03/01/2016 04:10 PM |
Updated by Joshua Randall almost 9 years ago
I've already implemented a version of arvados_fuse that runs against 0.42.1 (because I wanted it for the additional operations that get RequestContext). I'll split out the minimum needed to build and submit a PR.
From https://github.com/python-llfuse/python-llfuse/blob/release-0.42.1/Changes.rst:
BACKWARDS INCOMPATIBLE CHANGES: Ducktyping the EntryAttributes class is no longer allowed, Operations methods must return instances of this class (rather than any object that provides the neccessary attributes). The !st_Xtime attributes of the EntryAttributes have been dropped in favor of the !st_Xtime_ns attributes. The setattr method now always receives a completely filled EntryAttributes instance. To determine which attributes should be changed, a new fields argument has been introduced. The ~Operations.setattr method now also receives an fh parameter. The llfuse.main function now has a workers parameter, and the single parameter was dropped. Almost all request handlers now receive a RequestContext instance in an additional parameter.
Updated by Joshua Randall almost 9 years ago
A pair of pull requests implementing arvados_fuse support for llfuse >= 0.42 (including fuse backports for ubuntu1204):
https://github.com/curoverse/arvados-dev/pull/6
https://github.com/curoverse/arvados/pull/36
This has been tested and appears to be working on ubuntu1204. I also ran the jenkins script for all targets, which had some errors that seemed to be unrelated to the work I did. Not sure if it was a transient failure or what - I leave it to curoverse jenkins to test them completely (N.B. must be tested as an arvados / arvados-dev pair).
Updated by Tom Clegg almost 9 years ago
Thanks for doing this!
On my jessie dev system, pip upgraded my llfuse to 0.42.1, but then:
running build_ext Traceback (most recent call last): File "setup.py", line 43, in <module> cmdclass={'egg_info': tagger}, ... File "/home/tom/src/arvados/services/fuse/tests/test_command_args.py", line 2, in <module> import arvados_fuse File "/home/tom/src/arvados/services/fuse/arvados_fuse/__init__.py", line 77, in <module> llfuse.capi._notify_queue = Queue.Queue() AttributeError: 'module' object has no attribute 'capi'
Haven't yet figured out what that's trying to tell me.
centos6 seems to have 2.8.3-4 -- does this mean centos6 won't be able to install python-arvados-fuse?
AFAIK jenkins doesn't make it easy to test a matched pair of arvados-dev and arvados branches without merging them (one of various pitfalls of having stuff split between the two repos). Will try to build locally, but probably won't get to it right away.
Updated by Joshua Randall almost 9 years ago
Oops - forgot to cherry-pick that change, I should have time to update it tomorrow from the market square in Krakow.
Updated by Joshua Randall almost 9 years ago
Regarding centos6, yes, we'll need a fuse 4.9.1+ in order for llfuse 0.42 to work. I guess I can try `fpm -s deb -t rpm` on the backported ubuntu 12.04 package i've created. I'll give that a try and see if it works...
Updated by Joshua Randall almost 9 years ago
I've managed to do another fuse 2.9.2 backport (this time from centos7 to centos6). Testing now, will push soon if all goes well.
Updated by Joshua Randall almost 9 years ago
I've complete the centos6 backport.
https://github.com/wtsi-hgi/arvados-dev/tree/8345-llfuse-unpin f46095f
0910614
https://github.com/wtsi-hgi/arvados/tree/8345-arvados-fuse-llfuse-0.42
Updated by Joshua Randall almost 9 years ago
Re: jenkins testing of arvados / arvados-dev repo pairs, why not include a sit submodule of arvados-dev in the arvados repo and set up jenkins to `git submodule update` in order to get the matched arvados-dev repo at exactly the submodule'd version.
Updated by Joshua Randall almost 9 years ago
In reviewing the python-llfuse 0.42+ support, I found an issue with the setattr operation, whose signature has changed and needs updating. I'll work on that now.
Updated by Joshua Randall almost 9 years ago
Everything should be working with llfuse 0.42 as of https://github.com/wtsi-hgi/arvados/tree/8345-arvados-fuse-llfuse-0.42 @ de8054d6
Updated by Tom Clegg almost 9 years ago
- 07c0534 Fix dependencies after installing built package. refs #8345
- 4640c72 8345: Add libselinux-dev build-dep missed by awk script, skip "recommended" packages
With those, everything seemed to work except workbench, which dies for seemingly unrelated reasons:
Using rails 4.1.12 Using sass-rails 5.0.1 Using piwik_analytics 1.0.2 Using themes_for_rails 0.5.1 from https://github.com/curoverse/themes_for_rails (at /arvados/apps/workbench/vendor/cache/themes_for_rails-61154877047d@6115487) Updating files in vendor/cache Bundle complete! 43 Gemfile dependencies, 108 gems now installed. Bundled gems are installed into ./vendor/bundle. rake aborted! LoadError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/ext/oj/oj.so) - /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/ext/oj/oj.so /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `require' /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `block in require' /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:232:in `load_dependency' /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `require' /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/lib/oj.rb:42:in `<top (required)>' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:77:in `require' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:77:in `block (2 levels) in require' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:72:in `each' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:72:in `block in require' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:61:in `each' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:61:in `require' /usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler.rb:99:in `require' /arvados/apps/workbench/config/application.rb:5:in `<top (required)>' /arvados/apps/workbench/Rakefile:5:in `require' /arvados/apps/workbench/Rakefile:5:in `<top (required)>' (See full trace by running task with --trace) ERROR: Asset precompilation failed ERROR: build failed with exit status 1.
I suspect this has to do with the apps/workbench/vendor/cache directory being shared among the build jobs for different OS vendors/versions.
I'm giving up for now.
Updated by Nico César almost 9 years ago
maybe is related to the glibc update done last week?
we have 2.13-38+deb7u8 there:
ci:/home/nico# grep libc6 /var/log/dpkg.log 2016-02-17 15:26:13 upgrade libc6-dev:amd64 2.13-38+deb7u8 2.13-38+deb7u10 2016-02-17 15:26:13 status half-configured libc6-dev:amd64 2.13-38+deb7u8 2016-02-17 15:26:13 status unpacked libc6-dev:amd64 2.13-38+deb7u8 2016-02-17 15:26:13 status half-installed libc6-dev:amd64 2.13-38+deb7u8 2016-02-17 15:26:14 status half-installed libc6-dev:amd64 2.13-38+deb7u8 2016-02-17 15:26:14 status unpacked libc6-dev:amd64 2.13-38+deb7u10 2016-02-17 15:26:14 status unpacked libc6-dev:amd64 2.13-38+deb7u10 2016-02-17 15:26:16 upgrade libc6:amd64 2.13-38+deb7u8 2.13-38+deb7u10 2016-02-17 15:26:16 status half-configured libc6:amd64 2.13-38+deb7u8 2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u8 2016-02-17 15:26:16 status half-installed libc6:amd64 2.13-38+deb7u8 2016-02-17 15:26:16 status half-installed libc6:amd64 2.13-38+deb7u8 2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:17 configure libc6:amd64 2.13-38+deb7u10 <none> 2016-02-17 15:26:17 status unpacked libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:17 status unpacked libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:17 status half-configured libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:18 status installed libc6:amd64 2.13-38+deb7u10 2016-02-17 15:26:30 configure libc6-dev:amd64 2.13-38+deb7u10 <none> 2016-02-17 15:26:30 status unpacked libc6-dev:amd64 2.13-38+deb7u10 2016-02-17 15:26:30 status half-configured libc6-dev:amd64 2.13-38+deb7u10 2016-02-17 15:26:30 status installed libc6-dev:amd64 2.13-38+deb7u10
Why is it looking for 2.14?
Updated by Brett Smith almost 9 years ago
- Target version set to 2016-03-16 sprint
Updated by Tom Clegg almost 9 years ago
Hm.
[NOTE] You may have encountered a bug in the Ruby interpreter or extension libraries.
Updated by Brett Smith almost 9 years ago
- Category set to FUSE
- Assigned To set to Tom Clegg
Updated by Joshua Randall almost 9 years ago
I did successfully run `run-build-packages-all-targets.sh` when I completed the work ~2 weeks ago, so this is strange. I can try running it again here to see if it is failing for me now as well.
Updated by Tom Clegg almost 9 years ago
- arvados 8345-arvados-fuse-llfuse-0.42 @ 0bbcf2e (merged master, otherwise unchanged from Josh's branch at de8054d)
- arvados-dev 8345-llfuse-unpin @ arvados-dev|8eda2e7
- arvados-dev|27ac563 8345: Build contextlib2 backport, needed by python-llfuse on debian7.
- arvados-dev|a4f8a4e 8345: Add build-essential to ubuntu1204 package-building image
- arvados-dev|cdbcd6d 8345: Build and install new fuse libs before building llfuse.
- arvados-dev|659d736 8345: Use a separate vendor/cache for each OS target. See dev.arvados.org/issues/8345#note-11
Updated by Brett Smith almost 9 years ago
Reviewing arvados-dev 8345-llfuse-pin at arvados-dev|commit:8eda2e7
I reviewed with git log -p
. Individually all the commits make sense to me and the code's all clear. I'm running the build now, but I won't see the results until the morning. If you want to merge, I think that's fine; if you want to wait and see the results of the sanity check of running on someone else's system, I'll let you know what happens. Thanks.
Updated by Tom Clegg almost 9 years ago
Huh. I'm sure this test has been passing with the st_mtime_ns change, but I just happened to try running tests with a clean --temp, and now Something Is Different:
====================================================================== ERROR: runTest (tests.test_mount.FuseSharedTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/src/arvados/services/fuse/tests/test_mount.py", line 243, in runTest self.assertEqual(st.st_mtime_ns, 1397493304) AttributeError: 'posix.stat_result' object has no attribute 'st_mtime_ns' ----------------------------------------------------------------------
Updated by Tom Clegg almost 9 years ago
Hm, looks like the change to st_mtime_ns might be buggy? I changed the test to use st_mtime if st_mtime_ns isn't available, and I get this:
====================================================================== FAIL: runTest (tests.test_mount.FuseSharedTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/src/arvados/services/fuse/tests/test_mount.py", line 247, in runTest self.assertEqual(mtime, 1397493304) AssertionError: 1.3974933040000002 != 1397493304 ----------------------------------------------------------------------
I suspect we need to fix a9fa507c505ab702cae850f4ccd5aae26f6218a9 to multiply by 10^9...
Updated by Brett Smith almost 9 years ago
run-build-package-all-targets.sh succeeded for me, FWIW.
Updated by Brett Smith almost 9 years ago
Current status: I am on arvados b257d006610748cb133c352f9ce95b61d1c66e17, arvados-dev|2b9b7518a60a71315a1504bf96b3182122bec702.
Package building worked fine for me.
Running tests has been a hassle because they frequently hang. It seems like the hang can strike any test: I have seen it happen to tests.test_mount.FuseMvFileBetweenCollectionsTest, tests.test_mount.FuseWriteFileTest, and a third test that was so verbose I didn't catch the name.
% dpkg -l | grep -i fuse ii fuse 2.9.3-15+deb8u2 amd64 Filesystem in Userspace ii libfuse-dev 2.9.3-15+deb8u2 amd64 Filesystem in Userspace (development) ii libfuse2:amd64 2.9.3-15+deb8u2 amd64 Filesystem in Userspace (library)
I have to SIGKILL the tests when this happens. SIGINT gets me nowhere.
For unrelated work, I installed the FUSE driver from this branch and tried using it to upload data, running cat newfile >~/keep/home/…/oldfile
. That showed me this output, which seems at least potentially interesting:
% arv-mount --read-write --foreground ~/keep/ 2016-03-09 16:50:33 arvados.arvados_fuse[27307] ERROR: Unhandled exception during FUSE operation Traceback (most recent call last): File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper return orig_func(self, *args, **kwargs) File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 397, in on_event parent.update() File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/fusefile.py", line 98, in update self._mtime = convertTime(obj['modified_at']) if 'modified_at' in obj else 0 TypeError: argument of type 'NoneType' is not iterable Exception in thread WebSocketClient: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 427, in run if not self.once(): File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 305, in once if not self.process(b): File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 377, in process self.received_message(s.message) File "/home/brett/.local/lib/python2.7/site-packages/arvados_python_client-0.1.20160307054726-py2.7.egg/arvados/events.py", line 46, in received_message self.on_event(json.loads(str(m))) File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 290, in catch_exceptions_wrapper raise llfuse.FUSEError(errno.EIO) FUSEError: Input/output error
Updated by Brett Smith almost 9 years ago
Apparently my write succeeded despite the unhandled exception.
Updated by Brett Smith almost 9 years ago
Ran the tests in arvbox on a different system and reproduced the hang, this time in tests.test_mount.FuseMvFileTest.
% sudo cat /proc/559/stack [<ffffffff810d192c>] futex_wait_queue_me+0xcc/0x130 [<ffffffff810d1c02>] futex_wait+0x162/0x260 [<ffffffff81078454>] get_signal_to_deliver+0x104/0x5a0 [<ffffffff810d3d82>] do_futex+0xf2/0xb60 [<ffffffff81078325>] ptrace_notify+0x55/0x80 [<ffffffff81020c0f>] tracehook_report_syscall_exit+0x2f/0xa0 [<ffffffff810d485e>] SyS_futex+0x6e/0x150 [<ffffffff81012ea8>] do_notify_resume+0x78/0xa0 [<ffffffff81513ecd>] system_call_fast_compare_end+0x10/0x15 [<ffffffffffffffff>] 0xffffffffffffffff
Stray thought: We previously saw hangs in arv-mount because llfuse had a limited message queue that could fill and block. arv-mount monkeypatches it now, but is it possible that code no longer works as intended with newer versions of llfuse?
Updated by Joshua Randall almost 9 years ago
I did have to change the line that you use to override the Queue, as the internals had changed and `capi` was no longer exposed. I thought we were ok since the tests passed, but maybe there isn't a test for the queue overloaded situation that required that hack in the first place?
Updated by Tom Clegg almost 9 years ago
for _notify_queue, see #3198#note-43 (note-43) and 08284382b53f621c09c4ffc87d82fa0261a69d32
FWIW, according to those notes, the "queue full" problem required lots of activity. E.g., it seems unlikely that tests.test_mount.FuseWriteFileTest
would have hit it.
So far I've seen only one hang:
runTest (tests.test_mount.FuseNoAPITest) ...
Updated by Tom Clegg almost 9 years ago
dd40196 fixes a bug that is surely related to #8345#note-25 (note-25).
I don't quite see why that would manifest as hanging tests, though.
Updated by Peter Amstutz almost 9 years ago
Going back over my email exchanges with Nikolaus Rath, I reported the Queue full problem to him and he agreed with my fix and asked me to file an issue to change it. However I can't find an issue on the llfuse Bitbucket repo so I may have failed to do so, and as a result the change never made it in.
Updated by Brett Smith almost 9 years ago
Tom Clegg wrote:
dd40196 fixes a bug that is surely related to #8345#note-25 (note-25).
I don't quite see why that would manifest as hanging tests, though.
Unfortunately it doesn't seem to have made any noticeable improvement in my test runs. Still hanging on a variety of tests.
Updated by Brett Smith almost 9 years ago
Ideas to try from here:
- Try removing the debugging additions from the branch, so we're "just" testing llfuse 0.42 support, to try to bisect the problem.
- Try testing in even more environments: arvbox, different distros, etc.
- Convert tests using the old filesystem access methods to the new multiprocessing-based methods.
- Upgrade directly to 1.0.
Updated by Brett Smith almost 9 years ago
- Target version changed from 2016-03-16 sprint to Arvados Future Sprints
Updated by Tom Clegg almost 9 years ago
I'm not certain this is related, but I'm using a different dev box than usual, and this test keeps hanging. As with note-25 above, SIGINT and SIGTERM don't help: I have to SIGKILL. Notably, I'm at 02a2277 (from #8879), which does not have the changes from #8345 that we've been blaming these deadlocks on.
test_cache_spill (tests.test_cache.TmpCollectionTest) ... [keep0] 2016/04/06 15:46:27 [[::1]:35127] PUT 16a7cbedc0cf22673283940f5bb78f36 9 0.004485s 0.004473s 0.000013s 200 86 "OK" [keep1] 2016/04/06 15:46:27 [[::1]:59524] PUT 16a7cbedc0cf22673283940f5bb78f36 9 0.009449s 0.009432s 0.000016s 200 86 "OK" ... [keep0] 2016/04/06 15:46:30 [[::1]:35172] PUT 5b36a2183b880ca29e01bd23559e7080 6 0.003038s 0.003020s 0.000018s 200 86 "OK" 2016-04-06 15:46:31 arvados.arv-mount[20971] WARNING: Mount.__exit__: llfuse thread still alive 0.100000s after umount -- abandoning and exiting anyway ok Sent SIGTERM to 20985 (/home/tom/src/2arvados/tmp/keep0.pid) [keep0] 2016/04/06 15:46:33 caught signal: terminated [keep0] 2016/04/06 15:46:33 keepstore exiting, pid 20985 Sent SIGTERM to 20996 (/home/tom/src/2arvados/tmp/keep1.pid) [keep1] 2016/04/06 15:46:33 caught signal: terminated [keep1] 2016/04/06 15:46:33 keepstore exiting, pid 20996 test_CreateCollectionWithManyFilesAndMoveAndDeleteFile (tests.performance.test_collection_performance.CreateCollectionWithManyFilesAndMoveAndDeleteFile) ...
It looks like process 20971 ("setup.py test") is waiting for 21114 to exit, meanwhile 21114 is waiting for 20971 to write to a pipe. I'm guessing this means that the test has finished, and the multiprocessing pool worker isn't exiting properly. Perhaps llfuse/arvados_fuse's use of threads conflicts with the multiprocessing module somehow?
$ pstree -p 20971 python(20971)─┬─python(21114) ├─sed(20984) ├─sed(20995) ├─{python}(21006) ├─{python}(21007) ├─{python}(21011) ├─{python}(21012) ├─{python}(21055) ├─{python}(21056) ├─{python}(21071) ├─{python}(21072) ├─{python}(21077) ├─{python}(21078) ├─{python}(21083) ├─{python}(21084) ├─{python}(21089) ├─{python}(21090) ├─{python}(21095) ├─{python}(21096) ├─{python}(21101) ├─{python}(21102) ├─{python}(21115) ├─{python}(21116) ├─{python}(21117) ├─{python}(21121) ├─{python}(21122) ├─{python}(21123) └─{python}(21124) $ ls -l /proc/*/fd/* 2>/dev/null | grep 66717 lr-x------ 1 tom tom 64 Apr 6 15:57 /proc/20971/fd/54 -> pipe:[66717] l-wx------ 1 tom tom 64 Apr 6 15:57 /proc/20971/fd/55 -> pipe:[66717] lr-x------ 1 tom tom 64 Apr 6 15:57 /proc/21114/fd/54 -> pipe:[66717] $ strace -f -p 21114 Process 21114 attached - interrupt to quit read(54, ^C <unfinished ...> Process 21114 detached $ strace -f -p 20971 Process 20971 attached with 26 threads - interrupt to quit [pid 21124] read(59, <unfinished ...> ... [pid 20971] futex(0x4adaca0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 21115] <... select resumed> ) = 0 (Timeout) [pid 21115] wait4(21114, 0x7fbe58e43730, WNOHANG, NULL) = 0 [pid 21115] select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) [pid 21115] wait4(21114, 0x7fbe58e43730, WNOHANG, NULL) = 0 [pid 21115] select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) ... (last two lines repeat until ^C) $ uname -a Linux belle 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt25-1~bpo70+1 (2016-04-02) x86_64 GNU/Linux
Is it possible we're hitting an old Python problem that occurs when you use both threading and multiprocessing in the same program?
Some of these might be relevant:- http://stackoverflow.com/questions/24509650/deadlock-with-logging-multiprocess-multithread-python-script
- https://bugs.python.org/issue9205
- https://bugs.python.org/issue9244
- http://bugs.python.org/issue6721
- test
N-1
runs - test
N-1
has an "llfuse thread still alive", which we log but otherwise ignore - test
N
usesmultiprocessing.Pool
to start a process - something goes wrong because there is >1 thread in the main process when
multiprocessing.Pool
creates its worker(s) - test
N
hangs during pool.join() as a result
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-04-27 sprint
Tom should pursue his idea above and see if it helps.
Updated by Tom Clegg almost 9 years ago
- This seems to affect (sometimes) any test that uses MountTestBase or IntegrationTest. Some seem to crash more than others, although this might be determined by test order rather than anything meaningful:
- tests.test_mount.FuseHomeTest
- tests.test_mount.FuseMagicTest
- tests.test_mount.FuseMagicTestPDHOnly
- tests.test_mount.FuseMagicTestPDHOnly
- tests.test_mount.FuseMagicTestPDHOnly
- tests.test_mount.FuseHomeTest
- tests.test_mount.FuseTagsTest
- tests.test_mount.FuseMountTest
- tests.test_mount.FuseWriteFileTest
- tests.test_mount.FuseProjectMkdirRmdirTest
- tests.test_mount.FuseWriteFileTest
- tests.test_mount.FuseMountTest
- tests.test_mount.FuseWriteFileTest
- tests.test_mount.FuseTagsTest
- tests.test_mount.FuseProjectMkdirRmdirTest
- tests.test_mount.FuseMountTest
- tests.test_mount.FuseMvFileBetweenCollectionsTest
- tests.test_mount.FuseHomeTest
- tests.test_mount.FuseProjectMkdirRmdirTest
- tests.test_mount.FuseProjectMkdirRmdirTest
- tests.test_retry.RetryPUT
- tests.test_mount.FuseProjectMkdirRmdirTest
- Suspect multiprocessing and threading libraries just don't work together reliably. In particular, forking from a multithreaded program is known to be hard to do correctly: the forked child process can inherit mutexes which were locked by other threads when fork() happened, which means they will never be released.
- This problem deadlocks all threads, not just our main thread. I tried running a "thread snapshot" thread that dumps a stack trace for every running thread, but the deadlock stopped that thread, too, so I couldn't get a snapshot of the deadlock itself.
- Running test helpers, and the fusermount command, with a multiprocessing pool in which all workers are started before any other library gets a chance to start any other threads (see 8345-prefill-pool).
- Shorter or longer timeout when joining the llfuse thread
- Abandoning the llfuse thread instead of join()ing it
- time.sleep() before join()ing the llfuse thread
- more/fewer log messages near the fusermount call
- time.sleep() before fusermount -u -z
- fusermount -u instead of fusermount -u -z
- Ran 2x 100x MountTestBase (with no prior IntegrationTest), no hang
- Ran 1x 100x IntegrationTest (with no prior MountTestBase), HANG (54)
- Ran 1x 100x IntegrationTest (with prior MountTestBase), no hang
- Ran 1x 100x MountTestBase (with no prior IntegrationTest), no hang
- Ran 1x 100x IntegrationTest (with prior MountTestBase), HANG (62)
- Ran 3x 100x IntegrationTest (with no prior MountTestBase), no hang
This seems to suggest running after an IntegrationTest makes a test more likely to hang.
Replacing llfuse.listdir() with pool.apply(os.listdir, ...) (5ac2b46 on 8345-pool-listdir) seems to increase the incidence of deadlocks (it's hard to be certain). This is true even with the "don't start workers after starting threads" change in 8345-prefill-pool, which suggests the deadlock is related to using multiprocessing workers, not just starting them.
Upgrading to llfuse 1.0 seems to be a matter of dropping "with llfuse.lock" from Operations.destroy(). Presumably, we should do this and skip 0.42 and 0.43.
So, now using llfuse 1.0 but still having deadlock trouble, we're on 8345-llfuse-1.0 at 494f528
Updated by Tom Clegg over 8 years ago
This might be related.
https://ci.curoverse.com/job/arvados-api-server/2950/consoleText
2016-04-21 19:13:40 arvados.arvados_fuse[2170] ERROR: Unhandled exception during FUSE operation Traceback (most recent call last): File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper return orig_func(self, *args, **kwargs) File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 483, in forget ent = self.inodes[inode] File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 215, in __getitem__ return self._entries[item] KeyError: 3L INIT: 7.18 flags=0x00000021 max_readahead=0x00020000 max_write=0x00020000 max_background=0 congestion_threshold=0 unique: 1, success, outsize: 40 /var/lib/jenkins/workspace/arvados-api-server/build/run-tests.sh: line 477: 2170 Segmentation fault "${3}python" setup.py test ${testargs[$1]}
https://pythonhosted.org/llfuse/operations.html says forget() is not allowed to raise exceptions. Unfortunately our generic @catch_exceptions
handler didn't tell us what the original exception was.
https://github.com/python-llfuse/python-llfuse/blob/master/developer-notes/lookup_counts.rst suggests forget() is supposed to do its own locking:
def forget(self, inode):
with lookup_lock:
lookup_count[inode] -= 1
if lookup_count[inode] > 0:
return
del lookup_count[inode]
self.maybe_remove_inode(inode)
But we don't seem to do any such locking:
@catch_exceptions
def forget(self, inodes):
if self._shutdown_started.is_set():
return
for inode, nlookup in inodes:
ent = self.inodes[inode]
_logger.debug("arv-mount forget: inode %i nlookup %i ref_count %i", inode, nlookup, ent.ref_count)
if ent.dec_ref(nlookup) == 0 and ent.dead:
self.inodes.del_entry(ent)
OTOH, it looks like it's already called from a "with lock:" block, so it seems odd that we'd need our own locking too: https://github.com/python-llfuse/python-llfuse/blob/master/src/handlers.pxi#L51-L58
cdef void fuse_forget (fuse_req_t req, fuse_ino_t ino,
ulong_t nlookup) with gil:
try:
with lock:
operations.forget([(ino, nlookup)])
except:
handle_exc(NULL)
fuse_reply_none(req)
Even if forget() is only called with llfuse.lock acquired, perhaps some other code path accesses the inode cache with llfuse lock released -- and this sort of path is anticipated by the llfuse, motivating the extra locking in the example code?
Updated by Brett Smith over 8 years ago
Tom Clegg wrote:
https://pythonhosted.org/llfuse/operations.html says forget() is not allowed to raise exceptions. Unfortunately our generic
@catch_exceptions
handler didn't tell us what the original exception was.
I'm not sure I follow. Why isn't the KeyError you pasted the original exception? It looks like that would've been caught by the bare except:
block, and logged with _logger.exception()
.
Updated by Tom Clegg over 8 years ago
- Target version deleted (
2016-04-27 sprint)
Updated by Tom Clegg over 8 years ago
Brett Smith wrote:
I'm not sure I follow. Why isn't the KeyError you pasted the original exception? It looks like that would've been caught by the bare
except:
block, and logged with_logger.exception()
.
Aha, you're right. I was just looking at the code, which doesn't pass the caught exception to logger.exception. But logger.exception prefers to figure out the current exception using magic, rather than having one passed to it. So yes, KeyError: 3L really is the original exception.
Updated by Brett Smith over 8 years ago
Tom Clegg wrote:
But logger.exception prefers to figure out the current exception using magic, rather than having one passed to it.
Right. And the magic is as simple as "call sys.exc_info()
," so... not that magical. :)
Updated by Tom Clegg about 8 years ago
Rebased to master (now 8345-llfuse-1.0-rebased). Ran 5 test runs in a row. No deadlock.
I suspect 62d8f575a93374a235097b88499fa65eef2f56ec (from #8460) fixed the deadlock we were seeing here.
Updated by Tom Clegg about 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:bfe7a62407aa0a6bc078659f259468f9b4105581.