Bug #6643
closed[FUSE] Crash on del self._by_uuid[obj.cache_uuid]
100%
Description
It appears that when certain API server requests fail, the internal state of FUSE is left in an inconsistent state that results in a crash.
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #1 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #2 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #3 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr 2015-07-08 23:35:20 arvados.arvados_fuse[31747] ERROR: arv-mount 69414010a3d0f286ad6eb5a578801aa1+11278592: error 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr Traceback (most recent call last): 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusedir.py", line 404, in update 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr num_retries=self.num_retries) 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1616, in __init__ 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs) 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1188, in __init__ 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr self._populate() 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1306, in _populate 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr error_via_keep)) 2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr NotFoundError: Failed to retrieve collection '69414010a3d0f286ad6eb5a578801aa1+11278592' from either API server (<HttpError 503 when requesting https://su92l.arvadosapi.com/arvados/v1/collec tions/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json returned "Service Unavailable">) or Keep (failed to read 69414010a3d0f286ad6eb5a578801aa1+11278592: no Keep services available (<HttpError 503 when requesting https://su92l.arvadosapi.com/arvados /v1/keep_services/accessible?alt=json returned "Service Unavailable">)). 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr ERROR:arvados.arvados_fuse:Unhandled exception during FUSE operation 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr Traceback (most recent call last): 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 253, in catch_exceptions_wrapper 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr return orig_func(self, *args, **kwargs) 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 526, in opendir 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr self.inodes.touch(p) 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 218, in touch 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr self.inode_cache.touch(entry) 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 180, in touch 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr self._remove(obj, False) 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 150, in _remove 2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr del self._by_uuid[obj.cache_uuid]
Updated by Brett Smith over 9 years ago
- Target version changed from Bug Triage to 2015-08-05 sprint
Updated by Tom Clegg over 9 years ago
Depending on how much time is needed to fix this particular bug, it would be great to add some additional tests designed to find similar sensitivities.
For example, perhaps we can make a "pseudorandomly unreliable API server" mock (sort of like http://golang.org/pkg/testing/iotest/ ?) and verify that, when faced with this situation, a sequence of N userland operations in a FUSE mount- doesn't crash arv-mount, and
- always generates one of an expected list of errors (just "I/O error"?) on the client side in cases where there are enough API failures to make arv-mount give up.
Updated by Peter Amstutz over 9 years ago
"Magic directory" creates an inode for collection directory, but did not properly clean up on failure, resulting in object uuid collision when re-attempting collection directory lookup.
Now deletes stale inode on failed collection lookups.
Updated by Peter Amstutz over 9 years ago
- Status changed from New to In Progress
Updated by Nico César over 9 years ago
reviewing 119d87107846d77274fdaa52dc8cdadd2a7b9765
sdk/python/arvados/events.py @@ -81,14 +81,18 @@ (..) for i in items["items"]:
can be changed to
try: tmp = items.get("items",None) except: # do something clever.. like logging for i in tmp:
and the same with items["items_available"]
to avoid unwanted exception crashing if "items" doesn't exists
note: this comment was edited
Updated by Peter Amstutz over 9 years ago
Nico Cesar wrote:
reviewing 119d87107846d77274fdaa52dc8cdadd2a7b9765
[...]
can be changed to
[...]and the same with
items["items_available"]
to avoid unwanted exception crashing if "items" doesn't exists
note: this comment was edited
This is unrelated to the bug being fixed, and I argue this is not a bug because API server is expected to always return "items" on query, unless there was an error, in which case it raises an exception; in either case getting an API request without "items" will result in an exception.
How do you want to proceed? I can walk you through the code if you like.
Updated by Nico César over 9 years ago
ok .. my tests pass... but I was not able to really test it on my machine.
we should merge and deploy and test it in one cluster
Updated by Peter Amstutz over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ed5230f0e823445f56e7d6940b6fb178e328718b.