Project

General

Profile

Actions

Bug #6643

closed

[FUSE] Crash on del self._by_uuid[obj.cache_uuid]

Added by Peter Amstutz over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/23/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

It appears that when certain API server requests fail, the internal state of FUSE is left in an inconsistent state that results in a crash.

2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #1 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #2 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr WARNING:root:Retry #3 for request: GET https://su92l.arvadosapi.com/arvados/v1/collections/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json, following status: 503
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr 2015-07-08 23:35:20 arvados.arvados_fuse[31747] ERROR: arv-mount 69414010a3d0f286ad6eb5a578801aa1+11278592: error
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr Traceback (most recent call last):
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusedir.py", line 404, in update
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     num_retries=self.num_retries)
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1616, in __init__
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs)
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1188, in __init__
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     self._populate()
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 1306, in _populate
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     error_via_keep))
2015-07-08_23:37:55 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr NotFoundError: Failed to retrieve collection '69414010a3d0f286ad6eb5a578801aa1+11278592' from either API server (<HttpError 503 when requesting https://su92l.arvadosapi.com/arvados/v1/collec
tions/69414010a3d0f286ad6eb5a578801aa1%2B11278592?alt=json returned "Service Unavailable">) or Keep (failed to read 69414010a3d0f286ad6eb5a578801aa1+11278592: no Keep services available (<HttpError 503 when requesting https://su92l.arvadosapi.com/arvados
/v1/keep_services/accessible?alt=json returned "Service Unavailable">)).

2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr ERROR:arvados.arvados_fuse:Unhandled exception during FUSE operation
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr Traceback (most recent call last):
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 253, in catch_exceptions_wrapper
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     return orig_func(self, *args, **kwargs)
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 526, in opendir
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     self.inodes.touch(p)
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 218, in touch
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     self.inode_cache.touch(entry)
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 180, in touch
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     self._remove(obj, False)
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 150, in _remove
2015-07-08_23:37:56 su92l-8i9sb-2sy4hr0v4pgr4rn 24832 15 stderr     del self._by_uuid[obj.cache_uuid]

Subtasks 2 (0 open2 closed)

Task #6717: FixResolvedPeter Amstutz07/23/2015

Actions
Task #6718: Review 6643-fuse-del-crashResolvedPeter Amstutz07/23/2015

Actions
Actions #1

Updated by Peter Amstutz over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith over 9 years ago

  • Target version changed from Bug Triage to 2015-08-05 sprint
Actions #3

Updated by Tom Clegg over 9 years ago

Depending on how much time is needed to fix this particular bug, it would be great to add some additional tests designed to find similar sensitivities.

For example, perhaps we can make a "pseudorandomly unreliable API server" mock (sort of like http://golang.org/pkg/testing/iotest/ ?) and verify that, when faced with this situation, a sequence of N userland operations in a FUSE mount
  • doesn't crash arv-mount, and
  • always generates one of an expected list of errors (just "I/O error"?) on the client side in cases where there are enough API failures to make arv-mount give up.
Actions #4

Updated by Peter Amstutz over 9 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz over 9 years ago

"Magic directory" creates an inode for collection directory, but did not properly clean up on failure, resulting in object uuid collision when re-attempting collection directory lookup.

Now deletes stale inode on failed collection lookups.

Actions #6

Updated by Peter Amstutz over 9 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Nico César over 9 years ago

reviewing 119d87107846d77274fdaa52dc8cdadd2a7b9765

sdk/python/arvados/events.py
@@ -81,14 +81,18 @@
(..)
for i in items["items"]:

can be changed to

try:
  tmp = items.get("items",None)
except:
  # do something clever.. like logging
for i in tmp:

and the same with items["items_available"]

to avoid unwanted exception crashing if "items" doesn't exists

note: this comment was edited

Actions #8

Updated by Peter Amstutz over 9 years ago

Nico Cesar wrote:

reviewing 119d87107846d77274fdaa52dc8cdadd2a7b9765

[...]

can be changed to
[...]

and the same with items["items_available"]

to avoid unwanted exception crashing if "items" doesn't exists

note: this comment was edited

This is unrelated to the bug being fixed, and I argue this is not a bug because API server is expected to always return "items" on query, unless there was an error, in which case it raises an exception; in either case getting an API request without "items" will result in an exception.

How do you want to proceed? I can walk you through the code if you like.

Actions #9

Updated by Nico César over 9 years ago

ok .. my tests pass... but I was not able to really test it on my machine.

we should merge and deploy and test it in one cluster

Actions #10

Updated by Peter Amstutz over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ed5230f0e823445f56e7d6940b6fb178e328718b.

Actions

Also available in: Atom PDF