Bug #12306
closed
[arv-mount] --unmount should work on an unresponsive mount
Added by Tom Clegg over 7 years ago.
Updated over 7 years ago.
Estimated time:
(Total: 0.00 h)
Description
Currently, if an arv-mount process is in some deadlocked/stuck state, running arv-mount --unmount PATH
just hangs instead of unmounting.
When this happens, echo 1 > /sys/fs/fuse/connections/NNN/abort
revives the stuck unmount command.
It looks like arv-mount --unmount
attempts to lstat() all mount points in /proc/self/mounts
and lstat(stuck_mount_path) hangs.
This seems to be the fault of realpath() in source:services/fuse/arvados_fuse/unmount.py:
while True:
mounted = False
for m in mountinfo():
if m.is_fuse and (mnttype is None or mnttype == m.mnttype):
try:
if os.path.realpath(m.path) == path:
On the shell node where this happened, where /home
and /home/foo
are both symlinks, arv-mount /home/foo/keep
results in /data-sdd/foo/keep
appearing in /proc/self/mountinfo
, which means realpath() is superfluous here. (Is that true on all systems?)
These stuck mounts come up occasionally on Jenkins. When they do, all builds get stuck ("UnmountTest" -- presumably because of this bug), until someone clears the stuck mounts manually using ".../connections/NNN/abort" or "fusermount -u -z".
- Target version set to 2017-11-08 Sprint
- Assigned To set to Tom Morris
- Status changed from New to In Progress
- Assigned To changed from Tom Morris to Tom Clegg
So following symlinks to mounts seems weird and not something you would normally do, however, the other thing that realpath() does is turn a relative path into an absolute path, which is probably what we were really trying to use it for. So how about adding this back in?
path = os.path.abspath(path)
(abspath doesn't use stat(), only get os.getcwd()).
Peter Amstutz wrote:
So following symlinks to mounts seems weird and not something you would normally do
On our shell nodes $HOME is typically /home/username where /home is a symlink, so ~/keep
doesn't appear in mountinfo but realpath(~/keep) does.
I wonder if it's worth implementing a more careful realpath() that can resolve ~/keep
in such situations without calling lstat() on ~/keep
itself. Seems like a bit of a rabbit hole, though.
(abspath doesn't use stat(), only get os.getcwd()).
Indeed, one less opportunity to fall into the realpath() hole. Added.
12306-dont-stat-mounts @ aabf1ca0e99701550f9af785e9f1fee098b0020a
Tom Clegg wrote:
Peter Amstutz wrote:
So following symlinks to mounts seems weird and not something you would normally do
On our shell nodes $HOME is typically /home/username where /home is a symlink, so /keep doesn't appear in mountinfo but realpath(/keep) does.
Got it. But does that mean arv-mount --umount won't actually work in this case, when you have a stuck mount which you are trying to unmount on a symlink path?
I wonder if it's worth implementing a more careful realpath() that can resolve ~/keep in such situations without calling lstat() on ~/keep itself. Seems like a bit of a rabbit hole, though.
How about calling realpath() on the parent directory and then joining it with the mount point?
Indeed, the previous version would have ended up calling realpath() on ~/keep
on a system where $HOME
contains symlinks.
I think I made it back from the rabbit hole with a version that avoids calling realpath in those cases.
12306-dont-stat-mounts @ 08a4ebba0e5bfbc179103ac5e6916164bc8083fa
Tom Clegg wrote:
Indeed, the previous version would have ended up calling realpath() on ~/keep
on a system where $HOME
contains symlinks.
I think I made it back from the rabbit hole with a version that avoids calling realpath in those cases.
12306-dont-stat-mounts @ 08a4ebba0e5bfbc179103ac5e6916164bc8083fa
Tentatively, safer_realpath
seems to work.
I just noticed that arv-mount --unmount requires an unnecessary API token:
$ arv-mount --unmount keep/
2017-11-08 09:49:38 arvados.arv-mount[7740] ERROR: Missing environment: 'ARVADOS_API_TOKEN'
Unmounting an arv-mount which is stuck with SIGSTOP does remove the mount but doesn't kill the daemon:
- arv-mount
- SIGSTOP
- arv-mount --unmount (works)
- SIGCONT
- arv-mount is still there
Could be a problem if it is occupying a lot of memory and refusing to go away on its own.
My preferred method to bring the hammer down:
- abort if available
- sigkill
- fusermount -u -z
Otherwise, the main goal of this bugfix (don't get stuck on realpath()) seems to be accomplished, so declare victory and merge.
LGTM
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:0af053088c83d1107866cb06fd6c5736d9065eee.
Also available in: Atom
PDF