Bug #23136: Intermittent failures when bind-mounting things into an arv-mount tmp directory - Arvados

Actions

Copy link

Bug #23136

closed

Intermittent failures when bind-mounting things into an arv-mount tmp directory

Added by Lucas Di Pentima 7 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

FUSE

Target version:

Development 2025-10-29

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

It has been reported that when running this workflow, since 3.1.1 some steps fail due to AWS credentials issues, others do nothing and report as "Complete" and others correctly download the requested data.

Files

Download all files

MountTest.sh (1003 Bytes) MountTest.sh	try to reproduce hitting arv-mount directly (succeeded)	Brett Smith, 09/02/2025 10:19 PM
MountTest.sh (1.84 KB) MountTest.sh	try to reproduce bind-mounting arv-mount inside Docker (succeeded)	Brett Smith, 09/03/2025 01:30 AM
MountTest.sh (1.83 KB) MountTest.sh	try to reproduce using clean disk cache on each run (succeeded)	Brett Smith, 09/03/2025 02:54 AM
MountTest.sh (2.37 KB) MountTest.sh	successful reproduction using layered mounts	Brett Smith, 09/04/2025 02:34 PM
MountTest.sh (2.12 KB) MountTest.sh	slimmer reproduction with just two container mounts	Brett Smith, 09/04/2025 09:02 PM
MountTest.sh (1.75 KB) MountTest.sh	slimmer reproduction with only an arv-mount tmpdir and a static file bound into it	Brett Smith, 09/05/2025 01:29 PM
arv-mount.success.log (5 KB) arv-mount.success.log		Brett Smith, 09/22/2025 03:09 PM
arv-mount.empty.log (3.88 KB) arv-mount.empty.log		Brett Smith, 09/22/2025 03:09 PM
arv-mount.err.log (19.1 KB) arv-mount.err.log		Brett Smith, 09/22/2025 03:09 PM
23136-chatgpt.md (11.8 KB) 23136-chatgpt.md		Brett Smith, 10/18/2025 02:19 AM
23136-test-with-fix.log (36.7 KB) 23136-test-with-fix.log		Brett Smith, 10/27/2025 08:59 PM
23136-test-without-fix.log (45.5 KB) 23136-test-without-fix.log		Brett Smith, 10/27/2025 08:59 PM

Subtasks 1 (0 open — 1 closed)

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Brett Smith 7 months ago

Description updated (diff)

If we've seen this in 3.1.1, and that was an "installer bugfixes only" release, I think it's fair to start looking at changes in 3.1.0. On that list, #22420 jumps out as the kind of bug fix that could've accidentally introduced a new race condition or something like that.

Trying a reproduction at pirca-j7d0g-upcguy83q3ya755

Actions

Copy link

Updated by Brett Smith 7 months ago

File MountTest.sh MountTest.sh added

Tried to reproduce with the attached script and did not succeed. Going to do a version 2 that actually mounts to Docker with options closer to the ones crunch actually uses and try that.

Actions

Copy link

Updated by Brett Smith 7 months ago

Second try to reproduce the issue, this time using more common arv-mount options and making the mount available through Docker. All 2500 runs passed on my machine.

Going to try again with disk cache. If that still passes, I might want to start trying to reproduce on a Linux system as close as possible to ones where we've seen this.

Actions

Copy link

Updated by Brett Smith 7 months ago

File MountTest.sh MountTest.sh added

Actions

Copy link

Updated by Brett Smith 7 months ago

File MountTest.sh MountTest.sh added

Third try with disk cache, still succeeded.

Actions

Copy link

Updated by Brett Smith 7 months ago

One stray thought I had: you know how you can break shell scripts by editing them while they're running? I wonder if there's some situation where the shell gets a partial read of the file, like zero bytes or just the shebang line or something. A read like that would be consistent with the behavior we're seeing, although I have no explanation for how it happens. arv-mount should know immediately how large the script is (from the collection manifest) and be able to report that.

Actions

Copy link

#10

Updated by Brett Smith 7 months ago

I thought this line might be masking the bug:

while ! [ -d mnt/tmp0 ]; do sleep .25s; done

But crunch-run has analogous code in source:lib/crunchrun/crunchrun.go:

go func() {
    for keepStatting {
        time.Sleep(100 * time.Millisecond)
        _, err = os.Stat(fmt.Sprintf("%s/by_id/README", runner.ArvMountPoint))
        if err == nil {
            keepStatting = false
            statReadme <- true
        }
    }
    close(statReadme)
}()

Actions

Copy link

#11

Updated by Tom Clegg 7 months ago

Another place to look: if the local keepstore process fails to retrieve a block requested by arv-mount, does arv-mount reliably propagate that error back to the reader process (shell)? It's conceivable that when lots containers are downloading from S3, and keepstore is competing with them for S3 bandwidth/quota (especially if the buckets are in the same account?), such errors are more frequent. Perhaps this contention causes a particular class of keepstore/S3 error that isn't handled properly.

Actions

Copy link

#12

Updated by Brett Smith 7 months ago

Target version changed from Development 2025-09-03 to Development 2025-09-17

Actions

Copy link

#13

Updated by Brett Smith 7 months ago

Subtask #23139 added

Actions

Copy link

#14

Updated by Brett Smith 7 months ago

Tom Clegg wrote in #note-11:

Another place to look: if the local keepstore process fails to retrieve a block requested by arv-mount, does arv-mount reliably propagate that error back to the reader process (shell)?

I rigged up my test so that the host had an /etc/hosts entry for the Keep service pointing to an unreachable address. With that in place, arv-mount logged the exception:

2025-09-03 13:34:17 arvados.arvados_fuse[555194] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/__init__.py", line 547, in catch_exceptions_wrapper
    return orig_func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/__init__.py", line 900, in read
    r = handle.obj.readfrom(off, size, self.num_retries)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/fusefile.py", line 66, in readfrom
    return self.arvfile.readfrom(off, size, num_retries, exact=True, return_memoryview=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/arvfile.py", line 1043, in readfrom
    block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/arvfile.py", line 696, in get_block_contents
    return self._keep.get(locator, num_retries=num_retries)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/retry.py", line 245, in num_retries_setter
    return orig_func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/keep.py", line 1097, in get
    return self._get_or_head(loc_s, method="GET", **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/keep.py", line 1246, in _get_or_head
    raise arvados.errors.KeepReadError(
arvados.errors.KeepReadError: [req-9yia9240s704xz3kyeee] failed to read 2d44620559aa642824845f07d810e07a+4626+A[redacted]@68caedf6 after 5 attempts: service https://keep.pirca.arvadosapi.com:443/ responded with 0 (28, 'Failed to connect to keep.pirca.arvadosapi.com port 443 after 130822 ms: Could not connect to server')

And eventually that bubbles up to the test process:

sha256sum: /keep/by_id/4fce132bc4ae9b4cc844115246a6bd41+175/generatereport.py: Input/output error
/keep/by_id/4fce132bc4ae9b4cc844115246a6bd41+175/generatereport.py: FAILED open or read

This is not perfectly analogous since it's a networking error rather than an HTTP one. But this suggests at least some errors are handled as you'd hope, and makes it more surprising that we haven't seen arv-mount log anything in either failure mode.

Actions

Copy link

#15

Updated by Brett Smith 7 months ago

I am starting to become equally suspicious of secret handling as arv-mount. Secret inputs were basically unusable until 3.0.0 shipped 9e030521667e7998f8bde11a6f66dd6fe4c081c3 and this is basically our only workflow that exercises them. Compared to an arv-mount problem, the timeline makes a little less sense; but on the flip side, it's a better explanation for why the problem only seems to be affecting this workflow, and why we're getting the explicit failure case. From there, the silent failure case could be explained by something like "incomplete credentials cause aws s3 to fast exit 0." Very vague theory for sure, but at least as plausible as anything else we have right now.

Actions

Copy link

#16

Updated by Lucas Di Pentima 7 months ago

Brett Smith wrote in #note-15:

From there, the silent failure case could be explained by something like "incomplete credentials cause aws s3 to fast exit 0." Very vague theory for sure, but at least as plausible as anything else we have right now.

Note that logs don't even show the attempt to execute aws s3 cp ..., and the download.sh script has set -x in it.

Actions

Copy link

#17

Updated by Brett Smith 7 months ago

Lucas Di Pentima wrote in #note-16:

Note that logs don't even show the attempt to execute aws s3 cp ..., and the download.sh script has set -x in it.

Yeah, hmmm. Maybe it's the combination, maybe something about the way we set up secret mounts either triggers a latent bug in arv-mount or interferes with the way we make it available to the container. I should trace that code path and try to add it to my reproduction script.

Actions

Copy link

#18

Updated by Brett Smith 7 months ago

File MountTest.sh MountTest.sh added

I have a meeting now and can't do a full write-up but the attached version successfully reproduces the issue. The issue is that results are unpredictable when a container layers mounts the way the S3 downloader does. The results show a container exiting 0 but not generating any script output, much like we're seeing in the silent success case.

% sh ~/Curii/MountTest.sh
2025-09-04 10:24:57 arvados.arv_put[557765] INFO: Creating new cache file at /home/brett/.cache/arvados/arv-put/6c9d01e002728bb0c2402ddf25b2c529
0M / 0M 100.0% 2025-09-04 10:24:57 arvados.arv_put[557765] INFO: 

2025-09-04 10:24:57 arvados.arv_put[557765] INFO: Collection updated: 'Test Script'
Run #5 at 2025-09-04 10:25:03-04:00... --- expected.out.log    2025-09-04 10:25:03.681983970 -0400
+++ container.out.log    2025-09-04 10:25:04.193978419 -0400
@@ -1,2 +0,0 @@
-cat test.sh zecret.txt | tee cat.log
-Secret000005
ERROR: Container #5 stdout mismatch

% ls -l *.log
-rw-r--r-- 1 166 2025-09-04 10:25 arv-mount.err.log
-rw-r--r-- 1   0 2025-09-04 10:24 arv-mount.out.log
-rw-r----- 1  37 2025-09-04 10:24 cat.log
-rw-r----- 1   0 2025-09-04 10:25 container.err.log
-rw-r----- 1   0 2025-09-04 10:25 container.out.log
-rw-r----- 1  39 2025-09-04 10:24 expected.err.log
-rw-r----- 1  50 2025-09-04 10:25 expected.out.log

Actions

Copy link

#19

Updated by Brett Smith 7 months ago

Subject changed from aws-s3-bulk-download workflow failures to Inconsistent mount layering causes aws-s3-bulk-download failures

Actions

Copy link

#20

Updated by Tom Clegg 7 months ago

Related to Bug #23142: lib/crunchrun flaky test singularitySuite.TestImageCache_Concurrency_10 added

Actions

Copy link

#21

Updated by Brett Smith 7 months ago

File MountTest.sh MountTest.sh added

I have not been able to reproduce the issue using just Docker or just arv-mount. (The arv-mount test ran a script out of by_id and wrote the output to tmp0 in one command.) Right now it seems to be specific to the way the mount gets propagated to the container.

I have simplified the reproduction so it only mounts the script and the tmp collection. This means we can rule out one of the mounts being secret as a factor. I also ran arv-mount with --debug to get more information.

Like the real workflow, it fails two ways. In one case, the container exits 0 but outputs nothing. In another, docker run exits 125 and writes this message to stderr:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-PbWBJSyj/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": change mount propagation through procfd: mount dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x44000: invalid argument: unknown

See new script attached. I need to pick through the arv-mount debug logs and see if that reveals anything.

Actions

Copy link

#22

Updated by Brett Smith 7 months ago

The script exhibits both failure modes with Docker 27, so I'm back to suspecting arv-mount changes as the source.

At the same time, I'm also aware that all the clusters we've seen this issue on were recently upgraded. It's possible that the real change is in Linux itself. But if that's the case there's basically nothing we're going to be able to do about that.

Actions

Copy link

#23

Updated by Brett Smith 7 months ago

Sometimes docker run also exits 127 with this slightly different error (maybe depending on which mount fails?):

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-pzjmN4uo/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": mount src=/run/user/1000/MountTest-pzjmN4uo/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh, dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x5000: no such file or directory: unknown

I have also seen it exit 125 with this error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-0QMrSc7D/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": possibly malicious path detected -- refusing to operate on /var/lib/docker/overlay2/68268f4da76a8041e365b254e6238cf9fb0e2b4c4a5097ba8aafc6368e796eb9/merged/mnt/test.sh (deleted): unknown

All errors happen even if I add the following code above docker run to try to ensure the mount is ready for work beforehand:

    while ! [ -d mnt/tmp0 ]; do sleep .25s; done
    diff -u test.sh "mnt/by_id/$SCRIPT_PDH/test.sh" ||
        fail 15 "cmp #$RUN exited $?"

Actions

Copy link

#24

Updated by Brett Smith 7 months ago

Assigned To set to Brett Smith
Status changed from New to In Progress
Category changed from Crunch to FUSE
Subject changed from Inconsistent mount layering causes aws-s3-bulk-download failures to Intermittent failures when bind-mounting things into an arv-mount tmp directory
File MountTest.sh MountTest.sh added

I have a smaller reproduction: all you need is the arv-mount tmp directory, and then you bind mount something, anything, into it. Doing this exhibits all the different failure modes.

https://github.com/moby/moby/issues/26051 sort of gestures at some of the mechanics that the runtime uses to set up bind mounts. Note in particular that it has to create mount points on the original host (the arv-mount tmp directory in our case) to do this. It seems like there might be some race condition where arv-mount doesn't have the mount point ready in time for the runtime to use. Possibly arv-mount is telling Linux some operation is finished before it actually is.

Further confirmation: if you manually create the destination mount point (e.g., add touch mnt/tmp0/test.sh before docker run), the issues go away and the test script succeeds. You prevent the race by ensuring the destination completely exists before Docker tries to manipulate it.

I can now explain why the S3 download workflow sometimes fails silently and other times complains about lack of credentials: it's just a question of which file fails to mount. If you race such that download.sh is an empty mount point, bash runs an empty script and you get the silent exit 0. If you race such that .aws/credentials is an empty mount point, aws complains about the lack of credentials.

Actions

Copy link

#25

Updated by Brett Smith 7 months ago

Brett Smith wrote in #note-4:

If we've seen this in 3.1.1, and that was an "installer bugfixes only" release, I think it's fair to start looking at changes in 3.1.0. On that list, #22420 jumps out as the kind of bug fix that could've accidentally introduced a new race condition or something like that.

I just ran my reproduction test script with arv-mount 3.0.0 installed in a virtualenv. The script passed without any failed runs. So yeah now this is 100% my top candidate.

Actions

Copy link

#26

Updated by Brett Smith 6 months ago

Assigned To deleted (~~Brett Smith~~)
Status changed from In Progress to New

Actions

Copy link

#27

Updated by Brett Smith 6 months ago

Target version deleted (~~Development 2025-09-17~~)

Actions

Copy link Download all files

#28

Updated by Brett Smith 6 months ago

File arv-mount.err.log arv-mount.err.log added
File arv-mount.empty.log arv-mount.empty.log added
File arv-mount.success.log arv-mount.success.log added

See logs attached from a successful run; a run where test.sh was not found; and a run where test.sh was empty. It seems like what distinguishes success is that we do GETATTR on test.sh before RELEASEDIR.

Actions

Copy link

#29

Updated by Tom Clegg 6 months ago

Yes, I see the same pattern.

When fusetest.sh succeeds, arv-mount debug logs have MKNOD-NOTIFY-RELEASEDIR-GETATTR, like this

unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 918457
2025-09-22 20:49:44 arvados.arvados_fuse[918394] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644
   unique: 12, success, outsize: 144
NOTIFY: code=3 length=40
unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0
2025-09-22 20:49:44 arvados.arvados_fuse[918394] DEBUG: arv-mount release fh 0
   unique: 14, success, outsize: 16
unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 918457
   unique: 16, success, outsize: 120

When fusetest.sh fails, arv-mount debug logs have MKNOD-RELEASEDIR-GETATTR-NOTIFY, like this

unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 785032
2025-09-19 11:14:33 arvados.arvados_fuse[784976] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644
   unique: 12, success, outsize: 144
unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0
2025-09-19 11:14:33 arvados.arvados_fuse[784976] DEBUG: arv-mount release fh 0
   unique: 14, success, outsize: 16
unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 785032
   unique: 16, success, outsize: 120
NOTIFY: code=3 length=40

I think

RELEASEDIR timing is irrelevant
if GETATTR is processed before NOTIFY, filesystem semantics are broken

With some added debug logs, we can see the NOTIFY comes from "add" and "mod" events, as in this failure:

unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 920138
2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644
2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: event add name test.sh
2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: event mod name test.sh
   unique: 12, success, outsize: 144
unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0
2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: arv-mount release fh 0
   unique: 14, success, outsize: 16
unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 920138
   unique: 16, success, outsize: 120
NOTIFY: code=3 length=40

With a "wait for pending NOTIFY to complete" in place in CollectionDirectoryBase.on_event(), NOTIFY is always immediately after MKNOD:

unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 1005522
2025-09-22 21:29:02 arvados.arvados_fuse[1005465] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644
   unique: 12, success, outsize: 144
NOTIFY: code=3 length=40

So far I've run fusetest.sh 6 times with this change, and all 600 trials have succeeded.

This seems promising. There may be more places where a similar fix is needed.

23136-flush-invalidate @ b7c05540d6c5c100394a11b2a668ccb3870993c5 -- developer-run-tests: #4890

Actions

Copy link

#30

Updated by Brett Smith 6 months ago

Target version set to Development 2025-10-01
Assigned To set to Tom Clegg

Actions

Copy link

#31

Updated by Tom Clegg 6 months ago

Status changed from New to In Progress

I haven't been able to reproduce this with a python test case. (I thought I was onto something for a while because my "repeatedly rename and stat in different threads" test was deadlocking, but it was just that the test tearDown function hangs when it races with an in-progress fuse operation.)

It does seem plausible that the mount syscall would be uniquely sensitive to this race, which makes it annoying to test. Do we want to refactor fusetest.sh into a test case in services/fuse that depends on Docker and runs 100x when explicitly selected, or something like that?

Here's a slightly updated patch, with the "wait for invalidate queue to finish" bit moved up before re-acquiring the collection mutex, which seems safer.

23136-flush-invalidate @ 4851cebeb4c278fd967a76d9c2d0559259b569d2 -- developer-run-tests: #4891

Actions

Copy link

#32

Updated by Brett Smith 6 months ago

Tom Clegg wrote in #note-31:

It does seem plausible that the mount syscall would be uniquely sensitive to this race, which makes it annoying to test. Do we want to refactor fusetest.sh into a test case in services/fuse that depends on Docker and runs 100x when explicitly selected, or something like that?

On my system, once I got it failing reliably I think it always failed within the first five tries. If you're seeing something noticeably different let's talk about it, but I think I'd be open to a test that runs 10 tries as part of the normal test suite. Even if it doesn't fail 100% of the time, failing more often than not is good enough.

Actions

Copy link

#33

Updated by Tom Clegg 6 months ago

I finally noticed that we can do N trials without restarting arv-mount if we just delete test.sh from the temporary collection after each trial.

I've added a test that does 10 iterations, which takes about 10 seconds. With the fix commented out, it usually fails.

23136-flush-invalidate @ 52bea788b3033a1074040c8da906c6422395ae24 -- developer-run-tests: #4894

(workbench2 failed for obviously unrelated reasons)

Actions

Copy link

#34

Updated by Brett Smith 6 months ago

Target version changed from Development 2025-10-01 to Development 2025-10-15

Actions

Copy link

#35

Updated by Brett Smith 6 months ago

I have pushed some minor style changes to the branch. But even without those, I am seeing ~5% failure rate when running DockerRaceTest repeatedly. And I have seen both assertions fail. I am concerned the branch mitigates but does not actually fix the issue. I am open to the possibility the problem is in the test or the test environment rather than the fix, but I'd need some good convincing as to why that is.

To that point about environment, Docker versions and filesystems for posterity:

arvdev ⟩ findmnt --real
TARGET                 SOURCE                                                      FSTYPE OPTIONS
/                      /dev/mapper/xps9310-var--lib--machines[/arvdev]             btrfs  rw,relatime,ssd,space_cache=v2,subvo
├─/run/host/os-release /dev/mapper/xps9310-root--debian12[/usr/lib/os-release]     ext4   ro,nosuid,nodev,noexec,relatime,erro
├─/run/host/os-release /dev/mapper/xps9310-root--debian12[/usr/lib/os-release]     ext4   rw,relatime,errors=remount-ro
└─/home/brett/Curii    /dev/mapper/xps9310-home[/brett/Curii]                      ext4   rw,relatime
arvdev ⟩ findmnt /tmp
TARGET SOURCE FSTYPE OPTIONS
/tmp   tmpfs  tmpfs  rw,nosuid,nodev,size=1608148k,nr_inodes=409600,inode64

arvdev ⟩ docker version
Client: Docker Engine - Community
 Version:           28.4.0
 API version:       1.51
 Go version:        go1.24.7
 Git commit:        d8eb465
 Built:             Wed Sep  3 20:57:37 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.4.0
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.24.7
  Git commit:       249d679
  Built:            Wed Sep  3 20:57:37 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

___________________________________________________ DockerRaceTest.runTest ___________________________________________________

self = <tests.test_mount.DockerRaceTest testMethod=runTest>

    def runTest(self):
        self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
        os.chmod(self.mounttmp, 0o755)
        with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
            scriptfile.write(b"#!/bin/sh\necho OK\n")
            scriptfile.flush()
            os.chmod(scriptfile.name, 0o755)
            for _ in range(10):
                dockerrun = subprocess.run(
                    ["docker", "run",
                     "--rm",
                     "--workdir", "/mnt",
                     "--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
                     "--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
                     "busybox:uclibc", "sh", "test.sh"],
                    stdout=subprocess.PIPE,
                    stderr=2)
                self.assertEqual(dockerrun.returncode, 0)
>               self.assertEqual(dockerrun.stdout, b"OK\n")
E               AssertionError: b'' != b'OK\n'

tests/test_mount.py:1521: AssertionError
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: b'' != b'OK\n'
============================================= 1 failed, 148 deselected in 3.32s ==============================================

___________________________________________________ DockerRaceTest.runTest ___________________________________________________

self = <tests.test_mount.DockerRaceTest testMethod=runTest>

    def runTest(self):
        self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
        os.chmod(self.mounttmp, 0o755)
        scriptfile = tempfile.NamedTemporaryFile(delete=False)
        try:
            scriptfile.write(b"#!/bin/sh\necho OK\n")
            scriptfile.close()
            os.chmod(scriptfile.name, 0o755)
            for _ in range(10):
                dockerrun = subprocess.run(
                    ["docker", "run",
                     "--rm",
                     "--workdir", "/mnt",
                     "--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
                     "--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
                     "busybox:uclibc", "sh", "test.sh"],
                    stdout=subprocess.PIPE,
                    stderr=2)
>               self.assertEqual(dockerrun.returncode, 0)
E               AssertionError: 125 != 0

tests/test_mount.py:1525: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpr7fps_dj" to rootfs at "/mnt/test.sh": possibly malicious path detected -- refusing to operate on /var/lib/docker/overlay2/06fa01931056931304c6e6eeff3e9d6867649ad9b4a2b51910b83e9fa566d297/merged/mnt/test.sh (deleted): unknown

Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 125 != 0
============================================= 1 failed, 148 deselected in 4.90s ==============================================

___________________________________________________ DockerRaceTest.runTest ___________________________________________________

self = <tests.test_mount.DockerRaceTest testMethod=runTest>

    def runTest(self):
        self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
        os.chmod(self.mounttmp, 0o755)
        with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
            scriptfile.write(b"#!/bin/sh\necho OK\n")
            scriptfile.flush()
            os.chmod(scriptfile.name, 0o755)
            for _ in range(10):
                dockerrun = subprocess.run(
                    ["docker", "run",
                     "--rm",
                     "--workdir", "/mnt",
                     "--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
                     "--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
                     "busybox:uclibc", "sh", "test.sh"],
                    stdout=subprocess.PIPE,
                    stderr=2)
>               self.assertEqual(dockerrun.returncode, 0)
E               AssertionError: 125 != 0

tests/test_mount.py:1520: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpvej09n4p.sh" to rootfs at "/mnt/test.sh": change mount propagation through procfd: mount dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x44000: invalid argument: unknown

Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 125 != 0
============================================= 1 failed, 148 deselected in 4.06s ==============================================

___________________________________________________ DockerRaceTest.runTest ___________________________________________________

self = <tests.test_mount.DockerRaceTest testMethod=runTest>

    def runTest(self):
        self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
        os.chmod(self.mounttmp, 0o755)
        with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
            scriptfile.write(b"#!/bin/sh\necho OK\n")
            scriptfile.flush()
            os.chmod(scriptfile.name, 0o755)
            for _ in range(10):
                dockerrun = subprocess.run(
                    ["docker", "run",
                     "--rm",
                     "--workdir", "/mnt",
                     "--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
                     "--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
                     "busybox:uclibc", "sh", "test.sh"],
                    stdout=subprocess.PIPE,
                    stderr=2)
>               self.assertEqual(dockerrun.returncode, 0)
E               AssertionError: 127 != 0

tests/test_mount.py:1520: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpxu_2j088.sh" to rootfs at "/mnt/test.sh": mount src=/tmp/tmpxu_2j088.sh, dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x5000: no such file or directory: unknown

Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 127 != 0
============================================= 1 failed, 148 deselected in 2.78s ==============================================

Actions

Copy link

#36

Updated by Tom Clegg 6 months ago

Huh, I had started with "with ... as scriptfile" but changed to explicit unlink to avoid having the file still open for writing when executing ("text file busy"). But flush() does seem to be enough. Great!

Still haven't made this test fail on my system. Ran stock (10x) test 10 times. Raised iterations to 100 and ran 3 times.

tom@curve2:~$ findmnt --real
TARGET SOURCE    FSTYPE OPTIONS
/      /dev/vda1 ext4   rw,relatime,errors=remount-ro

tom@curve2:~$ findmnt /tmp

tom@curve2:~$ findmnt $XDG_RUNTIME_DIR
TARGET         SOURCE FSTYPE OPTIONS
/run/user/1000 tmpfs  tmpfs  rw,nosuid,nodev,relatime,size=1223976k,nr_inodes=305994,mode=700,uid=1000,gid=1000,inode64

tom@curve2:~$ docker version
Client: Docker Engine - Community
 Version:           28.4.0
 API version:       1.51
 Go version:        go1.24.7
 Git commit:        d8eb465
 Built:             Wed Sep  3 20:57:37 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.4.0
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.24.7
  Git commit:       249d679
  Built:            Wed Sep  3 20:57:37 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

tom@curve2:~$ uname -a
Linux curve2 6.1.0-30-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) x86_64 GNU/Linux

Actions

Copy link

#37

Updated by Brett Smith 6 months ago

I suspect the fact that your /tmp isn't tmpfs is very relevant. Since that's where we write the script that gets mounted (unless you're messing with $TMPDIR?) and it behaves very differently from a "real" filesystem, that could cause enough discrepancy to get different results.

I believe having /tmp be a tmpfs is standard on most, if not all, of the distros we're currently supporting. One way you could check is by booting a tordo compute node and looking at its mounts. We create that image from a standard Debian AMI and then run our compute node build playbook on it, so it's closest to what real users will get, and the case that I'm most interested in making sure we're good for.

The Jenkins workers are built similarly, we install more software but we still do minimal reconfiguration of the base system. So it's close-ish to real customer compute nodes. I would be interested to see what happens if you make a fork of the branch that runs 500 iterations of the test, and then see what result Jenkins gets. (It would be okay to run only the developer-run-tests subjob that runs FUSE tests for this.)

Actions

Copy link

#38

Updated by Tom Clegg 6 months ago

rebooted to get a new kernel, then

ran 5x with 100 iterations - OK
ran 50x with 10 iterations - 1 failure!

Actions

Copy link

#39

Updated by Tom Clegg 5 months ago

Well, I think I see the problem, with one unresolved mystery, and I haven't figured out how to fix it.

Problem: llfuse.invalidate_entry() and _inode() are backed by llfuse's own asynchronous queue, so even if we wait for our own queue to join, that still doesn't guarantee the kernel notification has actually happened. Therefore, sometimes it hasn't happened before the subsequent GETATTR.

docs say "This operation is carried out asynchronously, i.e. the method may return before the kernel has executed the request." In fact the method may return before llfuse has sent the request to the kernel.

Unresolved mystery: why didn't this fail periodically in 3.0.0? (I ran the new test 20x with 500 iterations each against arv-mount 3.0.0 and it succeeded 20/20.)

Failed attempts to fix:

Update llfuse to call task_done() on _notify_queue, and add a join_notify_queue() function that calls _notify_queue.join(). This just hangs, even when called in a with llfuse.lock_released. I don't know why.
Insert time.sleep(1) in the mknod handler. Even this fails eventually in the same way as #note-38. Does this mean sometimes it takes a full second for the llfuse notify thread to process entries?

If I combine those two (sleep, then call join), join still hangs (every single time), even though queue size is already 0, and we see the NOTIFY: log, which means (at least sometimes) the notifications do get processed before our op handler returns.

unique: 12, opcode: MKNOD (8), nodeid: 1, insize: 64, pid: 3318199
2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: parent_inode 1 'test.sh' 100644
2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: on_event ADD name test.sh
2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: on_event MOD name test.sh
2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: llfuse lock released
2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: notify queue size 1
NOTIFY: code=3 length=40
2025-10-14 12:55:26 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: slept 1 second
2025-10-14 12:55:26 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: notify queue size 0, calling join_notify_queue
...(hangs)

Why does this hang...

def join_notify_queue():
    _notify_queue.join()

...when clearly the notification has been sent (we see the log) and

def notify_queue_size():
    return _notify_queue.qsize()

...and the notification handler loop has been modified to look like

    while True:
        req = _notify_queue.get()
        try:
            ...
        finally:
            _notify_queue.task_done()

Here is the 2011 llfuse commit that changed invalidate_entry() and invalidate_inode() to work asynchronously

Actions

Copy link

#40

Updated by Brett Smith 5 months ago

Assigned To changed from Tom Clegg to Brett Smith

Actions

Copy link

#41

Updated by Brett Smith 5 months ago

Target version changed from Development 2025-10-15 to Development 2025-10-29

Actions

Copy link

#42

Updated by Tom Clegg 5 months ago

to reproduce what I'm seeing in #note-39,

check out 23136-debug branch of arvados repo
check out 23136-debug branch of github.com:arvados/arvados-python-llfuse and run (. $your_arvados_test_temp_dir/VENV3DIR/bin/activate && python setup.py build_cython && python setup.py build_ext --inplace && python setup.py install)
optionally change number of iterations on line 1510 of services/fuse/tests/test_mount.py (it's 500 on this branch)
WORKSPACE=$your_arvados_checkout_dir ./build/run-tests.sh --temp $your_arvados_test_temp_dir --interactive
test services/fuse --capture=tee-sys -k Race

It will hang at "calling join_notify_queue". ^C won't work, you'll have to killall run-tests.sh.

Actions

Copy link

#43

Updated by Brett Smith 5 months ago

Related to Bug #22420: file contents in cached arv-mount directory don't update added

Actions

Copy link

#44

Updated by Brett Smith 5 months ago

Tom Clegg wrote in #note-39:

Unresolved mystery: why didn't this fail periodically in 3.0.0? (I ran the new test 20x with 500 iterations each against arv-mount 3.0.0 and it succeeded 20/20.)

#22420 made significant changes to the way we invalidate things after a MOD event. The range of that branch is dda08e5e2a..0069b81694. Looking at git diff -b dda08e5e2a..0069b81694 services/fuse/arvados_fuse/, this hunk from fusedir.py::CollectionDirectoryBase.on_event jumps out:

@@ -365,10 +372,23 @@ class CollectionDirectoryBase(Directory):
                         self.inodes.invalidate_entry(self, name)
                         self.inodes.del_entry(ent)
                     elif event == arvados.collection.MOD:
-                            if hasattr(item, "fuse_entry") and item.fuse_entry is not None:
-                                self.inodes.invalidate_inode(item.fuse_entry)
+                        # MOD events have (modified_from, newitem)
+                        newitem = item[1]
+                        entry = None
+                        if hasattr(newitem, "fuse_entry") and newitem.fuse_entry is not None:
+                            entry = newitem.fuse_entry
                         elif name in self._entries:
-                                self.inodes.invalidate_inode(self._entries[name])
+                            entry = self._entries[name]
+
+                        if entry is not None:
+                            entry.invalidate()
+                            self.inodes.invalidate_inode(entry)
+
+                        if name in self._entries:
+                            self.inodes.invalidate_entry(self, name)
+
+                    # TOK and WRITE events just invalidate the
+                    # collection record file.

                     if self.collection_record_file is not None:
                         self.collection_record_file.invalidate()

Things I notice here:

We've gone from working on item: tuple[Collection, Collection] to working on newitem, the second element of that tuple. In the old code, hasattr(item, "fuse_entry") always returned false (you can't assign arbitrary attributes on tuples) so we never went down that branch.

We introduced a call to entry.invalidate() when the entry is found.

We introduced a call to self.inodes.invalidate_entry() when the name is found.

Things I'm wondering:

For the case under test, were we simply not doing any invalidation after MOD in 3.0.0?

Given all the new calls, is it possible we're doing some kind of mistaken double-invalidation now?

Actions

Copy link

#45

Updated by Brett Smith 5 months ago

Brett Smith wrote in #note-44:

Given all the new calls, is it possible we're doing some kind of mistaken double-invalidation now?

Seems unlikely: the invalidate method just sets an internal flag, and invalidating inodes vs. entries is definitely not the same operation.

For the case under test, were we simply not doing any invalidation after MOD in 3.0.0?

We definitely were not invalidating the entry. We may or may not have been invalidating the inode.

Actions

Copy link

#46

Updated by Brett Smith 5 months ago

Tom Clegg wrote in #note-39:

Why does this hang... when clearly the notification has been sent (we see the log)... and the notification handler loop has been modified to look like [this]?

The simplest possible explanation is that qsize==0 because the notification has been retrieved with queue.get() but there has been no corresponding queue.task_done(). Things I wonder: is it possible one of the C calls hasn't returned yet? Is it possible that they break normal execution flow? Does Cython implement the full Python semantics of try/finally to ensure that the finally block runs even if the try block does a break or continue?

Actions

Copy link

#47

Updated by Brett Smith 5 months ago

Things I wonder: while invalidate_inode and invalidate_entry are separate operations, is it possible we're not supposed to be doing both? Should we be preferring one over the other? We definitely were not doing both in 3.0.0, we never tried to call invalidate_entry.

Actions

Copy link

#48

Updated by Brett Smith 5 months ago

It is maddeningly difficult to find information about when you should call the invalidate functions. The llfuse documentation doesn't cover it. If you do a search like "linux fuse invalidate_inode," most of the results are directly from LKML, which is a pretty authoritative source but one where it's difficult to find this kind of user-facing information. I also found a couple of individuals' blog posts etc. also bemoaning the lack of low-level FUSE documentation and writing their own but they don't cover the invalidate functions, possibly because those were added later.

One thing I note is that none of the llfuse examples call any of the invalidate functions. Presumably they work fine. And presumably we were working more or less fine before this change. Based on all this, my working theory is that you are supposed to call these invalidate functions when the fileystem changes in ways that Linux can't otherwise know about. In our case, that means the collection record on the API server changed.

And then I can't help but notice that for the specific case under test with TmpCollectionDirectory, there is no collection record on the API server, by definition. These ADD and MOD events are coming from inside the house. It seems like we're creating an inode+entry, and then immediately invalidating it, which feels like it's probably inefficient in the best case, and wrong in the worst case.

I would be interested in a change that makes it so that on_event is only "active" (implementation TBD) when we are receiving events generated by changes to the upstream collection; i.e., when we explicitly call collection.update() or similar. I think this would get us closer to the 3.0.0 behavior while retaining the fix introduced in 3.1.0.

I am also really not sure when we should call invalidate_entry. Is it:

when the entry is completely deleted (we got a DEL event)?
when the entry is completely replaced (a rename-type event (although we don't model that specifically))?
when there's any change to the entry?

The answers to these questions would inform a lot whether the 3.0.0 or 3.1.0 MOD handling was more correct.

Actions

Copy link

#49

Updated by Brett Smith 5 months ago

File 23136-chatgpt.md 23136-chatgpt.md added

So I had the thought "you know what, synthesizing authoritative information from a bunch of scattered Linux txt docs and mailing list posts is something an LLM might actually be good at." And then I figured we already live in hell, so sure, why not, I asked ChatGPT. (I wanted to ask Claude but it wanted my phone number and lol no.) Attached is my prompt and the response. In my prompt I tried to avoid making leading statements that would lead to an answer that was just what I wanted to hear.

Its top reference is the fuse_lowlevel docs and I am very annoyed those didn't come up in my search results (I double-checked!). Because the docs for invalidating entries say:

To avoid a deadlock this function must not be called in the execution path of a related filesystem operation or within any code that could hold a lock that could be needed to execute such an operation. As of kernel 4.18, a "related operation" is a lookup(), symlink(), mknod(), mkdir(), unlink(), rename(), link() or create() request for the parent, and a setattr(), unlink(), rmdir(), rename(), setxattr(), removexattr(), readdir() or readdirplus() request for the inode itself.

When called correctly, this function will never block.

The LLM also calls this out a couple of times. This is a function that we were not calling at all in 3.0.0 and I'm pretty sure we're calling exactly the way you're not supposed to in 3.1.0.

This makes me think I'm onto something with my basic "only call on_event from external events" idea. It also makes me think I was right to feel like all the lock manipulation we do with locks in on_event just feels wrong (forcefully unlock an RLock as many times as needed???): the docs make it sound like we shouldn't need a lock at all, at least not for invalidating entries.

It also suggests that we should be calling fuse_lowlevel_notify_expire_entry specifically to better handle the case where a file is overmounted, which seems very relevant. Unfortunately llfuse does not provide a wrapper for that function. But I'm not too worried about it, at least for the specific case under test, because if we only run on_event for upstream changes that means we'll never run any of this for TmpCollectionDirectory.

I am interested in a fresh set of eyes on all this though.

Actions

Copy link

#50

Updated by Brett Smith 5 months ago

Brett Smith wrote in #note-46:

Tom Clegg wrote in #note-39:

Why does this hang... when clearly the notification has been sent (we see the log)... and the notification handler loop has been modified to look like [this]?

The simplest possible explanation is that qsize==0 because the notification has been retrieved with queue.get() but there has been no corresponding queue.task_done(). Things I wonder: is it possible one of the C calls hasn't returned yet? Is it possible that they break normal execution flow? Does Cython implement the full Python semantics of try/finally to ensure that the finally block runs even if the try block does a break or continue?

Synthesizing this with my last comment, the most likely explanation seems to be:

notify_queue.get() brings qsize down to 0.
fuse_lowlevel_notify_inval_entry deadlocks because you're specifically not supposed to call it when you're holding the FUSE lock.
therefore we never call notify_queue.task_done() and the upper level deadlocks too.

This is consistent with the fact that invalidate_entry() is the last invalidation thing we do when handling a MOD event.

Actions

Copy link

#51

Updated by Brett Smith 5 months ago

This libfuse example seems instructive: it has a completely separate thread outside the main FUSE loop periodically calling fuse_lowlevel_notify_inval_inode. Which matches what I would expect given all the other references I read.

Actions

Copy link

#52

Updated by Tom Clegg 5 months ago

Indeed,

The 'inval_inode()' and 'inval_entry()' functions are only required to invalidate the cache in the kernel when your file-system makes changes that are NOT driven by the local kernel through the VFS and fuse kernel module. says email-archive-as-documentation

If I comment out the llfuse.invalidate_entry() call (so we never call it at all) the Race test passes 1000x iterations.

Skipping the invalidate calls for changes that were initiated by fuse is not as easy, but I think you're right that that's what we should be doing.

Actions

Copy link

#53

Updated by Brett Smith 5 months ago

The CollectionDirectoryBase docstring notes:

Most operations act only the underlying Arvados Collection object. The Collection object signals via a notify callback to CollectionDirectoryBase.on_event that an item was added, removed or modified. FUSE inodes and directory entries are created, deleted or invalidated in response to these events.

This means a fix oriented around "only call on_event on upstream changes" is non-trivial because the entire thing is architected around on_event propagating all changes to FUSE.

The smallest possible fix might be rearranging the event loop so we only claim llfuse.lock for the operations that specifically need it. Other operations like invalidate_entry can stay out.

Actions

Copy link

#54

Updated by Brett Smith 5 months ago

23136-event-locking @ b4baad7687fbe2b4b9d0048eb714338e96bd75f3 passes all existing tests plus 10,000 iterations (200×50) of Tom's new tests. developer-run-tests: #4917

But I think I would like to add some tests for collection updates propagating to FUSE before I feel confident about including this in the release.

Actions

Copy link

#55

Updated by Brett Smith 5 months ago

23136-event-locking-test-wip @ 50a63a83bc35ed3a1c316c17c5985b313979aa2d

This begins to add a test suite that IMO we should've had a long time ago. It mounts a collection, then changes the mount and the original API record at the "same time," then checks the results.

Most of the tests pass. This one usually but doesn't always fail:

FAILED tests/test_concurrency.py::test_coll_concurrency[AddInMount-ModInRecord] - AssertionError: assert 3 == 6

i.e., when we add a new file to the mount, and at the "same time" append data to the original file in the collection, we don't see that appended data in the mount. Usually. Sometimes we do! I am so far gone I have no idea where to even start thinking about whether this is a bug or a sadly unavoidable situation when you have multiple clients writing updates without guardrails like etags etc.

I would like a pre-review of these tests for like, are they useful, are they well-written, do they help convince us that the bugfix is safe, what else should we be doing, let's have a conversation about it.

Actions

Copy link

#56

Updated by Brett Smith 5 months ago

23136-event-locking-test-wip @ 33d56d4d4ab34eeeba94c6511b968d320054d00f

This adds tests for simultaneous mount writes and Git clones as discussed at standup. All tests consistently pass except test_git_clone_to_coll which consistently fails like this:

>           assert git_proc.returncode == os.EX_OK
E           AssertionError: assert 128 == 0
E            +  where 128 = CompletedProcess(args=['git', 'clone', '--jobs=3', '--no-hardlinks', '/home/brett/Curii/arvados/.git', '/tmp/arv-mount-sgjbjho8'], returncode=128).returncode
E            +  and   0 = os.EX_OK

tests/test_concurrency.py:381: AssertionError
------------------------------------ Captured stdout call ------------------------------------

------------------------------------ Captured stderr call ------------------------------------
Cloning into '/tmp/arv-mount-sgjbjho8'...
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fuse_releasedir(): fuse_reply_* failed with No such file or directory
fusermount: entry for /tmp/arv-mount-sgjbjho8 not found in /etc/mtab

The fact that cloning to a tmp collection succeeds makes me wonder if the event handler is interfering somehow. Things I need to investigate:

What does fuse_releasedir(): fuse_reply_* failed with No such file or directory mean? Track down where this is coming from and whether it happens with other tests.
Try to reproduce this manually or at least capture the contents of the mount in such a way to try to get a sense of what might be getting lost/overwritten/mishandled.
Try these tests against 3.1.2 code and compare results.

Actions

Copy link Download all files

#57

Updated by Brett Smith 5 months ago

File 23136-test-without-fix.log 23136-test-without-fix.log added
File 23136-test-with-fix.log 23136-test-with-fix.log added

23136-event-locking @ c809dbf2853776d9b311f528ec758ad3593cebda - developer-run-tests: #4926

I have arranged the branch so it adds tests, then adds the fix. Then I've attached the results of running the tests before the fix from d30675362b and with the fix from fb3851bb06. It gets strictly more tests passing, including "Git clone to tmp mount" which seems especially relevant for keeping containers working (remember, tmp mounts failing on containers is how this whole saga started).

I am going to run the Git test in a loop as well. Assuming there are no failures there: there are definitely still bugs here, but I think this restores 3.0.0 reliability.

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
Anything not implemented (discovered or discussed during work) has a follow-up story.
- I have at least prominently marked failing tests so it's hopefully as clear as possible to see what's going on. I think it's TBD whether we invest fix effort in Python FUSE or Go FUSE driver.
Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above
Tested code incorporates recent main branch changes.
- Yes
New or changed UI/UX and has gotten feedback from stakeholders.
- N/A
Documentation has been updated.
- N/A
Behaves appropriately at the intended scale (describe intended scale).
- It should have performance characteristics at least as good as 3.0.0, and maybe slightly improved with less cache invalidation.
Considered backwards and forwards compatibility issues between client and server.
- N/A
Follows our coding standards and GUI style guidelines.
- Yes

Actions

Copy link

#58

Updated by Brett Smith 5 months ago

Brett Smith wrote in #note-57:

I am going to run the Git test in a loop as well. Assuming there are no failures there: there are definitely still bugs here, but I think this restores 3.0.0 reliability.

I ran 1000 iterations of test_clone_git_to_tmp (by wrapping the body in for _ in range(1000)) and it passed:

======= test services/fuse
==================================================== test session starts =====================================================
platform linux -- Python 3.10.19, pytest-8.4.2, pluggy-1.6.0
rootdir: /home/brett/Curii/arvados/services/fuse
configfile: pytest.ini
testpaths: tests
plugins: cwltest-2.6.20250818005349
collected 166 items / 165 deselected / 1 selected                                                                            

tests/test_concurrency.py .                                                                                            [100%]

======================================= 1 passed, 165 deselected in 6424.34s (1:47:04) =======================================
======= test services/fuse -- 6425s

Actions

Copy link

#59

Updated by Tom Clegg 5 months ago

LGTM, thanks.

Actions

Copy link

#60

Updated by Brett Smith 5 months ago

Status changed from In Progress to Resolved

Applied in changeset arvados|c6ab55f9a12c437e3f9a06e008f808f3c15491fe.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #23136

Intermittent failures when bind-mounting things into an arv-mount tmp directory

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Tom Clegg 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Lucas Di Pentima 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Tom Clegg 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago

Updated by Tom Clegg 6 months ago

Updated by Brett Smith 6 months ago

Updated by Tom Clegg 6 months ago

Updated by Brett Smith 6 months ago

Updated by Tom Clegg 6 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago

Updated by Tom Clegg 6 months ago

Updated by Brett Smith 6 months ago

Updated by Tom Clegg 6 months ago

Updated by Tom Clegg 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Brett Smith 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Brett Smith 5 months ago