Bug #23136
closedIntermittent failures when bind-mounting things into an arv-mount tmp directory
Description
It has been reported that when running this workflow, since 3.1.1 some steps fail due to AWS credentials issues, others do nothing and report as "Complete" and others correctly download the requested data.
Files
Updated by Brett Smith 7 months ago
- Description updated (diff)
If we've seen this in 3.1.1, and that was an "installer bugfixes only" release, I think it's fair to start looking at changes in 3.1.0. On that list, #22420 jumps out as the kind of bug fix that could've accidentally introduced a new race condition or something like that.
Trying a reproduction at pirca-j7d0g-upcguy83q3ya755
Updated by Brett Smith 7 months ago
- File MountTest.sh MountTest.sh added
Tried to reproduce with the attached script and did not succeed. Going to do a version 2 that actually mounts to Docker with options closer to the ones crunch actually uses and try that.
Updated by Brett Smith 7 months ago
Second try to reproduce the issue, this time using more common arv-mount options and making the mount available through Docker. All 2500 runs passed on my machine.
Going to try again with disk cache. If that still passes, I might want to start trying to reproduce on a Linux system as close as possible to ones where we've seen this.
Updated by Brett Smith 7 months ago
- File MountTest.sh MountTest.sh added
Third try with disk cache, still succeeded.
Updated by Brett Smith 7 months ago
One stray thought I had: you know how you can break shell scripts by editing them while they're running? I wonder if there's some situation where the shell gets a partial read of the file, like zero bytes or just the shebang line or something. A read like that would be consistent with the behavior we're seeing, although I have no explanation for how it happens. arv-mount should know immediately how large the script is (from the collection manifest) and be able to report that.
Updated by Brett Smith 7 months ago
I thought this line might be masking the bug:
while ! [ -d mnt/tmp0 ]; do sleep .25s; done
But crunch-run has analogous code in source:lib/crunchrun/crunchrun.go:
go func() {
for keepStatting {
time.Sleep(100 * time.Millisecond)
_, err = os.Stat(fmt.Sprintf("%s/by_id/README", runner.ArvMountPoint))
if err == nil {
keepStatting = false
statReadme <- true
}
}
close(statReadme)
}()
Updated by Tom Clegg 7 months ago
Another place to look: if the local keepstore process fails to retrieve a block requested by arv-mount, does arv-mount reliably propagate that error back to the reader process (shell)? It's conceivable that when lots containers are downloading from S3, and keepstore is competing with them for S3 bandwidth/quota (especially if the buckets are in the same account?), such errors are more frequent. Perhaps this contention causes a particular class of keepstore/S3 error that isn't handled properly.
Updated by Brett Smith 7 months ago
- Target version changed from Development 2025-09-03 to Development 2025-09-17
Updated by Brett Smith 7 months ago
Tom Clegg wrote in #note-11:
Another place to look: if the local keepstore process fails to retrieve a block requested by arv-mount, does arv-mount reliably propagate that error back to the reader process (shell)?
I rigged up my test so that the host had an /etc/hosts entry for the Keep service pointing to an unreachable address. With that in place, arv-mount logged the exception:
2025-09-03 13:34:17 arvados.arvados_fuse[555194] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/__init__.py", line 547, in catch_exceptions_wrapper
return orig_func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/__init__.py", line 900, in read
r = handle.obj.readfrom(off, size, self.num_retries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados_fuse/fusefile.py", line 66, in readfrom
return self.arvfile.readfrom(off, size, num_retries, exact=True, return_memoryview=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/arvfile.py", line 1043, in readfrom
block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/arvfile.py", line 696, in get_block_contents
return self._keep.get(locator, num_retries=num_retries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/retry.py", line 245, in num_retries_setter
return orig_func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/keep.py", line 1097, in get
return self._get_or_head(loc_s, method="GET", **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/.cache/arvados-test/VENV3DIR/lib/python3.11/site-packages/arvados/keep.py", line 1246, in _get_or_head
raise arvados.errors.KeepReadError(
arvados.errors.KeepReadError: [req-9yia9240s704xz3kyeee] failed to read 2d44620559aa642824845f07d810e07a+4626+A[redacted]@68caedf6 after 5 attempts: service https://keep.pirca.arvadosapi.com:443/ responded with 0 (28, 'Failed to connect to keep.pirca.arvadosapi.com port 443 after 130822 ms: Could not connect to server')
And eventually that bubbles up to the test process:
sha256sum: /keep/by_id/4fce132bc4ae9b4cc844115246a6bd41+175/generatereport.py: Input/output error /keep/by_id/4fce132bc4ae9b4cc844115246a6bd41+175/generatereport.py: FAILED open or read
This is not perfectly analogous since it's a networking error rather than an HTTP one. But this suggests at least some errors are handled as you'd hope, and makes it more surprising that we haven't seen arv-mount log anything in either failure mode.
Updated by Brett Smith 7 months ago
I am starting to become equally suspicious of secret handling as arv-mount. Secret inputs were basically unusable until 3.0.0 shipped 9e030521667e7998f8bde11a6f66dd6fe4c081c3 and this is basically our only workflow that exercises them. Compared to an arv-mount problem, the timeline makes a little less sense; but on the flip side, it's a better explanation for why the problem only seems to be affecting this workflow, and why we're getting the explicit failure case. From there, the silent failure case could be explained by something like "incomplete credentials cause aws s3 to fast exit 0." Very vague theory for sure, but at least as plausible as anything else we have right now.
Updated by Lucas Di Pentima 7 months ago
Brett Smith wrote in #note-15:
From there, the silent failure case could be explained by something like "incomplete credentials cause
aws s3to fast exit 0." Very vague theory for sure, but at least as plausible as anything else we have right now.
Note that logs don't even show the attempt to execute aws s3 cp ..., and the download.sh script has set -x in it.
Updated by Brett Smith 7 months ago
Lucas Di Pentima wrote in #note-16:
Note that logs don't even show the attempt to execute
aws s3 cp ..., and thedownload.shscript hasset -xin it.
Yeah, hmmm. Maybe it's the combination, maybe something about the way we set up secret mounts either triggers a latent bug in arv-mount or interferes with the way we make it available to the container. I should trace that code path and try to add it to my reproduction script.
Updated by Brett Smith 7 months ago
- File MountTest.sh MountTest.sh added
I have a meeting now and can't do a full write-up but the attached version successfully reproduces the issue. The issue is that results are unpredictable when a container layers mounts the way the S3 downloader does. The results show a container exiting 0 but not generating any script output, much like we're seeing in the silent success case.
% sh ~/Curii/MountTest.sh 2025-09-04 10:24:57 arvados.arv_put[557765] INFO: Creating new cache file at /home/brett/.cache/arvados/arv-put/6c9d01e002728bb0c2402ddf25b2c529 0M / 0M 100.0% 2025-09-04 10:24:57 arvados.arv_put[557765] INFO: 2025-09-04 10:24:57 arvados.arv_put[557765] INFO: Collection updated: 'Test Script' Run #5 at 2025-09-04 10:25:03-04:00... --- expected.out.log 2025-09-04 10:25:03.681983970 -0400 +++ container.out.log 2025-09-04 10:25:04.193978419 -0400 @@ -1,2 +0,0 @@ -cat test.sh zecret.txt | tee cat.log -Secret000005 ERROR: Container #5 stdout mismatch % ls -l *.log -rw-r--r-- 1 166 2025-09-04 10:25 arv-mount.err.log -rw-r--r-- 1 0 2025-09-04 10:24 arv-mount.out.log -rw-r----- 1 37 2025-09-04 10:24 cat.log -rw-r----- 1 0 2025-09-04 10:25 container.err.log -rw-r----- 1 0 2025-09-04 10:25 container.out.log -rw-r----- 1 39 2025-09-04 10:24 expected.err.log -rw-r----- 1 50 2025-09-04 10:25 expected.out.log
Updated by Brett Smith 7 months ago
- Subject changed from aws-s3-bulk-download workflow failures to Inconsistent mount layering causes aws-s3-bulk-download failures
Updated by Tom Clegg 7 months ago
- Related to Bug #23142: lib/crunchrun flaky test singularitySuite.TestImageCache_Concurrency_10 added
Updated by Brett Smith 7 months ago
- File MountTest.sh MountTest.sh added
I have not been able to reproduce the issue using just Docker or just arv-mount. (The arv-mount test ran a script out of by_id and wrote the output to tmp0 in one command.) Right now it seems to be specific to the way the mount gets propagated to the container.
I have simplified the reproduction so it only mounts the script and the tmp collection. This means we can rule out one of the mounts being secret as a factor. I also ran arv-mount with --debug to get more information.
Like the real workflow, it fails two ways. In one case, the container exits 0 but outputs nothing. In another, docker run exits 125 and writes this message to stderr:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-PbWBJSyj/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": change mount propagation through procfd: mount dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x44000: invalid argument: unknown
See new script attached. I need to pick through the arv-mount debug logs and see if that reveals anything.
Updated by Brett Smith 7 months ago
The script exhibits both failure modes with Docker 27, so I'm back to suspecting arv-mount changes as the source.
At the same time, I'm also aware that all the clusters we've seen this issue on were recently upgraded. It's possible that the real change is in Linux itself. But if that's the case there's basically nothing we're going to be able to do about that.
Updated by Brett Smith 7 months ago
Sometimes docker run also exits 127 with this slightly different error (maybe depending on which mount fails?):
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-pzjmN4uo/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": mount src=/run/user/1000/MountTest-pzjmN4uo/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh, dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x5000: no such file or directory: unknown
I have also seen it exit 125 with this error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/user/1000/MountTest-0QMrSc7D/mnt/by_id/d0b2917ad927fbe02735c59a17891308+60/test.sh" to rootfs at "/mnt/test.sh": possibly malicious path detected -- refusing to operate on /var/lib/docker/overlay2/68268f4da76a8041e365b254e6238cf9fb0e2b4c4a5097ba8aafc6368e796eb9/merged/mnt/test.sh (deleted): unknown
All errors happen even if I add the following code above docker run to try to ensure the mount is ready for work beforehand:
while ! [ -d mnt/tmp0 ]; do sleep .25s; done
diff -u test.sh "mnt/by_id/$SCRIPT_PDH/test.sh" ||
fail 15 "cmp #$RUN exited $?"
Updated by Brett Smith 7 months ago
- Assigned To set to Brett Smith
- Status changed from New to In Progress
- Category changed from Crunch to FUSE
- Subject changed from Inconsistent mount layering causes aws-s3-bulk-download failures to Intermittent failures when bind-mounting things into an arv-mount tmp directory
- File MountTest.sh MountTest.sh added
I have a smaller reproduction: all you need is the arv-mount tmp directory, and then you bind mount something, anything, into it. Doing this exhibits all the different failure modes.
https://github.com/moby/moby/issues/26051 sort of gestures at some of the mechanics that the runtime uses to set up bind mounts. Note in particular that it has to create mount points on the original host (the arv-mount tmp directory in our case) to do this. It seems like there might be some race condition where arv-mount doesn't have the mount point ready in time for the runtime to use. Possibly arv-mount is telling Linux some operation is finished before it actually is.
Further confirmation: if you manually create the destination mount point (e.g., add touch mnt/tmp0/test.sh before docker run), the issues go away and the test script succeeds. You prevent the race by ensuring the destination completely exists before Docker tries to manipulate it.
I can now explain why the S3 download workflow sometimes fails silently and other times complains about lack of credentials: it's just a question of which file fails to mount. If you race such that download.sh is an empty mount point, bash runs an empty script and you get the silent exit 0. If you race such that .aws/credentials is an empty mount point, aws complains about the lack of credentials.
Updated by Brett Smith 7 months ago
Brett Smith wrote in #note-4:
If we've seen this in 3.1.1, and that was an "installer bugfixes only" release, I think it's fair to start looking at changes in 3.1.0. On that list, #22420 jumps out as the kind of bug fix that could've accidentally introduced a new race condition or something like that.
I just ran my reproduction test script with arv-mount 3.0.0 installed in a virtualenv. The script passed without any failed runs. So yeah now this is 100% my top candidate.
Updated by Brett Smith 6 months ago
- Assigned To deleted (
Brett Smith) - Status changed from In Progress to New
Updated by Brett Smith 6 months ago
- Target version deleted (
Development 2025-09-17)
Updated by Brett Smith 6 months ago
- File arv-mount.err.log arv-mount.err.log added
- File arv-mount.empty.log arv-mount.empty.log added
- File arv-mount.success.log arv-mount.success.log added
See logs attached from a successful run; a run where test.sh was not found; and a run where test.sh was empty. It seems like what distinguishes success is that we do GETATTR on test.sh before RELEASEDIR.
Updated by Tom Clegg 6 months ago
Yes, I see the same pattern.
When fusetest.sh succeeds, arv-mount debug logs have MKNOD-NOTIFY-RELEASEDIR-GETATTR, like this
unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 918457 2025-09-22 20:49:44 arvados.arvados_fuse[918394] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644 unique: 12, success, outsize: 144 NOTIFY: code=3 length=40 unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0 2025-09-22 20:49:44 arvados.arvados_fuse[918394] DEBUG: arv-mount release fh 0 unique: 14, success, outsize: 16 unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 918457 unique: 16, success, outsize: 120
When fusetest.sh fails, arv-mount debug logs have MKNOD-RELEASEDIR-GETATTR-NOTIFY, like this
unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 785032 2025-09-19 11:14:33 arvados.arvados_fuse[784976] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644 unique: 12, success, outsize: 144 unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0 2025-09-19 11:14:33 arvados.arvados_fuse[784976] DEBUG: arv-mount release fh 0 unique: 14, success, outsize: 16 unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 785032 unique: 16, success, outsize: 120 NOTIFY: code=3 length=40I think
- RELEASEDIR timing is irrelevant
- if GETATTR is processed before NOTIFY, filesystem semantics are broken
With some added debug logs, we can see the NOTIFY comes from "add" and "mod" events, as in this failure:
unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 920138 2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644 2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: event add name test.sh 2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: event mod name test.sh unique: 12, success, outsize: 144 unique: 14, opcode: RELEASEDIR (29), nodeid: 2, insize: 64, pid: 0 2025-09-22 20:52:33 arvados.arvados_fuse[920079] DEBUG: arv-mount release fh 0 unique: 14, success, outsize: 16 unique: 16, opcode: GETATTR (3), nodeid: 2, insize: 56, pid: 920138 unique: 16, success, outsize: 120 NOTIFY: code=3 length=40
With a "wait for pending NOTIFY to complete" in place in CollectionDirectoryBase.on_event(), NOTIFY is always immediately after MKNOD:
unique: 12, opcode: MKNOD (8), nodeid: 2, insize: 64, pid: 1005522 2025-09-22 21:29:02 arvados.arvados_fuse[1005465] DEBUG: arv-mount mknod: parent_inode 2 'test.sh' 100644 unique: 12, success, outsize: 144 NOTIFY: code=3 length=40
So far I've run fusetest.sh 6 times with this change, and all 600 trials have succeeded.
This seems promising. There may be more places where a similar fix is needed.
23136-flush-invalidate @ b7c05540d6c5c100394a11b2a668ccb3870993c5 -- developer-run-tests: #4890
Updated by Brett Smith 6 months ago
- Target version set to Development 2025-10-01
- Assigned To set to Tom Clegg
Updated by Tom Clegg 6 months ago
- Status changed from New to In Progress
I haven't been able to reproduce this with a python test case. (I thought I was onto something for a while because my "repeatedly rename and stat in different threads" test was deadlocking, but it was just that the test tearDown function hangs when it races with an in-progress fuse operation.)
It does seem plausible that the mount syscall would be uniquely sensitive to this race, which makes it annoying to test. Do we want to refactor fusetest.sh into a test case in services/fuse that depends on Docker and runs 100x when explicitly selected, or something like that?
Here's a slightly updated patch, with the "wait for invalidate queue to finish" bit moved up before re-acquiring the collection mutex, which seems safer.
23136-flush-invalidate @ 4851cebeb4c278fd967a76d9c2d0559259b569d2 -- developer-run-tests: #4891
Updated by Brett Smith 6 months ago
Tom Clegg wrote in #note-31:
It does seem plausible that the
mountsyscall would be uniquely sensitive to this race, which makes it annoying to test. Do we want to refactor fusetest.sh into a test case in services/fuse that depends on Docker and runs 100x when explicitly selected, or something like that?
On my system, once I got it failing reliably I think it always failed within the first five tries. If you're seeing something noticeably different let's talk about it, but I think I'd be open to a test that runs 10 tries as part of the normal test suite. Even if it doesn't fail 100% of the time, failing more often than not is good enough.
Updated by Tom Clegg 6 months ago
I finally noticed that we can do N trials without restarting arv-mount if we just delete test.sh from the temporary collection after each trial.
I've added a test that does 10 iterations, which takes about 10 seconds. With the fix commented out, it usually fails.
23136-flush-invalidate @ 52bea788b3033a1074040c8da906c6422395ae24 -- developer-run-tests: #4894
(workbench2 failed for obviously unrelated reasons)
Updated by Brett Smith 6 months ago
- Target version changed from Development 2025-10-01 to Development 2025-10-15
Updated by Brett Smith 6 months ago
I have pushed some minor style changes to the branch. But even without those, I am seeing ~5% failure rate when running DockerRaceTest repeatedly. And I have seen both assertions fail. I am concerned the branch mitigates but does not actually fix the issue. I am open to the possibility the problem is in the test or the test environment rather than the fix, but I'd need some good convincing as to why that is.
To that point about environment, Docker versions and filesystems for posterity:
arvdev ⟩ findmnt --real
TARGET SOURCE FSTYPE OPTIONS
/ /dev/mapper/xps9310-var--lib--machines[/arvdev] btrfs rw,relatime,ssd,space_cache=v2,subvo
├─/run/host/os-release /dev/mapper/xps9310-root--debian12[/usr/lib/os-release] ext4 ro,nosuid,nodev,noexec,relatime,erro
├─/run/host/os-release /dev/mapper/xps9310-root--debian12[/usr/lib/os-release] ext4 rw,relatime,errors=remount-ro
└─/home/brett/Curii /dev/mapper/xps9310-home[/brett/Curii] ext4 rw,relatime
arvdev ⟩ findmnt /tmp
TARGET SOURCE FSTYPE OPTIONS
/tmp tmpfs tmpfs rw,nosuid,nodev,size=1608148k,nr_inodes=409600,inode64
arvdev ⟩ docker version
Client: Docker Engine - Community
Version: 28.4.0
API version: 1.51
Go version: go1.24.7
Git commit: d8eb465
Built: Wed Sep 3 20:57:37 2025
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 28.4.0
API version: 1.51 (minimum version 1.24)
Go version: go1.24.7
Git commit: 249d679
Built: Wed Sep 3 20:57:37 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.27
GitCommit: 05044ec0a9a75232cad458027ca83437aae3f4da
runc:
Version: 1.2.5
GitCommit: v1.2.5-0-g59923ef
docker-init:
Version: 0.19.0
GitCommit: de40ad0
___________________________________________________ DockerRaceTest.runTest ___________________________________________________
self = <tests.test_mount.DockerRaceTest testMethod=runTest>
def runTest(self):
self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
os.chmod(self.mounttmp, 0o755)
with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
scriptfile.write(b"#!/bin/sh\necho OK\n")
scriptfile.flush()
os.chmod(scriptfile.name, 0o755)
for _ in range(10):
dockerrun = subprocess.run(
["docker", "run",
"--rm",
"--workdir", "/mnt",
"--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
"--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
"busybox:uclibc", "sh", "test.sh"],
stdout=subprocess.PIPE,
stderr=2)
self.assertEqual(dockerrun.returncode, 0)
> self.assertEqual(dockerrun.stdout, b"OK\n")
E AssertionError: b'' != b'OK\n'
tests/test_mount.py:1521: AssertionError
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: b'' != b'OK\n'
============================================= 1 failed, 148 deselected in 3.32s ==============================================
___________________________________________________ DockerRaceTest.runTest ___________________________________________________
self = <tests.test_mount.DockerRaceTest testMethod=runTest>
def runTest(self):
self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
os.chmod(self.mounttmp, 0o755)
scriptfile = tempfile.NamedTemporaryFile(delete=False)
try:
scriptfile.write(b"#!/bin/sh\necho OK\n")
scriptfile.close()
os.chmod(scriptfile.name, 0o755)
for _ in range(10):
dockerrun = subprocess.run(
["docker", "run",
"--rm",
"--workdir", "/mnt",
"--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
"--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
"busybox:uclibc", "sh", "test.sh"],
stdout=subprocess.PIPE,
stderr=2)
> self.assertEqual(dockerrun.returncode, 0)
E AssertionError: 125 != 0
tests/test_mount.py:1525: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpr7fps_dj" to rootfs at "/mnt/test.sh": possibly malicious path detected -- refusing to operate on /var/lib/docker/overlay2/06fa01931056931304c6e6eeff3e9d6867649ad9b4a2b51910b83e9fa566d297/merged/mnt/test.sh (deleted): unknown
Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 125 != 0
============================================= 1 failed, 148 deselected in 4.90s ==============================================
___________________________________________________ DockerRaceTest.runTest ___________________________________________________
self = <tests.test_mount.DockerRaceTest testMethod=runTest>
def runTest(self):
self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
os.chmod(self.mounttmp, 0o755)
with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
scriptfile.write(b"#!/bin/sh\necho OK\n")
scriptfile.flush()
os.chmod(scriptfile.name, 0o755)
for _ in range(10):
dockerrun = subprocess.run(
["docker", "run",
"--rm",
"--workdir", "/mnt",
"--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
"--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
"busybox:uclibc", "sh", "test.sh"],
stdout=subprocess.PIPE,
stderr=2)
> self.assertEqual(dockerrun.returncode, 0)
E AssertionError: 125 != 0
tests/test_mount.py:1520: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpvej09n4p.sh" to rootfs at "/mnt/test.sh": change mount propagation through procfd: mount dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x44000: invalid argument: unknown
Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 125 != 0
============================================= 1 failed, 148 deselected in 4.06s ==============================================
___________________________________________________ DockerRaceTest.runTest ___________________________________________________
self = <tests.test_mount.DockerRaceTest testMethod=runTest>
def runTest(self):
self.make_mount(fuse.TmpCollectionDirectory, fuse_options=["allow_other"])
os.chmod(self.mounttmp, 0o755)
with tempfile.NamedTemporaryFile(suffix='.sh') as scriptfile:
scriptfile.write(b"#!/bin/sh\necho OK\n")
scriptfile.flush()
os.chmod(scriptfile.name, 0o755)
for _ in range(10):
dockerrun = subprocess.run(
["docker", "run",
"--rm",
"--workdir", "/mnt",
"--mount", f"type=bind,dst=/mnt,src={self.mounttmp}",
"--mount", f"type=bind,dst=/mnt/test.sh,src={scriptfile.name}",
"busybox:uclibc", "sh", "test.sh"],
stdout=subprocess.PIPE,
stderr=2)
> self.assertEqual(dockerrun.returncode, 0)
E AssertionError: 127 != 0
tests/test_mount.py:1520: AssertionError
---------------------------------------------------- Captured stderr call ----------------------------------------------------
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/tmp/tmpxu_2j088.sh" to rootfs at "/mnt/test.sh": mount src=/tmp/tmpxu_2j088.sh, dst=/mnt/test.sh, dstFd=/proc/thread-self/fd/8, flags=0x5000: no such file or directory: unknown
Run 'docker run --help' for more information
================================================== short test summary info ===================================================
FAILED tests/test_mount.py::DockerRaceTest::runTest - AssertionError: 127 != 0
============================================= 1 failed, 148 deselected in 2.78s ==============================================
Updated by Tom Clegg 6 months ago
Huh, I had started with "with ... as scriptfile" but changed to explicit unlink to avoid having the file still open for writing when executing ("text file busy"). But flush() does seem to be enough. Great!
Still haven't made this test fail on my system. Ran stock (10x) test 10 times. Raised iterations to 100 and ran 3 times.
tom@curve2:~$ findmnt --real TARGET SOURCE FSTYPE OPTIONS / /dev/vda1 ext4 rw,relatime,errors=remount-ro tom@curve2:~$ findmnt /tmp tom@curve2:~$ findmnt $XDG_RUNTIME_DIR TARGET SOURCE FSTYPE OPTIONS /run/user/1000 tmpfs tmpfs rw,nosuid,nodev,relatime,size=1223976k,nr_inodes=305994,mode=700,uid=1000,gid=1000,inode64 tom@curve2:~$ docker version Client: Docker Engine - Community Version: 28.4.0 API version: 1.51 Go version: go1.24.7 Git commit: d8eb465 Built: Wed Sep 3 20:57:37 2025 OS/Arch: linux/amd64 Context: default Server: Docker Engine - Community Engine: Version: 28.4.0 API version: 1.51 (minimum version 1.24) Go version: go1.24.7 Git commit: 249d679 Built: Wed Sep 3 20:57:37 2025 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.7.27 GitCommit: 05044ec0a9a75232cad458027ca83437aae3f4da runc: Version: 1.2.5 GitCommit: v1.2.5-0-g59923ef docker-init: Version: 0.19.0 GitCommit: de40ad0 tom@curve2:~$ uname -a Linux curve2 6.1.0-30-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) x86_64 GNU/Linux
Updated by Brett Smith 6 months ago
I suspect the fact that your /tmp isn't tmpfs is very relevant. Since that's where we write the script that gets mounted (unless you're messing with $TMPDIR?) and it behaves very differently from a "real" filesystem, that could cause enough discrepancy to get different results.
I believe having /tmp be a tmpfs is standard on most, if not all, of the distros we're currently supporting. One way you could check is by booting a tordo compute node and looking at its mounts. We create that image from a standard Debian AMI and then run our compute node build playbook on it, so it's closest to what real users will get, and the case that I'm most interested in making sure we're good for.
The Jenkins workers are built similarly, we install more software but we still do minimal reconfiguration of the base system. So it's close-ish to real customer compute nodes. I would be interested to see what happens if you make a fork of the branch that runs 500 iterations of the test, and then see what result Jenkins gets. (It would be okay to run only the developer-run-tests subjob that runs FUSE tests for this.)
Updated by Tom Clegg 5 months ago
Well, I think I see the problem, with one unresolved mystery, and I haven't figured out how to fix it.
Problem: llfuse.invalidate_entry() and _inode() are backed by llfuse's own asynchronous queue, so even if we wait for our own queue to join, that still doesn't guarantee the kernel notification has actually happened. Therefore, sometimes it hasn't happened before the subsequent GETATTR.- docs say "This operation is carried out asynchronously, i.e. the method may return before the kernel has executed the request." In fact the method may return before llfuse has sent the request to the kernel.
Unresolved mystery: why didn't this fail periodically in 3.0.0? (I ran the new test 20x with 500 iterations each against arv-mount 3.0.0 and it succeeded 20/20.)
Failed attempts to fix:- Update llfuse to call task_done() on _notify_queue, and add a join_notify_queue() function that calls _notify_queue.join(). This just hangs, even when called in a
with llfuse.lock_released. I don't know why. - Insert time.sleep(1) in the mknod handler. Even this fails eventually in the same way as #note-38. Does this mean sometimes it takes a full second for the llfuse notify thread to process entries?
If I combine those two (sleep, then call join), join still hangs (every single time), even though queue size is already 0, and we see the NOTIFY: log, which means (at least sometimes) the notifications do get processed before our op handler returns.
unique: 12, opcode: MKNOD (8), nodeid: 1, insize: 64, pid: 3318199 2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: parent_inode 1 'test.sh' 100644 2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: on_event ADD name test.sh 2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: on_event MOD name test.sh 2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: llfuse lock released 2025-10-14 12:55:25 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: notify queue size 1 NOTIFY: code=3 length=40 2025-10-14 12:55:26 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: slept 1 second 2025-10-14 12:55:26 arvados.arvados_fuse[3318110] DEBUG: arv-mount mknod: notify queue size 0, calling join_notify_queue ...(hangs)
Why does this hang...
def join_notify_queue():
_notify_queue.join()
...when clearly the notification has been sent (we see the log) and
def notify_queue_size():
return _notify_queue.qsize()
...and the notification handler loop has been modified to look like
while True:
req = _notify_queue.get()
try:
...
finally:
_notify_queue.task_done()
Here is the 2011 llfuse commit that changed invalidate_entry() and invalidate_inode() to work asynchronously
Updated by Brett Smith 5 months ago
- Assigned To changed from Tom Clegg to Brett Smith
Updated by Brett Smith 5 months ago
- Target version changed from Development 2025-10-15 to Development 2025-10-29
Updated by Tom Clegg 5 months ago
- check out 23136-debug branch of arvados repo
- check out 23136-debug branch of github.com:arvados/arvados-python-llfuse and run
(. $your_arvados_test_temp_dir/VENV3DIR/bin/activate && python setup.py build_cython && python setup.py build_ext --inplace && python setup.py install) - optionally change number of iterations on line 1510 of services/fuse/tests/test_mount.py (it's 500 on this branch)
WORKSPACE=$your_arvados_checkout_dir ./build/run-tests.sh --temp $your_arvados_test_temp_dir --interactivetest services/fuse --capture=tee-sys -k Race
It will hang at "calling join_notify_queue". ^C won't work, you'll have to killall run-tests.sh.
Updated by Brett Smith 5 months ago
- Related to Bug #22420: file contents in cached arv-mount directory don't update added
Updated by Brett Smith 5 months ago
Tom Clegg wrote in #note-39:
Unresolved mystery: why didn't this fail periodically in 3.0.0? (I ran the new test 20x with 500 iterations each against arv-mount 3.0.0 and it succeeded 20/20.)
#22420 made significant changes to the way we invalidate things after a MOD event. The range of that branch is dda08e5e2a..0069b81694. Looking at git diff -b dda08e5e2a..0069b81694 services/fuse/arvados_fuse/, this hunk from fusedir.py::CollectionDirectoryBase.on_event jumps out:
@@ -365,10 +372,23 @@ class CollectionDirectoryBase(Directory):
self.inodes.invalidate_entry(self, name)
self.inodes.del_entry(ent)
elif event == arvados.collection.MOD:
- if hasattr(item, "fuse_entry") and item.fuse_entry is not None:
- self.inodes.invalidate_inode(item.fuse_entry)
+ # MOD events have (modified_from, newitem)
+ newitem = item[1]
+ entry = None
+ if hasattr(newitem, "fuse_entry") and newitem.fuse_entry is not None:
+ entry = newitem.fuse_entry
elif name in self._entries:
- self.inodes.invalidate_inode(self._entries[name])
+ entry = self._entries[name]
+
+ if entry is not None:
+ entry.invalidate()
+ self.inodes.invalidate_inode(entry)
+
+ if name in self._entries:
+ self.inodes.invalidate_entry(self, name)
+
+ # TOK and WRITE events just invalidate the
+ # collection record file.
if self.collection_record_file is not None:
self.collection_record_file.invalidate()
Things I notice here:
We've gone from working on item: tuple[Collection, Collection] to working on newitem, the second element of that tuple. In the old code, hasattr(item, "fuse_entry") always returned false (you can't assign arbitrary attributes on tuples) so we never went down that branch.
We introduced a call to entry.invalidate() when the entry is found.
We introduced a call to self.inodes.invalidate_entry() when the name is found.
Things I'm wondering:
For the case under test, were we simply not doing any invalidation after MOD in 3.0.0?
Given all the new calls, is it possible we're doing some kind of mistaken double-invalidation now?
Updated by Brett Smith 5 months ago
Brett Smith wrote in #note-44:
Given all the new calls, is it possible we're doing some kind of mistaken double-invalidation now?
Seems unlikely: the invalidate method just sets an internal flag, and invalidating inodes vs. entries is definitely not the same operation.
For the case under test, were we simply not doing any invalidation after
MODin 3.0.0?
We definitely were not invalidating the entry. We may or may not have been invalidating the inode.
Updated by Brett Smith 5 months ago
Tom Clegg wrote in #note-39:
Why does this hang... when clearly the notification has been sent (we see the log)... and the notification handler loop has been modified to look like [this]?
The simplest possible explanation is that qsize==0 because the notification has been retrieved with queue.get() but there has been no corresponding queue.task_done(). Things I wonder: is it possible one of the C calls hasn't returned yet? Is it possible that they break normal execution flow? Does Cython implement the full Python semantics of try/finally to ensure that the finally block runs even if the try block does a break or continue?
Updated by Brett Smith 5 months ago
Things I wonder: while invalidate_inode and invalidate_entry are separate operations, is it possible we're not supposed to be doing both? Should we be preferring one over the other? We definitely were not doing both in 3.0.0, we never tried to call invalidate_entry.
Updated by Brett Smith 5 months ago
It is maddeningly difficult to find information about when you should call the invalidate functions. The llfuse documentation doesn't cover it. If you do a search like "linux fuse invalidate_inode," most of the results are directly from LKML, which is a pretty authoritative source but one where it's difficult to find this kind of user-facing information. I also found a couple of individuals' blog posts etc. also bemoaning the lack of low-level FUSE documentation and writing their own but they don't cover the invalidate functions, possibly because those were added later.
One thing I note is that none of the llfuse examples call any of the invalidate functions. Presumably they work fine. And presumably we were working more or less fine before this change. Based on all this, my working theory is that you are supposed to call these invalidate functions when the fileystem changes in ways that Linux can't otherwise know about. In our case, that means the collection record on the API server changed.
And then I can't help but notice that for the specific case under test with TmpCollectionDirectory, there is no collection record on the API server, by definition. These ADD and MOD events are coming from inside the house. It seems like we're creating an inode+entry, and then immediately invalidating it, which feels like it's probably inefficient in the best case, and wrong in the worst case.
I would be interested in a change that makes it so that on_event is only "active" (implementation TBD) when we are receiving events generated by changes to the upstream collection; i.e., when we explicitly call collection.update() or similar. I think this would get us closer to the 3.0.0 behavior while retaining the fix introduced in 3.1.0.
I am also really not sure when we should call invalidate_entry. Is it:
- when the entry is completely deleted (we got a
DELevent)? - when the entry is completely replaced (a rename-type event (although we don't model that specifically))?
- when there's any change to the entry?
The answers to these questions would inform a lot whether the 3.0.0 or 3.1.0 MOD handling was more correct.
Updated by Brett Smith 5 months ago
- File 23136-chatgpt.md 23136-chatgpt.md added
So I had the thought "you know what, synthesizing authoritative information from a bunch of scattered Linux txt docs and mailing list posts is something an LLM might actually be good at." And then I figured we already live in hell, so sure, why not, I asked ChatGPT. (I wanted to ask Claude but it wanted my phone number and lol no.) Attached is my prompt and the response. In my prompt I tried to avoid making leading statements that would lead to an answer that was just what I wanted to hear.
Its top reference is the fuse_lowlevel docs and I am very annoyed those didn't come up in my search results (I double-checked!). Because the docs for invalidating entries say:
To avoid a deadlock this function must not be called in the execution path of a related filesystem operation or within any code that could hold a lock that could be needed to execute such an operation. As of kernel 4.18, a "related operation" is a lookup(), symlink(), mknod(), mkdir(), unlink(), rename(), link() or create() request for the parent, and a setattr(), unlink(), rmdir(), rename(), setxattr(), removexattr(), readdir() or readdirplus() request for the inode itself.
When called correctly, this function will never block.
The LLM also calls this out a couple of times. This is a function that we were not calling at all in 3.0.0 and I'm pretty sure we're calling exactly the way you're not supposed to in 3.1.0.
This makes me think I'm onto something with my basic "only call on_event from external events" idea. It also makes me think I was right to feel like all the lock manipulation we do with locks in on_event just feels wrong (forcefully unlock an RLock as many times as needed???): the docs make it sound like we shouldn't need a lock at all, at least not for invalidating entries.
It also suggests that we should be calling fuse_lowlevel_notify_expire_entry specifically to better handle the case where a file is overmounted, which seems very relevant. Unfortunately llfuse does not provide a wrapper for that function. But I'm not too worried about it, at least for the specific case under test, because if we only run on_event for upstream changes that means we'll never run any of this for TmpCollectionDirectory.
I am interested in a fresh set of eyes on all this though.
Updated by Brett Smith 5 months ago
Brett Smith wrote in #note-46:
Tom Clegg wrote in #note-39:
Why does this hang... when clearly the notification has been sent (we see the log)... and the notification handler loop has been modified to look like [this]?
The simplest possible explanation is that qsize==0 because the notification has been retrieved with
queue.get()but there has been no correspondingqueue.task_done(). Things I wonder: is it possible one of the C calls hasn't returned yet? Is it possible that they break normal execution flow? Does Cython implement the full Python semantics oftry/finallyto ensure that thefinallyblock runs even if thetryblock does abreakorcontinue?
Synthesizing this with my last comment, the most likely explanation seems to be:
notify_queue.get()brings qsize down to 0.fuse_lowlevel_notify_inval_entrydeadlocks because you're specifically not supposed to call it when you're holding the FUSE lock.- therefore we never call
notify_queue.task_done()and the upper level deadlocks too.
This is consistent with the fact that invalidate_entry() is the last invalidation thing we do when handling a MOD event.
Updated by Brett Smith 5 months ago
This libfuse example seems instructive: it has a completely separate thread outside the main FUSE loop periodically calling fuse_lowlevel_notify_inval_inode. Which matches what I would expect given all the other references I read.
Updated by Tom Clegg 5 months ago
Indeed,
The 'inval_inode()' and 'inval_entry()' functions are only required to invalidate the cache in the kernel when your file-system makes changes that are NOT driven by the local kernel through the VFS and fuse kernel module. says email-archive-as-documentation
If I comment out the llfuse.invalidate_entry() call (so we never call it at all) the Race test passes 1000x iterations.
Skipping the invalidate calls for changes that were initiated by fuse is not as easy, but I think you're right that that's what we should be doing.
Updated by Brett Smith 5 months ago
The CollectionDirectoryBase docstring notes:
Most operations act only the underlying Arvados
Collectionobject. TheCollectionobject signals via a notify callback toCollectionDirectoryBase.on_eventthat an item was added, removed or modified. FUSE inodes and directory entries are created, deleted or invalidated in response to these events.
This means a fix oriented around "only call on_event on upstream changes" is non-trivial because the entire thing is architected around on_event propagating all changes to FUSE.
The smallest possible fix might be rearranging the event loop so we only claim llfuse.lock for the operations that specifically need it. Other operations like invalidate_entry can stay out.
Updated by Brett Smith 5 months ago
23136-event-locking @ b4baad7687fbe2b4b9d0048eb714338e96bd75f3 passes all existing tests plus 10,000 iterations (200×50) of Tom's new tests. developer-run-tests: #4917
But I think I would like to add some tests for collection updates propagating to FUSE before I feel confident about including this in the release.
Updated by Brett Smith 5 months ago
23136-event-locking-test-wip @ 50a63a83bc35ed3a1c316c17c5985b313979aa2d
This begins to add a test suite that IMO we should've had a long time ago. It mounts a collection, then changes the mount and the original API record at the "same time," then checks the results.
Most of the tests pass. This one usually but doesn't always fail:
FAILED tests/test_concurrency.py::test_coll_concurrency[AddInMount-ModInRecord] - AssertionError: assert 3 == 6
i.e., when we add a new file to the mount, and at the "same time" append data to the original file in the collection, we don't see that appended data in the mount. Usually. Sometimes we do! I am so far gone I have no idea where to even start thinking about whether this is a bug or a sadly unavoidable situation when you have multiple clients writing updates without guardrails like etags etc.
I would like a pre-review of these tests for like, are they useful, are they well-written, do they help convince us that the bugfix is safe, what else should we be doing, let's have a conversation about it.
Updated by Brett Smith 5 months ago
23136-event-locking-test-wip @ 33d56d4d4ab34eeeba94c6511b968d320054d00f
This adds tests for simultaneous mount writes and Git clones as discussed at standup. All tests consistently pass except test_git_clone_to_coll which consistently fails like this:
> assert git_proc.returncode == os.EX_OK E AssertionError: assert 128 == 0 E + where 128 = CompletedProcess(args=['git', 'clone', '--jobs=3', '--no-hardlinks', '/home/brett/Curii/arvados/.git', '/tmp/arv-mount-sgjbjho8'], returncode=128).returncode E + and 0 = os.EX_OK tests/test_concurrency.py:381: AssertionError ------------------------------------ Captured stdout call ------------------------------------ ------------------------------------ Captured stderr call ------------------------------------ Cloning into '/tmp/arv-mount-sgjbjho8'... fatal: 'origin' does not appear to be a git repository fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. fuse_releasedir(): fuse_reply_* failed with No such file or directory fusermount: entry for /tmp/arv-mount-sgjbjho8 not found in /etc/mtab
The fact that cloning to a tmp collection succeeds makes me wonder if the event handler is interfering somehow. Things I need to investigate:
- What does
fuse_releasedir(): fuse_reply_* failed with No such file or directorymean? Track down where this is coming from and whether it happens with other tests. - Try to reproduce this manually or at least capture the contents of the mount in such a way to try to get a sense of what might be getting lost/overwritten/mishandled.
- Try these tests against 3.1.2 code and compare results.
Updated by Brett Smith 5 months ago
- File 23136-test-without-fix.log 23136-test-without-fix.log added
- File 23136-test-with-fix.log 23136-test-with-fix.log added
23136-event-locking @ c809dbf2853776d9b311f528ec758ad3593cebda - developer-run-tests: #4926
I have arranged the branch so it adds tests, then adds the fix. Then I've attached the results of running the tests before the fix from d30675362b and with the fix from fb3851bb06. It gets strictly more tests passing, including "Git clone to tmp mount" which seems especially relevant for keeping containers working (remember, tmp mounts failing on containers is how this whole saga started).
I am going to run the Git test in a loop as well. Assuming there are no failures there: there are definitely still bugs here, but I think this restores 3.0.0 reliability.
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- I have at least prominently marked failing tests so it's hopefully as clear as possible to see what's going on. I think it's TBD whether we invest fix effort in Python FUSE or Go FUSE driver.
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above
- Tested code incorporates recent main branch changes.
- Yes
- New or changed UI/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- It should have performance characteristics at least as good as 3.0.0, and maybe slightly improved with less cache invalidation.
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- Yes
Updated by Brett Smith 5 months ago
Brett Smith wrote in #note-57:
I am going to run the Git test in a loop as well. Assuming there are no failures there: there are definitely still bugs here, but I think this restores 3.0.0 reliability.
I ran 1000 iterations of test_clone_git_to_tmp (by wrapping the body in for _ in range(1000)) and it passed:
======= test services/fuse ==================================================== test session starts ===================================================== platform linux -- Python 3.10.19, pytest-8.4.2, pluggy-1.6.0 rootdir: /home/brett/Curii/arvados/services/fuse configfile: pytest.ini testpaths: tests plugins: cwltest-2.6.20250818005349 collected 166 items / 165 deselected / 1 selected tests/test_concurrency.py . [100%] ======================================= 1 passed, 165 deselected in 6424.34s (1:47:04) ======================================= ======= test services/fuse -- 6425s
Updated by Brett Smith 5 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|c6ab55f9a12c437e3f9a06e008f808f3c15491fe.