Project

General

Profile

Actions

Bug #23078

closed

keep-web leaves filehandle open for deleted cache files

Added by George Chlipala 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
-
Release relationship:
Auto

Description

We have noticed that keep-web is leaving the file handle open for cache files even after they are deleted. Over time, this causes the apparent disk usage to steadily increase even though the files have been deleted (see https://askubuntu.com/a/390301 for more details).

For example, here is the output of df and du from our machine. df is reporting 30 GB used while du is only reporting 6.2 GB used.

ROOT [rrc-arvados:/] # df -h /var
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rhel-var   33G   30G  3.5G  90% /var
ROOT [rrc-arvados:/] # du -sh /var
6.2G    /var

When we check for open file handles (lsof) of deleted files we see a large number for keep-web. (this is a partial list)

keep-web      999    5008 keep-web              root   85rR     REG              253,3    237612     634087 /var/cache/arvados/keep/27d/27de0531b5372db170a054e6df0f0aa1.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   86rR     REG              253,3  26173858   33943175 /var/cache/arvados/keep/904/904b8295a8a86462dbd0c82f642a7939.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   87rR     REG              253,3   2091436   33633408 /var/cache/arvados/keep/f0b/f0b504b58d0994971385429d751f1757.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   88rR     REG              253,3    433504   33943310 /var/cache/arvados/keep/4b4/4b4a58916c75265dc4a24192df95ebfa.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   89rR     REG              253,3    187185   33940364 /var/cache/arvados/keep/060/0602752ad95659553c968d37a0f201e5.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   90rR     REG              253,3     67557   33945374 /var/cache/arvados/keep/c40/c40243f6e9d0a9033a344712aac4de76.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   91rR     REG              253,3    307355     631795 /var/cache/arvados/keep/531/53171f07296c0e0ec59d50e40ee6aee1.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   92rR     REG              253,3   1501172     429471 /var/cache/arvados/keep/487/487a88a1d76a21e28f39fda7db5b3c06.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   93rR     REG              253,3  46741094     631583 /var/cache/arvados/keep/36b/36b405466bbdf8ae9302fa74524b6581.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   94rR     REG              253,3     24385   33939523 /var/cache/arvados/keep/696/696e4736c66003d2878ec6b2fcf5030e.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   95rR     REG              253,3   4284907     429484 /var/cache/arvados/keep/88a/88a97a2b3f9bb8586ffc1b7ab202851a.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   96rR     REG              253,3  33968301   67189929 /var/cache/arvados/keep/840/8405aa5000e66ff48c368132adf68787.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   97rR     REG              253,3    728799   67360946 /var/cache/arvados/keep/8cd/8cd0b52649655becc8eb971abc6eb0ab.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   98rR     REG              253,3    138256     102497 /var/cache/arvados/keep/16a/16af18b45568aa8e58a3385d5fbc87e3.keepcacheblock (deleted)
keep-web      999    5008 keep-web              root   99rR     REG              253,3     68962   33945373 /var/cache/arvados/keep/221/2213ff99c38e1760a63d93ac076618eb.keepcacheblock (deleted)

For right now, we can restart the keep-web service and the file handles will be closed and the space is "recovered". It would be helpful if keep-web can close the file handle before it deletes the file from the cache.

ROOT [rrc-arvados:/] # systemctl restart keep-web
ROOT [rrc-arvados:/] # df -h /var
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/rhel-var   33G  6.9G   26G  22% /var

Thanks.


Subtasks 1 (0 open1 closed)

Task #23100: Review 23078-close-deleted-filesResolvedTom Clegg08/19/2025Actions
Actions #1

Updated by Tom Clegg 7 months ago

  • Target version set to Development 2025-08-21
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #2

Updated by Tom Clegg 7 months ago

  • Subtask #23100 added
Actions #3

Updated by Tom Clegg 7 months ago

{The block cache used by} keep-web does close its filehandles when deleting cache files.
  • Our (pirca) cluster's keep-web process has been up for 3 weeks and has no file descriptors open on deleted files.

However that code only runs when the files are deleted by keep-web itself during cache size limiting -- not when they are deleted by a different process (e.g., arv-mount, or another keep-web process), in which case keep-web doesn't notice, and it leaves the file descriptors open until it reaches its internal limit of file descriptors used for this purpose, typically 10K, at which point it closes all of its cached filehandles. I suspect this is the problem here.

There is also a bug in the "close all cached filehandles" code that causes the cache code to crash, which causes keep-web to fail the current request. The crash also causes a mutex to stay locked, which causes all subsequent keep-web requests to hang. We haven't had any reports of this, and there are no crashes in our cluster logs, probably because it only happens when
  • the disk cache is big enough for 10000 blocks or
  • another process is deleting blocks and the dangling-file-descriptor problem goes unnoticed long enough to accumulate 10000 open files.

Both of these problems are fixed in this branch.

23078-close-deleted-files @ 3a9edff7d50db535a805012821d27b4187249a17 -- developer-run-tests: #4850

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ Detect when cache files are deleted by a different process, and close the corresponding filehandles.
    • ✨ Don't crash when reaching cached file descriptor limit.
    • ✨ Fix mutex handling (using defer) to avoid stuck mutex on crash.
    • ✨ Refactor some code for readability.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • ✅ n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ Added tests for both issues.
  • The tested code incorporates recent main branch changes.
  • New or changed UI/UX has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • ✅ "concurrent short reads" benchmark is about 12% slower.
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ n/a
  • Follows our coding standards and GUI style guidelines.
Actions #4

Updated by Brett Smith 7 months ago

Tom Clegg wrote in #note-3:

23078-close-deleted-files @ 3a9edff7d50db535a805012821d27b4187249a17 -- developer-run-tests: #4850

Three tiny things, please address as you see fit and merge:

  • The name testConcurrentReads doesn't really make sense to me because it doesn't really seem to be testing anything. I see that it does concurrent reads and does technically fail the test if any of them fail but that just seems to be a safeguard, not really the point of the function (the fact that you didn't write a failure message seems suggestive). Is there a better name? Happy to bounce around ideas in chat if you like.
  • Small comment typo "helpdopen" on TestHeldOpen_CloseDeletedFiles.
  • In TestHeldOpen_CloseDeletedFiles, using Check for len(cache.sharedCache.heldopen) > targetsize doesn't seem appropriate to me, because if it's false then the later checks are meaningless. IMO this would be better as an Assert.

Thanks.

Actions #5

Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved
Actions #6

Updated by Brett Smith 6 months ago

  • Release set to 79
Actions

Also available in: Atom PDF