Story #3640
Updated by Tom Clegg over 9 years ago
Background: arv-mount has a block cache, which improves performance when the same blocks are read multiple times. However: * Currently a new arv-mount process is started for each Crunch task execution. This means tasks don't share a cache, even if they're running at the same time. * In the common case where multiple crunch tasks run at the same time and use the same data, we have multiple arv-mount processes each retrieving and caching its own copy of the same data blocks. Proposed improvement: * Use large swap on worker nodes (preferably SSD). (We already do this for other reasons.) * Set up a large tmpfs on worker nodes and use it as crunch job scratch space. (This already gets cleared at the beginning of a job to avoid leakage between jobs/users.) * Use a directory in that tmpfs as an arv-mount cache. This makes it feasible to use a large cache size, and makes it easy to share the cache between multiple arv-mount processes. Implementation notes: * Rely on unix permissions for cache privacy. (Warn (Perhaps warn if the cache dir's @mode & 0007 != 0@, but go ahead anyway: there will be cases where that would be useful and not dangerous.) permissions look wrong.) * Use flock() to avoid races and duplicated effort. (If arv-mount 1 is writing a block to the cache, then arv-mount 2 should wait for arv-mount 1 to finish then read from the cache, rather than fetch its own copy.) * Do not clean up cache dir at start/exit, at least by default start/exit (the general idea is to share with past/future arv-mount procs). An optional @--cache-clear-atexit@ flag would be nice to have. procs), but perhaps offer a @--clear@ mode? * Measuring/limiting cache size could be interesting * Delete & replace upon finding a corrupt/truncated cache entry Integration: * The default Keep mount on shell nodes should use a filesystem cache, assuming there is an appropriate filesystem for it (i.e., something faster than network: tmpfs, SSD, or at least a disk with async/barriers=0). * crunch-job should create a per-job temp dir on each node during the "install" phase, and point all arv-mount processes to it.