Bug #13100
closed[crunch-run] Replace custom manifest-writing code with collectionFS
100%
Description
After a lot of debugging trying to figure out why every job was being killed for exceeding memory limits seemingly regardless of the limit specified, I finally caught crunch-run in the act of consuming a truly massive amount of RAM towards the end of a job (in the end I just set container requirement for nearly all the RAM on the node, so I could see what was going on).
Mid run, crunch-run was using ~200MB, but then all of a sudden it started to allocate more memory at a rate of 850MB/s until 30s later when it had consumed 25.5GB, at which point it levelled off and held steady for 230s after which the job finished successfully. It had the full 25.5GB allocated until it exited (or within a second of when it exited).
Looking at the container logs shows that the point in time when it started to allocate lots of RAM corresponds with the end of the container, beginning at the "Container exited with code: 0" line and proceeding to upload the output, which in this case was specified (in CWL) as a Directory.
The output collection (5ca01264c4721b24c9d36320a00027ce+328812) contains 4005 files totalling 8.9GiB - so crunch-run allocated enough memory to cache the full output collection in memory 2.8x over, which seems somewhat excessive.