Bug #19563
closedcrunch-run upload step possibly buffers too much
100%
Description
We have crunch-run processes that are getting OOM killed (and restarted) in the upload phase.
Crunch-run is uploading very large files (30+ GB) and running on very small nodes (t3.small) which have 1 core, 2 GB RAM, and throttled network bandwidth. The hoststat numbers show a much greater amount of data being received than transmitted.
The suspicion is that the crunch-run process is buffering data in RAM, which is piling up until it gets OOM killed.
a) determine if it is true that the queue of blocks to be uploaded is uncapped
b) if so, make it possible to set some cap which ensures there is backpressure that will block the uploader until there is more buffer space. Experimentally, I think we've found optimal upload rates with around 4-6 parallel block uploads.
Updated by Tom Clegg over 2 years ago
- Status changed from New to In Progress
- 4x 64 MiB blocks waiting to write to keepstore
- 1x 64 MiB block waiting for a write-to-keepstore slot
- 1x <64 MiB block at the end of each file (which could cause a problem when writing a large number of files if crunch-run's "flush after writing a large file" thing isn't working, but here we're seeing the crash after writing only 1 or 2 files)
- arv-mount (keep_cache_ram was 268435456 and after reading a lot of data it seems safe to assume it's still got the maximum amount allocated).
I don't think there's any situation where the fuse mount is accessed after the "copying output files" stage starts, and in that case crunch-run could reduce memory use by killing the arv-mount process earlier(in the present case there are individual files from a collection mounted inside the output dir, so the output-copying phase does require arv-mount) - keepstore itself (LocalKeepBlobBuffersPerVCPU=1 and VCPUs=1 so keepstore should only be using 1x 64 MiB)
All this adds up to ~11 blocks, 704 MiB, which doesn't sound like it should be a problem for a node with 2 GB RAM.
However, hoststat.txt shows memory use growing ~1 GB just as the container ends and crunch-run starts copying a 29 GB file, then growing ~100MB soon after it starts copying a 5 GB file.
hoststat 2022-09-15T19:34:44.561985895Z mem 713097216 cache 0 swap 3050 pgmajfault 637743104 rss crunch-run 2022-09-15T19:34:46.079622438Z Container exited with status code 0 crunch-run 2022-09-15T19:34:46.124137478Z copying "/outputs/xxxxxxxx" (29123123123 bytes) hoststat 2022-09-15T19:34:54.561348376Z mem 305278976 cache 0 swap 3391 pgmajfault 1475633152 rss hoststat 2022-09-15T19:44:54.563159520Z mem 110309376 cache 0 swap 26601 pgmajfault 1678643200 rss crunch-run 2022-09-15T19:44:58.123005311Z copying "/outputs/xxxxxxxx" (5123123123 bytes) hoststat 2022-09-15T19:45:04.565925880Z mem 130220032 cache 0 swap 27024 pgmajfault 1678675968 rss hoststat 2022-09-15T19:45:14.570719935Z mem 47415296 cache 0 swap 28308 pgmajfault 1780465664 rss
Updated by Tom Clegg over 2 years ago
The failing container essentially does "gunzip *.gz" and the first two files in the logs quoted above are file1 and file1.gz. This means the ~1 GB increase in memory usage happens when crunch-run starts to copy the uncompressed file1 from local scratch space → native keep client → local keepstore → remove keep volume.
Updated by Tom Clegg over 2 years ago
We haven't seen this on regular (non-fractional) VM types. It's still unclear which process is using all this extra memory. Most likely culprit is crunch-run itself, but could be keepstore, arv-mount, docker, or something else that gets triggered by a container finishing.
We should add hoststat.txt entries for the memory footprint of crunch-run, arv-mount, and keepstore.
This should provide good info for troubleshooting next time this (or another memory issue) comes up.
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Updated by Tom Clegg about 2 years ago
19563-log-cr-mem @ e55bd0d3b54494061d54853c4b613ad680ffb6a9 -- developer-run-tests: #3335
adds a "procmem" line to crunchstat output.
Testing dev version on ce8i5:
2022-10-25T14:28:48.056830530Z crunch-run e55bd0d3b54494061d54853c4b613ad680ffb6a9-dev (go1.19.2) started ... 2022-10-25T14:28:51.941425414Z mem 731615232 cache 1485 pgmajfault 179535872 rss 2022-10-25T14:28:51.941494214Z procmem 10196 crunch-run 9701 keepstore ... 2022-10-25T14:29:01.943707056Z mem 1134686208 cache 1485 pgmajfault 539070464 rss 2022-10-25T14:29:01.943801556Z procmem 61176 arv-mount 10196 crunch-run 27555 keepstore
Updated by Tom Clegg about 2 years ago
19563-log-cr-mem @ 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019 -- developer-run-tests: #3339
Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)
2022-10-25T18:12:42.964863796Z crunch-run 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019-dev (go1.19.2) started ... 2022-10-25T18:12:46.971549619Z mem 736526336 cache 1485 pgmajfault 180895744 rss 2022-10-25T18:12:46.991653214Z procmem 45219840 crunch-run 38080512 keepstore ... 2022-10-25T18:12:56.972254253Z mem 1126486016 cache 1485 pgmajfault 550330368 rss 2022-10-25T18:12:56.972366598Z procmem 259072000 arv-mount 45219840 crunch-run 112295936 keepstore
That looks more realistic.
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Updated by Peter Amstutz about 2 years ago
Tom Clegg wrote in #note-8:
19563-log-cr-mem @ 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019 -- developer-run-tests: #3339
Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)
[...]
That looks more realistic.
LGTM
Updated by Tom Clegg about 2 years ago
- Status changed from In Progress to Closed
The mystery isn't resolved, but this seems to be all we can do about it for now. Closing.