Bug #12568
closed[CWL] Cap size of collection cache
100%
Description
The collection cache in arvados-cwl-runner is unbounded. When running workflows with many very large collections, this can result in gigabytes of RAM usage (requiring larger nodes for the workflow runner) and difficult to debug out of memory failures. It should manage the collection cache in order to cap RAM usage.
Updated by Peter Amstutz over 7 years ago
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
- Target version set to 2017-11-08 Sprint
12568-cwl-collection-cache @ d59645f3e566e691ba757f74bba503c13773dbe8
Updated by Peter Amstutz over 7 years ago
Added a couple unit tests. Now @ a70727762dafee667b022331307f6c0f949fd7e7
Updated by Peter Amstutz over 7 years ago
One more test update for completeness! a18005f8b35a68b4fcd9ccdf76832b28e564289c
Updated by Lucas Di Pentima over 7 years ago
The cache size management looks good to me.
If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
The cache size management looks good to me.
If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?
If the collection is really enormous (such as #12575) it will overflow the cache, and min_entries
comes into play. We have to load the whole collection regardless. However to avoid caching too many huge collections I reduced min_entries
from 4 to 2.
I believe default cache parameters are reasonably scaled to the 1 core / 3.5 GiB nodes that we ideally want to use for running workflows.
Adding a command line to adjust the cache size isn't quite as straightforward as it seems, because the command line parameter also needs to be propagated to the job or container request. It is probably a bigger change (with its own testing implications) than the underlying fix.
12568-cwl-collection-cache @ 8de691c25eac0454f8f30cfa35eccff15642e330
Updated by Lucas Di Pentima over 7 years ago
Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.
LGTM, please merge.
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.
Yes, that's right. I believe if the default cache settings are not sufficent, fiddling with the cache settings is unlikely to help, the user probably needs a bigger node anyway.
LGTM, please merge.
Thanks.
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:fc128bc6497aa266f925e2aa4821bde6fce9aade.