Bug #12568: [CWL] Cap size of collection cache - Arvados

Actions

Copy link

Bug #12568

closed

[CWL] Cap size of collection cache

Added by Peter Amstutz over 7 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Peter Amstutz

Category:

Target version:

2017-11-08 Sprint

Start date:

11/07/2017

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

Description

The collection cache in arvados-cwl-runner is unbounded. When running workflows with many very large collections, this can result in gigabytes of RAM usage (requiring larger nodes for the workflow runner) and difficult to debug out of memory failures. It should manage the collection cache in order to cap RAM usage.

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Status changed from New to In Progress
Assigned To set to Peter Amstutz
Target version set to 2017-11-08 Sprint

12568-cwl-collection-cache @ d59645f3e566e691ba757f74bba503c13773dbe8

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Added a couple unit tests. Now @ a70727762dafee667b022331307f6c0f949fd7e7

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

One more test update for completeness! a18005f8b35a68b4fcd9ccdf76832b28e564289c

Actions

Copy link

Updated by Lucas Di Pentima over 7 years ago

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Lucas Di Pentima wrote:

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

If the collection is really enormous (such as #12575) it will overflow the cache, and min_entries comes into play. We have to load the whole collection regardless. However to avoid caching too many huge collections I reduced min_entries from 4 to 2.

I believe default cache parameters are reasonably scaled to the 1 core / 3.5 GiB nodes that we ideally want to use for running workflows.

Adding a command line to adjust the cache size isn't quite as straightforward as it seems, because the command line parameter also needs to be propagated to the job or container request. It is probably a bigger change (with its own testing implications) than the underlying fix.

12568-cwl-collection-cache @ 8de691c25eac0454f8f30cfa35eccff15642e330

Actions

Copy link

Updated by Lucas Di Pentima over 7 years ago

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

LGTM, please merge.

Actions

Copy link

Updated by Peter Amstutz over 7 years ago

Lucas Di Pentima wrote:

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

Yes, that's right. I believe if the default cache settings are not sufficent, fiddling with the cache settings is unlikely to help, the user probably needs a bigger node anyway.

LGTM, please merge.

Thanks.

Actions

Copy link

#10

Updated by Anonymous over 7 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:fc128bc6497aa266f925e2aa4821bde6fce9aade.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #12568

[CWL] Cap size of collection cache

Updated by Peter Amstutz over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Lucas Di Pentima over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Lucas Di Pentima over 7 years ago

Updated by Peter Amstutz over 7 years ago

Updated by Anonymous over 7 years ago