Bug #22827: crunch-run can use excess memory and crash in the "copy" phase of container finalization - Arvados

Actions

Copy link

Bug #22827

closed

crunch-run can use excess memory and crash in the "copy" phase of container finalization

Added by Peter Amstutz 11 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

Development 2025-06-25

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

Observed on a user container task:

This "merge" step reads from a large number of files, does some light processing in a python script, then outputs a large number of derived files, producing a new output directory.

Because the actual requirements of the Python script a very lightweight, the workflow only requests 1 GiB of RAM and that is all it needs to complete running the Python script.

Then it goes to phase of capturing and uploading output.

It is copying a pretty large number of files into the output collection (I'm not sure exactly how many, probably several thousand).

I think it is both copying both the files that were staged to the working directory (so it is copying from one collection to another). It is probably also supposed to copy some newly created files, but I'm not sure it gets that far.

At some point, it starts massively thrashing and can't work in the 4 GiB of RAM of the instance that it was given, and gets pretty wedged.

A few takeaways:

it really needs to flush logs when the container ends but before starting the copying phase. Right now the last thing that appears in the logs is "Starting container" and that could mean anything.
Even with thousands of files it doesn't seems like this should require gigabytes of RAM, besides uploading individual 64 MiB blocks it should be primarily manifest manipulation, which I'd expect to require 100s of megabytes, not 1000s, so now I'm suspicious that there is a memory leak or some other unwise use of memory.
1. One thing I'll note is that the copier implements a basic cache to avoid reloading the same collection, but I don't think it does any management, so if there were lots of different collections, that could be a problem. However, in this specific case, there's only one collection, so that's not it.
I'm not sure how exactly, but it would be really great if crunch-run could self-regulate somehow so it knew when it was potentially hitting memory pressure and could write a log, "oops I don't know if I have enough RAM to do this" and try to commit logs before it dies. Because what really sucks about this is that it just grinds to a halt and is un-debuggable from the user perspective.

Subtasks 2 (0 open — 2 closed)

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Position changed from -950309 to -950305

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz 11 months ago

func (cp *copier) copyFromCollection(dest string, coll *arvados.Collection, srcRelPath string) error {
    tmpfs, err := coll.FileSystem(cp.client, cp.keepClient)
    if err != nil {
        return err
    }
    snap, err := arvados.Snapshot(tmpfs, srcRelPath)
    if err != nil {
        return err
    }
    // Create ancestors of dest, if necessary.
    for i, c := range dest {
        if i > 0 && c == '/' {
            err = cp.staged.Mkdir(dest[:i], 0777)
            if err != nil && !os.IsExist(err) {
                return err
            }
        }
    }
    return arvados.Splice(cp.staged, dest, snap)
}

How efficient is it to create a new FileSystem object and then make a Snapshot and then do a Splice for every file when you are copying 1000s of files from the same collection?

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-05-14 to Development 2025-05-28

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-05-28 to Development 2025-05-14

Actions

Copy link

Updated by Tom Clegg 11 months ago

Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg 11 months ago

Subtask #22836 added

Actions

Copy link

#10

Updated by Tom Clegg 11 months ago

Status changed from New to In Progress

Actions

Copy link

#11

Updated by Tom Clegg 10 months ago

How efficient is it to create a new FileSystem object and then make a Snapshot and then do a Splice for every file when you are copying 1000s of files from the same collection?

That copyFromCollection method runs once per collection mount point, not once per file. Creating a FileSystem and doing Snapshot+Splice with 1000s of files should not be a problem.

I don't have a good theory about the OOM situation yet. I suppose it's even possible the copying stage is causing arv-mount to do something inefficient...?

Meanwhile here's a fix for the "not getting any logs from the container itself because copying failed" issue:

22827-save-log-before-copy @ 48cc8d55837bf8c3bbe3edf4f6a3584b804ef088 -- developer-run-tests: #4775
retry remainder run-tests-remainder: #5231

✅ save log collection after container finishes, before copying outputs
✨ refactor "save log collection" code a little, hoping to make it a little easier to follow (saveLogCollection updates the container record and stashes PDH in a "current log PDH" field, instead of returning a collection record and leaving it up to the caller to update the container record)

Actions

Copy link

#12

Updated by Tom Clegg 10 months ago

Target version changed from Development 2025-05-14 to Development 2025-05-28

Actions

Copy link

#13

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-11:

How efficient is it to create a new FileSystem object and then make a Snapshot and then do a Splice for every file when you are copying 1000s of files from the same collection?

That copyFromCollection method runs once per collection mount point, not once per file. Creating a FileSystem and doing Snapshot+Splice with 1000s of files should not be a problem.

I don't have a good theory about the OOM situation yet. I suppose it's even possible the copying stage is causing arv-mount to do something inefficient...?

I still owe you a copy of the container record that was actually causing the problem, that may shed some light on the situation.

Meanwhile here's a fix for the "not getting any logs from the container itself because copying failed" issue:

22827-save-log-before-copy @ 48cc8d55837bf8c3bbe3edf4f6a3584b804ef088 -- developer-run-tests: #4775
retry remainder run-tests-remainder: #5231

✅ save log collection after container finishes, before copying outputs

✨ refactor "save log collection" code a little, hoping to make it a little easier to follow (saveLogCollection updates the container record and stashes PDH in a "current log PDH" field, instead of returning a collection record and leaving it up to the caller to update the container record)

This LGTM.

It occurred to me that something that has come up a few times is that container.json in the log collection is only written once before the container executes, so if anything useful was reported in runtime_status it isn't available. It would be nice to stick another call to LogContainerRecord() in there before committing the log collection.

Actions

Copy link

#14

Updated by Tom Clegg 10 months ago

Status changed from In Progress to Feedback

This is on hold until we see an example.

Actions

Copy link

#15

Updated by Tom Clegg 10 months ago

Target version changed from Development 2025-05-28 to Development 2025-06-25

Actions

Copy link

#16

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-14:

This is on hold until we see an example.

Here are the logs from the offending run:

https://workbench.jutro.arvadosapi.com/collections/jutro-4zz18-dtz86ioahajk75l

I was able to get logs from this run because I was watching it closely (and actually logged into the compute node) and cancelled it before it went fully unresponsive, so it was able to commit logs.

The symptom in the field was using up 100% of RAM and going completely into a unresponsive state and eventually just getting shut down for lack of responsiveness, and not having any logs to see what went wrong.

Actions

Copy link

#17

Updated by Tom Clegg 10 months ago

I haven't yet confirmed this with a test case, but I suspect this is caused by a memory leak in Snapshot/Splice. In Splice(), the "fs" fields of dirnodes and filenodes are updated to point to the new filesystem, but the "kc" field of storedSegments still points to the original source filesystem. Copier creates a new source filesystem for each of 3000 mount points (even if it's the same PDH mounted 3000 times, as it is here) and the lingering "kc" pointer prevents any of the 3000 source filesystems from being garbage-collected. The manifest itself is 356K. If the filesystem representation is 4x that, we would see about 3.2 GB memory usage.

Ways to fix:

When traversing mounts, cache the filesystem representation of each collection, not just the manifest text (this would solve the memory issue and be faster in the current case, but in other cases it will use more memory than the current approach)
When traversing mounts, retain the filesystem representation of the most recent source collection, so we can skip re-generating it when the next one happens to be identical (this would be faster in all cases, and would solve the memory issue in the current case)
Fix the lingering "kc" pointers (this would solve the memory issue in the general case, but wouldn't be faster)

I think we should do options 2 and 3.

Actions

Copy link

#18

Updated by Tom Clegg 10 months ago

Status changed from Feedback to In Progress

Actions

Copy link

#19

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-17:

I haven't yet confirmed this with a test case, but I suspect this is caused by a memory leak in Snapshot/Splice. In Splice(), the "fs" fields of dirnodes and filenodes are updated to point to the new filesystem, but the "kc" field of storedSegments still points to the original source filesystem. Copier creates a new source filesystem for each of 3000 mount points (even if it's the same PDH mounted 3000 times, as it is here) and the lingering "kc" pointer prevents any of the 3000 source filesystems from being garbage-collected. The manifest itself is 356K. If the filesystem representation is 4x that, we would see about 3.2 GB memory usage.
Ways to fix:

When traversing mounts, cache the filesystem representation of each collection, not just the manifest text (this would solve the memory issue and be faster in the current case, but in other cases it will use more memory than the current approach)

When traversing mounts, retain the filesystem representation of the most recent source collection, so we can skip re-generating it when the next one happens to be identical (this would be faster in all cases, and would solve the memory issue in the current case)

Fix the lingering "kc" pointers (this would solve the memory issue in the general case, but wouldn't be faster)

I think we should do options 2 and 3.

This sounds like a good plan to me.

Actions

Copy link

#20

Updated by Peter Amstutz 10 months ago

Release set to 79

Actions

Copy link

#21

Updated by Tom Clegg 10 months ago

22827-many-mounts-inside-output-dir @ fd026d718bc67e1d9c8320fc646b9109da357370 -- developer-run-tests: #4804

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ Fix memory leak
- ✅ Don't discard+recompute collection filesystem when consecutive mounts copy from the same collection
Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ Test case checks memory before/after copying files from 100 mounts of the same 100-file collection; usage down from 254 MB to 5 MB
- ✅ Manually checked (by editing test case params) 100 mounts of two alternating 100-file collection; ditto
- ✅ Manually checked (by editing test case params) 500 mounts of the same 500-file collection; usage down from 1500 MB to 15 MB
New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
Documentation has been updated.
- n/a
Behaves appropriately at the intended scale (describe intended scale).
- ✅ Much improved -- memory usage is now dominated by the output collection manifest, not the inputs
Considered backwards and forwards compatibility issues between client and server.
- ✅
Follows our coding standards and GUI style guidelines.
- ✅

Actions

Copy link

#22

Updated by Tom Clegg 10 months ago

Subtask #22966 added

Actions

Copy link

#23

Updated by Peter Amstutz 10 months ago

So what I see from the code, the main fixes are:

correctly updating the backlinks from segments to the filesystem object in Splice() and Snapshot()
the simplest possible optimization of not reloading the collection when the previously loaded collection was the same one

LGTM

Actions

Copy link

#24

Updated by Tom Clegg 10 months ago

Status changed from In Progress to Resolved

Applied in changeset arvados|3b1bf09d83e4d92d59c307742b139d5f06d5b2ec.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #22827

crunch-run can use excess memory and crash in the "copy" phase of container finalization

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Peter Amstutz 11 months ago

Updated by Tom Clegg 11 months ago

Updated by Tom Clegg 11 months ago

Updated by Tom Clegg 11 months ago

Updated by Tom Clegg 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Tom Clegg 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Tom Clegg 10 months ago