Feature #14538: [keep-web] Do not block writes while flushing blocks to Keep - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Feature #14538

closed

[keep-web] Do not block writes while flushing blocks to Keep

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Keep

Target version:

2018-12-12 Sprint

Start date:

11/28/2018

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

Release:

Arvados v1.4 - Q1/Q2 2019

Release relationship:

Auto

Description

Currently, writing a large file via collectionfs (e.g., webdav PUT) pauses every 64 MiB while a block is written to Keep.

The block flushing should be done in the background and subsequent writes should be allowed to continue.

Subtasks 1 (0 open — 1 closed)

Task #14548: Review 14538-async-write

Resolved

Peter Amstutz

11/28/2018

Actions

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Tom Clegg about 6 years ago

Assigned To set to Tom Clegg
Target version changed from To Be Groomed to 2018-12-12 Sprint

Actions

Copy link

Updated by Tom Clegg about 6 years ago

Category set to Keep
Status changed from New to In Progress

14538-async-write @ 78c18757e42c40178d7a9eaf78f7b6d167bee926

This allows up to 4 concurrent writes per file while writing, and up to 4 concurrent writes per collection when syncing a collection in MarshalManifest.

TODO:

make the limit configurable
limit concurrency at the filesystem level instead of per-file, to accommodate callers that write multiple large files concurrently

I'm not sure whether this should be held up by one or both of those.

Actions

Copy link

Updated by Tom Clegg about 6 years ago

I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.

Actions

Copy link

Updated by Peter Amstutz about 6 years ago

Over all LGTM assuming tests pass (link to jenkins below). Here's some mostly stylistic comments, take what you will.

In dirnode.sync():

Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.

pendingLen is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.

In pruneMemSegments():

This seems redundant: seg.Len() < maxBlockSize || seg.Len() == 0

Is there a potential race between pruneMemSegments() and sync() ? sync() does not check "flushing" so it seems like it might try to flush a block which is already being flushed. By my reading of the code, it probably is not a disaster if this happens (it would replace one storedSegment with an equivalent one) but worth considering. Now I see it calls waitPrune() before that, so no problem.

I'm thinking the default/initial write concurrency limit (flushing finished blocks while writing a file) should be 1 rather than 4. If the Keep side is slower than the incoming data, more concurrency won't necessarily help -- it'll just use more memory.

I expect the common case is slow clients / fast backend, in which case more threads won't get used. And as you said, fast client / slow backend would result in a pileup. The Python client uses 2 threads by default, which enables it to continue making progress if one write stalls due to a transient failure.

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

I submitted a test run here:

https://ci.curoverse.com/view/Developer/job/developer-run-tests/986/

Actions

Copy link

Updated by Tom Clegg about 6 years ago

Peter Amstutz wrote:

Stylistically, I don't know if flush() really benefits from being declared inline. It could just as easily be a method of dirnode and take throttle as a parameter. In order to follow execution of the code, I have to scroll down to the bottom of the function, and then scroll back up again.

Yes, sync() was getting long. Moved flush to (*dirnode)commitBlock(ctx, throttle, []fnSegmentRef).

pendingLen is used before it is assigned. I know this means it has its default value (0) but it would be helpful to make that explicit so it doesn't look like a bug.

Done.

This seems redundant: seg.Len() < maxBlockSize || seg.Len() == 0

Indeed. Removed the == 0 part.

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.

We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.

test run

Fixed flaky test.

14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/

Actions

Copy link

Updated by Peter Amstutz about 6 years ago

Tom Clegg wrote:

I'm a little concerned that flushing a collection seems to require iterating over every single segment of every single file, which could be expensive for very large collections (some benchmarking is warranted).

This seems unavoidable, since flushing a collection involves producing a manifest which references every segment of every file. Now we iterate twice, but (even without the concurrency being added here) I expect that's still much faster than the ensuing network round-trip with the resulting manifest.

The Python client is structured a little differently, block operations go through a BlockManager which tracks which blocks are pending, so there's no need to iterate over all the segments when flushing blocks. But this story is about incrementally flushing a PUT of a single large file so it isn't going to be iterating over the manifest except at the end.

We could stash the last known manifest text and use a "dirty" flag to optimize the no-op case, but it seems out of scope here unless I'm missing something.

14538-async-write @ a88f7ad9728ee6968367928c6d3d7613bbf290ec https://ci.curoverse.com/view/Developer/job/developer-run-tests/988/

This LGTM.

Actions

Copy link