Bug #10315
closedPerformance issues on arv-put command version from '9701-collection-pack-small-files-alt' branch
100%
Description
It seems that the new arv-put command has problems when asked to upload lots of files.
Updated by Lucas Di Pentima over 8 years ago
Some tests:
- Test 1: File size = 1 byte - Arvbox local keep store - Cache update interval 1s
- 100 files: 0m0.473s
- 1000 files: 0m1.843s
- 10000 files: 3m7.695s (!!!!)
- Test 2: File size 1 Kbyte - Arvbox local keep store - Cache update interval 1s
- 100 files: 0m0.492s
- 1000 files: 0m1.871s
- 10000 files: 3m19.292s (seems that file size is not important)
- Test 3: File size 1 Kbyte - Arvbox local keep store - Cache update interval 60s
- 10000 files: 2m45.130s (seems that the issue is not about updating an increasingly bigger cache file)
Updated by Lucas Di Pentima about 8 years ago
Found what the problem was: For every small file close()
call, the BlockManager.repack_small_files()
method is called, and the first thing that used to do is a list comprehension searching for all pending buffer blocks on its internal list, and then adding the size()
of every one of them to check if there're enough to fill an entire block.
This list comprehension execution was the slow operation, when the number of pending buffer blocks was > 1000, it was starting to add up lots of time.
The solution I came up with is to add an internal variable to the BlockManager
class to keep count of the sum of all the pending buffer block's sizes, this counter get added on every small file close()
call and the slow list comprehension is executed only when the counter gets to at least the keep block size.
My local tests came with around 10X speed improvement when testing with 5000 files of 1 KB each. Improvement ratios should be a lot higher when tests are performed with even more files.
These changes are at the 10315-new-arv-put-performance
branch at: 6f04833
Test suite running at: https://ci.curoverse.com/job/developer-run-tests/47/
Updated by Lucas Di Pentima about 8 years ago
Found a pre-existing locking bug (already there at 9701 branch): When having enough small blocks to fill an entire 64MB block, it gets locked when calling commit_bufferblock()
with sync=False
because that method calls start_put_threads()
and that's a @synchronized method.
Thinking about wrapping a non-synchronized version of it.
Updated by Lucas Di Pentima about 8 years ago
Commit: 078bdb1
Solved the locking issue by moving the start_put_threads()
call into BlockManager's __init__()
New test suite run at: https://ci.curoverse.com/job/developer-run-tests/48/
Updated by Lucas Di Pentima about 8 years ago
Additional stats data: 100000 files of 1KB each: 2 minutes 14 secs at a local arvbox installation.
Updated by Lucas Di Pentima about 8 years ago
- Target version changed from 2016-10-26 sprint to 2016-11-09 sprint
Updated by Peter Amstutz about 8 years ago
Reviewing 10315-new-arv-put-performance @ 078bdb166766423e1a423523e5285966aff7ec6b
The main reason for lazy start of the put threads is to avoid creating put threads for read-only collections. Could you have CollectionReader()
pass in a flag that skips starting the put threads?
Caching _pending_write_size is a good idea. However, would it make sense to recompute pending_write_size after collecting small_blocks and then check it again? (For example, a file could be opened, written, and closed multiple times, which may cause it to be double-counted in this implementation).
Updated by Lucas Di Pentima about 8 years ago
Updates at 95fc987
- About put threads lazy start: I set up an exclusive lock for this operation, and leave the start call where it was previously, at
commit_bufferblock()
method. - Updated pending write size count as requested.
Test suite being run at: https://ci.curoverse.com/job/developer-run-tests/49/
Thanks!
Updated by Peter Amstutz about 8 years ago
Lucas Di Pentima wrote:
Updates at 95fc987
- About put threads lazy start: I set up an exclusive lock for this operation, and leave the start call where it was previously, at
commit_bufferblock()
method.
Looks good.
- Updated pending write size count as requested.
Suggested tweak:
self._pending_write_size = sum([b.size() for b in small_blocks]) if self._pending_write_size < config.KEEP_BLOCK_SIZE and not force: return
Updated by Lucas Di Pentima about 8 years ago
Branch updated with the suggested change and master branch merged in at 734335da27f27e2177d3b931b1e5e9e8e83a042f
Tests successfully run: https://ci.curoverse.com/job/developer-run-tests/51/
Updated by Peter Amstutz about 8 years ago
Lucas Di Pentima wrote:
Branch updated with the suggested change and master branch merged in at 734335da27f27e2177d3b931b1e5e9e8e83a042f
Tests successfully run: https://ci.curoverse.com/job/developer-run-tests/51/
Thanks, LGTM, please merge.
Updated by Lucas Di Pentima about 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:15d8499a8d95abbc4bc2dbbd0bcfd2a4c6666408.