Story #6203
closed[API] [Performance] Optimize time spent on the API server side during a large collection creation.
Added by Radhika Chippada over 9 years ago. Updated over 9 years ago.
100%
Description
See https://arvados.org/issues/6087#note-10 for complete detail.
It seems we can reduce several seconds in the overall time taken to create a large collection by optimizing the following:
The workbench test takes 21 seconds to do create. Of these 21 seconds, 17seconds is being spent in the following areas on the API server side: App 7227 stdout: ***** check_signatures start 2015-05-29 16:03:57 -0400 App 7227 stdout: ***** compute_pdh start 2015-05-29 16:03:57 -0400 App 7227 stdout: ***** compute_pdh end 2015-05-29 16:03:59 -0400 App 7227 stdout: ***** check_signatures end 2015-05-29 16:04:04 -0400 App 7227 stdout: ***** strip_manifest_text start 2015-05-29 16:04:04 -0400 App 7227 stdout: ***** strip_manifest_text end 2015-05-29 16:04:06 -0400 App 7227 stdout: ***** set_portable_data_hash start 2015-05-29 16:04:06 -0400 App 7227 stdout: ***** set_portable_data_hash end 2015-05-29 16:04:08 -0400 App 7227 stdout: ***** signed_manifest start 2015-05-29 16:04:09 -0400 App 7227 stdout: ***** signed_manifest end 2015-05-29 16:04:14 -0400 And, compute_pdh is being invoked twice.
Updated by Brett Smith over 9 years ago
- Assigned To set to Radhika Chippada
- Target version set to 2015-06-10 sprint
Updated by Radhika Chippada over 9 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada over 9 years ago
Below are the numbers after I tweaked the above mentioned methods:
Workbench tests: test/integration_performance/collection_unit_test.rb
In my branch:
0.158198771s build example
0.512019084s new (manifest size = 6MiB)
9.831454755s create
3.164605099s read
0.013937852s read(cached)
6.026941976s list
4.689271296s update(name-only)
13.405557288s update
3.584947671s delete
0.052028055s read(404)
BigCollectionTest#test_crud_cycle_for_collection_with_big_manifest_(compress=true) = 42.39 s = .
In master:
0.130952126s build example
0.525899681s new (manifest size = 6MiB)
14.075667663s create
4.108906145s read
0.012675329s read(cached)
8.084185956s list
6.344242344s update(name-only)
19.684942139s update
4.963485125s delete
0.039076585s read(404)
BigCollectionTest#test_crud_cycle_for_collection_with_big_manifest_(compress=true) = 58.95 s = .
API TESTS: test/integration/collections_performance_test.rb
In my branch:
CollectionsApiPerformanceTest#test_crud_cycle_for_a_collection_with_a_big_manifest = 0.28716071s make example
0.134842282s JSON encode 18MiB manifest
11.402078717s create
3.083956521s read
3.086899177s list
12.464819769s update
3.742708626s delete
35.18 s = .
In master:
CollectionsApiPerformanceTest#test_crud_cycle_for_a_collection_with_a_big_manifest = 0.296271169s make example
0.131979662s JSON encode 18MiB manifest
16.23883748s create
3.903416969s read
3.928377929s list
18.550129007s update
4.560380538s delete
48.61 s = .
Updated by Tom Clegg over 9 years ago
6203-collection-perf-api @ 12afe0e
compute_pdh and caching¶
I think the newcompute_pdh
method is dangerous because it doesn't necessarily compute the pdh. For example, this looks like pdh2 will come from c2.manifest_text, but it doesn't:
pdh1 = c1.compute_pdh c1.manifest_text = c2.manifest_text pdh2 = c1.compute_pdh
@need_pdh_validation
compute_pdh()
@computed_pdh
computed_pdh()
computed_pdh
(local variable in check_signatures)
compute_pdh
method computes the pdh right now.computed_pdh
method implements a dependable cache of compute_pdh by doing something like this:def computed_pdh if @computed_pdh_for_manifest_text == manifest_text return @computed_pdh end @computed_pdh = compute_pdh @computed_pdh_for_manifest_text = manifest_text.dup end
- Get rid of
@need_pdh_validation
-- just callcomputed_pdh
instead ofcompute_pdh
. In fact, the wholeset_portable_data_hash
method seems to exist to [a] set a flag that validation is needed, which is irrelevant since validation is always fast now that we have computed_pdh, and [b] transparently modify the "size" part of the client-provided locator to the actual size of portable_manifest_text, which (if we need it at all) would make more sense as an explicit "auto-correct" step inensure_hash_matches_manifest_text
. - Get rid of the
computed_pdh
local variable incheck_signatures
and use thecomputed_pdh
method instead. - After
strip_manifest_text
, take care of the special case where@computed_pdh
is still valid even though we changedmanifest_text
(i.e., if@computed_pdh_for_manifest_text == manifest_text
before stripping, then set@computed_pdh_for_manifest_text = stripped_manifest_text
after stripping).
Aside: I think strip_manifest_text_and_...
should be renamed to strip_signatures_and_...
. The term "strip" is vague, and confusing if you've recently looked at the Python SDK where it refers to what portable_manifest_text
does here.
Aside: ensure_hash_matches_manifest_text
should probably be renamed to ensure_pdh_matches_manifest_text
.
New munge method¶
The new version no longer modifies its argument, so it should be renamed to just munge_manifest_locators
.
manifest.dup
is unnecessary in sign_manifest now that munge doesn't modify its argument. In fact, it can now reduce to:
- m = manifest.dup
- m = munge_manifest_locators!(m) do |match|
- Blob.sign_locator(locator_without_signature(match), signing_opts)
- end
- return m
+ munge_manifest_locators manifest do |match|
+ Blob.sign_locator(locator_without_signature(match), signing_opts)
+ end
Likewise, in portable_manifest_text:
- portable_manifest = self[:manifest_text].dup
- portable_manifest = self.class.munge_manifest_locators!(portable_manifest) do |match|
+ portable_manifest = self.class.munge_manifest_locators!(manifest_text) do |match|
Every non-zero manifest ends with "\n"
so I don't think we need the "omit trailing \n if the input didn't have one" code. The only valid case this covers can be handled more simply by putting return "" if manifest==""
right at the top. Likewise, "return nil if manifest.nil?" should eliminate the .andand
stuff.
Instead of appending "\n" to the string, you can do this, to avoid that last (and presumably most expensive) string concatenation:
- if !new_lines.empty?
- ends_with_newline = manifest.end_with?("\n")
- manifest = new_lines.join("\n")
- manifest += "\n" if ends_with_newline
- end
+ (new_lines + ['']).join("\n")
In self.sign_manifest
we now sign locator_without_signature(match)
. Is there ever a case where sign_manifest is called before the signatures are stripped? I don't think there should be, in which case this could be more efficient as just match[0]
, instead of locator_without_signature(match)
. AFAICT, we didn't need to remove signatures here before this branch...
There are two uses of "word.strip" here which seem to be superfluous, since word
comes from line.split(' ')
.
Could we use Ruby's built-in iterators (manifest.each_line
and line.each_line(' ')
) instead of actually instantiating arrays with split()? Manifests and lines can be many megabytes long, and we don't actually use the third copy of the manifest in order to do what we need. Assuming this change doesn't make performance worse, it seems like a worthwhile memory-consumption optimization. Unfortunately each_line includes the trailing separator with each yielded item, which might make the blocks a bit ugly. We could punt on this -- but please do note if you already tried it and found performance was worse.
locator_without_signature¶
This method looks like it does a lot more work than it should. It only ever needs to do a single replacement. Is the split() and the N appends really faster than this?
def self.locator_without_signature match
match[0].sub(/\+A[^+]*/, '')
end
combined validation method¶
I guess you don't like "_maybe_" in the name, but how about "..._update_replication_confirmed"? I think the new name "..._and_clear_replication_confirmed" suggests it's always going to clear those attrs, making it less helpful than the previous name.
This explanatory comment should probably move to the appropriate place in the new code instead of being lost:- # If the new manifest_text contains locators whose hashes - # weren't in the old manifest_text, storage replication is no - # longer confirmed.
We should check self.replication_confirmed.nil?
directly, instead of adding a new cleared_replication_confirmed
variable to track it: the "not in_old_manifest" check is equally useless regardless of when/why replication_confirmed
became nil. We should also skip building the in_old_manifest
hash in the first place if self.replication_confirmed
is already nil. We should also add something to the API unit tests to reveal how slow the replication_confirmed check is -- IIRC all existing performance tests will skip it once we make this optimization.
AIUI, the reason we use self[:manifest_text]
instead of the usual manifest_text
is to sneakily avoid setting the ActiveRecord "changed" flag when stripping signatures. I think it would be good form to avoid using it in any cases where it isn't necessary. Here, we need to use it when setting, but not when getting.
The part where we build the in_old_manifest
hash used to use an each_manifest_locator
method which was more efficient because it didn't build a new manifest: it just visited every locator, and ignored the block's return values. The new code loses this optimization. Instead of dropping the each_manifest_locator
method, I think we should give it the same optimizations that munge
got (i.e., use each
or split
instead of scan
) and use it to build in_old_manifest
. That should make it even faster (and use less memory) than munge
.
LOCATOR_REGEXP¶
This seems to be copied straight from the Ruby SDK. Can we require 'arvados/keep' and use Keep::Locator::LOCATOR_REGEXP in collection.rb, instead of copying & pasting?
It doesn't seem to have any unit tests. Can we add some in the Ruby SDK test suite?
We pass around "match" objects that depend on the position of regexp matches (match[4]
etc). If this is meant to be a supported use of LOCATOR_REGEXP, we should write a test to ensure the right pieces show up in the right positions.
If the =~
operator is faster than match()
then Keep::Locator::valid() should be updatd to use =~
too.
LOCATOR_REGEXP erroneously passes the following bogus locators, (I think) because of a spurious "+" in the "hint content" part:
d41d8cd98f00b204e9800998ecf8427e+0+B+0 d41d8cd98f00b204e9800998ecf8427e+0+B++
It would be great if we could use the exact regexp documented on the Keep locator format wiki page -- possibly with only the ()
groupings added.
Updated by Tom Clegg over 9 years ago
6203-collection-perf-ruby-sdk @ 77bf746
each_file_spec¶
The splits
variable is poorly named and never used anyway so this should be compressed back to
- splits = line.split
- splits.each do |token|
+ line.split.each do |token|
FILE_REGEXP¶
The colons have no apparent reason to be in regexp groups. How about:- FILE_REGEXP = /^(\d+)(:)(\d+)(:)(.*)$/ + FILE_REGEXP = /^(\d+):(\d+):(.*)$/
files_count¶
I don't see why it's a good idea to re-implement each_file_spec here.
The new method seems to do exactly the same amount of work as the old version, except that it saves work by introducing bugs (e.g., it doesn't unescape stream and file names, and it doesn't realize ["./foo","bar"]
and [".","foo/bar"]
are the same file).
One non-bug tweak I see here is keeping a counter instead of calling seen_files.size each time -- but a quick benchmark indicates the counter method is actually slower, especially if you include the extra hash.include?
check before incrementing the counter.
require 'benchmark' n=4000000 Benchmark.bm(8) do |x| x.report('Hash#size') { h={}; for i in 1..n; h[h.size.to_s]=true; end } x.report('counter1') { h={}; c=0; for i in 1..n; c+=1; h[c.to_s]=true; end } x.report('counter2') { h={}; c=0; for i in 1..n; c+=1 unless h.include? 'foo'; h[c.to_s]=true; end } end
user system total real
Hash#size 6.200000 0.150000 6.350000 ( 6.360331)
counter1 8.740000 0.040000 8.780000 ( 8.801807)
counter2 9.750000 0.100000 9.850000 ( 9.878150)
The other non-bug tweak I see is that it uses FILE_REGEXP instead of not Locator.valid?
. Is FILE_REGEXP faster than LOCATOR_REGEXP? If so, we should use that to make each_file_spec
faster. (It might even turn out that the old version is the fastest if we fix valid?
to use =~
instead of match()
...)
(Meanwhile, files_size
could conceivably benefit by doing something simpler than each_file_spec, because it returns the same number regardless of how the files are named, whether they're fragmented, etc. However, we could easily shoot ourselves in the foot this way by computing files_size a fast way and then having to run the slow loop anyway because the caller also wants to know files_count. So I don't think this is a worthwhile investment.)
Optimizing for the case where files_count and files_size are called, but files()
isn't called, has the same problem -- but at least we know one case where it happens, and it does make it possible to get the file count and size without keeping the big @files
array in memory, so it seems worth keeping.
But if we expect callers to invoke both files_count
and files_size
, we should avoid being sensitive to the order in which they're called, like this:
def files_count(stop_after=nil)
...
@files_size = total_size
@files_count = count_so_far
end
def files_size
files_count
return @files_size if @files_size
files.reduce(0) { |total, (_, _, size)| total + size }
end
Tests¶
There are various new code paths here that could return different results depending on what order the methods are called. If that's our strategy, we need to test accordingly.
Also, when we introduce more manifest-parsing/iterating code (which we should try to avoid) it should at least get tested on things like COLON_FILENAME_MANIFEST, MANY_ESCAPES_MANIFEST, and MULTILEVEL_MANIFEST_WITH_DIRS_IN_FILENAMES.
Updated by Radhika Chippada over 9 years ago
Responses to https://arvados.org/issues/6203#note-4
- Addressed the comments in “compute_pdh and caching”
- Updated munge method accordingly.
- Also, removed “ends_with_newline” and instead corrected the failing unit test to include “\n” at the end of manifest text.
- “In self.sign_manifest we now sign locator_without_signature(match) …” You are right. This is my mistake and corrected it.
- manifest.each_line is indeed faster than split(“\n”). However this still includes the trailing new line character. Stripping it using rstrip! made it the same as split(“\n”). However, given the memory optimization, this is better and hence updated accordingly.
- Continuing to use split instead of each_line(‘ ‘) since this is resulting in some performance loss. We can revisit this if in future we experience memory issues.
- locator_without_signature: using sub (gsub in fact) as you suggested, resulted in another 2s gain in api integration test performance.
- combined validation method
- Added the suggested enhancements
- Did not add tests in the interest of time + as you said no real point in doing this at this stage
- Please check self[:manifest_text] usages; I could not see any misuses after the updates.
- LOCATOR_REGEXP
Thanks.
Updated by Radhika Chippada over 9 years ago
Regarding "Review branch: 6203-collection-perf-ruby-sdk" and https://arvados.org/issues/6203#note-5:
I removed this review task and instead created #6270 and #6271. Based on what we learned from #6087 and #6203 (benchmarking API methods and improving performance), I think there are no short cuts and we need to do a thorough job of first measuring the performance of various methods and making updates to the ruby SDK based on those findings. Thanks for reviewing though.
Updated by Tom Clegg over 9 years ago
Updates look good. I've pushed a few more fixes on 6203-collection-perf-api-TC. Mostly comments and clarity, but also saved a bit of time in "sign" by removing an unused split() left over from changing to each_line. Can you take a look, and merge if these tweaks look OK?
Updated by Radhika Chippada over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:1a98070dcdd379ab6effdc3fc4301c9435a137b1.