Bug #10932
closed[arv-put] job resume too slow & uninformative
100%
Description
Completed 97% if a 636959M upload (637GB) and then attempted to resume it after arv-put was killed. It took almost 3 hours (175 minutes) of 100% CPU for arv-put to calculate/confirm the restart point (which seemed to only be 45%, not 97%) before beginning to do any additional uploading. During this time there was no user feedback other than "0M / 636959M 0.0%" for hours on end/
Some other, possibly relevant, details:
98 MB cache file in ~/.cache/arvados/ :
97913649 Jan 18 21:54 476acf1a6d4cfb9b630a13289c4a72be
of which ~7 MB is manifest text:
$ jq .manifest 476acf1a6d4cfb9b630a13289c4a72be | wc
1 240325 6861939
and the rest is info for the 478502 filesL
$ jq '.files | length' 476acf1a6d4cfb9b630a13289c4a72be
478502
$ jq -r '.files[] | .size ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478502 478502 3827704
$ jq -r '.files[] | .mtime ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478502 478502 8928344
$ jq -r '.files | keys ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478504 480984 69204103
That's 69 MB of file names which are compressable to 1.4 MB due to high levels of redundancy
$ jq -r '.files | keys ' 476acf1a6d4cfb9b630a13289c4a72be | gzip | wc
10074 39234 1442694
Updated by Tom Morris almost 8 years ago
I added some logging to the client and it shows that loading from JSON takes just seconds, but the deep object copy takes almost half an hour and the JSON dump takes OVER AN HOUR.
Investigate:
- more compact storage structure like a trie or b-tree
- cPickle for save/restore
- non- JSON storage format since JSON is overkill
2017-01-19 03:24:49 arvados.arv_put571: About to do setup state
2017-01-19 03:24:49 arvados.arv_put571: Loading JSON cache file
2017-01-19 03:24:53 arvados.arv_put571: Done loading cache file
2017-01-19 03:24:53 arvados.arv_put571: Loading local collection from manifest text
2017-01-19 03:25:08 arvados.arv_put571: Done loading local collection
2017-01-19 03:25:08 arvados.arv_put571: done setup state
2017-01-19 03:25:53 arvados.arv_put571: Removing local files which don't exist any more from collection
2017-01-19 03:26:08 arvados.arv_put571: Entering _update
2017-01-19 03:26:19 arvados.arv_put571: getting nonfinal local manifest text
2017-01-19 03:27:06 arvados.arv_put571: Done getting local manifest text
2017-01-19 03:27:06 arvados.arv_put571: acquired lock doing deep copy self._state - 280 bytes
2017-01-19 03:53:13 arvados.arv_put571: done deep copy - 26 minutes
2017-01-19 03:53:17 arvados.arv_put571: About to do json dump
2017-01-19 04:55:52 arvados.arv_put571: Leaving _save_state - 63 minutes
282275M / 636959M 44.3%
2017-01-19 04:55:56 arvados.arv_put571 WARNING: Leaving _update
Updated by Tom Morris almost 8 years ago
- Target version changed from Arvados Future Sprints to 2017-02-15 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-02-15 sprint to Arvados Future Sprints
Updated by Tom Morris almost 8 years ago
- Subject changed from arv-put job resume too slow & uninformative to [arv-put] job resume too slow & uninformative
Updated by Lucas Di Pentima almost 8 years ago
- Assigned To changed from Tom Morris to Lucas Di Pentima
- Target version changed from Arvados Future Sprints to 2017-02-15 sprint
- Story points set to 0.5
Updated by Lucas Di Pentima almost 8 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima almost 8 years ago
Updates at: 9c9b520
Tests: https://ci.curoverse.com/job/developer-run-tests/151/
A couple of optimizations have improved times a lot.
On my tests with a ~12MB cache, state saving times went down from ~14 secs to ~0.8 secs by just avoiding the use of copy.deepcopy()
and json.dump()
and replacing them with json.dumps()
As for the resume start delay, using a set() instead of a list on the code reduced time spent on checking if every file on the local collection existed on the local disk before starting to upload the rest.
Updated by Peter Amstutz almost 8 years ago
Is there any reason self._file_paths
can't be a set()
? It looks like the only thing that would need to be changed is to update self._file_paths.append(filename)
to self._file_paths.add(filename)
.
Caching change looks good, might want to add a comment with what you learned about the performance of deepcopy/dump/dumps.
Updated by Lucas Di Pentima almost 8 years ago
Peter Amstutz wrote:
Is there any reason
self._file_paths
can't be aset()
? It looks like the only thing that would need to be changed is to updateself._file_paths.append(filename)
toself._file_paths.add(filename)
.
I read that sets weren't performant on iterations as list are, but didn't tested it myself. As you mentioned, the difference between both is some microseconds so it's updated as you requested.
Caching change looks good, might want to add a comment with what you learned about the performance of deepcopy/dump/dumps.
Done.
Test run: https://ci.curoverse.com/job/developer-run-tests/153/
Updates: e41dc06
Updated by Lucas Di Pentima almost 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:a8b210d2bb5b10e6583abf295b99788b3dff7479.