Bug #8769
closedre-upload seems to consume a lot of space
0%
Description
We had a 30TiB Keep setup (5x Keepstore nodes each with 6x 1TiB Keepstore volumes) and added another 30TiB (same setup).
Then we uploaded a 25TiB collection. This failed with:
librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114 DDDP* 25956153M / 25956153M 100.0% arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">. Traceback (most recent call last): File "/usr/local/bin/arv-put", line 4, in <module> main() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main stdout.write(output) UnboundLocalError: local variable 'output' referenced before assignment Attachments
We have then started to re-uploaded the 25TiB collection as 6x subsets, 3x at a time, and all 3 first re-uploads failed because of running out space as in:
librarian@sole$ time arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114_4 $(< ~/l4) 1241152M / 4228192M 29.4% Traceback (most recent call last): File "/usr/local/bin/arv-put", line 4, in <module> main() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 484, in main path, max_manifest_depth=args.max_manifest_depth) File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 334, in write_directory_tree path, stream_name, max_manifest_depth) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 216, in write_directory_tree self.do_queued_work() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 144, in do_queued_work self._work_file() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 157, in _work_file self.write(buf) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 471, in write return super(ResumableCollectionWriter, self).write(data) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 227, in write self.flush_data() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data super(ArvPutCollectionWriter, self).flush_data() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data copies=self.replication)) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1065, in put data_hash, copies, thread_limiter.done()), service_errors, label="service") arvados.errors.KeepWriteError: failed to write 041e9f3b83a075608ee1227acc757b0c (wanted 1 copies but wrote 0): service http://keep9.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep0.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep2.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep4.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep5.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep7.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep8.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep1.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep6.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep3.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable real 2226m47.733s user 135m7.266s sys 116m52.827s
The 'arv-put' command is from the Debian package dated 160311.
What perplexed me in the above is that there was still quite a bit of free space. In the attached free space report the inflection point around "Friday" is when the re-upload was started. I was surprised to see fast decreasing space for uploads of content that had already been allegedly 100% uploaded.
I have enumerated all the blocks on all 10 Keepstore servers and there around 950k, with around 24k duplicates (and 6 triplicates), that is there are only 1.5TB of duplicates. Also those duplicates are entirely on two Keepstores that were part of the first set of 5, which had filled up before the re-upload (bottom yellow and orange in the graph). There is a perhaps a chance that on the original upload the "25956153M / 25956153M 100.0%" report might have been optimistic.
What worries me is the possibility that different hashes may be assigned to the same content. Suggestions and comments would be interesting.
Files