Bug #8769
closedre-upload seems to consume a lot of space
0%
Description
We had a 30TiB Keep setup (5x Keepstore nodes each with 6x 1TiB Keepstore volumes) and added another 30TiB (same setup).
Then we uploaded a 25TiB collection. This failed with:
librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114 DDDP* 25956153M / 25956153M 100.0% arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">. Traceback (most recent call last): File "/usr/local/bin/arv-put", line 4, in <module> main() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main stdout.write(output) UnboundLocalError: local variable 'output' referenced before assignment Attachments
We have then started to re-uploaded the 25TiB collection as 6x subsets, 3x at a time, and all 3 first re-uploads failed because of running out space as in:
librarian@sole$ time arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114_4 $(< ~/l4) 1241152M / 4228192M 29.4% Traceback (most recent call last): File "/usr/local/bin/arv-put", line 4, in <module> main() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 484, in main path, max_manifest_depth=args.max_manifest_depth) File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 334, in write_directory_tree path, stream_name, max_manifest_depth) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 216, in write_directory_tree self.do_queued_work() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 144, in do_queued_work self._work_file() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 157, in _work_file self.write(buf) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 471, in write return super(ResumableCollectionWriter, self).write(data) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 227, in write self.flush_data() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data super(ArvPutCollectionWriter, self).flush_data() File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data copies=self.replication)) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1065, in put data_hash, copies, thread_limiter.done()), service_errors, label="service") arvados.errors.KeepWriteError: failed to write 041e9f3b83a075608ee1227acc757b0c (wanted 1 copies but wrote 0): service http://keep9.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep0.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep2.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep4.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep5.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep7.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep8.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep1.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep6.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable; service http://keep3.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue HTTP/1.1 503 Service Unavailable real 2226m47.733s user 135m7.266s sys 116m52.827s
The 'arv-put' command is from the Debian package dated 160311.
What perplexed me in the above is that there was still quite a bit of free space. In the attached free space report the inflection point around "Friday" is when the re-upload was started. I was surprised to see fast decreasing space for uploads of content that had already been allegedly 100% uploaded.
I have enumerated all the blocks on all 10 Keepstore servers and there around 950k, with around 24k duplicates (and 6 triplicates), that is there are only 1.5TB of duplicates. Also those duplicates are entirely on two Keepstores that were part of the first set of 5, which had filled up before the re-upload (bottom yellow and orange in the graph). There is a perhaps a chance that on the original upload the "25956153M / 25956153M 100.0%" report might have been optimistic.
What worries me is the possibility that different hashes may be assigned to the same content. Suggestions and comments would be interesting.
Files
Updated by Peter Grandi about 9 years ago
From #Arvados
I got a pointed to a potential duplication issue in older arv-put
as: https://dev.arvados.org/issues/6358
Updated by Peter Grandi about 9 years ago
Report from a typical Keepstore server for 3 typical Keepstore filetrees:
manager@keepstore07:~$ sudo du -sm /var/lib/keepstore/gcam1-keep-4[345] 1047961 /var/lib/keepstore/gcam1-keep-43 1047976 /var/lib/keepstore/gcam1-keep-44 1047960 /var/lib/keepstore/gcam1-keep-45
manager@keepstore07:~$ df -T -BG /var/lib/keepstore/gcam1-keep-43 /var/lib/keepstore/gcam1-keep-44 /var/lib/keepstore/gcam1-keep-45 Filesystem Type 1G-blocks Used Available Use% Mounted on /dev/vdc1 xfs 1024G 1024G 1G 100% /var/lib/keepstore/gcam1-keep-43 /dev/vdd1 xfs 1024G 1024G 1G 100% /var/lib/keepstore/gcam1-keep-44 /dev/vde1 xfs 1024G 1024G 1G 100% /var/lib/keepstore/gcam1-keep-45
manager@keepstore07:~$ df -i /var/lib/keepstore/gcam1-keep-43 /var/lib/keepstore/gcam1-keep-44 /var/lib/keepstore/gcam1-keep-45 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/vdc1 35982 20475 15507 57% /var/lib/keepstore/gcam1-keep-43 /dev/vdd1 28438 20469 7969 72% /var/lib/keepstore/gcam1-keep-44 /dev/vde1 36414 20475 15939 57% /var/lib/keepstore/gcam1-keep-45
Updated by Peter Grandi about 9 years ago
Data Manager report appended. The 447178 unattached blocks it reports match fairly well the size of the 25GiB collection that was reported as 100% uploaded. What is perplexing is that re-uploading the very same files goes as far as 30% as per above reports and then results in a no-space message.
2016/03/23 17:00:35 Returned 10 keep disks 2016/03/23 17:00:35 Replication level distribution: map[1:936463 2:25380 3:3] 2016/03/23 17:00:38 Blocks In Collections: 514668, Blocks In Keep: 961846. 2016/03/23 17:00:38 Replication Block Counts: Missing From Keep: 0, Under Replicated: 0, Over Replicated: 22639, Replicated Just Right: 492029, Not In Any Collection: 447178. Replication Collection Counts: Missing From Keep: 0, Under Replicated: 0, Over Replicated: 22, Replicated Just Right: 395. 2016/03/23 17:00:38 Blocks Histogram: 2016/03/23 17:00:38 {Requested:0 Actual:1}: 444434 2016/03/23 17:00:38 {Requested:0 Actual:2}: 2744 2016/03/23 17:00:38 {Requested:1 Actual:1}: 492029 2016/03/23 17:00:38 {Requested:1 Actual:2}: 22636 2016/03/23 17:00:38 {Requested:1 Actual:3}: 3
Updated by Peter Grandi about 9 years ago
So thanks to #6358 I discovered (or rediscovered) ARVADOS_DEBUG=2
and applied to a re-upload it showed that:
arv-put
picks "some" Keepstore (probably according to the sorting order mentioned in #6358).- Tries to PUT the block to it, regardless of whether it may or may not be already on it. and if the PUT fails, goes on to the next Keepstore.
- Eventually it will find a Keepstore where the PUT does not fail, and if it by chance it is the one where the block is already stored it won't be stored twice, else it will be.
This is rather disappointing, as it means that failed uploads create usually unwanted duplicates, that is uploads are not idempotent as to "pragmatics".
Also there is the terrible problem that if an upload of say 25TB lasts many days and perchance the Data Manager runs before the collection's manifest is registered in the API there might be some big disappointment. IIRC this is an aspect that is being worked on, I guess with a black-grey-white state system.
For arv-put
these might be possible improvements (several hopefully non-critical details omitted, like replication):
- The Data Managers or the Keepstores maintain in the API server database periodically updated list of all blocks present but not registered in any collection manifest.
- Then
arv-put
optionally checks existing manifests and that list. If there is no list or the "some" blocks are not present in it, they get uploaded to "some" Keepstore. - If there is a list and "some" blocks are present in it:
arv-put
sends a PUT request listing those hashes to the '/reserve' endpoint of all Keepstores.- Keepstores reply with a status per-hash: 0="have-and-reserved", 1=dont-have-and-waiting, 2=dont-know.
- Hashes for which all statuses are "dont-have-and-waiting" or "dont-know": PUT the hash and block to some Keepstore's '/upload' endpoint.
- At the end when hash is registered in a manifest sent to the API server, send a PUT to endpoint '/registered' of the relevant Keepstore.
- A Keepstore will refuse to delete a block between its hash being PUT to the '/reserve' endpoint and it being listed in a PUT to the '/registered' endpoint.
IIRC in a previous discussion someone mentioned a more persistent mechanism for "grey" status (uploaded but not yet registered), like uploading or hard-linking the block in a directory like 'incoming' on the Keepstore volume.
To discuss more on #Arvados I guess.
Updated by Peter Grandi about 9 years ago
Put another way, the current algorithm for selecting a Keepstore as the destination for a block results in no duplication among Keepstores only if the number of Keepstores never changes and (maybe) have the same amount of free space.
Updated by Peter Grandi about 9 years ago
Just noted on IRC that at one point we had all keepstores 100% full, and 100% of blocks being uploaded having already been uploaded. In that case I would have expected all re-uploads to succeed, but they all failed instead instead about 30% in.
Updated by Peter Amstutz almost 9 years ago
We covered this on IRC, but to summarize:
Currently, for each directory arv-put concatenates all the files into a single "stream", and it is the "stream" that is chunked into 64 MiB blocks rather than individual files. This means files are not guaranteed to fall on block boundaries and can start in the middle of a block immediately following the end of a previous file. As a result, if files are uploaded in a different order, this results in a different "stream" which is likely to yield different blocks.
This is not an inherit property of Keep. Data chunking decisions are made at the client level during upload, so we can change the data chunking policy without changing any of the keep infrastructure. For example, creating collections with writable arv-mount ("arv-mount --read-write") creates a separate block stream for each file.
However, this obviously makes deduplication less effective, so I've filed #8791 to change this behavior.
Updated by Tom Morris about 8 years ago
- Status changed from New to Resolved
The new chunking behavior in #8791 should fix this.